Issac Buenrostro created PARQUET-224:
----------------------------------------
Summary: Implement writing Parquet files into Cassandra natively
Key: PARQUET-224
URL: https://issues.apache.org/jira/browse/PARQUET-224
Project: Parquet
Issue Type: New Feature
Reporter: Issac Buenrostro
Priority: Minor
Writing Parquet files into Cassandra could allow parallel writes of multiple
pages into different cells, and low latency reads with a persistent connection
to C*.
Each page could be written to separate C* cells, with metadata written into a
separate column family.
A possible way of implementing is:
- abstract ParquetFileWriter -> ParquetDataWriter. writeDictionaryPage,
writeDataPage are abstract methods.
- ParquetFileWriter implements ParquetDataWriter, writing the data to Hadoop
compatible files.
- ParquetCassandraWriter implements ParquetDataWriter, writing data to Cassandra
-- for each page, metadata is written to Metadata CF, with key
<parquet-file-name>:<row-chunk>:<column>:<page>
-- for each page, data is written to Data CF, with key
<parquet-file-name>:<row-chunk>:<column>:<page>
-- footer is written to Metadata CF, with key <parquet-file-name>
- abstract ParquetFileReader -> ParquetDataReader. readNextRowGroup, readFooter
are abstract methods. Chunk will also need to be abstract.
- ParquetFileReader implements ParquetDataReader, reading from Hadoop
compatible files.
- ParquetCassandraReader implements ParquetDataReader, reading from Cassandra
- ParquetDataWriter and ParquetDataReader are instantiated through reflection.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)