All,
We have a use case to built a data ingestion app that reads data from Kafka, 
transforms and write it to HDFS in Parquet File. I am trying to implement a 
Parquet File output operator which supports partitions ( defined by input 
fields ). I would appreciate communities input for the following.

Staging data
Parquet format stores data organized by column instead of record. Because it 
keeps data in contiguous chunks by column, appending new records to a dataset 
requires rewriting substantial portions of existing an file or buffering 
records to create a new file ( data compaction) . So while Parquet may have 
storage and query benefits, it may not make sense to write from record stream.

Partition and sorting strategy implementation
Our use case is for immutable, read only data sets. We plan to use Impala to 
access the data once it's built.

Frameworks
I'm leaning towards using kite-sdk ( http://kitesdk.org/ ) as it supports APIs 
for both staging , complex data types and partitioning.

  *   In general thoughts about the approach and ideas.
  *   If any of you have faced similar issues or done something like this. 
Please share your thoughts, obstacles and code samples if possible.
  *   Apex Dev, if something like this is already planned in Malhar; please let 
us know.

Thanks,
Sunil

Reply via email to