All, We have a use case to built a data ingestion app that reads data from Kafka, transforms and write it to HDFS in Parquet File. I am trying to implement a Parquet File output operator which supports partitions ( defined by input fields ). I would appreciate communities input for the following.
Staging data Parquet format stores data organized by column instead of record. Because it keeps data in contiguous chunks by column, appending new records to a dataset requires rewriting substantial portions of existing an file or buffering records to create a new file ( data compaction) . So while Parquet may have storage and query benefits, it may not make sense to write from record stream. Partition and sorting strategy implementation Our use case is for immutable, read only data sets. We plan to use Impala to access the data once it's built. Frameworks I'm leaning towards using kite-sdk ( http://kitesdk.org/ ) as it supports APIs for both staging , complex data types and partitioning. * In general thoughts about the approach and ideas. * If any of you have faced similar issues or done something like this. Please share your thoughts, obstacles and code samples if possible. * Apex Dev, if something like this is already planned in Malhar; please let us know. Thanks, Sunil
