Hi Rafeeq,

I've added answers below.

On 08/12/2014 12:28 AM, rafeeq s wrote:
I am new to Parquet and using parquet format for storing spark stream data
into hdfs.

Question:
1. Is it possible to merge two small parquet file ?

There isn't a quick solution like concatenating the files, if that's what you're looking for. You'd have to rewrite the files as a new one.

As long as you're rewriting the file, you might want to consider staging the data in a different format and then compacting it into parquet periodically. Avro, for example, would allow you to use flush and sync methods to guarantee records are on disk and a later conversion to parquet would give you the I/O and encoding benefits.

2. Partitioning directory structure: Is it possible to partition the
parquet file directory based on date?

Creating files in a partitioned structure isn't supported in parquet itself, but once they are in a partitioned structure the input format will walk the directory structure and find all of the files.

In parquet-avro, you pass in a path when creating a parquet file, so while you'd have to do some of this yourself, it isn't too difficult. I'm most familiar with the parquet-avro API, but you could probably build a partitioned structure with the other modules too.

(Disclosure: I work on Kite...) It sounds like what you're looking for is probably a library built on top of parquet that acts more like a data store than a file format. You might want to check out the Kite project [1], which does both of the things you're asking about. Specifically, you can use a config file to define your partition layout and a simple API to automatically select partitions.

rb

[1]: http://kitesdk.org/docs/current/guide/

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to