Writing Parquet / Orc files

Niels Basjes Tue, 05 Sep 2017 01:05:43 -0700

Hi,

For my application I want to write files with records that are efficient in
size and easy to read from my application code, so I want to write
something like Parquet or Orc from a Beam application.


I found https://github.com/apache/beam/pull/1851 for Parquet and decided to
try to make this actually work.

While working on this I was confronted full impact of the choice that in
Beam (DataFlow?) you cannot specify how many parallel instances you want of
something.

I found that formats like TextIO first write all the data to temporary
files where a uuid is used in the filename and the all data is rewritten
into the desired number of shards.
Doing this for Parquet files seems to become a too big hurdle given the
complexity of the file format.
If you do not do this then you get a varying number of files where you
actually need something like a UUID to ensure the files are unique.

>From my perspective the core problem here is the fact that Beam does (in
general) automatic scaling of steps in a flow, which is really good in most
scenarios ... except in scenarios like this.

I would like advice on how to proceed in this case.
At this point I'm really tempted to switch back to Flink as there support
for these files formats is readily available and works as expected.


-- 
Best regards,

Niels Basjes

Writing Parquet / Orc files

Reply via email to