Re: Writing Parquet / Orc files

Lukasz Cwik Tue, 05 Sep 2017 07:08:49 -0700

Apache Beam supports a fixed number of shards but discourages use for
auto-tuning/scaling reasons and simplifies good scalable pipeline creation
for users.

Some users do require a fixed number of shards and several classes like
TextIO support fixed sharding. If your trying to always use a fixed number
of shards, take a look at how TextIO#withNumShards
<https://github.com/apache/beam/blob/8b540d2dd43af2a0495884d8152472a7ebff8a8e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L793>
works
and the internal implementation details of WriteFiles#withNumShards
<https://github.com/apache/beam/blob/8b540d2dd43af2a0495884d8152472a7ebff8a8e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L253>
and
re-use them for ParquetIO as needed.

On Tue, Sep 5, 2017 at 1:04 AM, Niels Basjes <[email protected]> wrote:

> Hi,
>
> For my application I want to write files with records that are efficient
> in size and easy to read from my application code, so I want to write
> something like Parquet or Orc from a Beam application.
>
> I found https://github.com/apache/beam/pull/1851 for Parquet and decided
> to try to make this actually work.
>
> While working on this I was confronted full impact of the choice that in
> Beam (DataFlow?) you cannot specify how many parallel instances you want of
> something.
>
> I found that formats like TextIO first write all the data to temporary
> files where a uuid is used in the filename and the all data is rewritten
> into the desired number of shards.
> Doing this for Parquet files seems to become a too big hurdle given the
> complexity of the file format.
> If you do not do this then you get a varying number of files where you
> actually need something like a UUID to ensure the files are unique.
>
> From my perspective the core problem here is the fact that Beam does (in
> general) automatic scaling of steps in a flow, which is really good in most
> scenarios ... except in scenarios like this.
>
> I would like advice on how to proceed in this case.
> At this point I'm really tempted to switch back to Flink as there support
> for these files formats is readily available and works as expected.
>
>
> --
> Best regards,
>
> Niels Basjes
>

Re: Writing Parquet / Orc files

Reply via email to