Apache Beam supports a fixed number of shards but discourages use for auto-tuning/scaling reasons and simplifies good scalable pipeline creation for users.
Some users do require a fixed number of shards and several classes like TextIO support fixed sharding. If your trying to always use a fixed number of shards, take a look at how TextIO#withNumShards <https://github.com/apache/beam/blob/8b540d2dd43af2a0495884d8152472a7ebff8a8e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L793> works and the internal implementation details of WriteFiles#withNumShards <https://github.com/apache/beam/blob/8b540d2dd43af2a0495884d8152472a7ebff8a8e/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L253> and re-use them for ParquetIO as needed. On Tue, Sep 5, 2017 at 1:04 AM, Niels Basjes <[email protected]> wrote: > Hi, > > For my application I want to write files with records that are efficient > in size and easy to read from my application code, so I want to write > something like Parquet or Orc from a Beam application. > > I found https://github.com/apache/beam/pull/1851 for Parquet and decided > to try to make this actually work. > > While working on this I was confronted full impact of the choice that in > Beam (DataFlow?) you cannot specify how many parallel instances you want of > something. > > I found that formats like TextIO first write all the data to temporary > files where a uuid is used in the filename and the all data is rewritten > into the desired number of shards. > Doing this for Parquet files seems to become a too big hurdle given the > complexity of the file format. > If you do not do this then you get a varying number of files where you > actually need something like a UUID to ensure the files are unique. > > From my perspective the core problem here is the fact that Beam does (in > general) automatic scaling of steps in a flow, which is really good in most > scenarios ... except in scenarios like this. > > I would like advice on how to proceed in this case. > At this point I'm really tempted to switch back to Flink as there support > for these files formats is readily available and works as expected. > > > -- > Best regards, > > Niels Basjes >
