Looks like an interesting possibility, will take a look, ty! From: Chamikara Jayalath <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, July 29, 2020 at 1:41 PM To: dev <[email protected]> Cc: Sunil Mishra <[email protected]> Subject: Re: ParquetIO - max file size?
Notice: This email is from an external sender. You can consider using dynamic destinations [1] and providing a destination function [2] that keeps track of the sizes of elements already written to a given destination. Note that this might have performance implications (due to extra computations to keep track of element sizes). You are correct regarding the default behaviour where the number of shards of sinks is determined by the runner based on the parallelism of the corresponding step. Thanks, Cham [1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L222 [2] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L988 On Wed, Jul 29, 2020 at 12:28 PM [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> wrote: We would like to use ParquetIO but limit individual files written out a maximum size. Don’t see any easy way to do this, and it seems like default behavior is to split based on parallelism? Anyone have any guidance on this?
