In that case you should be able to leave sharding unspecified, and you
won't incur the extra shuffle. Specifying explicit sharding is generally
necessary only for streaming.

On Wed, Sep 18, 2019 at 2:06 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> batch on dataflowRunner.
>
> On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax <re...@google.com> wrote:
>
>> Are you using streaming or batch? Also which runner are you using?
>>
>> On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> So I followed up on why TextIO shuffles and dug into the code some. It
>>> is using the shards and getting all the values into a keyed group to write
>>> to a single file.
>>>
>>> However... I wonder if there is way to just take the records that are on
>>> a worker and write them out. Thus not needing a shard number and doing
>>> this. Closer to how hadoop handle's writes.
>>>
>>> Maybe just a regular pardo and on bundleSetup it creates a writer and
>>> processElement reuses that writter to write to the same file for all
>>> elements within a bundle?
>>>
>>> I feel like this goes beyond scope of simple user mailing list so I'm
>>> expanding it to dev as well.
>>> +dev <dev@beam.apache.org>
>>>
>>> Finding a solution that prevents quadrupling shuffle costs when simply
>>> writing out a file is a necessity for large scale jobs that work with 100+
>>> TB of data. If anyone has any ideas I'd love to hear them.
>>>
>>> Thanks,
>>> Shannon Duncan
>>>
>>> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
>>>> We have been using Beam for a bit now. However we just turned on the
>>>> dataflow shuffle service and were very surprised that the shuffled data
>>>> amounts were quadruple the amounts we expected.
>>>>
>>>> Turns out that the file writing TextIO is doing shuffles within itself.
>>>>
>>>> Is there a way to prevent shuffling in the writing phase?
>>>>
>>>> Thanks,
>>>> Shannon Duncan
>>>>
>>>

Reply via email to