On Wed, Sep 18, 2019 at 2:12 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> I will attempt to do without sharding (though I believe we did do a run
> without shards and it incurred the extra shuffle costs).
>

It shouldn't. There will be a shuffle, but that shuffle should contain a
small amount of data (essentially a list of filenames).

>
> Pipeline is simple.
>
> The only shuffle that is explicitly defined is the shuffle after merging
> files together into a single PCollection (Flatten Transform).
>
> So it's a Read > Flatten > Shuffle > Map (Format) > Write. We expected to
> pay for shuffles on the middle shuffle but were surprised to see that the
> output data from the Flatten was quadrupled in the reflected shuffled GB
> shown in Dataflow. Which lead me down this path of finding things.
>
> [image: image.png]
>
> On Wed, Sep 18, 2019 at 4:08 PM Reuven Lax <re...@google.com> wrote:
>
>> In that case you should be able to leave sharding unspecified, and you
>> won't incur the extra shuffle. Specifying explicit sharding is generally
>> necessary only for streaming.
>>
>> On Wed, Sep 18, 2019 at 2:06 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> batch on dataflowRunner.
>>>
>>> On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> Are you using streaming or batch? Also which runner are you using?
>>>>
>>>> On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <
>>>> joseph.dun...@liveramp.com> wrote:
>>>>
>>>>> So I followed up on why TextIO shuffles and dug into the code some. It
>>>>> is using the shards and getting all the values into a keyed group to write
>>>>> to a single file.
>>>>>
>>>>> However... I wonder if there is way to just take the records that are
>>>>> on a worker and write them out. Thus not needing a shard number and doing
>>>>> this. Closer to how hadoop handle's writes.
>>>>>
>>>>> Maybe just a regular pardo and on bundleSetup it creates a writer and
>>>>> processElement reuses that writter to write to the same file for all
>>>>> elements within a bundle?
>>>>>
>>>>> I feel like this goes beyond scope of simple user mailing list so I'm
>>>>> expanding it to dev as well.
>>>>> +dev <dev@beam.apache.org>
>>>>>
>>>>> Finding a solution that prevents quadrupling shuffle costs when simply
>>>>> writing out a file is a necessity for large scale jobs that work with 100+
>>>>> TB of data. If anyone has any ideas I'd love to hear them.
>>>>>
>>>>> Thanks,
>>>>> Shannon Duncan
>>>>>
>>>>> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <
>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>
>>>>>> We have been using Beam for a bit now. However we just turned on the
>>>>>> dataflow shuffle service and were very surprised that the shuffled data
>>>>>> amounts were quadruple the amounts we expected.
>>>>>>
>>>>>> Turns out that the file writing TextIO is doing shuffles within
>>>>>> itself.
>>>>>>
>>>>>> Is there a way to prevent shuffling in the writing phase?
>>>>>>
>>>>>> Thanks,
>>>>>> Shannon Duncan
>>>>>>
>>>>>

Reply via email to