Yes.

*From: *Lukasz Cwik <lc...@google.com>
*Date: *Fri, May 10, 2019 at 4:19 PM
*To: *dev
*Cc: * <u...@beam.apache.org>

When you had X gzip files and were not using Reshuffle, did you see X
> workers read and process the files?
>
> On Fri, May 10, 2019 at 1:17 PM Allie Chen <yifangc...@google.com> wrote:
>
>> Yes, I do see the data after reshuffle are processed in parallel. But
>> Reshuffle transform itself takes hours or even days to run, according to
>> one test (24 gzip files, 17 million lines in total) I did.
>>
>> The file format for our users are mostly gzip format, since uncompressed
>> files would be too costly to store (It could be in hundreds of GB).
>>
>> Thanks,
>>
>> Allie
>>
>>
>> *From: *Lukasz Cwik <lc...@google.com>
>> *Date: *Fri, May 10, 2019 at 4:07 PM
>> *To: *dev, <u...@beam.apache.org>
>>
>> +u...@beam.apache.org <u...@beam.apache.org>
>>>
>>> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all
>>> the data has been read before the next transforms can run. After the
>>> reshuffle, the data should have been processed in parallel across the
>>> workers. Did you see this?
>>>
>>> Are you able to change the input of your pipeline to use an uncompressed
>>> file or many compressed files?
>>>
>>> On Fri, May 10, 2019 at 1:03 PM Allie Chen <yifangc...@google.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I am trying to load a gzip file to BigQuey using Dataflow. Since the
>>>> compressed file is not splittable, one worker is allocated to read the
>>>> file. The same worker will do all the other transforms since Dataflow fused
>>>> all transforms together.  There are a large amount of data in the file, and
>>>> I expect to see more workers spinning up after reading transforms. I tried
>>>> to use Reshuffle Transform
>>>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516>
>>>> to prevent the fusion, but it is not scalable since it won’t proceed until
>>>> all data arrived at this point.
>>>>
>>>> Is there any other ways to allow more workers working on all the other
>>>> transforms after reading?
>>>>
>>>> Thanks,
>>>>
>>>> Allie
>>>>
>>>>

Reply via email to