Yes, that is correct. *From: *Allie Chen <[email protected]> *Date: *Fri, May 10, 2019 at 4:21 PM *To: * <[email protected]> *Cc: * <[email protected]>
Yes. > > *From: *Lukasz Cwik <[email protected]> > *Date: *Fri, May 10, 2019 at 4:19 PM > *To: *dev > *Cc: * <[email protected]> > > When you had X gzip files and were not using Reshuffle, did you see X >> workers read and process the files? >> >> On Fri, May 10, 2019 at 1:17 PM Allie Chen <[email protected]> wrote: >> >>> Yes, I do see the data after reshuffle are processed in parallel. But >>> Reshuffle transform itself takes hours or even days to run, according to >>> one test (24 gzip files, 17 million lines in total) I did. >>> >>> The file format for our users are mostly gzip format, since uncompressed >>> files would be too costly to store (It could be in hundreds of GB). >>> >>> Thanks, >>> >>> Allie >>> >>> >>> *From: *Lukasz Cwik <[email protected]> >>> *Date: *Fri, May 10, 2019 at 4:07 PM >>> *To: *dev, <[email protected]> >>> >>> [email protected] <[email protected]> >>>> >>>> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till >>>> all the data has been read before the next transforms can run. After the >>>> reshuffle, the data should have been processed in parallel across the >>>> workers. Did you see this? >>>> >>>> Are you able to change the input of your pipeline to use an >>>> uncompressed file or many compressed files? >>>> >>>> On Fri, May 10, 2019 at 1:03 PM Allie Chen <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> I am trying to load a gzip file to BigQuey using Dataflow. Since the >>>>> compressed file is not splittable, one worker is allocated to read the >>>>> file. The same worker will do all the other transforms since Dataflow >>>>> fused >>>>> all transforms together. There are a large amount of data in the file, >>>>> and >>>>> I expect to see more workers spinning up after reading transforms. I tried >>>>> to use Reshuffle Transform >>>>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516> >>>>> to prevent the fusion, but it is not scalable since it won’t proceed until >>>>> all data arrived at this point. >>>>> >>>>> Is there any other ways to allow more workers working on all the other >>>>> transforms after reading? >>>>> >>>>> Thanks, >>>>> >>>>> Allie >>>>> >>>>>
