Yes. *From: *Lukasz Cwik <lc...@google.com> *Date: *Fri, May 10, 2019 at 4:19 PM *To: *dev *Cc: * <u...@beam.apache.org>
When you had X gzip files and were not using Reshuffle, did you see X > workers read and process the files? > > On Fri, May 10, 2019 at 1:17 PM Allie Chen <yifangc...@google.com> wrote: > >> Yes, I do see the data after reshuffle are processed in parallel. But >> Reshuffle transform itself takes hours or even days to run, according to >> one test (24 gzip files, 17 million lines in total) I did. >> >> The file format for our users are mostly gzip format, since uncompressed >> files would be too costly to store (It could be in hundreds of GB). >> >> Thanks, >> >> Allie >> >> >> *From: *Lukasz Cwik <lc...@google.com> >> *Date: *Fri, May 10, 2019 at 4:07 PM >> *To: *dev, <u...@beam.apache.org> >> >> +u...@beam.apache.org <u...@beam.apache.org> >>> >>> Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all >>> the data has been read before the next transforms can run. After the >>> reshuffle, the data should have been processed in parallel across the >>> workers. Did you see this? >>> >>> Are you able to change the input of your pipeline to use an uncompressed >>> file or many compressed files? >>> >>> On Fri, May 10, 2019 at 1:03 PM Allie Chen <yifangc...@google.com> >>> wrote: >>> >>>> Hi, >>>> >>>> >>>> I am trying to load a gzip file to BigQuey using Dataflow. Since the >>>> compressed file is not splittable, one worker is allocated to read the >>>> file. The same worker will do all the other transforms since Dataflow fused >>>> all transforms together. There are a large amount of data in the file, and >>>> I expect to see more workers spinning up after reading transforms. I tried >>>> to use Reshuffle Transform >>>> <https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516> >>>> to prevent the fusion, but it is not scalable since it won’t proceed until >>>> all data arrived at this point. >>>> >>>> Is there any other ways to allow more workers working on all the other >>>> transforms after reading? >>>> >>>> Thanks, >>>> >>>> Allie >>>> >>>>