Problem with gzip

Allie Chen Fri, 10 May 2019 13:10:37 -0700

Hi,


I am trying to load a gzip file to BigQuey using Dataflow. Since the
compressed file is not splittable, one worker is allocated to read the
file. The same worker will do all the other transforms since Dataflow fused
all transforms together.  There are a large amount of data in the file, and
I expect to see more workers spinning up after reading transforms. I tried
to use Reshuffle Transform
<https://github.com/apache/beam/blob/release-2.3.0/sdks/python/apache_beam/transforms/util.py#L516>
to prevent the fusion, but it is not scalable since it won’t proceed until
all data arrived at this point.

Is there any other ways to allow more workers working on all the other
transforms after reading?

Thanks,

Allie

Problem with gzip

Reply via email to