Re: Issues processing 150K files with DataflowRunner

Asha Rostamianfar Wed, 22 Nov 2017 08:24:07 -0800

Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar
to TextIO and AvroIO :) We'll implement this.


On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath <[email protected]>
wrote:

> I suspect that you might be hitting Dataflow API limit for messages during
> initial splitting the source. Some details are available under "Total
> number of BoundedSource objects" below (you should see a similar message in
> worker logs but exact error message might be out of date).
> https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline
>
> The exact number of files you can support depends on the size of generated
> splits (usually about 400k for TextIO).
>
> One solution for this is to develop a ReadAll() transform for VcfSource
> similar to the following available for TextIO.
> https://github.com/apache/beam/blob/master/sdks/python/
> apache_beam/io/textio.py#L409
>
> Thanks,
> Cham
>
>
> On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
> <[email protected]> wrote:
>
> > Hi,
> >
> > I'm wondering whether anyone has tried processing a large number (~150K)
> of
> > files using DataflowRunner? We are seeing a behavior where the dataflow
> job
> > starts, but never attaches any workers. After 1h, it cancels the job due
> to
> > being "stuck". See logs here
> > <
> > https://02931532374587840286.googlegroups.com/attach/
> 3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-
> Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-
> PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> > >.
> > It works fine for smaller number of files (e.g. 1k).
> >
> > We have tried setting num_workers, max_num_workers, etc. Are there any
> > other settings that we can try?
> >
> > Context: the pipeline is using the python Apache Beam SDK and running the
> > code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> > using the VcfSource, which is based on TextSource. See this thread
> > <
> > https://groups.google.com/d/msg/google-genomics-discuss/
> LUgqh1s56SY/WUnJkkHUAwAJ
> > >
> > for
> > more context.
> >
> > Thanks,
> > Asha
> >
>

Re: Issues processing 150K files with DataflowRunner

Reply via email to