Re: Issues processing 150K files with DataflowRunner

Chamikara Jayalath Wed, 22 Nov 2017 08:32:13 -0800

Thanks. Note that shards generated by ReadAll transform will not support
dynamic work rebalancing but this should not matter when number of shards
are large. Long term solution is Splittable DoFn which is on the works.


- Cham

On Wed, Nov 22, 2017 at 8:23 AM Asha Rostamianfar
<arost...@google.com.invalid> wrote:

> Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar
> to TextIO and AvroIO :) We'll implement this.
>
> On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath <chamikar...@gmail.com
> >
> wrote:
>
> > I suspect that you might be hitting Dataflow API limit for messages
> during
> > initial splitting the source. Some details are available under "Total
> > number of BoundedSource objects" below (you should see a similar message
> in
> > worker logs but exact error message might be out of date).
> >
> https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline
> >
> > The exact number of files you can support depends on the size of
> generated
> > splits (usually about 400k for TextIO).
> >
> > One solution for this is to develop a ReadAll() transform for VcfSource
> > similar to the following available for TextIO.
> > https://github.com/apache/beam/blob/master/sdks/python/
> > apache_beam/io/textio.py#L409
> >
> > Thanks,
> > Cham
> >
> >
> > On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
> > <arost...@google.com.invalid> wrote:
> >
> > > Hi,
> > >
> > > I'm wondering whether anyone has tried processing a large number
> (~150K)
> > of
> > > files using DataflowRunner? We are seeing a behavior where the dataflow
> > job
> > > starts, but never attaches any workers. After 1h, it cancels the job
> due
> > to
> > > being "stuck". See logs here
> > > <
> > > https://02931532374587840286.googlegroups.com/attach/
> > 3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-
> >
> Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-
> > PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> > > >.
> > > It works fine for smaller number of files (e.g. 1k).
> > >
> > > We have tried setting num_workers, max_num_workers, etc. Are there any
> > > other settings that we can try?
> > >
> > > Context: the pipeline is using the python Apache Beam SDK and running
> the
> > > code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> > > using the VcfSource, which is based on TextSource. See this thread
> > > <
> > > https://groups.google.com/d/msg/google-genomics-discuss/
> > LUgqh1s56SY/WUnJkkHUAwAJ
> > > >
> > > for
> > > more context.
> > >
> > > Thanks,
> > > Asha
> > >
> >
>

Re: Issues processing 150K files with DataflowRunner

Reply via email to