I suspect that you might be hitting Dataflow API limit for messages during
initial splitting the source. Some details are available under "Total
number of BoundedSource objects" below (you should see a similar message in
worker logs but exact error message might be out of date).

The exact number of files you can support depends on the size of generated
splits (usually about 400k for TextIO).

One solution for this is to develop a ReadAll() transform for VcfSource
similar to the following available for TextIO.


On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar
<arost...@google.com.invalid> wrote:

> Hi,
> I'm wondering whether anyone has tried processing a large number (~150K) of
> files using DataflowRunner? We are seeing a behavior where the dataflow job
> starts, but never attaches any workers. After 1h, it cancels the job due to
> being "stuck". See logs here
> <
> https://02931532374587840286.googlegroups.com/attach/3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4
> >.
> It works fine for smaller number of files (e.g. 1k).
> We have tried setting num_workers, max_num_workers, etc. Are there any
> other settings that we can try?
> Context: the pipeline is using the python Apache Beam SDK and running the
> code at https://github.com/googlegenomics/gcp-variant-transforms. It's
> using the VcfSource, which is based on TextSource. See this thread
> <
> https://groups.google.com/d/msg/google-genomics-discuss/LUgqh1s56SY/WUnJkkHUAwAJ
> >
> for
> more context.
> Thanks,
> Asha

Reply via email to