Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar to TextIO and AvroIO :) We'll implement this.
On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath <[email protected]> wrote: > I suspect that you might be hitting Dataflow API limit for messages during > initial splitting the source. Some details are available under "Total > number of BoundedSource objects" below (you should see a similar message in > worker logs but exact error message might be out of date). > https://cloud.google.com/dataflow/pipelines/troubleshooting-your-pipeline > > The exact number of files you can support depends on the size of generated > splits (usually about 400k for TextIO). > > One solution for this is to develop a ReadAll() transform for VcfSource > similar to the following available for TextIO. > https://github.com/apache/beam/blob/master/sdks/python/ > apache_beam/io/textio.py#L409 > > Thanks, > Cham > > > On Tue, Nov 21, 2017 at 8:04 AM Asha Rostamianfar > <[email protected]> wrote: > > > Hi, > > > > I'm wondering whether anyone has tried processing a large number (~150K) > of > > files using DataflowRunner? We are seeing a behavior where the dataflow > job > > starts, but never attaches any workers. After 1h, it cancels the job due > to > > being "stuck". See logs here > > < > > https://02931532374587840286.googlegroups.com/attach/ > 3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay- > Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU- > PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4 > > >. > > It works fine for smaller number of files (e.g. 1k). > > > > We have tried setting num_workers, max_num_workers, etc. Are there any > > other settings that we can try? > > > > Context: the pipeline is using the python Apache Beam SDK and running the > > code at https://github.com/googlegenomics/gcp-variant-transforms. It's > > using the VcfSource, which is based on TextSource. See this thread > > < > > https://groups.google.com/d/msg/google-genomics-discuss/ > LUgqh1s56SY/WUnJkkHUAwAJ > > > > > for > > more context. > > > > Thanks, > > Asha > > >
