Issues processing 150K files with DataflowRunner

Asha Rostamianfar Tue, 21 Nov 2017 08:05:46 -0800

Hi,

I'm wondering whether anyone has tried processing a large number (~150K) of
files using DataflowRunner? We are seeing a behavior where the dataflow job
starts, but never attaches any workers. After 1h, it cancels the job due to
being "stuck". See logs here
<https://02931532374587840286.googlegroups.com/attach/3d44192c94959/log?part=0.1&view=1&vt=ANaJVrFF9hay-Htd06tIuxol3aQb6meA9h2pVoe4tjOwcG71IT9FCqTSWkGMUWnW_lxBuN6Daq8XzmnSUZaNHU-PLvSF3jHinYGwCE13Jg9o0W3AulQy7U4>.
It works fine for smaller number of files (e.g. 1k).


We have tried setting num_workers, max_num_workers, etc. Are there any
other settings that we can try?

Context: the pipeline is using the python Apache Beam SDK and running the
code at https://github.com/googlegenomics/gcp-variant-transforms. It's
using the VcfSource, which is based on TextSource. See this thread
<https://groups.google.com/d/msg/google-genomics-discuss/LUgqh1s56SY/WUnJkkHUAwAJ>
for
more context.

Thanks,
Asha

Issues processing 150K files with DataflowRunner

Reply via email to