Re: Issues processing 150K files with DataflowRunner

2017-11-22 Thread Chamikara Jayalath
Thanks. Note that shards generated by ReadAll transform will not support dynamic work rebalancing but this should not matter when number of shards are large. Long term solution is Splittable DoFn which is on the works. - Cham On Wed, Nov 22, 2017 at 8:23 AM Asha Rostamianfar wrote: > Thanks a l

Re: Issues processing 150K files with DataflowRunner

2017-11-22 Thread Asha Rostamianfar
Thanks a lot, Cham! yes, it looks like we need a ReadAll transform similar to TextIO and AvroIO :) We'll implement this. On Tue, Nov 21, 2017 at 1:05 PM, Chamikara Jayalath wrote: > I suspect that you might be hitting Dataflow API limit for messages during > initial splitting the source. Some de

Re: Issues processing 150K files with DataflowRunner

2017-11-21 Thread Chamikara Jayalath
I suspect that you might be hitting Dataflow API limit for messages during initial splitting the source. Some details are available under "Total number of BoundedSource objects" below (you should see a similar message in worker logs but exact error message might be out of date). https://cloud.googl

Issues processing 150K files with DataflowRunner

2017-11-21 Thread Asha Rostamianfar
Hi, I'm wondering whether anyone has tried processing a large number (~150K) of files using DataflowRunner? We are seeing a behavior where the dataflow job starts, but never attaches any workers. After 1h, it cancels the job due to being "stuck". See logs here