I opened a support case with Google and they helped me get the pipeline to a state where it's able to keep up with the 10k events/second input. The job was bottlenecked on disk I/O, so we switched the workers to use SSD and added enough disk capacity to max out the potential disk throughput per worker.
On Wed, Jan 30, 2019 at 3:03 AM Kaymak, Tobias <[email protected]> wrote: > Hi, > > I am currently playing around with BigQueryIO options, and I am not an > expert on it, but 60 workers sounds like a lot to me (or expensive > computation) for 10k records hitting 2 tables each. > Could you maybe share the code of your pipeline? > > Cheers, > Tobi > > On Tue, Jan 22, 2019 at 9:28 PM Jeff Klukas <[email protected]> wrote: > >> I'm attempting to deploy a fairly simple job on the Dataflow runner that >> reads from PubSub and writes to BigQuery using file loads, but I have so >> far not been able to tune it to keep up with the incoming data rate. >> >> I have configured BigQueryIO.write to trigger loads every 5 minutes, and >> I've let the job autoscale up to a max of 60 workers (which it has done). >> I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming >> data per table is ~10k events/second, so every 5 minutes each table should >> be ingesting on order 200k records of ~20 kB apiece. >> >> We don't get many knobs to turn in BigQueryIO. I have tested numShards >> between 10 and 1000, but haven't seen obvious differences in performance. >> >> Potentially relevant: I see a high rate of warnings on the shuffler. They >> consist mostly of LevelDB warnings about "Too many L0 files". There are >> occasionally some other warnings relating to memory as well. Would using >> larger workers potentially help? >> >> Does anybody have experience with tuning BigQueryIO writing? It's quite a >> complicated transform under the hood and it looks like there are several >> steps of grouping and shuffling data that could be limiting throughput. >> > > > -- > Tobias Kaymak > Data Engineer > > [email protected] > www.ricardo.ch >
