I opened a support case with Google and they helped me get the pipeline to
a state where it's able to keep up with the 10k events/second input. The
job was bottlenecked on disk I/O, so we switched the workers to use SSD and
added enough disk capacity to max out the potential disk throughput per
worker.

On Wed, Jan 30, 2019 at 3:03 AM Kaymak, Tobias <[email protected]>
wrote:

> Hi,
>
> I am currently playing around with BigQueryIO options, and I am not an
> expert on it, but 60 workers sounds like a lot to me (or expensive
> computation) for 10k records hitting 2 tables each.
> Could you maybe share the code of your pipeline?
>
> Cheers,
> Tobi
>
> On Tue, Jan 22, 2019 at 9:28 PM Jeff Klukas <[email protected]> wrote:
>
>> I'm attempting to deploy a fairly simple job on the Dataflow runner that
>> reads from PubSub and writes to BigQuery using file loads, but I have so
>> far not been able to tune it to keep up with the incoming data rate.
>>
>> I have configured BigQueryIO.write to trigger loads every 5 minutes, and
>> I've let the job autoscale up to a max of 60 workers (which it has done).
>> I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming
>> data per table is ~10k events/second, so every 5 minutes each table should
>> be ingesting on order 200k records of ~20 kB apiece.
>>
>> We don't get many knobs to turn in BigQueryIO. I have tested numShards
>> between 10 and 1000, but haven't seen obvious differences in performance.
>>
>> Potentially relevant: I see a high rate of warnings on the shuffler. They
>> consist mostly of LevelDB warnings about "Too many L0 files". There are
>> occasionally some other warnings relating to memory as well. Would using
>> larger workers potentially help?
>>
>> Does anybody have experience with tuning BigQueryIO writing? It's quite a
>> complicated transform under the hood and it looks like there are several
>> steps of grouping and shuffling data that could be limiting throughput.
>>
>
>
> --
> Tobias Kaymak
> Data Engineer
>
> [email protected]
> www.ricardo.ch
>

Reply via email to