Thanks Heejong. I agree that writing to a service using 50 unlimited threadpools sounds excessive and can result in flooding that service (BigQuery in this case) in error scenarios where we should backoff. Determining a suitable and limited amount of parallelization (50 in this case) sounds good to me.
Thanks, Cham On Wed, Jan 16, 2019 at 6:53 PM Heejong Lee <[email protected]> wrote: > Hi, > > I want to suggest the change[1] of the thread pool type in BigQuery > streaming insert for Java SDK (BEAM-6443). When we insert small data into > BigQuery very fast by using BigQueryIO.write, it generates lots of rate > limit exceeded errors in a log file. It's mainly because the number of > threads to be used for the inserting job is just too large (50 shards * > dozens of futures executed by unlimited thread pool per each bundle). I've > conducted some benchmarks[2] and could see that the change from unlimited > thread pool to single thread pool reduces the number of (same repeated, > possibly meaningless) error messages by 1/4 while retaining the same > performance. I think that this change will not break any important > performance measure but if anybody has any concerns about this change > please let me know. > > Thanks, > > [1] https://github.com/apache/beam/pull/7547 > [2] > https://docs.google.com/document/d/1EhRNWLevm86GD_QtvlrTauHITVMwQBzuemyp-w4Z_ck/edit#heading=h.c0angyd9tn21 >
