Re: BigQueryIO streaming inserts - poor performance with multiple tables
Could you please keep writing here the findings you make? I'm very interested in this issue as well. Thanks! On Thu, Mar 1, 2018 at 9:45 AM Joshwrote: > Hi Cham, > > Thanks, I have emailed the dataflow-feedback email address with the > details. > > Best regards, > Josh > > On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath > wrote: > >> Could be a DataflowRunner specific issue. Would you mind reporting this >> with corresponding Dataflow job IDs to either Dataflow stackoverflow >> channel [1] or dataflow-feedb...@google.com ? >> >> I suspect Dataflow split writing to multiple tables into multiple workers >> which may be keep all workers busy but we have to look at the job to >> confirm. >> >> Thanks, >> Cham >> >> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow >> >> On Tue, Feb 27, 2018 at 11:56 PM Josh wrote: >> >>> Hi all, >>> >>> We are using BigQueryIO.write() to stream data into BigQuery, and are >>> seeing very poor performance in terms of number of writes per second per >>> worker. >>> >>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000 >>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the >>> number of workers and worker CPU utilisation remains constant at ~90% even >>> when the rate of input fluctuates down to below 10,000 writes/sec. The job >>> always keeps up with the stream (no backlog). >>> >>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being >>> achieved with a single node, when streaming data into a *single* BQ >>> table... So my theory is that writing to multiple tables is somehow causing >>> the performance issue. Our writes are spread (unevenly) across 200+ tables. >>> The job itself does very little processing, and looking at the Dataflow >>> metrics pretty much all of the wall time is spent in the >>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0. >>> >>> Our code looks like this: >>> >>> stream.apply(BigQueryIO.write() >>> .to(new ToDestination()) >>> .withFormatFunction(new FormatForBigQuery()) >>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) >>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); >>> >>> where ToDestination is a: >>> >>> SerializableFunction >>> >>> which returns a: >>> >>> new TableDestination(tableName, "") >>> >>> where tableName looks like "myproject:dataset.tablename$20180228" >>> >>> Has as anyone else seen this kind of poor performance when streaming writes >>> to multiple BQ tables? Is there anything here that sounds wrong, or any >>> optimisations we can make? >>> >>> Thanks for any advice! >>> >>> Josh >>> >> >
Re: BigQueryIO streaming inserts - poor performance with multiple tables
Hi Cham, Thanks, I have emailed the dataflow-feedback email address with the details. Best regards, Josh On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalathwrote: > Could be a DataflowRunner specific issue. Would you mind reporting this > with corresponding Dataflow job IDs to either Dataflow stackoverflow > channel [1] or dataflow-feedb...@google.com ? > > I suspect Dataflow split writing to multiple tables into multiple workers > which may be keep all workers busy but we have to look at the job to > confirm. > > Thanks, > Cham > > [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow > > On Tue, Feb 27, 2018 at 11:56 PM Josh wrote: > >> Hi all, >> >> We are using BigQueryIO.write() to stream data into BigQuery, and are >> seeing very poor performance in terms of number of writes per second per >> worker. >> >> We are currently using *32* x *n1-standard-4* workers to stream ~15,000 >> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the >> number of workers and worker CPU utilisation remains constant at ~90% even >> when the rate of input fluctuates down to below 10,000 writes/sec. The job >> always keeps up with the stream (no backlog). >> >> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved >> with a single node, when streaming data into a *single* BQ table... So >> my theory is that writing to multiple tables is somehow causing the >> performance issue. Our writes are spread (unevenly) across 200+ tables. The >> job itself does very little processing, and looking at the Dataflow metrics >> pretty much all of the wall time is spent in the *StreamingWrite* step >> of BigQueryIO. The Beam version is 2.2.0. >> >> Our code looks like this: >> >> stream.apply(BigQueryIO.write() >> .to(new ToDestination()) >> .withFormatFunction(new FormatForBigQuery()) >> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) >> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); >> >> where ToDestination is a: >> >> SerializableFunction >> >> which returns a: >> >> new TableDestination(tableName, "") >> >> where tableName looks like "myproject:dataset.tablename$20180228" >> >> Has as anyone else seen this kind of poor performance when streaming writes >> to multiple BQ tables? Is there anything here that sounds wrong, or any >> optimisations we can make? >> >> Thanks for any advice! >> >> Josh >> >
Re: BigQueryIO streaming inserts - poor performance with multiple tables
Could be a DataflowRunner specific issue. Would you mind reporting this with corresponding Dataflow job IDs to either Dataflow stackoverflow channel [1] or dataflow-feedb...@google.com ? I suspect Dataflow split writing to multiple tables into multiple workers which may be keep all workers busy but we have to look at the job to confirm. Thanks, Cham [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow On Tue, Feb 27, 2018 at 11:56 PM Joshwrote: > Hi all, > > We are using BigQueryIO.write() to stream data into BigQuery, and are > seeing very poor performance in terms of number of writes per second per > worker. > > We are currently using *32* x *n1-standard-4* workers to stream ~15,000 > writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the > number of workers and worker CPU utilisation remains constant at ~90% even > when the rate of input fluctuates down to below 10,000 writes/sec. The job > always keeps up with the stream (no backlog). > > I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved > with a single node, when streaming data into a *single* BQ table... So my > theory is that writing to multiple tables is somehow causing the > performance issue. Our writes are spread (unevenly) across 200+ tables. The > job itself does very little processing, and looking at the Dataflow metrics > pretty much all of the wall time is spent in the *StreamingWrite* step of > BigQueryIO. The Beam version is 2.2.0. > > Our code looks like this: > > stream.apply(BigQueryIO.write() > .to(new ToDestination()) > .withFormatFunction(new FormatForBigQuery()) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); > > where ToDestination is a: > > SerializableFunction > > which returns a: > > new TableDestination(tableName, "") > > where tableName looks like "myproject:dataset.tablename$20180228" > > Has as anyone else seen this kind of poor performance when streaming writes > to multiple BQ tables? Is there anything here that sounds wrong, or any > optimisations we can make? > > Thanks for any advice! > > Josh >