Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-03-06 Thread Carlos Alonso
Could you please keep writing here the findings you make?

I'm very interested in this issue as well.

Thanks!

On Thu, Mar 1, 2018 at 9:45 AM Josh  wrote:

> Hi Cham,
>
> Thanks, I have emailed the dataflow-feedback email address with the
> details.
>
> Best regards,
> Josh
>
> On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath 
> wrote:
>
>> Could be a DataflowRunner specific issue. Would you mind reporting this
>> with corresponding Dataflow job IDs to either Dataflow stackoverflow
>> channel [1] or dataflow-feedb...@google.com ?
>>
>> I suspect Dataflow split writing to multiple tables into multiple workers
>> which may be keep all workers busy but we have to look at the job to
>> confirm.
>>
>> Thanks,
>> Cham
>>
>> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>
>> On Tue, Feb 27, 2018 at 11:56 PM Josh  wrote:
>>
>>> Hi all,
>>>
>>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>>> seeing very poor performance in terms of number of writes per second per
>>> worker.
>>>
>>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>>> number of workers and worker CPU utilisation remains constant at ~90% even
>>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>>> always keeps up with the stream (no backlog).
>>>
>>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being
>>> achieved with a single node, when streaming data into a *single* BQ
>>> table... So my theory is that writing to multiple tables is somehow causing
>>> the performance issue. Our writes are spread (unevenly) across 200+ tables.
>>> The job itself does very little processing, and looking at the Dataflow
>>> metrics pretty much all of the wall time is spent in the
>>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0.
>>>
>>> Our code looks like this:
>>>
>>> stream.apply(BigQueryIO.write()
>>> .to(new ToDestination())
>>> .withFormatFunction(new FormatForBigQuery())
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>>
>>> where ToDestination is a:
>>>
>>> SerializableFunction
>>>
>>> which returns a:
>>>
>>> new TableDestination(tableName, "")
>>>
>>> where tableName looks like "myproject:dataset.tablename$20180228"
>>>
>>> Has as anyone else seen this kind of poor performance when streaming writes 
>>> to multiple BQ tables? Is there anything here that sounds wrong, or any 
>>> optimisations we can make?
>>>
>>> Thanks for any advice!
>>>
>>> Josh
>>>
>>
>


Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-03-01 Thread Josh
Hi Cham,

Thanks, I have emailed the dataflow-feedback email address with the details.

Best regards,
Josh

On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath 
wrote:

> Could be a DataflowRunner specific issue. Would you mind reporting this
> with corresponding Dataflow job IDs to either Dataflow stackoverflow
> channel [1] or dataflow-feedb...@google.com ?
>
> I suspect Dataflow split writing to multiple tables into multiple workers
> which may be keep all workers busy but we have to look at the job to
> confirm.
>
> Thanks,
> Cham
>
> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>
> On Tue, Feb 27, 2018 at 11:56 PM Josh  wrote:
>
>> Hi all,
>>
>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>> seeing very poor performance in terms of number of writes per second per
>> worker.
>>
>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>> number of workers and worker CPU utilisation remains constant at ~90% even
>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>> always keeps up with the stream (no backlog).
>>
>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
>> with a single node, when streaming data into a *single* BQ table... So
>> my theory is that writing to multiple tables is somehow causing the
>> performance issue. Our writes are spread (unevenly) across 200+ tables. The
>> job itself does very little processing, and looking at the Dataflow metrics
>> pretty much all of the wall time is spent in the *StreamingWrite* step
>> of BigQueryIO. The Beam version is 2.2.0.
>>
>> Our code looks like this:
>>
>> stream.apply(BigQueryIO.write()
>> .to(new ToDestination())
>> .withFormatFunction(new FormatForBigQuery())
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>
>> where ToDestination is a:
>>
>> SerializableFunction
>>
>> which returns a:
>>
>> new TableDestination(tableName, "")
>>
>> where tableName looks like "myproject:dataset.tablename$20180228"
>>
>> Has as anyone else seen this kind of poor performance when streaming writes 
>> to multiple BQ tables? Is there anything here that sounds wrong, or any 
>> optimisations we can make?
>>
>> Thanks for any advice!
>>
>> Josh
>>
>


Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-02-28 Thread Chamikara Jayalath
Could be a DataflowRunner specific issue. Would you mind reporting this
with corresponding Dataflow job IDs to either Dataflow stackoverflow
channel [1] or dataflow-feedb...@google.com ?

I suspect Dataflow split writing to multiple tables into multiple workers
which may be keep all workers busy but we have to look at the job to
confirm.

Thanks,
Cham

[1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow

On Tue, Feb 27, 2018 at 11:56 PM Josh  wrote:

> Hi all,
>
> We are using BigQueryIO.write() to stream data into BigQuery, and are
> seeing very poor performance in terms of number of writes per second per
> worker.
>
> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
> number of workers and worker CPU utilisation remains constant at ~90% even
> when the rate of input fluctuates down to below 10,000 writes/sec. The job
> always keeps up with the stream (no backlog).
>
> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
> with a single node, when streaming data into a *single* BQ table... So my
> theory is that writing to multiple tables is somehow causing the
> performance issue. Our writes are spread (unevenly) across 200+ tables. The
> job itself does very little processing, and looking at the Dataflow metrics
> pretty much all of the wall time is spent in the *StreamingWrite* step of
> BigQueryIO. The Beam version is 2.2.0.
>
> Our code looks like this:
>
> stream.apply(BigQueryIO.write()
> .to(new ToDestination())
> .withFormatFunction(new FormatForBigQuery())
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>
> where ToDestination is a:
>
> SerializableFunction
>
> which returns a:
>
> new TableDestination(tableName, "")
>
> where tableName looks like "myproject:dataset.tablename$20180228"
>
> Has as anyone else seen this kind of poor performance when streaming writes 
> to multiple BQ tables? Is there anything here that sounds wrong, or any 
> optimisations we can make?
>
> Thanks for any advice!
>
> Josh
>