Re: Loosing records when using BigQuery IO Connector

2023-05-03 Thread XQ Hu via user
https://github.com/apache/beam/issues/26515 tracks this issue. The fix was
merged. Thanks a lot for reporting this issue, Binh!

On Mon, Apr 17, 2023 at 12:58 PM Binh Nguyen Van  wrote:

> Hi,
>
> I tested with streaming insert and file load, and they all worked as
> expected. But looks like storage API is the new way to go so want to test
> it too. I am using Apache Beam v2.46.0 and running it with Google Dataflow.
>
> Thanks
> -Binh
>
>
> On Mon, Apr 17, 2023 at 9:53 AM Reuven Lax via user 
> wrote:
>
>> What version of Beam are you using? There are no known data-loss bugs in
>> the connector, however if there has been a regression we would like to
>> address it with high priority.
>>
>> On Mon, Apr 17, 2023 at 12:47 AM Binh Nguyen Van 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a job that uses BigQuery IO Connector to write to a BigQuery
>>> table. When I test it with a small number of records (100) it works as
>>> expected but when I tested it with a larger number of records (1), I
>>> don’t see all of the records written to the output table but only part of
>>> it. It changes from run to run but no more than 1000 records as I have seen
>>> so far.
>>>
>>> There are WARNING log entries in the job log like this
>>>
>>> by client #0 failed with error, operations will be retried.
>>> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
>>> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>>>
>>> And one log entry like this
>>>
>>> Finalize of stream [stream-name] finished with row_count: 683
>>>
>>> If I sum all the numbers reported in the WARNING message with the one in
>>> the finalize of stream above, I get 1, which is exactly the number
>>> of input records.
>>>
>>> My pipeline uses 1 worker and it looks like this
>>>
>>> WriteResult writeResult = inputCollection.apply(
>>> "Save Events To BigQuery",
>>> BigQueryIO.write()
>>> .to(options.getTable())
>>> .withFormatFunction(TableRowMappers::toRow)
>>> .withMethod(Method.STORAGE_WRITE_API)
>>> 
>>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>> 
>>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>>> .withExtendedErrorInfo());
>>>
>>> writeResult
>>> .getFailedStorageApiInserts()
>>> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>>>
>>> There are no log entries for the failed inserts.
>>>
>>> Is there anything wrong with my pipeline code or is it a bug in BigQuery
>>> IO Connector?
>>>
>>> Thanks
>>> -Binh
>>>
>>


Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread Binh Nguyen Van
Hi,

I tested with streaming insert and file load, and they all worked as
expected. But looks like storage API is the new way to go so want to test
it too. I am using Apache Beam v2.46.0 and running it with Google Dataflow.

Thanks
-Binh


On Mon, Apr 17, 2023 at 9:53 AM Reuven Lax via user 
wrote:

> What version of Beam are you using? There are no known data-loss bugs in
> the connector, however if there has been a regression we would like to
> address it with high priority.
>
> On Mon, Apr 17, 2023 at 12:47 AM Binh Nguyen Van 
> wrote:
>
>> Hi,
>>
>> I have a job that uses BigQuery IO Connector to write to a BigQuery
>> table. When I test it with a small number of records (100) it works as
>> expected but when I tested it with a larger number of records (1), I
>> don’t see all of the records written to the output table but only part of
>> it. It changes from run to run but no more than 1000 records as I have seen
>> so far.
>>
>> There are WARNING log entries in the job log like this
>>
>> by client #0 failed with error, operations will be retried.
>> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
>> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>>
>> And one log entry like this
>>
>> Finalize of stream [stream-name] finished with row_count: 683
>>
>> If I sum all the numbers reported in the WARNING message with the one in
>> the finalize of stream above, I get 1, which is exactly the number
>> of input records.
>>
>> My pipeline uses 1 worker and it looks like this
>>
>> WriteResult writeResult = inputCollection.apply(
>> "Save Events To BigQuery",
>> BigQueryIO.write()
>> .to(options.getTable())
>> .withFormatFunction(TableRowMappers::toRow)
>> .withMethod(Method.STORAGE_WRITE_API)
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>> .withExtendedErrorInfo());
>>
>> writeResult
>> .getFailedStorageApiInserts()
>> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>>
>> There are no log entries for the failed inserts.
>>
>> Is there anything wrong with my pipeline code or is it a bug in BigQuery
>> IO Connector?
>>
>> Thanks
>> -Binh
>>
>


Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread Reuven Lax via user
What version of Beam are you using? There are no known data-loss bugs in
the connector, however if there has been a regression we would like to
address it with high priority.

On Mon, Apr 17, 2023 at 12:47 AM Binh Nguyen Van  wrote:

> Hi,
>
> I have a job that uses BigQuery IO Connector to write to a BigQuery table.
> When I test it with a small number of records (100) it works as expected
> but when I tested it with a larger number of records (1), I don’t see
> all of the records written to the output table but only part of it. It
> changes from run to run but no more than 1000 records as I have seen so far.
>
> There are WARNING log entries in the job log like this
>
> by client #0 failed with error, operations will be retried.
> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>
> And one log entry like this
>
> Finalize of stream [stream-name] finished with row_count: 683
>
> If I sum all the numbers reported in the WARNING message with the one in
> the finalize of stream above, I get 1, which is exactly the number of
> input records.
>
> My pipeline uses 1 worker and it looks like this
>
> WriteResult writeResult = inputCollection.apply(
> "Save Events To BigQuery",
> BigQueryIO.write()
> .to(options.getTable())
> .withFormatFunction(TableRowMappers::toRow)
> .withMethod(Method.STORAGE_WRITE_API)
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
> .withExtendedErrorInfo());
>
> writeResult
> .getFailedStorageApiInserts()
> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>
> There are no log entries for the failed inserts.
>
> Is there anything wrong with my pipeline code or is it a bug in BigQuery
> IO Connector?
>
> Thanks
> -Binh
>


Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread XQ Hu via user
Does FILE_LOADS (
https://beam.apache.org/releases/javadoc/2.46.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS)
work for your case?
For STORAGE_WRITE_API, it has been actively improved. If the latest SDK
still has this issue, I highly recommend you to create a Google Cloud
support ticket.

On Mon, Apr 17, 2023 at 3:47 AM Binh Nguyen Van  wrote:

> Hi,
>
> I have a job that uses BigQuery IO Connector to write to a BigQuery table.
> When I test it with a small number of records (100) it works as expected
> but when I tested it with a larger number of records (1), I don’t see
> all of the records written to the output table but only part of it. It
> changes from run to run but no more than 1000 records as I have seen so far.
>
> There are WARNING log entries in the job log like this
>
> by client #0 failed with error, operations will be retried.
> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>
> And one log entry like this
>
> Finalize of stream [stream-name] finished with row_count: 683
>
> If I sum all the numbers reported in the WARNING message with the one in
> the finalize of stream above, I get 1, which is exactly the number of
> input records.
>
> My pipeline uses 1 worker and it looks like this
>
> WriteResult writeResult = inputCollection.apply(
> "Save Events To BigQuery",
> BigQueryIO.write()
> .to(options.getTable())
> .withFormatFunction(TableRowMappers::toRow)
> .withMethod(Method.STORAGE_WRITE_API)
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
> .withExtendedErrorInfo());
>
> writeResult
> .getFailedStorageApiInserts()
> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>
> There are no log entries for the failed inserts.
>
> Is there anything wrong with my pipeline code or is it a bug in BigQuery
> IO Connector?
>
> Thanks
> -Binh
>


Loosing records when using BigQuery IO Connector

2023-04-17 Thread Binh Nguyen Van
Hi,

I have a job that uses BigQuery IO Connector to write to a BigQuery table.
When I test it with a small number of records (100) it works as expected
but when I tested it with a larger number of records (1), I don’t see
all of the records written to the output table but only part of it. It
changes from run to run but no more than 1000 records as I have seen so far.

There are WARNING log entries in the job log like this

by client #0 failed with error, operations will be retried.
com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists:
ALREADY_EXISTS: The offset is within stream, expected offset 933,
received 0

And one log entry like this

Finalize of stream [stream-name] finished with row_count: 683

If I sum all the numbers reported in the WARNING message with the one in
the finalize of stream above, I get 1, which is exactly the number of
input records.

My pipeline uses 1 worker and it looks like this

WriteResult writeResult = inputCollection.apply(
"Save Events To BigQuery",
BigQueryIO.write()
.to(options.getTable())
.withFormatFunction(TableRowMappers::toRow)
.withMethod(Method.STORAGE_WRITE_API)

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)

.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo());

writeResult
.getFailedStorageApiInserts()
.apply("Log failed inserts", ParDo.of(new PrintFn<>()));

There are no log entries for the failed inserts.

Is there anything wrong with my pipeline code or is it a bug in BigQuery IO
Connector?

Thanks
-Binh