Re: [Question] Beam Java Dataflow v1 Runner Oracle JDK

2023-04-17 Thread hardip singh
Hi Bruno,

Yep it is indeed based on the OpenJDK source code.  However it looks to be 
provided by Oracle (and hence falls under the oracle licence):

%docker run -it --entrypoint '/bin/bash' 
gcr.io/cloud-dataflow/v1beta3/beam-java-streaming:2.46.0
WARNING: The requested image's platform (linux/amd64) does not match the 
detected host platform (linux/arm64/v8) and no specific platform was requested
root@e8dfe675ad85:/# java  -XshowSettings:properties -version 2>&1 | grep vendor
java.specification.vendor = Oracle Corporation
java.vendor = Oracle Corporation
java.vendor.url = http://java.oracle.com/
java.vendor.url.bug = http://bugreport.sun.com/bugreport/
java.vm.specification.vendor = Oracle Corporation
java.vm.vendor = Oracle Corporation

(The eclipse runtimes vendor, as used in beam_javaX_sdk, are set to Temurin 
with adoptium url’s.)

I’m unable to find the docker file to see exactly were the runtime is being 
sourced from.

Thanks

Hardip


> On 17 Apr 2023, at 19:00, Bruno Volpato via user  wrote:
> 
> Hello Hardip,
> 
> If you are using Beam 2.46.0, it should be using OpenJDK already (not 
> Oracle's JRE as before).
> 
> No need for the sources, you can check the images directly from your 
> terminal, if you have Docker installed:
> 
> $ docker run -it --entrypoint '/bin/bash' 
> gcr.io/cloud-dataflow/v1beta3/beam-java-streaming:2.46.0 
> 
> # java -version
> openjdk version "1.8.0_322"
> OpenJDK Runtime Environment (build 1.8.0_322-b06)
> OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)
> 
> 
> Best,
> Bruno
> 
> 
> 
> 
> 
> 
> 
> On Mon, Apr 17, 2023 at 1:29 PM hardip singh  > wrote:
>> Hi,
>> 
>> I was hoping some one could shed some light and potentially a solution to a 
>> problem I face with usage of the v1 runner.
>> 
>> Due to Oracle Java SE licensing changes of older Java versions, I am looking 
>> to move to the eclipse (Temurin) OpenJdk runtime, which I can see has been 
>> updated in the container used by the V2 runner of version 2.46.0 
>> (https://github.com/apache/beam/blob/master/sdks/java/container/Dockerfile#L19).
>>  My code is running on Java 8.
>> 
>> My code is currently running on the V1 runner which looks to be using the 
>> Oracle OpenJDK runtime.
>> 
>> Upgrading the pipelines tom the V2 runners causes significant degradation of 
>> performance, which I am not in a position to work through straight away.
>> 
>> So is it possible to use the v1 runner with a none Oracle provided JVM??
>> 
>> I cannot seem to find the source of the docker container for 
>> beam-java-batch/streaming (I was looking to see if I update that to the same 
>> java runtime as the V2 runner).
>> 
>> Any guidance would be gratefully recieved.
>> 
>> Thanks
>> 
>> Hardip



Re: [Question] Beam Java Dataflow v1 Runner Oracle JDK

2023-04-17 Thread Bruno Volpato via user
Hello Hardip,

If you are using Beam 2.46.0, it should be using OpenJDK already (not
Oracle's JRE as before).

No need for the sources, you can check the images directly from your
terminal, if you have Docker installed:


$ docker run -it --entrypoint '/bin/bash'
gcr.io/cloud-dataflow/v1beta3/beam-java-streaming:2.46.0

# java -version

openjdk version "1.8.0_322"
OpenJDK Runtime Environment (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)



Best,
Bruno







On Mon, Apr 17, 2023 at 1:29 PM hardip singh  wrote:

> Hi,
>
> I was hoping some one could shed some light and potentially a solution to
> a problem I face with usage of the v1 runner.
>
> Due to Oracle Java SE licensing changes of older Java versions, I am
> looking to move to the eclipse (Temurin) OpenJdk runtime, which I can see
> has been updated in the container used by the V2 runner of version 2.46.0 (
> https://github.com/apache/beam/blob/master/sdks/java/container/Dockerfile#L19).
> My code is running on Java 8.
>
> My code is currently running on the V1 runner which looks to be using the
> Oracle OpenJDK runtime.
>
> Upgrading the pipelines tom the V2 runners causes significant degradation
> of performance, which I am not in a position to work through straight away.
>
> So is it possible to use the v1 runner with a none Oracle provided JVM??
>
> I cannot seem to find the source of the docker container for
> beam-java-batch/streaming (I was looking to see if I update that to the
> same java runtime as the V2 runner).
>
> Any guidance would be gratefully recieved.
>
> Thanks
>
> Hardip
>


[Question] Beam Java Dataflow v1 Runner Oracle JDK

2023-04-17 Thread hardip singh
Hi,

I was hoping some one could shed some light and potentially a solution to a 
problem I face with usage of the v1 runner.

Due to Oracle Java SE licensing changes of older Java versions, I am looking to 
move to the eclipse (Temurin) OpenJdk runtime, which I can see has been updated 
in the container used by the V2 runner of version 2.46.0 
(https://github.com/apache/beam/blob/master/sdks/java/container/Dockerfile#L19).
 My code is running on Java 8.

My code is currently running on the V1 runner which looks to be using the 
Oracle OpenJDK runtime.

Upgrading the pipelines tom the V2 runners causes significant degradation of 
performance, which I am not in a position to work through straight away.

So is it possible to use the v1 runner with a none Oracle provided JVM??

I cannot seem to find the source of the docker container for 
beam-java-batch/streaming (I was looking to see if I update that to the same 
java runtime as the V2 runner).

Any guidance would be gratefully recieved.

Thanks

Hardip

Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread Binh Nguyen Van
Hi,

I tested with streaming insert and file load, and they all worked as
expected. But looks like storage API is the new way to go so want to test
it too. I am using Apache Beam v2.46.0 and running it with Google Dataflow.

Thanks
-Binh


On Mon, Apr 17, 2023 at 9:53 AM Reuven Lax via user 
wrote:

> What version of Beam are you using? There are no known data-loss bugs in
> the connector, however if there has been a regression we would like to
> address it with high priority.
>
> On Mon, Apr 17, 2023 at 12:47 AM Binh Nguyen Van 
> wrote:
>
>> Hi,
>>
>> I have a job that uses BigQuery IO Connector to write to a BigQuery
>> table. When I test it with a small number of records (100) it works as
>> expected but when I tested it with a larger number of records (1), I
>> don’t see all of the records written to the output table but only part of
>> it. It changes from run to run but no more than 1000 records as I have seen
>> so far.
>>
>> There are WARNING log entries in the job log like this
>>
>> by client #0 failed with error, operations will be retried.
>> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
>> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>>
>> And one log entry like this
>>
>> Finalize of stream [stream-name] finished with row_count: 683
>>
>> If I sum all the numbers reported in the WARNING message with the one in
>> the finalize of stream above, I get 1, which is exactly the number
>> of input records.
>>
>> My pipeline uses 1 worker and it looks like this
>>
>> WriteResult writeResult = inputCollection.apply(
>> "Save Events To BigQuery",
>> BigQueryIO.write()
>> .to(options.getTable())
>> .withFormatFunction(TableRowMappers::toRow)
>> .withMethod(Method.STORAGE_WRITE_API)
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>> .withExtendedErrorInfo());
>>
>> writeResult
>> .getFailedStorageApiInserts()
>> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>>
>> There are no log entries for the failed inserts.
>>
>> Is there anything wrong with my pipeline code or is it a bug in BigQuery
>> IO Connector?
>>
>> Thanks
>> -Binh
>>
>


Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread Reuven Lax via user
What version of Beam are you using? There are no known data-loss bugs in
the connector, however if there has been a regression we would like to
address it with high priority.

On Mon, Apr 17, 2023 at 12:47 AM Binh Nguyen Van  wrote:

> Hi,
>
> I have a job that uses BigQuery IO Connector to write to a BigQuery table.
> When I test it with a small number of records (100) it works as expected
> but when I tested it with a larger number of records (1), I don’t see
> all of the records written to the output table but only part of it. It
> changes from run to run but no more than 1000 records as I have seen so far.
>
> There are WARNING log entries in the job log like this
>
> by client #0 failed with error, operations will be retried.
> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>
> And one log entry like this
>
> Finalize of stream [stream-name] finished with row_count: 683
>
> If I sum all the numbers reported in the WARNING message with the one in
> the finalize of stream above, I get 1, which is exactly the number of
> input records.
>
> My pipeline uses 1 worker and it looks like this
>
> WriteResult writeResult = inputCollection.apply(
> "Save Events To BigQuery",
> BigQueryIO.write()
> .to(options.getTable())
> .withFormatFunction(TableRowMappers::toRow)
> .withMethod(Method.STORAGE_WRITE_API)
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
> .withExtendedErrorInfo());
>
> writeResult
> .getFailedStorageApiInserts()
> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>
> There are no log entries for the failed inserts.
>
> Is there anything wrong with my pipeline code or is it a bug in BigQuery
> IO Connector?
>
> Thanks
> -Binh
>


Re: Loosing records when using BigQuery IO Connector

2023-04-17 Thread XQ Hu via user
Does FILE_LOADS (
https://beam.apache.org/releases/javadoc/2.46.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS)
work for your case?
For STORAGE_WRITE_API, it has been actively improved. If the latest SDK
still has this issue, I highly recommend you to create a Google Cloud
support ticket.

On Mon, Apr 17, 2023 at 3:47 AM Binh Nguyen Van  wrote:

> Hi,
>
> I have a job that uses BigQuery IO Connector to write to a BigQuery table.
> When I test it with a small number of records (100) it works as expected
> but when I tested it with a larger number of records (1), I don’t see
> all of the records written to the output table but only part of it. It
> changes from run to run but no more than 1000 records as I have seen so far.
>
> There are WARNING log entries in the job log like this
>
> by client #0 failed with error, operations will be retried.
> com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists: 
> ALREADY_EXISTS: The offset is within stream, expected offset 933, received 0
>
> And one log entry like this
>
> Finalize of stream [stream-name] finished with row_count: 683
>
> If I sum all the numbers reported in the WARNING message with the one in
> the finalize of stream above, I get 1, which is exactly the number of
> input records.
>
> My pipeline uses 1 worker and it looks like this
>
> WriteResult writeResult = inputCollection.apply(
> "Save Events To BigQuery",
> BigQueryIO.write()
> .to(options.getTable())
> .withFormatFunction(TableRowMappers::toRow)
> .withMethod(Method.STORAGE_WRITE_API)
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
> .withExtendedErrorInfo());
>
> writeResult
> .getFailedStorageApiInserts()
> .apply("Log failed inserts", ParDo.of(new PrintFn<>()));
>
> There are no log entries for the failed inserts.
>
> Is there anything wrong with my pipeline code or is it a bug in BigQuery
> IO Connector?
>
> Thanks
> -Binh
>


Loosing records when using BigQuery IO Connector

2023-04-17 Thread Binh Nguyen Van
Hi,

I have a job that uses BigQuery IO Connector to write to a BigQuery table.
When I test it with a small number of records (100) it works as expected
but when I tested it with a larger number of records (1), I don’t see
all of the records written to the output table but only part of it. It
changes from run to run but no more than 1000 records as I have seen so far.

There are WARNING log entries in the job log like this

by client #0 failed with error, operations will be retried.
com.google.cloud.bigquery.storage.v1.Exceptions$OffsetAlreadyExists:
ALREADY_EXISTS: The offset is within stream, expected offset 933,
received 0

And one log entry like this

Finalize of stream [stream-name] finished with row_count: 683

If I sum all the numbers reported in the WARNING message with the one in
the finalize of stream above, I get 1, which is exactly the number of
input records.

My pipeline uses 1 worker and it looks like this

WriteResult writeResult = inputCollection.apply(
"Save Events To BigQuery",
BigQueryIO.write()
.to(options.getTable())
.withFormatFunction(TableRowMappers::toRow)
.withMethod(Method.STORAGE_WRITE_API)

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)

.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo());

writeResult
.getFailedStorageApiInserts()
.apply("Log failed inserts", ParDo.of(new PrintFn<>()));

There are no log entries for the failed inserts.

Is there anything wrong with my pipeline code or is it a bug in BigQuery IO
Connector?

Thanks
-Binh