Abacn opened a new issue, #26264:
URL: https://github.com/apache/beam/issues/26264

   ### What happened?
   
   BigQuery export read Pipeline worked for Beam<=2.42.0 but failing in newer 
versions. Error message:
   
   ```
   Error message from worker: java.lang.IllegalArgumentException: Total size of 
the BoundedSource objects generated by split() operation is larger than the 
allowable limit. When splitting 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@536a6236 into bundles 
of 565244165161 bytes it generated 5000 BoundedSource objects with total 
serialized size of 23719241 bytes which is larger than the limit 20971520
   ```
   
   From the logging, the serialized size for BoundedSource elevated:
   
   Failed job (v2.45.0):
   ```
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 into bundles 
of estimated size 5018826027 bytes produced 5000 bundles, which have total 
serialized size 1176766077 bytes. As this is too large for the Google Cloud 
Dataflow API, retrying splitting once with increased desiredBundleSizeBytes 
563_238_545_888 to reduce the number of splits.
   Splitting with desiredBundleSizeBytes 563238545888 produced 5000 bundles 
with total serialized size 1176766077 bytes
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 into bundles 
of estimated size 563238545888 bytes produced 5000 bundles. Rebundling into 100 
bundles.
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 produced 100 
bundles with total serialized response size 23625229
   ```
   succeeded job (v2.42.0)
   ```
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 into bundles 
of estimated size 5018826027 bytes produced 5000 bundles, which have total 
serialized size 612350329 bytes. As this is too large for the Google Cloud 
Dataflow API, retrying splitting once with increased desiredBundleSizeBytes 
293_090_798_266 to reduce the number of splits.
   Splitting with desiredBundleSizeBytes 293090798266 produced 5000 bundles 
with total serialized size 612350329 bytes
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 into bundles 
of estimated size 293090798266 bytes produced 5000 bundles. Rebundling into 100 
bundles.
   Splitting source 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 produced 100 
bundles with total serialized response size 12673589
   ```
   
   ----
   
   ## Problem on the runner code
   
   In both succeeded job and failed job, the total serialized size remains 
unchanged  in terms of reducing the number of bundles:
   
https://github.com/apache/beam/blob/837733e862007cbeba0a57f682109d9debec9340/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/WorkerCustomSources.java#L239
   
   only the second adjustment worked and change the number of the bundle to 
100: 
https://github.com/apache/beam/blob/837733e862007cbeba0a57f682109d9debec9340/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/WorkerCustomSources.java#L253
   
   This cannot guarantee the serialize limit has satisfied. There must some 
problem in splitAndValidate preventing it from working as expected.
   
   ----
   
   ## Problem on the serialized size
   
   It shows the total serialized size increased from `612350329` to 
`1176766077`, that is 92% increase.
   
   Did a minimum example Pipeline (reading a simple schema BigQuery using 
export), checking the log output for serialized source output for 
BigQueryTableSource, the strange thing is it shows opposite: the serialized 
size becomes smaller in Beam 2.43.0 and onwards:
   ```
   v2.41.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@1ab4652b 
produced 1 bundles with total serialized response size 3487
   v2.42.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@7072c138 
produced 1 bundles with total serialized response size 3491
   v2.43.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3648c304 
produced 1 bundles with total serialized response size 2399
   v2.46.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@634fa80a 
produced 1 bundles with total serialized response size 2439
   ```
   Looks like the serialized size of the source is less controlled.
   
   
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [X] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to