Abacn opened a new issue, #26264: URL: https://github.com/apache/beam/issues/26264
### What happened? BigQuery export read Pipeline worked for Beam<=2.42.0 but failing in newer versions. Error message: ``` Error message from worker: java.lang.IllegalArgumentException: Total size of the BoundedSource objects generated by split() operation is larger than the allowable limit. When splitting org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@536a6236 into bundles of 565244165161 bytes it generated 5000 BoundedSource objects with total serialized size of 23719241 bytes which is larger than the limit 20971520 ``` From the logging, the serialized size for BoundedSource elevated: Failed job (v2.45.0): ``` Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 into bundles of estimated size 5018826027 bytes produced 5000 bundles, which have total serialized size 1176766077 bytes. As this is too large for the Google Cloud Dataflow API, retrying splitting once with increased desiredBundleSizeBytes 563_238_545_888 to reduce the number of splits. Splitting with desiredBundleSizeBytes 563238545888 produced 5000 bundles with total serialized size 1176766077 bytes Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 into bundles of estimated size 563238545888 bytes produced 5000 bundles. Rebundling into 100 bundles. Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@6dfae6d4 produced 100 bundles with total serialized response size 23625229 ``` succeeded job (v2.42.0) ``` Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 into bundles of estimated size 5018826027 bytes produced 5000 bundles, which have total serialized size 612350329 bytes. As this is too large for the Google Cloud Dataflow API, retrying splitting once with increased desiredBundleSizeBytes 293_090_798_266 to reduce the number of splits. Splitting with desiredBundleSizeBytes 293090798266 produced 5000 bundles with total serialized size 612350329 bytes Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 into bundles of estimated size 293090798266 bytes produced 5000 bundles. Rebundling into 100 bundles. Splitting source org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3fe148e1 produced 100 bundles with total serialized response size 12673589 ``` ---- ## Problem on the runner code In both succeeded job and failed job, the total serialized size remains unchanged in terms of reducing the number of bundles: https://github.com/apache/beam/blob/837733e862007cbeba0a57f682109d9debec9340/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/WorkerCustomSources.java#L239 only the second adjustment worked and change the number of the bundle to 100: https://github.com/apache/beam/blob/837733e862007cbeba0a57f682109d9debec9340/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/WorkerCustomSources.java#L253 This cannot guarantee the serialize limit has satisfied. There must some problem in splitAndValidate preventing it from working as expected. ---- ## Problem on the serialized size It shows the total serialized size increased from `612350329` to `1176766077`, that is 92% increase. Did a minimum example Pipeline (reading a simple schema BigQuery using export), checking the log output for serialized source output for BigQueryTableSource, the strange thing is it shows opposite: the serialized size becomes smaller in Beam 2.43.0 and onwards: ``` v2.41.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@1ab4652b produced 1 bundles with total serialized response size 3487 v2.42.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@7072c138 produced 1 bundles with total serialized response size 3491 v2.43.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@3648c304 produced 1 bundles with total serialized response size 2399 v2.46.0: org.apache.beam.sdk.io.gcp.bigquery.BigQueryTableSource@634fa80a produced 1 bundles with total serialized response size 2439 ``` Looks like the serialized size of the source is less controlled. ### Issue Priority Priority: 1 (data loss / total loss of function) ### Issue Components - [ ] Component: Python SDK - [X] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [X] Component: IO connector - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
