TobiasBredow opened a new issue, #32155:
URL: https://github.com/apache/beam/issues/32155

   ### What happened?
   
   I noticed that there are some differences when switching over ingestions 
from the Streaming_Inserts to the Storage_Write_API in the WriteToBigQuery 
transform. Using the Python API.
   
   Namely in the old ingestion it is possible to pass in empty repeated fields 
and they will default to an empty list. However that fails as soon as the newer 
Storage_Write_API is used.
   
   It seems that since it converts the inputs to a beam row to send it to Java 
api it runs into an error in 
[beam_row_from_dict](https://github.com/apache/beam/blob/2f93d8bc19917f83d15f531bcbbfb7f36e21ff88/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1567).
 Since it converts fields that are not present to None but if that field is a 
repeated struct or recorde it then fails when trying to iterate over None in 
line 
[1601](https://github.com/apache/beam/blob/2f93d8bc19917f83d15f531bcbbfb7f36e21ff88/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L1601)
 . Is that new wanted behavior as it forces us to always add an empty list to a 
dict before sending it to to the write to BigQuery transform? Especially if you 
have multiple of these fields in a high frequency source they add to the 
data_processed costs by Dataflow.
   
   I would also be happy to adjust this behavior myself since it looks like a 
small and easy fix to me. If it is not by design and wanted that the transforms 
fails in that way with empty repeated fields.
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to