jbandoro commented on issue #31422:
URL: https://github.com/apache/beam/issues/31422#issuecomment-3033835318

   Similar to the OP but with the python sdk, we're reading messages from a 
Pub/Sub subscription (with `beam.io.gcp.pubsub.ReadFromPubSub`), parsing the 
records and then writing to BigQuery with:
   
   ```python
           parsed_messages | "Write to BigQuery" >> beam.io.WriteToBigQuery(
               table=FULL_TABLE_ID,
               schema=TABLE_SCHEMA,
               write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
               method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API,
               with_auto_sharding=True,
           )
   ```
   
   Using `apache-beam==2.66.0` with the DataflowRunner on a streaming pipeline 
with the `streaming_mode_at_least_once` mode. We're using a custom dataflow 
image following the instructions in the 
[docs](https://cloud.google.com/dataflow/docs/guides/build-container-image). 
   
   From my testing setting `with_auto_sharding` to False still resulted in the 
the errors I posted in the comment above. One arg for `WriteToBigQuery` that 
seemed to reduce the frequency of the errors was to set `triggering_frequency` 
to a value above 5 seconds. I also set some custom metric counters and it 
doesn't appear that we're loosing any data as the retries by the workers are 
working and all the streaming data is landing in the BigQuery sink.
   
   Let me know if there's anything else that would be helpful or steps I can 
take, thanks!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to