kvudata commented on issue #27156: URL: https://github.com/apache/beam/issues/27156#issuecomment-1642724845
Yes, we are using TFRecordIO (specifically `beam.io.WriteToTFRecord`). > it sounds like your process and/or finish_bundle methods may be not idempotent. Idempotency is an important consideration when writing IO code. Note that we're only logging to Stackdriver aka Google Cloud Logging in our DoFn (we gather logs in `process()` and then log the batch in `finish_bundle()`), and we're ok with duplicate logs being generated. We're observing that Dataflow seems to perform some kind of retry if our `finish_bundle()` fails, and the later `WriteToTFRecord` (which doesn't depend on this DoFn in our pipeline) ends up writing duplicates. Another interesting observation is that the metrics for the number of items processed / written in the Dataflow UI is the number of elements we would expect if duplicates were not being written - the duplicates are only apparent from inspecting the output tfrecord(s). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
