Thanks for sharing the learnings Ahmed! > The solution lies in keeping the retry of each step separate. A good example of this is in how steps 2 and 3 are implemented [3]. They are separated into different DoFns and step 3 can start only after step 2 completes successfully. This way, any failure in step 3 does not go back to affect step 2. Is it enough just that they're in different DoFns? I thought the key was that the DoFns are separated by a GroupByKey, so they will be in different fused stages, which are retried independently.
Brian On Thu, Sep 1, 2022 at 1:43 PM Ahmed Abualsaud via dev <dev@beam.apache.org> wrote: > Hi all, > > TLDR: When writing IO connectors, be wary of how bundle retries can affect > the work flow. > > A faulty implementation of a step in BigQuery batch loads was discovered > recently. I raised an issue [1] but also wanted to mention it here as a > potentially helpful lesson for those developing new/existing IO connectors. > > For those unfamiliar with BigQueryIO file loads, a write that is too large > for a single load job [2] looks roughly something like this: > > > 1. > > Take input rows and write them to temporary files. > 2. > > Load temporary files to temporary BQ tables. > 3. > > Delete temporary files. > 4. > > Copy the contents of temporary tables over to the final table. > 5. > > Delete temporary tables. > > > The faulty part here is that steps 4 and 5 are done in the same DoFn (4 in > processElement and 5 in finishBundle). In the case a bundle fails in the > middle of table deletion, let’s say an error occurs when deleting the nth > table, the whole bundle will retry and we will perform the copy again. But > tables 1~n have already been deleted and so we get stuck trying to copy > from non-existent sources. > > The solution lies in keeping the retry of each step separate. A good > example of this is in how steps 2 and 3 are implemented [3]. They are > separated into different DoFns and step 3 can start only after step 2 > completes successfully. This way, any failure in step 3 does not go back to > affect step 2. > > That's all, thanks for your attention :) > > Ahmed > > [1] https://github.com/apache/beam/issues/22920 > > [2] > https://github.com/apache/beam/blob/f921a2f1996cf906d994a9d62aeb6978bab09dd5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L100-L105 > > > [3] > https://github.com/apache/beam/blob/149ed074428ff9b5344169da7d54e8ee271aaba1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L437-L454 > > >