Reuven Lax created BEAM-2858:
--------------------------------
Summary: temp file garbage collection in BigQuery sink should be
in a separate DoFn
Key: BEAM-2858
URL: https://issues.apache.org/jira/browse/BEAM-2858
Project: Beam
Issue Type: Bug
Components: sdk-java-gcp
Affects Versions: 2.1.0
Reporter: Reuven Lax
Assignee: Chamikara Jayalath
Fix For: 2.2.0
Currently the WriteTables transform deletes the set of input files as soon as
the load() job completes. However this is incorrect - if the task fails
partially through deleting files (e.g. if the worker crashes), the task will be
retried. If the write disposition is WRITE_TRUNCATE, bad things could result.
The resulting behavior will depend on what BQ does if one of input files is
missing (because we had previously deleted it). In the best case, BQ will fail
the load. In this case the step will keep failing until the runner finally
fails the entire job. If however BQ ignores the missing file, the load will
overwrite the previously-written table with the smaller set of files and the
job will succeed. This is the worst-case scenario, as it will result in data
loss.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)