Reuven Lax created BEAM-2858:
--------------------------------

             Summary: temp file garbage collection in BigQuery sink should be 
in a separate DoFn
                 Key: BEAM-2858
                 URL: https://issues.apache.org/jira/browse/BEAM-2858
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-gcp
    Affects Versions: 2.1.0
            Reporter: Reuven Lax
            Assignee: Chamikara Jayalath
             Fix For: 2.2.0


Currently the WriteTables transform deletes the set of input files as soon as 
the load() job completes. However this is incorrect - if the task fails 
partially through deleting files (e.g. if the worker crashes), the task will be 
retried. If the write disposition is WRITE_TRUNCATE, bad things could result.

The resulting behavior will depend on what BQ does if one of input files is 
missing (because we had previously deleted it). In the best case, BQ will fail 
the load. In this case the step will keep failing until the runner finally 
fails the entire job. If however BQ ignores the missing file, the load will 
overwrite the previously-written table with the smaller set of files and the 
job will succeed. This is the worst-case scenario, as it will result in data 
loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to