Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Robert Bradshaw
Pipeline stages must be retry-tolerant. E.g. the VM it's running on might get shut down. We should not be failing jobs in this case. It seems the current implementation could only produce bad results if (1) unrelated output files already existed and (2) the temporary files were either not written

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Reuven Lax
On Fri, Feb 2, 2018 at 11:17 AM, Chamikara Jayalath wrote: > Currently, Python file-based sink is batch only. > Sure, but that won't be true forever. > > Regarding Raghu's question, stage/pipeline failure should not be > considered as a data loss but I prefer overriding

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Raghu Angadi
On Fri, Feb 2, 2018 at 10:21 AM, Reuven Lax wrote: > However this code might run in streaming as well, right? > True. What is the best practices recommendation to handle it? Probably the author of the sink transform should consider the context and decide if needs to be retry

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Chamikara Jayalath
Currently, Python file-based sink is batch only. Regarding Raghu's question, stage/pipeline failure should not be considered as a data loss but I prefer overriding existing output and completing a possibly expensive pipeline over failing the whole pipeline due to one or more existing files. -

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Reuven Lax
However this code might run in streaming as well, right? On Fri, Feb 2, 2018 at 9:54 AM, Raghu Angadi wrote: > In a batch pipeline, is it considered a data loss if the the stage fails > (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it > might be

Re: Filesystems.copy and .rename behavior

2018-02-02 Thread Raghu Angadi
In a batch pipeline, is it considered a data loss if the the stage fails (assuming it does not set IGNORE_MISSING_FILES and fails hard)? If not, it might be better to favor correctness and fail in current implementation. On Thu, Feb 1, 2018 at 4:07 PM, Robert Bradshaw wrote:

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Robert Bradshaw
You could add a step to delete all of dest before a barrier and the step that does the rename as outlined. In that case, any dest file that exists must be good. On Thu, Feb 1, 2018 at 2:52 PM, Eugene Kirpichov wrote: > I think this is still unsafe in case exists(dst) (e.g.

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Eugene Kirpichov
I think this is still unsafe in case exists(dst) (e.g. this is a re-run of a pipeline) but src is missing due to some bad reason. However it's probably better than what we have (e.g. we currently certainly don't perform checksum checks). On Thu, Feb 1, 2018 at 2:45 PM Udi Meiri

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Udi Meiri
For GCS, I would do what I believe we already do. rename(src, dst): - if !exists(src) and exists(dst) return 0 - if !exists(src) and !exists(dst) return error - if exists(src) and exists(dst) { if checksum(src) == checksum(dst) return 0 else delete(dst) } - Start a GCS copy from src to dst. - Wait

Re: Filesystems.copy and .rename behavior

2018-02-01 Thread Eugene Kirpichov
Yes, IGNORE_MISSING_FILES is unsafe because it will - well - ignore files that are missing for more ominous reasons than just this being a non-first attempt at renaming src to dst. E.g. if there was a bug in constructing the filename to be renamed, or if we somehow messed up the order of rename vs

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Eugene, if I get this right, you're saying that IGNORE_MISSING_FILES is unsafe because it will skip (src, dst) pairs where neither exist? (it only looks if src exists) For GCS, we can construct a safe retryable rename() operation, assuming that copy() and delete() are atomic for a single file or

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
On Wed, Jan 31, 2018 at 2:43 PM, Eugene Kirpichov wrote: > As far as I know, the current implementation of file sinks is the only > reason why the flag IGNORE_MISSING for copying even exists - there's no > other compelling reason to justify it. We implement "rename" as

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Eugene Kirpichov
I agree that using an atomic rename operation is even better. I'm mainly opposed to having a copy option that ignores missing files, and to our implementation of rename using that option, because it's unsafe. Unfortunately GCS doesn't have an atomic rename, so I'm not sure what's the best way to

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Agree with what Robert said. We have a rename() operation in the FileSystem abstraction and some file-systems might be able to implement this more efficiently than copy+delete. Also note that the same issue could arise in any other usage of rename operation. So I agree that a retry-tolerant

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
I agree that overwriting is more in line with user expectations. I believe that the sink should not ignore errors from the filesystem layer. Instead, the FileSystem API should be more well defined. Examples: rename() and copy() should overwrite existing files at the destination, copy() should have

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Raghu Angadi
Original mail mentions that output from second run of word_count is ignored. That does not seem as safe as ignoring error from a second attempt of a step. How do we know second run didn't run on different output? Overwriting seems more accurate than ignoring. Does handling this error at sink level

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Udi Meiri
Yeah, another round of refactoring is due to move the rename via copy+delete logic up to the file-based sink level. On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath wrote: > Good point. There's always the chance of step that performs final rename > being retried. So we'll

Re: Filesystems.copy and .rename behavior

2018-01-31 Thread Chamikara Jayalath
Good point. There's always the chance of step that performs final rename being retried. So we'll have to ignore this error at the sink level. We don't necessarily have to do this at the FileSystem level though. I think the proper behavior might be to raise an error for the rename at the FileSystem

Re: Filesystems.copy and .rename behavior

2018-01-30 Thread Reuven Lax
I think the idea was to ignore "already exists" errors. The reason being that any step in Beam can be executed multiple times, including the rename step. If the rename step gets run twice, the second run should succeed vacuously. On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri

Filesystems.copy and .rename behavior

2018-01-30 Thread Udi Meiri
Hi, I've been working on HDFS code for the Python SDK and I've noticed some behaviors which are surprising. I wanted to know if these behaviors are known and intended. 1. When renaming files during finalize_write, rename errors are ignored