Re: Placement of temporary files by FileBasedSink

2016-10-27 Thread Chamikara Jayalath
Yeah, you are right. I was testing using 'gsutil' which behaves differently. Thanks, Cham On Thu, Oct 27, 2016 at 2:06 PM Eugene Kirpichov wrote: > Indeed IOChannelFactory uses GcsUtil for GCS, and GcsUtil in fact does not > recurse into subdirectories inside a

Re: Placement of temporary files by FileBasedSink

2016-10-27 Thread Eugene Kirpichov
I don't think your assessment of behavior of glob patterns correct, per https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames#directory-by-directory-vs-recursive-wildcards . I believe (and hope) that behavior of IOChannelFactory.match() matches the behavior of gsutil. On Thu, Oct

Re: Placement of temporary files by FileBasedSink

2016-10-27 Thread Chamikara Jayalath
BTW I'm in favor of using a sub-directory and possibly asking users to update their glob pattern while also allowing users to optionally specify a temporary path in the future, as you propose. Thanks, Cham On Thu, Oct 27, 2016 at 1:45 PM Chamikara Jayalath wrote: > On

Re: Placement of temporary files by FileBasedSink

2016-10-27 Thread Chamikara Jayalath
On Thu, Oct 27, 2016 at 1:27 PM Eugene Kirpichov wrote: > Getting back to this. I noticed that the original user's job mentioned in > > http://stackoverflow.com/questions/39822859/temp-files-remain-in-gcs-after-a-dataflow-job-succeeded > is > configured to write to

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Kenneth Knowles
@Eugene, we can make breaking changes. But if we really don't want to, we can add it under a new name easily. That particular inheritance hierarchy is not precious IMO. On Thu, Oct 20, 2016, 14:03 Eugene Kirpichov wrote: > @Cham - this addresses temporary files

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Eugene Kirpichov
@Cham - this addresses temporary files that were written by successful bundles, but not by failed bundles (and not the case when the entire pipeline fails), so it's not sufficient. @Dan - yes, there are situations when it's impossible to create a sibling. In that case, we'd need a fallback -

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Robert Bradshaw
Another option would be to just use /path/to/temp-foo-$uid to avoid matching /path/to/foo-* (hoping of course the temp- or whatever prefix doesn't match anything). I see #2 causing all sorts of issues, and #3 would be a significant reduction in usability. I would lean towards doing

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Lukasz Cwik
The issue manifests when a completely different pipeline uses the output of the last pipeline as input to the new pipeline and then these temporary files are matched in the glob expression. This happens because FileBasedSource is responsible for creating the temporary paths which occurs while

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Dan Halperin
This thread is conflating many issues. * Putting temp files where they will not match the glob for the desired output files * Dealing with eventually-consistent filesystems (s3, GCS, ...) * Properly cleaning up all temp files They all need to get solved, but for now I think we only need to solve

Re: Placement of temporary files by FileBasedSink

2016-10-20 Thread Chamikara Jayalath
Can this be prevented by moving temporary files (copy + delete individually) at finalization instead of copying all of them and performing a bulk delete ? You can support task failures by ignoring renames when the destination exists. Python SDK currently does this (and puts temp files in a

Placement of temporary files by FileBasedSink

2016-10-19 Thread Eugene Kirpichov
Hello, This is a continuation of the discussion on PR https://github.com/apache/incubator-beam/pull/1050 which turned out more complex than expected. Short summary: Currently FileBasedSink, when writing to /path/to/foo (in practice, /path/to/foo-x-of-y where y is the total number of