Yeah, you are right. I was testing using 'gsutil' which behaves differently.
Thanks,
Cham
On Thu, Oct 27, 2016 at 2:06 PM Eugene Kirpichov
wrote:
> Indeed IOChannelFactory uses GcsUtil for GCS, and GcsUtil in fact does not
> recurse into subdirectories inside a
I don't think your assessment of behavior of glob patterns correct, per
https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames#directory-by-directory-vs-recursive-wildcards
.
I believe (and hope) that behavior of IOChannelFactory.match() matches the
behavior of gsutil.
On Thu, Oct
BTW I'm in favor of using a sub-directory and possibly asking users to
update their glob pattern while also allowing users to optionally specify a
temporary path in the future, as you propose.
Thanks,
Cham
On Thu, Oct 27, 2016 at 1:45 PM Chamikara Jayalath
wrote:
> On
On Thu, Oct 27, 2016 at 1:27 PM Eugene Kirpichov
wrote:
> Getting back to this. I noticed that the original user's job mentioned in
>
> http://stackoverflow.com/questions/39822859/temp-files-remain-in-gcs-after-a-dataflow-job-succeeded
> is
> configured to write to
@Eugene, we can make breaking changes. But if we really don't want to, we
can add it under a new name easily. That particular inheritance hierarchy
is not precious IMO.
On Thu, Oct 20, 2016, 14:03 Eugene Kirpichov
wrote:
> @Cham - this addresses temporary files
@Cham - this addresses temporary files that were written by successful
bundles, but not by failed bundles (and not the case when the entire
pipeline fails), so it's not sufficient.
@Dan - yes, there are situations when it's impossible to create a sibling.
In that case, we'd need a fallback -
Another option would be to just use /path/to/temp-foo-$uid to avoid
matching /path/to/foo-* (hoping of course the temp- or whatever prefix
doesn't match anything).
I see #2 causing all sorts of issues, and #3 would be a significant
reduction in usability. I would lean towards doing
The issue manifests when a completely different pipeline uses the output of
the last pipeline as input to the new pipeline and then these temporary
files are matched in the glob expression.
This happens because FileBasedSource is responsible for creating the
temporary paths which occurs while
This thread is conflating many issues.
* Putting temp files where they will not match the glob for the desired
output files
* Dealing with eventually-consistent filesystems (s3, GCS, ...)
* Properly cleaning up all temp files
They all need to get solved, but for now I think we only need to solve
Can this be prevented by moving temporary files (copy + delete
individually) at finalization instead of copying all of them and performing
a bulk delete ? You can support task failures by ignoring renames when the
destination exists. Python SDK currently does this (and puts temp files in
a
Hello,
This is a continuation of the discussion on PR
https://github.com/apache/incubator-beam/pull/1050 which turned out more
complex than expected.
Short summary:
Currently FileBasedSink, when writing to /path/to/foo (in practice,
/path/to/foo-x-of-y where y is the total number of
11 matches
Mail list logo