I think the idea was to ignore "already exists" errors. The reason being
that any step in Beam can be executed multiple times, including the rename
step. If the rename step gets run twice, the second run should succeed
vacuously.

On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <eh...@google.com> wrote:

> Hi,
> I've been working on HDFS code for the Python SDK and I've noticed some
> behaviors which are surprising. I wanted to know if these behaviors are
> known and intended.
>
> 1. When renaming files during finalize_write, rename errors are ignored
> <https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
> For example, if I run wordcount twice using HDFS code I get a warning the
> second time because the file already exists:
>
> WARNING:root:Rename not successful: hdfs://beam-temp-counts2-
> 7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
> -> hdfs://counts2-00000-of-00001, libhdfs error in renaming
> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
> 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2 to
> hdfs://counts2-00000-of-00001 with exceptions Unable to rename
> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da2
> 45/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2' to
> '/counts2-00000-of-00001'.
>
> For GCS and local files there are no rename errors (in this case), since
> the rename operation silently overwrites existing destination files.
> However, blindly ignoring these errors might make the pipeline to report
> success even though output files are missing.
>
> 2. Output files (--ouput) overwrite existing files.
>
> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't use
> Filesystem.Rename().
>
> Thanks,
> - Udi
>

Reply via email to