Hi,
I've been working on HDFS code for the Python SDK and I've noticed some
behaviors which are surprising. I wanted to know if these behaviors are
known and intended.

1. When renaming files during finalize_write, rename errors are ignored
<https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
For example, if I run wordcount twice using HDFS code I get a warning the
second time because the file already exists:

WARNING:root:Rename not successful:
hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
-> hdfs://counts2-00000-of-00001, libhdfs error in renaming
hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
to hdfs://counts2-00000-of-00001 with exceptions Unable to rename
'/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
to '/counts2-00000-of-00001'.

For GCS and local files there are no rename errors (in this case), since
the rename operation silently overwrites existing destination files.
However, blindly ignoring these errors might make the pipeline to report
success even though output files are missing.

2. Output files (--ouput) overwrite existing files.

3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't use
Filesystem.Rename().

Thanks,
- Udi

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to