Good point. There's always the chance of step that performs final rename
being retried. So we'll have to ignore this error at the sink level. We
don't necessarily have to do this at the FileSystem level though. I think
the proper behavior might be to raise an error for the rename at the
FileSystem level if the destination already exists (or source doesn't
exist) while ignoring that error (and possibly logging a warning) at the
sink level.

- Cham

On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax <[email protected]> wrote:

> I think the idea was to ignore "already exists" errors. The reason being
> that any step in Beam can be executed multiple times, including the rename
> step. If the rename step gets run twice, the second run should succeed
> vacuously.
>
>
> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <[email protected]> wrote:
>
>> Hi,
>> I've been working on HDFS code for the Python SDK and I've noticed some
>> behaviors which are surprising. I wanted to know if these behaviors are
>> known and intended.
>>
>> 1. When renaming files during finalize_write, rename errors are ignored
>> <https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
>> For example, if I run wordcount twice using HDFS code I get a warning the
>> second time because the file already exists:
>>
>> WARNING:root:Rename not successful:
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> -> hdfs://counts2-00000-of-00001, libhdfs error in renaming
>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>> to hdfs://counts2-00000-of-00001 with exceptions Unable to rename
>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>> to '/counts2-00000-of-00001'.
>>
>> For GCS and local files there are no rename errors (in this case), since
>> the rename operation silently overwrites existing destination files.
>> However, blindly ignoring these errors might make the pipeline to report
>> success even though output files are missing.
>>
>> 2. Output files (--ouput) overwrite existing files.
>>
>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>> use Filesystem.Rename().
>>
>> Thanks,
>> - Udi
>>
>
>

Reply via email to