Yeah, another round of refactoring is due to move the rename via
copy+delete logic up to the file-based sink level.

On Wed, Jan 31, 2018, 10:42 Chamikara Jayalath <chamik...@google.com> wrote:

> Good point. There's always the chance of step that performs final rename
> being retried. So we'll have to ignore this error at the sink level. We
> don't necessarily have to do this at the FileSystem level though. I think
> the proper behavior might be to raise an error for the rename at the
> FileSystem level if the destination already exists (or source doesn't
> exist) while ignoring that error (and possibly logging a warning) at the
> sink level.
>
> - Cham
>
>
> On Tue, Jan 30, 2018 at 6:47 PM Reuven Lax <re...@google.com> wrote:
>
>> I think the idea was to ignore "already exists" errors. The reason being
>> that any step in Beam can be executed multiple times, including the rename
>> step. If the rename step gets run twice, the second run should succeed
>> vacuously.
>>
>>
>> On Tue, Jan 30, 2018 at 6:19 PM, Udi Meiri <eh...@google.com> wrote:
>>
>>> Hi,
>>> I've been working on HDFS code for the Python SDK and I've noticed some
>>> behaviors which are surprising. I wanted to know if these behaviors are
>>> known and intended.
>>>
>>> 1. When renaming files during finalize_write, rename errors are ignored
>>> <https://github.com/apache/beam/blob/3aa2bef87c93d2844dd7c8dbaf45db75ec607792/sdks/python/apache_beam/io/filebasedsink.py#L232>.
>>> For example, if I run wordcount twice using HDFS code I get a warning the
>>> second time because the file already exists:
>>>
>>> WARNING:root:Rename not successful:
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> -> hdfs://counts2-00000-of-00001, libhdfs error in renaming
>>> hdfs://beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2
>>> to hdfs://counts2-00000-of-00001 with exceptions Unable to rename
>>> '/beam-temp-counts2-7cb0a78005f211e8b6a08851fb5da245/1059f870-d64f-4f63-b1de-e4bd20fcd70a.counts2'
>>> to '/counts2-00000-of-00001'.
>>>
>>> For GCS and local files there are no rename errors (in this case), since
>>> the rename operation silently overwrites existing destination files.
>>> However, blindly ignoring these errors might make the pipeline to report
>>> success even though output files are missing.
>>>
>>> 2. Output files (--ouput) overwrite existing files.
>>>
>>> 3. The Python SDK doesn't use Filesystems.copy(). The Java SDK doesn't
>>> use Filesystem.Rename().
>>>
>>> Thanks,
>>> - Udi
>>>
>>
>>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to