Niel, that's a good point -- I don't think there's any restriction on the
filesystem of PipelineOptions#tempLocation, I was able to run a job on
DirectRunner with PipelineOptions#tempLocation set to a local path & my
TextIO.Write#outputDirectory set to a remote filesystem.

But currently there's also no check that would catch
AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>)
at graph construction time, either (it will fail at runtime instead). I
think the FileSystems api
<https://github.com/apache/beam/blob/3a3a84483ba1d279d3d5ff53ecf6f0b925cece3d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L477>
has
some logic for performing this check that we could extract into a utility
method and use in PTransform#expand?

- Claire

On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <[email protected]> wrote:

> I have heard of an intermittent fault in avroIO where two independent
> pipelines using the same output directory deleted each others temp files
> while they were being written.
>
> I cannot reproduce the problem, nor have I found a code path that could
> cause it in fileio (with a superficial look), but it has been reported more
> than once...
>
>
>
> Back to the original question, is it possible that the tempDirectory from
> PipelineOptions points to a different filesystem than used for the fileio
> output? Because this could break transforms that write to a tempfile in the
> destination filesystem then rename the tempfile to the final output
> filename when writing is complete.
>
> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <[email protected]> wrote:
>
>> While this makes sense, I can also see a risk of breaking existing users
>> by changing the default like that. In addition it might be a
>> less-performant default. A file rename that takes place within the same
>> location might be performed via a fast rename operation, where otherwise it
>> would be forced to generate an expensive copy + delete. This could cause
>> strange and hard-to-debug performance problems when people upgrade to the
>> newer Beam version.
>>
>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <[email protected]> wrote:
>>
>>> Adding relevant folks +Chamikara Jayalath <[email protected]> +Pablo
>>> Estrada <[email protected]>
>>>
>>> This proposal makes sense to me. It makes it easier for users to reason
>>> about why a temp directory is chosen, and would lead to a unified code
>>> across all IOs that does this.
>>>
>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <
>>> [email protected]> wrote:
>>>
>>>> Hi Beam devs,
>>>>
>>>> I have a question/proposal about the default tempDirectory setting for
>>>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an
>>>> optional tempDirectory setter, and when the transforms are expanded,
>>>> tempDirectory will default to the value of the final output directory if
>>>> null [AvroIO
>>>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637>
>>>> /FileIO
>>>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290>
>>>> /TextIO
>>>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995>
>>>> ].
>>>>
>>>> I think it would make sense to default to the value of
>>>> PipelineOptions#getTempLocation instead, which is accessible inside
>>>> the expand(PCollection<T> input) method; it seems reasonable for the
>>>> user to expect that their PipelineOptions#getTempLocation will be
>>>> honored, and additionally, their final output locations may
>>>> have locks/retention policies set that make the temp file renaming step
>>>> fail. Plus, this pattern looks like it's already being used in
>>>> BigQueryIO
>>>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944>
>>>> .
>>>>
>>>> What do you think?
>>>>
>>>> Thanks!
>>>> Claire
>>>>
>>>>
>>>>

Reply via email to