Niel, that's a good point -- I don't think there's any restriction on the filesystem of PipelineOptions#tempLocation, I was able to run a job on DirectRunner with PipelineOptions#tempLocation set to a local path & my TextIO.Write#outputDirectory set to a remote filesystem.
But currently there's also no check that would catch AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>) at graph construction time, either (it will fail at runtime instead). I think the FileSystems api <https://github.com/apache/beam/blob/3a3a84483ba1d279d3d5ff53ecf6f0b925cece3d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L477> has some logic for performing this check that we could extract into a utility method and use in PTransform#expand? - Claire On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <[email protected]> wrote: > I have heard of an intermittent fault in avroIO where two independent > pipelines using the same output directory deleted each others temp files > while they were being written. > > I cannot reproduce the problem, nor have I found a code path that could > cause it in fileio (with a superficial look), but it has been reported more > than once... > > > > Back to the original question, is it possible that the tempDirectory from > PipelineOptions points to a different filesystem than used for the fileio > output? Because this could break transforms that write to a tempfile in the > destination filesystem then rename the tempfile to the final output > filename when writing is complete. > > On Fri, 10 Sep 2021, 07:31 Reuven Lax, <[email protected]> wrote: > >> While this makes sense, I can also see a risk of breaking existing users >> by changing the default like that. In addition it might be a >> less-performant default. A file rename that takes place within the same >> location might be performed via a fast rename operation, where otherwise it >> would be forced to generate an expensive copy + delete. This could cause >> strange and hard-to-debug performance problems when people upgrade to the >> newer Beam version. >> >> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <[email protected]> wrote: >> >>> Adding relevant folks +Chamikara Jayalath <[email protected]> +Pablo >>> Estrada <[email protected]> >>> >>> This proposal makes sense to me. It makes it easier for users to reason >>> about why a temp directory is chosen, and would lead to a unified code >>> across all IOs that does this. >>> >>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty < >>> [email protected]> wrote: >>> >>>> Hi Beam devs, >>>> >>>> I have a question/proposal about the default tempDirectory setting for >>>> file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an >>>> optional tempDirectory setter, and when the transforms are expanded, >>>> tempDirectory will default to the value of the final output directory if >>>> null [AvroIO >>>> <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637> >>>> /FileIO >>>> <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290> >>>> /TextIO >>>> <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995> >>>> ]. >>>> >>>> I think it would make sense to default to the value of >>>> PipelineOptions#getTempLocation instead, which is accessible inside >>>> the expand(PCollection<T> input) method; it seems reasonable for the >>>> user to expect that their PipelineOptions#getTempLocation will be >>>> honored, and additionally, their final output locations may >>>> have locks/retention policies set that make the temp file renaming step >>>> fail. Plus, this pattern looks like it's already being used in >>>> BigQueryIO >>>> <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944> >>>> . >>>> >>>> What do you think? >>>> >>>> Thanks! >>>> Claire >>>> >>>> >>>>
