Adding relevant folks +Chamikara Jayalath <[email protected]> +Pablo Estrada <[email protected]>
This proposal makes sense to me. It makes it easier for users to reason about why a temp directory is chosen, and would lead to a unified code across all IOs that does this. On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty <[email protected]> wrote: > Hi Beam devs, > > I have a question/proposal about the default tempDirectory setting for > file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an > optional tempDirectory setter, and when the transforms are expanded, > tempDirectory will default to the value of the final output directory if > null [AvroIO > <https://github.com/apache/beam/blob/1d4a9ccd11c14ac6f0a2de1cc438a881244ede0a/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L1634-L1637> > /FileIO > <https://github.com/apache/beam/blob/f759a5c7fe34c3d9e39cc21bb78cdc5da0a13eb1/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1284-L1290> > /TextIO > <https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L992-L995> > ]. > > I think it would make sense to default to the value of > PipelineOptions#getTempLocation instead, which is accessible inside the > expand(PCollection<T> > input) method; it seems reasonable for the user to expect that their > PipelineOptions#getTempLocation will be honored, and additionally, their > final output locations may have locks/retention policies set that make the > temp file renaming step fail. Plus, this pattern looks like it's already > being used in BigQueryIO > <https://github.com/apache/beam/blob/e76b4db30a90d8f351e807cb247a707e7a3c566c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L944> > . > > What do you think? > > Thanks! > Claire > > >
