Thanks Cham and Robert! I think having a .withPipelineTempDirectory builder option in the file-based sink Builders would be helpful for our use case. I'm happy to put up a PR if there's consensus on this.
Thanks, Claire On Fri, Sep 10, 2021 at 1:20 PM Chamikara Jayalath <[email protected]> wrote: > > > On Fri, Sep 10, 2021 at 9:24 AM Robert Bradshaw <[email protected]> > wrote: > >> Being able to (cheaply) rename files is precisely why the temporary >> files are colocated with the final output destination. There are other >> benefits as well like automatically inheriting the right permissions, >> not having to worry about having sufficient disk/quota etc. in both >> the temporary and final place, etc. I don't think it makes sense to >> change this default. (I don't think the analogy with BigqueryIO holds, >> as there we have no "local" place to collate with. >> >> We could introduce a new withPipelineTempDirectory() method that would >> invoke withTempDirectory with the PipelineOptions.getTempLocation to >> make this easier. >> > > As others pointed out, getTempLocation can be a completely different > bucket or maybe even a different region for some users. So changing the > default could result in performance regressions for some users. > > BigQueryIO sink (in export mode) is a bit different since we don't > perform a renaming of files to a final location. We just write temp files > and load them to BigQuery using load jobs. > > Also note that we create a uniquely named subdirectory within the temp > directory for each run of a pipeline. So there should not be conflicts > between different runs when using the same output directory: > https://github.com/apache/beam/blob/95c3811312ad41f98dd5ef0674ef30ce30448746/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java#L525 > > Temp output directory can be overridden today using the > "withTempDirectory" option. I'm also OK with introducing a pipeline option > to make this easier if needed. > > Thanks, > Cham > > >> >> On Fri, Sep 10, 2021 at 8:12 AM Claire McGinty >> <[email protected]> wrote: >> > >> > Niel, that's a good point -- I don't think there's any restriction on >> the filesystem of PipelineOptions#tempLocation, I was able to run a job on >> DirectRunner with PipelineOptions#tempLocation set to a local path & my >> TextIO.Write#outputDirectory set to a remote filesystem. >> > >> > But currently there's also no check that would catch >> AvroIO.write(...).to(<filesystem-1-path>).withTempDirectory(<filesystem-2-path>) >> at graph construction time, either (it will fail at runtime instead). I >> think the FileSystems api has some logic for performing this check that we >> could extract into a utility method and use in PTransform#expand? >> > >> > - Claire >> > >> > On Fri, Sep 10, 2021 at 4:11 AM Niel Markwick <[email protected]> wrote: >> >> >> >> I have heard of an intermittent fault in avroIO where two independent >> pipelines using the same output directory deleted each others temp files >> while they were being written. >> >> >> >> I cannot reproduce the problem, nor have I found a code path that >> could cause it in fileio (with a superficial look), but it has been >> reported more than once... >> >> >> >> >> >> >> >> Back to the original question, is it possible that the tempDirectory >> from PipelineOptions points to a different filesystem than used for the >> fileio output? Because this could break transforms that write to a tempfile >> in the destination filesystem then rename the tempfile to the final output >> filename when writing is complete. >> >> >> >> On Fri, 10 Sep 2021, 07:31 Reuven Lax, <[email protected]> wrote: >> >>> >> >>> While this makes sense, I can also see a risk of breaking existing >> users by changing the default like that. In addition it might be a >> less-performant default. A file rename that takes place within the same >> location might be performed via a fast rename operation, where otherwise it >> would be forced to generate an expensive copy + delete. This could cause >> strange and hard-to-debug performance problems when people upgrade to the >> newer Beam version. >> >>> >> >>> On Thu, Sep 9, 2021 at 8:15 PM Ahmet Altay <[email protected]> wrote: >> >>>> >> >>>> Adding relevant folks +Chamikara Jayalath +Pablo Estrada >> >>>> >> >>>> This proposal makes sense to me. It makes it easier for users to >> reason about why a temp directory is chosen, and would lead to a unified >> code across all IOs that does this. >> >>>> >> >>>> On Thu, Sep 9, 2021 at 11:37 AM Claire McGinty < >> [email protected]> wrote: >> >>>>> >> >>>>> Hi Beam devs, >> >>>>> >> >>>>> I have a question/proposal about the default tempDirectory setting >> for file-based IOs. AvroIO, FileIO, TextIO all provide Builders with an >> optional tempDirectory setter, and when the transforms are expanded, >> tempDirectory will default to the value of the final output directory if >> null [AvroIO/FileIO/TextIO]. >> >>>>> >> >>>>> I think it would make sense to default to the value of >> PipelineOptions#getTempLocation instead, which is accessible inside the >> expand(PCollection<T> input) method; it seems reasonable for the user to >> expect that their PipelineOptions#getTempLocation will be honored, and >> additionally, their final output locations may have locks/retention >> policies set that make the temp file renaming step fail. Plus, this pattern >> looks like it's already being used in BigQueryIO. >> >>>>> >> >>>>> What do you think? >> >>>>> >> >>>>> Thanks! >> >>>>> Claire >> >>>>> >> >>>>> >> >
