I would go for #1 since it's a better user experience. Especially for new users who don't understand every step involved on staging/deploying. It's just another (unnecessary) mental concept they don't have to be aware of. Anything that makes it closer to only providing the `--runner` flag without any additional flags (by default, but configurable if necessary) is a good thing in my opinion.
AutoML already auto-creates a GCS bucket (not configurable, with a global name which has its own downfalls). Other products are already doing this to simplify user experience. I think as long as there's an explicit logging statement it should be fine. If the bucket was not specified and was created: "No --temp_location specified, created gs://..." If the bucket was not specified and was found: "No --temp_location specified, found gs://..." If the bucket was specified, the logging could be omitted since it's already explicit from the command line arguments. On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <[email protected]> wrote: > Do we clean up auto created GCS buckets ? > > If there's no good way to cleanup, I think it might be better to make this > opt-in. > > Thanks, > Cham > > On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <[email protected]> > wrote: > >> I think having a single, default, auto-created temporary bucket per >> project for use in GCP (when running on Dataflow, or running elsewhere >> but using GCS such as for this BQ load files example), though not >> ideal, is the best user experience. If we don't want to be >> automatically creating such things for users by default, another >> option would be a single flag that opts-in to such auto-creation >> (which could include other resources in the future). >> >> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <[email protected]> wrote: >> > >> > Hello all, >> > I recently worked on a transform to load data into BigQuery by writing >> files to GCS, and issuing Load File jobs to BQ. I did this for the Python >> SDK[1]. >> > >> > This option requires the user to provide a GCS bucket to write the >> files: >> > >> > If the user provides a bucket to the transform, the SDK will use that >> bucket. >> > If the user does not provide a bucket: >> > >> > When running in Dataflow, the SDK will borrow the temp_location of the >> pipeline. >> > When running in other runners, the pipeline will fail. >> > >> > The Java SDK has had functionality for File Loads into BQ for a long >> time; and particularly, when users do not provide a bucket, it attempts to >> create a default bucket[2]; and this bucket is used as temp_location (which >> then is used by the BQ File Loads transform). >> > >> > I do not really like creating GCS buckets on behalf of users. In Java, >> the outcome is that users will not have to pass a --tempLocation parameter >> when submitting jobs to Dataflow - which is a nice convenience, but I'm not >> sure that this is in-line with users' expectations. >> > >> > Currently, the options are: >> > >> > Adding support for bucket autocreation for Python SDK >> > Deprecating support for bucket autocreation in Java SDK, and printing a >> warning. >> > >> > I am personally inclined for #1. But what do others think? >> > >> > Best >> > -P. >> > >> > [1] https://github.com/apache/beam/pull/7892 >> > [2] >> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343 >> >
