I would go for #1 since it's a better user experience. Especially for new
users who don't understand every step involved on staging/deploying. It's
just another (unnecessary) mental concept they don't have to be aware of.
Anything that makes it closer to only providing the `--runner` flag without
any additional flags (by default, but configurable if necessary) is a good
thing in my opinion.

AutoML already auto-creates a GCS bucket (not configurable, with a global
name which has its own downfalls). Other products are already doing this to
simplify user experience. I think as long as there's an explicit logging
statement it should be fine.

If the bucket was not specified and was created: "No --temp_location
specified, created gs://..."

If the bucket was not specified and was found: "No --temp_location
specified, found gs://..."

If the bucket was specified, the logging could be omitted since it's
already explicit from the command line arguments.

On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <[email protected]>
wrote:

> Do we clean up auto created GCS buckets ?
>
> If there's no good way to cleanup, I think it might be better to make this
> opt-in.
>
> Thanks,
> Cham
>
> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <[email protected]>
> wrote:
>
>> I think having a single, default, auto-created temporary bucket per
>> project for use in GCP (when running on Dataflow, or running elsewhere
>> but using GCS such as for this BQ load files example), though not
>> ideal, is the best user experience. If we don't want to be
>> automatically creating such things for users by default, another
>> option would be a single flag that opts-in to such auto-creation
>> (which could include other resources in the future).
>>
>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <[email protected]> wrote:
>> >
>> > Hello all,
>> > I recently worked on a transform to load data into BigQuery by writing
>> files to GCS, and issuing Load File jobs to BQ. I did this for the Python
>> SDK[1].
>> >
>> > This option requires the user to provide a GCS bucket to write the
>> files:
>> >
>> > If the user provides a bucket to the transform, the SDK will use that
>> bucket.
>> > If the user does not provide a bucket:
>> >
>> > When running in Dataflow, the SDK will borrow the temp_location of the
>> pipeline.
>> > When running in other runners, the pipeline will fail.
>> >
>> > The Java SDK has had functionality for File Loads into BQ for a long
>> time; and particularly, when users do not provide a bucket, it attempts to
>> create a default bucket[2]; and this bucket is used as temp_location (which
>> then is used by the BQ File Loads transform).
>> >
>> > I do not really like creating GCS buckets on behalf of users. In Java,
>> the outcome is that users will not have to pass a --tempLocation parameter
>> when submitting jobs to Dataflow - which is a nice convenience, but I'm not
>> sure that this is in-line with users' expectations.
>> >
>> > Currently, the options are:
>> >
>> > Adding support for bucket autocreation for Python SDK
>> > Deprecating support for bucket autocreation in Java SDK, and printing a
>> warning.
>> >
>> > I am personally inclined for #1. But what do others think?
>> >
>> > Best
>> > -P.
>> >
>> > [1] https://github.com/apache/beam/pull/7892
>> > [2]
>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343
>>
>

Reply via email to