On Wed, Apr 24, 2019 at 8:55 PM Chengxuan Wang <[email protected]> wrote:
> We are using DataflowRunner. Sorry about pointing to wrong place. > But still, It is better experience that Beam can support cleanup even > though the pipeline is failed. > Unfortunately there's no general cleanup mechanism for Beam. So we do this through a regular Beam step which might not execute if the Beam pipeline fails. > > Another thing is: > Could you let customers to specify the temporary dataset themselves > instead of generating by Beam every time? So right now the problem is there > are a lot of temporary datasets under BigQuery console, which is messy. My > team wants to group all temporary tables under same dataset. But it seems > Beam doesn't give any apis to do this right now.(correct me if I am wrong) > Yeah, currently there's no API for this. Note that we also generate a table inside the dataset. So even if the user specify a dataset we need to do some cleanup (which might not execute if job fails). BTW have you determine why your pipelines fail regularly ? Addressing that should also address the cleanup issue. Thanks, Cham > Thanks, > Chengxuan > > Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:42写道: > >> >> >> On Wed, Apr 24, 2019 at 5:38 PM Chengxuan Wang <[email protected]> wrote: >> >>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a >>>> dataflow job with BigQuerySource. Even though I checked the code, >>>> BigQueryReader will delete the temporary dataset after the query is done. >>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900 >>>> >>>> But I still see some temporary dataset in my GCP console. Could you >>>> help me look into it? >>>> >>> >>> Did some of your pipelines fail while reading from BigQuery ? If so, >>> it's possible that the pipeline failed before running the cleanup step. >>> >>> The cleanup is in __exit__ and we run with `with` statement, this should >>> clean up the dataset even though the pipeline is failed, right? >>> >> >> Which runner are you using ? Pls. note that what you are looking at is >> the implementation for the DirectRunner. Implementation for DataflowRunner >> is in Dataflow service (given that BQ is a native source). >> >> >>> >>> >>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:24写道: >>> >>>> >>>> >>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a >>>>> dataflow job with BigQuerySource. Even though I checked the code, >>>>> BigQueryReader will delete the temporary dataset after the query is done. >>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900 >>>>> >>>>> But I still see some temporary dataset in my GCP console. Could you >>>>> help me look into it? >>>>> >>>> >>>> Did some of your pipelines fail while reading from BigQuery ? If so, >>>> it's possible that the pipeline failed before running the cleanup step. >>>> >>>> >>>>> >>>>> Another thing is is that possible to set expiration for the temporary >>>>> dataset? right now I see is never. >>>>> >>>> >>>> Issue is, this will end up being an upper bound on the total execution >>>> time of the job. It's possible to set this to a very large value (multiple >>>> days or weeks) but not sure if this will help. >>>> >>>> >>>>> >>>>> Thanks, >>>>> Chengxuan >>>>> >>>>
