OK, thank you for your help, Cham. We will look into the pipeline failure. Thanks, Chengxuan
Chamikara Jayalath <[email protected]> 于2019年4月25日周四 上午10:34写道: > > > On Wed, Apr 24, 2019 at 8:55 PM Chengxuan Wang <[email protected]> wrote: > >> We are using DataflowRunner. Sorry about pointing to wrong place. >> But still, It is better experience that Beam can support cleanup even >> though the pipeline is failed. >> > > Unfortunately there's no general cleanup mechanism for Beam. So we do this > through a regular Beam step which might not execute if the Beam pipeline > fails. > > >> >> Another thing is: >> Could you let customers to specify the temporary dataset themselves >> instead of generating by Beam every time? So right now the problem is there >> are a lot of temporary datasets under BigQuery console, which is messy. My >> team wants to group all temporary tables under same dataset. But it seems >> Beam doesn't give any apis to do this right now.(correct me if I am wrong) >> > > Yeah, currently there's no API for this. Note that we also generate a > table inside the dataset. So even if the user specify a dataset we need to > do some cleanup (which might not execute if job fails). BTW have you > determine why your pipelines fail regularly ? Addressing that should also > address the cleanup issue. > > Thanks, > Cham > > >> Thanks, >> Chengxuan >> >> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:42写道: >> >>> >>> >>> On Wed, Apr 24, 2019 at 5:38 PM Chengxuan Wang <[email protected]> >>> wrote: >>> >>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a >>>>> dataflow job with BigQuerySource. Even though I checked the code, >>>>> BigQueryReader will delete the temporary dataset after the query is done. >>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900 >>>>> >>>>> But I still see some temporary dataset in my GCP console. Could you >>>>> help me look into it? >>>>> >>>> >>>> Did some of your pipelines fail while reading from BigQuery ? If >>>> so, it's possible that the pipeline failed before running the cleanup step. >>>> >>>> The cleanup is in __exit__ and we run with `with` statement, this >>>> should clean up the dataset even though the pipeline is failed, right? >>>> >>> >>> Which runner are you using ? Pls. note that what you are looking at is >>> the implementation for the DirectRunner. Implementation for DataflowRunner >>> is in Dataflow service (given that BQ is a native source). >>> >>> >>>> >>>> >>>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:24写道: >>>> >>>>> >>>>> >>>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a >>>>>> dataflow job with BigQuerySource. Even though I checked the code, >>>>>> BigQueryReader will delete the temporary dataset after the query is done. >>>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900 >>>>>> >>>>>> But I still see some temporary dataset in my GCP console. Could you >>>>>> help me look into it? >>>>>> >>>>> >>>>> Did some of your pipelines fail while reading from BigQuery ? If so, >>>>> it's possible that the pipeline failed before running the cleanup step. >>>>> >>>>> >>>>>> >>>>>> Another thing is is that possible to set expiration for the temporary >>>>>> dataset? right now I see is never. >>>>>> >>>>> >>>>> Issue is, this will end up being an upper bound on the total execution >>>>> time of the job. It's possible to set this to a very large value (multiple >>>>> days or weeks) but not sure if this will help. >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Chengxuan >>>>>> >>>>>
