Re: questions about BigQuery temp dataset

Chengxuan Wang Thu, 25 Apr 2019 21:29:43 -0700

OK, thank you for your help, Cham. We will look into the pipeline failure.

Thanks,
Chengxuan


Chamikara Jayalath <[email protected]> 于2019年4月25日周四 上午10:34写道：

>
>
> On Wed, Apr 24, 2019 at 8:55 PM Chengxuan Wang <[email protected]> wrote:
>
>> We are using DataflowRunner. Sorry about pointing to wrong place.
>> But still, It is better experience that Beam can support cleanup even
>> though the pipeline is failed.
>>
>
> Unfortunately there's no general cleanup mechanism for Beam. So we do this
> through a regular Beam step which might not execute if the Beam pipeline
> fails.
>
>
>>
>> Another thing is:
>> Could you let customers to specify the temporary dataset themselves
>> instead of generating by Beam every time? So right now the problem is there
>> are a lot of temporary datasets under BigQuery console, which is messy. My
>> team wants to group all temporary tables under same dataset. But it seems
>> Beam doesn't give any apis to do this right now.(correct me if I am wrong)
>>
>
> Yeah, currently there's no API for this. Note that we also generate a
> table inside the dataset. So even if the user specify a dataset we need to
> do some cleanup (which might not execute if job fails). BTW have you
> determine why your pipelines fail regularly ? Addressing that should also
> address the cleanup issue.
>
> Thanks,
> Cham
>
>
>> Thanks,
>> Chengxuan
>>
>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:42写道：
>>
>>>
>>>
>>> On Wed, Apr 24, 2019 at 5:38 PM Chengxuan Wang <[email protected]>
>>> wrote:
>>>
>>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>>>> dataflow job with BigQuerySource. Even though I checked the code,
>>>>> BigQueryReader will delete the temporary dataset after the query is done.
>>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>>>
>>>>> But I still see some temporary dataset in my GCP console. Could you
>>>>> help me look into it?
>>>>>
>>>>
>>>>     Did some of your pipelines fail while reading from BigQuery ? If
>>>> so, it's possible that the pipeline failed before running the cleanup step.
>>>>
>>>> The cleanup is in __exit__ and we run with `with` statement, this
>>>> should clean up the dataset even though the pipeline is failed, right?
>>>>
>>>
>>> Which runner are you using ? Pls. note that what you are looking at is
>>> the implementation for the DirectRunner. Implementation for DataflowRunner
>>> is in Dataflow service (given that BQ is a native source).
>>>
>>>
>>>>
>>>>
>>>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:24写道：
>>>>
>>>>>
>>>>>
>>>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>>>>> dataflow job with BigQuerySource. Even though I checked the code,
>>>>>> BigQueryReader will delete the temporary dataset after the query is done.
>>>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>>>>
>>>>>> But I still see some temporary dataset in my GCP console. Could you
>>>>>> help me look into it?
>>>>>>
>>>>>
>>>>> Did some of your pipelines fail while reading from BigQuery ? If so,
>>>>> it's possible that the pipeline failed before running the cleanup step.
>>>>>
>>>>>
>>>>>>
>>>>>> Another thing is is that possible to set expiration for the temporary
>>>>>> dataset? right now I see is never.
>>>>>>
>>>>>
>>>>> Issue is, this will end up being an upper bound on the total execution
>>>>> time of the job. It's possible to set this to a very large value (multiple
>>>>> days or weeks) but not sure if this will help.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Chengxuan
>>>>>>
>>>>>

Re: questions about BigQuery temp dataset

Reply via email to