Re: questions about BigQuery temp dataset

Chamikara Jayalath Thu, 25 Apr 2019 10:34:43 -0700

On Wed, Apr 24, 2019 at 8:55 PM Chengxuan Wang <[email protected]> wrote:


> We are using DataflowRunner. Sorry about pointing to wrong place.
> But still, It is better experience that Beam can support cleanup even
> though the pipeline is failed.
>

Unfortunately there's no general cleanup mechanism for Beam. So we do this
through a regular Beam step which might not execute if the Beam pipeline
fails.


>
> Another thing is:
> Could you let customers to specify the temporary dataset themselves
> instead of generating by Beam every time? So right now the problem is there
> are a lot of temporary datasets under BigQuery console, which is messy. My
> team wants to group all temporary tables under same dataset. But it seems
> Beam doesn't give any apis to do this right now.(correct me if I am wrong)
>

Yeah, currently there's no API for this. Note that we also generate a table
inside the dataset. So even if the user specify a dataset we need to do
some cleanup (which might not execute if job fails). BTW have you determine
why your pipelines fail regularly ? Addressing that should also address the
cleanup issue.

Thanks,
Cham


> Thanks,
> Chengxuan
>
> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:42写道：
>
>>
>>
>> On Wed, Apr 24, 2019 at 5:38 PM Chengxuan Wang <[email protected]> wrote:
>>
>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>>> dataflow job with BigQuerySource. Even though I checked the code,
>>>> BigQueryReader will delete the temporary dataset after the query is done.
>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>>
>>>> But I still see some temporary dataset in my GCP console. Could you
>>>> help me look into it?
>>>>
>>>
>>>     Did some of your pipelines fail while reading from BigQuery ? If so,
>>> it's possible that the pipeline failed before running the cleanup step.
>>>
>>> The cleanup is in __exit__ and we run with `with` statement, this should
>>> clean up the dataset even though the pipeline is failed, right?
>>>
>>
>> Which runner are you using ? Pls. note that what you are looking at is
>> the implementation for the DirectRunner. Implementation for DataflowRunner
>> is in Dataflow service (given that BQ is a native source).
>>
>>
>>>
>>>
>>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:24写道：
>>>
>>>>
>>>>
>>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>>>> dataflow job with BigQuerySource. Even though I checked the code,
>>>>> BigQueryReader will delete the temporary dataset after the query is done.
>>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>>>
>>>>> But I still see some temporary dataset in my GCP console. Could you
>>>>> help me look into it?
>>>>>
>>>>
>>>> Did some of your pipelines fail while reading from BigQuery ? If so,
>>>> it's possible that the pipeline failed before running the cleanup step.
>>>>
>>>>
>>>>>
>>>>> Another thing is is that possible to set expiration for the temporary
>>>>> dataset? right now I see is never.
>>>>>
>>>>
>>>> Issue is, this will end up being an upper bound on the total execution
>>>> time of the job. It's possible to set this to a very large value (multiple
>>>> days or weeks) but not sure if this will help.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Chengxuan
>>>>>
>>>>

Re: questions about BigQuery temp dataset

Reply via email to