Re: questions about BigQuery temp dataset

Chengxuan Wang Wed, 24 Apr 2019 20:55:55 -0700

We are using DataflowRunner. Sorry about pointing to wrong place.
But still, It is better experience that Beam can support cleanup even
though the pipeline is failed.


Another thing is:
Could you let customers to specify the temporary dataset themselves instead
of generating by Beam every time? So right now the problem is there are a
lot of temporary datasets under BigQuery console, which is messy. My team
wants to group all temporary tables under same dataset. But it seems Beam
doesn't give any apis to do this right now.(correct me if I am wrong)


Thanks,
Chengxuan

Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:42写道：

>
>
> On Wed, Apr 24, 2019 at 5:38 PM Chengxuan Wang <[email protected]> wrote:
>
>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>> dataflow job with BigQuerySource. Even though I checked the code,
>>> BigQueryReader will delete the temporary dataset after the query is done.
>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>
>>> But I still see some temporary dataset in my GCP console. Could you help
>>> me look into it?
>>>
>>
>>     Did some of your pipelines fail while reading from BigQuery ? If so,
>> it's possible that the pipeline failed before running the cleanup step.
>>
>> The cleanup is in __exit__ and we run with `with` statement, this should
>> clean up the dataset even though the pipeline is failed, right?
>>
>
> Which runner are you using ? Pls. note that what you are looking at is the
> implementation for the DirectRunner. Implementation for DataflowRunner is
> in Dataflow service (given that BQ is a native source).
>
>
>>
>>
>> Chamikara Jayalath <[email protected]> 于2019年4月24日周三 下午5:24写道：
>>
>>>
>>>
>>> On Tue, Apr 23, 2019 at 11:58 AM Chengxuan Wang <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using Apache Beam python sdk (apache-beam==2.11.0) to run a
>>>> dataflow job with BigQuerySource. Even though I checked the code,
>>>> BigQueryReader will delete the temporary dataset after the query is done.
>>>> https://github.com/apache/beam/blob/1ad61fd384bcd1edd11086a3cf9d7dddb154d934/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L900
>>>>
>>>> But I still see some temporary dataset in my GCP console. Could you
>>>> help me look into it?
>>>>
>>>
>>> Did some of your pipelines fail while reading from BigQuery ? If so,
>>> it's possible that the pipeline failed before running the cleanup step.
>>>
>>>
>>>>
>>>> Another thing is is that possible to set expiration for the temporary
>>>> dataset? right now I see is never.
>>>>
>>>
>>> Issue is, this will end up being an upper bound on the total execution
>>> time of the job. It's possible to set this to a very large value (multiple
>>> days or weeks) but not sure if this will help.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Chengxuan
>>>>
>>>

Re: questions about BigQuery temp dataset

Reply via email to