[jira] [Commented] (BEAM-6831) python sdk WriteToBigQuery excessive usage of metered API

2020-03-03 Thread Keiji Yoshida (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050880#comment-17050880
 ] 

Keiji Yoshida commented on BEAM-6831:
-

That is because the bigquery.tables.get API is called every time a bundle in a 
PCollection is processed in Apache Beam 2.10.0 
([code|https://github.com/apache/beam/blob/v2.10.0/sdks/python/apache_beam/io/gcp/bigquery.py#L1365-L1367]).

In the latest version of Apache Beam (2.19.0), the bigquery.tables.get API is 
not called as long as `create_disposition` is set to `CREATE_NEVER` 
([code|https://github.com/apache/beam/blob/v2.19.0/sdks/python/apache_beam/io/gcp/bigquery.py#L989-L1009]).
 So, you can avoid the rate limit error by using Apache Beam 2.19.0 and setting 
`create_disposition` to `CREATE_NEVER`.

> python sdk WriteToBigQuery excessive usage of metered API
> -
>
> Key: BEAM-6831
> URL: https://issues.apache.org/jira/browse/BEAM-6831
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.10.0
>Reporter: Pesach Weinstock
>Assignee: Pablo Estrada
>Priority: Major
>  Labels: bigquery, dataflow, gcp, python
> Attachments: apache-beam-py-sdk-gcp-bq-api-issue.png
>
>
> Right now, there is a potential issue with the python sdk where 
> {{beam.io.gcp.bigquery.WriteToBigQuery}} calls the following api more often 
> than needed:
> [https://www.googleapis.com/bigquery/v2/projects//datasets//tables/?alt=json|https://www.googleapis.com/bigquery/v2/projects/%3Cproject-name%3E/datasets/%3Cdataset-name%3E/tables/%3Ctable-name%3E?alt=json]
> The above request falls under specific bigquery API quotas which are excluded 
> from bigquery streaming inserts. When used in a streaming pipeline, we hit 
> this quota pretty quickly, and cannot proceed to write any further data to 
> bigquery.
> Dispositions being used are:
>  * create_disposition: {{beam.io.BigQueryDisposition.CREATE_NEVER}}
>  * write_disposition: {{beam.io.BigQueryDisposition.WRITE_APPEND}}
> This is currently blocking us from using bigqueryIO in a streaming pipeline 
> to write to bigquery, and required us to formally request an API quota 
> increase from Google to temporarily correct the situation.
> Our pipeline uses DataflowRunner. Error seen is below, and in attached 
> screenshot of stackdriver trace.
> {code:java}
>   "errors": [
> {
>   "message": "Exceeded rate limits: too many api requests per user per 
> method for this user_method. For more information, see 
> https://cloud.google.com/bigquery/troubleshooting-errors";,
>   "domain": "usageLimits",
>   "reason": "rateLimitExceeded"
> }
>   ],
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-2879) Implement and use an Avro coder rather than the JSON one for intermediary files to be loaded in BigQuery

2020-01-31 Thread Keiji Yoshida (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027389#comment-17027389
 ] 

Keiji Yoshida commented on BEAM-2879:
-

Thanks for your cooperation!

> Implement and use an Avro coder rather than the JSON one for intermediary 
> files to be loaded in BigQuery
> 
>
> Key: BEAM-2879
> URL: https://issues.apache.org/jira/browse/BEAM-2879
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Black Phoenix
>Assignee: Steve Niemitz
>Priority: Minor
>  Labels: starter
> Fix For: 2.17.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Before being loaded in BigQuery, temporary files are created and encoded in 
> JSON. Which is a costly solution compared to an Avro alternative 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-2879) Implement and use an Avro coder rather than the JSON one for intermediary files to be loaded in BigQuery

2020-01-30 Thread Keiji Yoshida (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026672#comment-17026672
 ] 

Keiji Yoshida commented on BEAM-2879:
-

Hi, I noticed that this improvement had been merged and released in 2.18.0.

So, I think it's better to close this ticket and include this improvement in 
[the release note of 
2.18.0|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346383&projectId=12319527].

> Implement and use an Avro coder rather than the JSON one for intermediary 
> files to be loaded in BigQuery
> 
>
> Key: BEAM-2879
> URL: https://issues.apache.org/jira/browse/BEAM-2879
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Black Phoenix
>Assignee: Steve Niemitz
>Priority: Minor
>  Labels: starter
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Before being loaded in BigQuery, temporary files are created and encoded in 
> JSON. Which is a costly solution compared to an Avro alternative 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)