[ https://issues.apache.org/jira/browse/BEAM-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Graham Polley updated BEAM-6910: -------------------------------- Description: When using the BigQuery source with a query in a pipeline, the "processing location" is not taken into consideration and the pipeline fails. For example, consider the following which uses `BigQuerySource` to read from BigQuery using some SQL. The BigQuery dataset and tables are located in "australia-southeast1". The query is submitted successfully ([Beam works out the processing location by examining the first table referenced in the query and sets it accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]), but when Beam attempts to poll for the job status after it has been submitted, it fails because it doesn't set the `location` to be "australia-southeast1", which is required by BigQuery: {code:java} p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True, query='SELECT * from `a_project_id.dataset_in_australia.table_in_australia`'){code} {code:java} HttpNotFoundError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>: response: <{'status': '404', 'content-length': '328', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar 2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="46,44,43,39"', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 404, "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9", "errors": [ { "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9", "domain": "global", "reason": "notFound" } ], "status": "NOT_FOUND" } } {code} The problem can be seen/found here: [https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571] [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357] The location of the job (in this case "australia-southeast1") needs to set/inferred (or exposed via the API), otherwise its fails. For reference, Airflow had the same bug/problem: [https://github.com/apache/airflow/pull/4695] was: When using the BigQuery source with a query in a pipeline, the "processing location" is not taken into consideration and the pipeline fails. For example, consider the following which uses `BigQuerySource` to read from BigQuery using some SQL. The BigQuery dataset and tables are located in "australia-southeast1". The query is submitted successfully ([Beam works out the processing location by examining the first table referenced in the query and sets it accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]), but when Beam attempts to poll for the job status after it has been submitted, it fails because it doesn't set the `location` to be "australia-southeast1", which is required by BigQuery: {code:java} p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True, query='SELECT * from `a_project_id.dataset_in_australia.table_in_australia`'){code} {code:java} HttpNotFoundError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>: response: <{'status': '404', 'content-length': '328', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar 2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="46,44,43,39"', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "code": 404, "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9", "errors": [ { "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9", "domain": "global", "reason": "notFound" } ], "status": "NOT_FOUND" } } {code} The problem can be seen here: [https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571] [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357] The location of the job (in this case "australia-southeast1") needs to set/inferred (or exposed via the API), otherwise its fails. For reference, Airflow had the same bug/problem: https://github.com/apache/airflow/pull/4695 > Beam does not consider BigQuery's processing location when getting query > results > -------------------------------------------------------------------------------- > > Key: BEAM-6910 > URL: https://issues.apache.org/jira/browse/BEAM-6910 > Project: Beam > Issue Type: Bug > Components: dependencies, runner-dataflow, sdk-py-core > Affects Versions: 2.11.0 > Environment: Python > Reporter: Graham Polley > Priority: Major > > When using the BigQuery source with a query in a pipeline, the "processing > location" is not taken into consideration and the pipeline fails. > For example, consider the following which uses `BigQuerySource` to read from > BigQuery using some SQL. The BigQuery dataset and tables are located in > "australia-southeast1". The query is submitted successfully ([Beam works out > the processing location by examining the first table referenced in the query > and sets it > accordingly|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L221]), > but when Beam attempts to poll for the job status after it has been > submitted, it fails because it doesn't set the `location` to be > "australia-southeast1", which is required by BigQuery: > > {code:java} > p | 'read' >> beam.io.Read(beam.io.BigQuerySource(use_standard_sql=True, > query='SELECT * from > `a_project_id.dataset_in_australia.table_in_australia`'){code} > > {code:java} > HttpNotFoundError: HttpError accessing > <https://www.googleapis.com/bigquery/v2/projects/a_project_id/queries/5ad9cc803baa432290b6cd0203f556d9?alt=json&maxResults=10000>: > response: <{'status': '404', 'content-length': '328', 'x-xss-protection': > '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': > 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', > '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Tue, 26 Mar > 2019 03:11:32 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; > ma=2592000; v="46,44,43,39"', 'content-type': 'application/json; > charset=UTF-8'}>, content <{ > "error": { > "code": 404, > "message": "Not found: Job a_project_id:5ad9cc803baa432290b6cd0203f556d9", > "errors": [ > { > "message": "Not found: Job > a_project_id:5ad9cc803baa432290b6cd0203f556d9", > "domain": "global", > "reason": "notFound" > } > ], > "status": "NOT_FOUND" > } > } > {code} > > The problem can be seen/found here: > [https://github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L571] > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L357] > The location of the job (in this case "australia-southeast1") needs to > set/inferred (or exposed via the API), otherwise its fails. > For reference, Airflow had the same bug/problem: > [https://github.com/apache/airflow/pull/4695] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)