Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
potiuk closed pull request #45610: BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly URL: https://github.com/apache/airflow/pull/45610 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
potiuk commented on PR #45610: URL: https://github.com/apache/airflow/pull/45610#issuecomment-2637130049 Closing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
MaksYermak commented on PR #45610: URL: https://github.com/apache/airflow/pull/45610#issuecomment-2630731893 > > > I'll be happy to approve and merge, after handling the unit test and the CI issue (probably merging from `main` shold help). @EugeneYushin - Could you please try @MaksYermak 's suggestion? > > > > > > Hi! I am afraid we can't merge this code. We are working rn on actually deprecating both BigQueryCreateExternalTableOperator and BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding because of those deprecations of parameters. So there will be a new BigQueryCreateTableOperator which will support current logic from both those operators > > @VladaZakharova What do you think would be the best way to handle this PR? Should I just close it with empty resolution? @EugeneYushin it's better to close this PR, without resolution. Soon we will create a new operator `BigQueryCreateTableOperator` and deprecate `BigQueryCreateExternalTableOperator` and `BigQueryCreateEmptyTableOperator` operators. As a solution for `autodetect`'s problem you can pass the `table_resource` parameter to the Operator as a `dict` or `Table` object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin commented on PR #45610: URL: https://github.com/apache/airflow/pull/45610#issuecomment-2629521673 > > I'll be happy to approve and merge, after handling the unit test and the CI issue (probably merging from `main` shold help). @EugeneYushin - Could you please try @MaksYermak 's suggestion? > > Hi! I am afraid we can't merge this code. We are working rn on actually deprecating both BigQueryCreateExternalTableOperator and BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding because of those deprecations of parameters. So there will be a new BigQueryCreateTableOperator which will support current logic from both those operators @VladaZakharova What do you think would be the best way to handle this PR? Should I just close it with empty resolution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1938559346
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
Hi @MaksYermak
I believe it will work with non-deprecated API because it works in
pass-through mode then.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
VladaZakharova commented on PR #45610: URL: https://github.com/apache/airflow/pull/45610#issuecomment-2629456348 > I'll be happy to approve and merge, after handling the unit test and the CI issue (probably merging from `main` shold help). @EugeneYushin - Could you please try @MaksYermak 's suggestion? Hi! I am afraid we can't merge this code. We are working rn on actually deprecating both BigQueryCreateExternalTableOperator and BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding because of those deprecations of parameters. So there will be a new BigQueryCreateTableOperator which will support current logic from both those operators -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
MaksYermak commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1923378670
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
> Hey @MaksYermak Indeed, it should not reference `self.schema_fields` but
`schema_fields` instead. I'll be happy to add unit test for that, but passing
anything except `table_resource` raises `AirflowProviderDeprecationWarning`. Do
you know if there's a chance to overcome that in CI/CD for particular tests?
>
> ```python
>
@mock.patch("airflow.providers.google.cloud.operators.bigquery.BigQueryHook")
> def test_execute_with_schema_fields_not_set(self, mock_hook:
MagicMock):
> operator = BigQueryCreateExternalTableOperator(
> task_id=TASK_ID,
> bucket=TEST_GCS_BUCKET,
> source_objects=TEST_GCS_CSV_DATA,
> schema_object="gcs://dummy",
> autodetect=True,
>
destination_project_dataset_table=f"{TEST_GCP_PROJECT_ID}.{TEST_DATASET}.{TEST_TABLE_ID}",
> )
>
> mock_hook.return_value.split_tablename.return_value = (
> TEST_GCP_PROJECT_ID,
> TEST_DATASET,
> TEST_TABLE_ID,
> )
>
> with
mock.patch("airflow.providers.google.cloud.operators.bigquery.GCSHook") as
gcs_hook:
> gcs_hook.return_value.download.return_value = b'{"c1": "int"}'
> operator.execute(context=MagicMock())
>
> assert "schema" in
mock_hook.return_value.create_empty_table.call_args.kwargs["table_resource"]
> ```
>
> ```shell
> Launching pytest with arguments --disable-forbidden-warnings
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
--no-header --no-summary -q in airflow/providers/tests/google/cloud/operators
>
> = test session starts
==
> collecting ... collected 1 item
>
>
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
>
> = 1 passed, 1 warning in 1.53s
=
> ```
Hello @EugeneYushin,
I have checked the code for `BigQueryCreateExternalTableOperator` operator
one more time and I noticed that the `autodetect` parameter is located in the
deprecated code. And it's a reason why you couldn't add a new Unit test for
this code. In my opinion it doesn't make sense to update this code, because in
the near future it will be removed.
About your problem with unworked `autodetect`. I think it can work with
non-deprecated configuration. I have updated the code from your example. Could
you please check this code on your environment and share the result?
```python
bq_op = BigQueryCreateExternalTableOperator(
gcp_conn_id="gcp_pp",
location="us-east4",
task_id="bq_op",
table_resource={
"tableReference": {
"projectId": PROJECT_ID,
"datasetId": "user_eyushin",
"tableId": "airflow_test_orig",
},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"autodetect": True,
"compression": "NONE",
"sourceUris": [f"gs://{GCS_BUCKET}/test.csv"],
"csvOptions": {
"skipLeadingRows": 1,
}
},
"location": "us-east4",
},
)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918462144
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
@VladaZakharova I'm using the following DAG for e2e tests locally:
```python
bq_op = BigQueryCreateExternalTableOperator(
gcp_conn_id="gcp_pp",
task_id="bq_op",
bucket=GCS_BUCKET,
source_objects=["test.csv"],
destination_project_dataset_table=f"{PROJECT_ID}.user_eyushin.airflow_test_orig",
#
destination_project_dataset_table=f"{PROJECT_ID}.user_eyushin.airflow_test_new",
source_format="CSV",
autodetect=True,
skip_leading_rows=1,
location="us-east4",
)
```
Original behavior:
https://github.com/user-attachments/assets/0b3136bd-df99-4ce4-9000-783af326de24";
/>
https://github.com/user-attachments/assets/5199bdd8-1026-4cc0-9218-7e5cbb46856b";
/>
Adjusted behavior:
https://github.com/user-attachments/assets/ac3db8e0-6a6f-4d1a-8080-85e1fa7a0aab";
/>
https://github.com/user-attachments/assets/d7385ccf-ed56-4c37-b34f-4d66f99209b5";
/>
https://github.com/user-attachments/assets/864147bc-42b3-4b05-a606-fd4d5318494c";
/>
In the other words, sending empty schema to GCP focres BQ to bypass
autodetect mechanics.
You can also check https://github.com/apache/airflow/issues/45512 for some
context about the issue.
I'll be happy to iterate on that if you have other questions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
VladaZakharova commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918194170
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
Hi there! Thanks for contributing to google-provider. Can you please provide
some screenshots of the running system test for this operator to show what was
the problem with this filed and how your fix actually fixes the problem? Thanks!
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
Hi there! Thanks for contributing to google-provider. Can you please provide
some screenshots of the running system test for this operator to show what was
the problem with this field and how your fix actually fixes the problem? Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918118012
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
https://github.com/apache/airflow/pull/45610/commits/c3c07ea8aa644058a885b1aae8af10716bcd93a8
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918116468
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
Hey @MaksYermak
Indeed, it should not reference `self.schema_fields` but `schema_fields`
instead. I'll be happy to add unit test for that, but passing anything except
`table_resource` raises `AirflowProviderDeprecationWarning`. Do you know if
there's a chance to overcome that in CI/CD for particular tests?
```python
@mock.patch("airflow.providers.google.cloud.operators.bigquery.BigQueryHook")
def test_execute_with_schema_fields_not_set(self, mock_hook: MagicMock):
operator = BigQueryCreateExternalTableOperator(
task_id=TASK_ID,
bucket=TEST_GCS_BUCKET,
source_objects=TEST_GCS_CSV_DATA,
schema_object="gcs://dummy",
autodetect=True,
destination_project_dataset_table=f"{TEST_GCP_PROJECT_ID}.{TEST_DATASET}.{TEST_TABLE_ID}",
)
mock_hook.return_value.split_tablename.return_value = (
TEST_GCP_PROJECT_ID,
TEST_DATASET,
TEST_TABLE_ID,
)
with
mock.patch("airflow.providers.google.cloud.operators.bigquery.GCSHook") as
gcs_hook:
gcs_hook.return_value.download.return_value = b'{"c1": "int"}'
operator.execute(context=MagicMock())
assert "schema" in
mock_hook.return_value.create_empty_table.call_args.kwargs["table_resource"]
```
```bash
Launching pytest with arguments --disable-forbidden-warnings
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
--no-header --no-summary -q in airflow/providers/tests/google/cloud/operators
= test session starts
==
collecting ... collected 1 item
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
= 1 passed, 1 warning in 1.53s
=
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
MaksYermak commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1914517397
##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
"tableId": table_id,
},
"labels": self.labels,
-"schema": {"fields": schema_fields},
"externalDataConfiguration": external_data_configuration,
"location": self.location,
"encryptionConfiguration": self.encryption_configuration,
}
+if self.schema_fields:
Review Comment:
Hello @EugeneYushin ,
I think your changes broke this part of the code
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/google/cloud/operators/bigquery.py#L1679C8-L1686C14
for me it looks like that the `schema_fields`, from this peace of code, will
never assign to `table_resource` if we merge your changes.
Also I think, maybe this code is a reason why `autodetect` does not work.
For me it looks like `schema_fields` is always present in the `table_resource`.
Could you please check and prove that case which cover by this code
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/google/cloud/operators/bigquery.py#L1679C8-L1686C14
is working with your changes?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
boring-cyborg[bot] commented on PR #45610: URL: https://github.com/apache/airflow/pull/45610#issuecomment-2586813833 Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst) Here are some useful points: - Pay attention to the quality of your code (ruff, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/contributing-docs/08_static_code_checks.rst#prerequisites-for-pre-commit-hooks) will help you with that. - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it. - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/dev/breeze/doc/README.rst) for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations. - Be patient and persistent. It might take some time to get a review or get the final approval from Committers. - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack. - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#coding-style-and-best-practices). - Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits. Apache Airflow is a community-driven project and together we are making it better 🚀. In case of doubts contact the developers at: Mailing List: [email protected] Slack: https://s.apache.org/airflow-slack -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
[PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]
EugeneYushin opened a new pull request, #45610: URL: https://github.com/apache/airflow/pull/45610 Closes https://github.com/apache/airflow/issues/45512 This PR fixes the case when schema `autodetect` is used in `BigQueryCreateExternalTableOperator`. Currently, the table created in BigQuery can't be queried when empty `schema.fields` are passed to GCP. Simply omitting empty `schema.fields` to be passed in request body to GCP allows schema auto-detection flow to prosper. This part of API is marked as deprecated, hence [unit tests](https://github.com/apache/airflow/commit/ffe09d62ed15bbc09aeba19ebd5d1b2a2cc8e0ed) I was able to come up with fail: ``` E airflow.exceptions.AirflowProviderDeprecationWarning: Passing table parameters via keywords arguments will be deprecated. Please provide table definition using `table_resource` parameter. ``` I was able to run that test with `--disable-forbidden-warnings ` flag: ``` pytest --disable-forbidden-warnings providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator ... providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_csv_format PASSED [ 25%] providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_parquet_format PASSED [ 50%] providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_get_openlineage_facets_on_complete PASSED [ 75%] providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set PASSED [100%] ... 4 passed, 1 warning in 1.83s ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
