Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-05 Thread via GitHub


potiuk closed pull request #45610: BigQueryCreateExternalTableOperator pass 
schema_fields through if only it set explicitly
URL: https://github.com/apache/airflow/pull/45610


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-05 Thread via GitHub


potiuk commented on PR #45610:
URL: https://github.com/apache/airflow/pull/45610#issuecomment-2637130049

   Closing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-03 Thread via GitHub


MaksYermak commented on PR #45610:
URL: https://github.com/apache/airflow/pull/45610#issuecomment-2630731893

   > > > I'll be happy to approve and merge, after handling the unit test and 
the CI issue (probably merging from `main` shold help). @EugeneYushin - Could 
you please try @MaksYermak 's suggestion?
   > > 
   > > 
   > > Hi! I am afraid we can't merge this code. We are working rn on actually 
deprecating both BigQueryCreateExternalTableOperator and 
BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding 
because of those deprecations of parameters. So there will be a new 
BigQueryCreateTableOperator which will support current logic from both those 
operators
   > 
   > @VladaZakharova What do you think would be the best way to handle this PR? 
Should I just close it with empty resolution?
   
   @EugeneYushin it's better to close this PR, without resolution. Soon we will 
create a new operator `BigQueryCreateTableOperator` and deprecate 
`BigQueryCreateExternalTableOperator` and `BigQueryCreateEmptyTableOperator` 
operators. As a solution for `autodetect`'s  problem you can pass the 
`table_resource` parameter to the Operator as a `dict` or `Table` object.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-02 Thread via GitHub


EugeneYushin commented on PR #45610:
URL: https://github.com/apache/airflow/pull/45610#issuecomment-2629521673

   > > I'll be happy to approve and merge, after handling the unit test and the 
CI issue (probably merging from `main` shold help). @EugeneYushin - Could you 
please try @MaksYermak 's suggestion?
   > 
   > Hi! I am afraid we can't merge this code. We are working rn on actually 
deprecating both BigQueryCreateExternalTableOperator and 
BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding 
because of those deprecations of parameters. So there will be a new 
BigQueryCreateTableOperator which will support current logic from both those 
operators
   
   @VladaZakharova What do you think would be the best way to handle this PR? 
Should I just close it with empty resolution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-02 Thread via GitHub


EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1938559346


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   Hi @MaksYermak 
   I believe it will work with non-deprecated API because it works in 
pass-through mode then.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-02-02 Thread via GitHub


VladaZakharova commented on PR #45610:
URL: https://github.com/apache/airflow/pull/45610#issuecomment-2629456348

   > I'll be happy to approve and merge, after handling the unit test and the 
CI issue (probably merging from `main` shold help). @EugeneYushin - Could you 
please try @MaksYermak 's suggestion?
   
   Hi! I am afraid we can't merge this code. We are working rn on actually 
deprecating both BigQueryCreateExternalTableOperator and 
BigQueryCreateEmptyTableOperator since there is a lot of misunderstanding 
because of those deprecations of parameters. So there will be a new 
BigQueryCreateTableOperator which will support current logic from both those 
operators


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-21 Thread via GitHub


MaksYermak commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1923378670


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   > Hey @MaksYermak Indeed, it should not reference `self.schema_fields` but 
`schema_fields` instead. I'll be happy to add unit test for that, but passing 
anything except `table_resource` raises `AirflowProviderDeprecationWarning`. Do 
you know if there's a chance to overcome that in CI/CD for particular tests?
   > 
   > ```python
   > 
@mock.patch("airflow.providers.google.cloud.operators.bigquery.BigQueryHook")
   > def test_execute_with_schema_fields_not_set(self, mock_hook: 
MagicMock):
   > operator = BigQueryCreateExternalTableOperator(
   > task_id=TASK_ID,
   > bucket=TEST_GCS_BUCKET,
   > source_objects=TEST_GCS_CSV_DATA,
   > schema_object="gcs://dummy",
   > autodetect=True,
   > 
destination_project_dataset_table=f"{TEST_GCP_PROJECT_ID}.{TEST_DATASET}.{TEST_TABLE_ID}",
   > )
   > 
   > mock_hook.return_value.split_tablename.return_value = (
   > TEST_GCP_PROJECT_ID,
   > TEST_DATASET,
   > TEST_TABLE_ID,
   > )
   > 
   > with 
mock.patch("airflow.providers.google.cloud.operators.bigquery.GCSHook") as 
gcs_hook:
   > gcs_hook.return_value.download.return_value = b'{"c1": "int"}'
   > operator.execute(context=MagicMock())
   > 
   > assert "schema" in 
mock_hook.return_value.create_empty_table.call_args.kwargs["table_resource"]
   > ```
   > 
   > ```shell
   > Launching pytest with arguments --disable-forbidden-warnings 
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
 --no-header --no-summary -q in airflow/providers/tests/google/cloud/operators
   > 
   > = test session starts 
==
   > collecting ... collected 1 item
   > 
   > 
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
 
   > 
   > = 1 passed, 1 warning in 1.53s 
=
   > ```
   
   Hello @EugeneYushin,
   
   I have checked the code for `BigQueryCreateExternalTableOperator` operator 
one more time and I noticed that the `autodetect` parameter is located in the 
deprecated code. And it's a reason why you couldn't add a new Unit test for 
this code. In my opinion it doesn't make sense to update this code, because in 
the near future it will be  removed.
   
   About your problem with unworked `autodetect`. I think it can work with 
non-deprecated configuration. I have updated the code from your example. Could 
you please check this code on your environment and share the result?
   ```python
   bq_op = BigQueryCreateExternalTableOperator(
   gcp_conn_id="gcp_pp",
   location="us-east4",
   task_id="bq_op",
   table_resource={
   "tableReference": {
   "projectId": PROJECT_ID,
   "datasetId": "user_eyushin",
   "tableId": "airflow_test_orig",
   },
   "externalDataConfiguration": {
   "sourceFormat": "CSV",
   "autodetect": True,
   "compression": "NONE",
   "sourceUris": [f"gs://{GCS_BUCKET}/test.csv"],
   "csvOptions": {
   "skipLeadingRows": 1,
   }
   },
   "location": "us-east4",
   },
   )
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-16 Thread via GitHub


EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918462144


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   @VladaZakharova I'm using the following DAG for e2e tests locally:
   ```python
   bq_op = BigQueryCreateExternalTableOperator(
   gcp_conn_id="gcp_pp",
   task_id="bq_op",
   bucket=GCS_BUCKET,
   source_objects=["test.csv"],
   
destination_project_dataset_table=f"{PROJECT_ID}.user_eyushin.airflow_test_orig",
   # 
destination_project_dataset_table=f"{PROJECT_ID}.user_eyushin.airflow_test_new",
   source_format="CSV",
   autodetect=True,
   skip_leading_rows=1,
   location="us-east4",
   )
   ```
   
   Original behavior:
   https://github.com/user-attachments/assets/0b3136bd-df99-4ce4-9000-783af326de24";
 />
   https://github.com/user-attachments/assets/5199bdd8-1026-4cc0-9218-7e5cbb46856b";
 />
   
   Adjusted behavior:
   https://github.com/user-attachments/assets/ac3db8e0-6a6f-4d1a-8080-85e1fa7a0aab";
 />
   https://github.com/user-attachments/assets/d7385ccf-ed56-4c37-b34f-4d66f99209b5";
 />
   
   https://github.com/user-attachments/assets/864147bc-42b3-4b05-a606-fd4d5318494c";
 />
   
   In the other words, sending empty schema to GCP focres BQ to bypass 
autodetect mechanics.
   You can also check https://github.com/apache/airflow/issues/45512 for some 
context about the issue.
   I'll be happy to iterate on that if you have other questions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-16 Thread via GitHub


VladaZakharova commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918194170


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   Hi there! Thanks for contributing to google-provider. Can you please provide 
some screenshots of the running system test for this operator to show what was 
the problem with this filed and how your fix actually fixes the problem? Thanks!



##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   Hi there! Thanks for contributing to google-provider. Can you please provide 
some screenshots of the running system test for this operator to show what was 
the problem with this field and how your fix actually fixes the problem? Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-16 Thread via GitHub


EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918118012


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   
https://github.com/apache/airflow/pull/45610/commits/c3c07ea8aa644058a885b1aae8af10716bcd93a8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-16 Thread via GitHub


EugeneYushin commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1918116468


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   Hey @MaksYermak 
   Indeed, it should not reference `self.schema_fields` but `schema_fields` 
instead. I'll be happy to add unit test for that, but passing anything except 
`table_resource` raises `AirflowProviderDeprecationWarning`. Do you know if 
there's a chance to overcome that in CI/CD for particular tests?
   
   ```python
   
@mock.patch("airflow.providers.google.cloud.operators.bigquery.BigQueryHook")
   def test_execute_with_schema_fields_not_set(self, mock_hook: MagicMock):
   operator = BigQueryCreateExternalTableOperator(
   task_id=TASK_ID,
   bucket=TEST_GCS_BUCKET,
   source_objects=TEST_GCS_CSV_DATA,
   schema_object="gcs://dummy",
   autodetect=True,
   
destination_project_dataset_table=f"{TEST_GCP_PROJECT_ID}.{TEST_DATASET}.{TEST_TABLE_ID}",
   )
   
   mock_hook.return_value.split_tablename.return_value = (
   TEST_GCP_PROJECT_ID,
   TEST_DATASET,
   TEST_TABLE_ID,
   )
   
   with 
mock.patch("airflow.providers.google.cloud.operators.bigquery.GCSHook") as 
gcs_hook:
   gcs_hook.return_value.download.return_value = b'{"c1": "int"}'
   operator.execute(context=MagicMock())
   
   assert "schema" in 
mock_hook.return_value.create_empty_table.call_args.kwargs["table_resource"]
   ```
   
   ```bash
   Launching pytest with arguments --disable-forbidden-warnings 
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
 --no-header --no-summary -q in airflow/providers/tests/google/cloud/operators
   
   = test session starts 
==
   collecting ... collected 1 item
   
   
test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
 
   
   = 1 passed, 1 warning in 1.53s 
=
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-14 Thread via GitHub


MaksYermak commented on code in PR #45610:
URL: https://github.com/apache/airflow/pull/45610#discussion_r1914517397


##
providers/src/airflow/providers/google/cloud/operators/bigquery.py:
##
@@ -1717,12 +1717,14 @@ def execute(self, context: Context) -> None:
 "tableId": table_id,
 },
 "labels": self.labels,
-"schema": {"fields": schema_fields},
 "externalDataConfiguration": external_data_configuration,
 "location": self.location,
 "encryptionConfiguration": self.encryption_configuration,
 }
 
+if self.schema_fields:

Review Comment:
   Hello @EugeneYushin ,
   I think your changes broke this part of the code 
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/google/cloud/operators/bigquery.py#L1679C8-L1686C14
 for me it looks like that the `schema_fields`, from this peace of code, will 
never assign to `table_resource` if we merge your changes.
   Also I think, maybe this code is a reason why `autodetect` does not work. 
For me it looks like `schema_fields` is always present in the `table_resource`.
   Could you please check and prove that case which cover by this code 
https://github.com/apache/airflow/blob/main/providers/src/airflow/providers/google/cloud/operators/bigquery.py#L1679C8-L1686C14
 is working with your changes?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-13 Thread via GitHub


boring-cyborg[bot] commented on PR #45610:
URL: https://github.com/apache/airflow/pull/45610#issuecomment-2586813833

   Congratulations on your first Pull Request and welcome to the Apache Airflow 
community! If you have any issues or are unsure about any anything please check 
our Contributors' Guide 
(https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (ruff, mypy and type 
annotations). Our [pre-commits]( 
https://github.com/apache/airflow/blob/main/contributing-docs/08_static_code_checks.rst#prerequisites-for-pre-commit-hooks)
 will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in 
`docs/` directory). Adding a new operator? Check this short 
[guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst)
 Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze 
environment](https://github.com/apache/airflow/blob/main/dev/breeze/doc/README.rst)
 for testing locally, it's a heavy docker but it ships with a working Airflow 
and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get 
the final approval from Committers.
   - Please follow [ASF Code of 
Conduct](https://www.apache.org/foundation/policies/conduct) for all 
communication including (but not limited to) comments on Pull Requests, Mailing 
list and Slack.
   - Be sure to read the [Airflow Coding style]( 
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#coding-style-and-best-practices).
   - Always keep your Pull Requests rebased, otherwise your build might fail 
due to changes not related to your commits.
   Apache Airflow is a community-driven project and together we are making it 
better 🚀.
   In case of doubts contact the developers at:
   Mailing List: [email protected]
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



[PR] BigQueryCreateExternalTableOperator pass schema_fields through if only it set explicitly [airflow]

2025-01-13 Thread via GitHub


EugeneYushin opened a new pull request, #45610:
URL: https://github.com/apache/airflow/pull/45610

   Closes https://github.com/apache/airflow/issues/45512
   
   This PR fixes the case when schema `autodetect` is used in 
`BigQueryCreateExternalTableOperator`. Currently, the table created in BigQuery 
can't be queried when empty `schema.fields` are passed to GCP. Simply omitting 
empty `schema.fields` to be passed in request body to GCP allows schema 
auto-detection flow to prosper.
   
   This part of API is marked as deprecated, hence [unit 
tests](https://github.com/apache/airflow/commit/ffe09d62ed15bbc09aeba19ebd5d1b2a2cc8e0ed)
 I was able to come up with fail:
   ```
   E   airflow.exceptions.AirflowProviderDeprecationWarning: Passing table 
parameters via keywords arguments will be deprecated. Please provide table 
definition using `table_resource` parameter.
   ```
   
   I was able to run that test with `--disable-forbidden-warnings ` flag:
   ```
   pytest --disable-forbidden-warnings 
providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator
   ...
   
   
providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_csv_format
 PASSED 
 [ 25%]
   
providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_parquet_format
 PASSED  [ 
50%]
   
providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_get_openlineage_facets_on_complete
 PASSED   [ 75%]
   
providers/tests/google/cloud/operators/test_bigquery.py::TestBigQueryCreateExternalTableOperator::test_execute_with_schema_fields_not_set
 PASSED   [100%]
   ...
   

 4 passed, 1 warning in 1.83s 

   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]