kacpermuda commented on code in PR #44872:
URL: https://github.com/apache/airflow/pull/44872#discussion_r1887989193
##########
providers/src/airflow/providers/google/cloud/openlineage/mixins.py:
##########
@@ -167,6 +211,21 @@ def _deduplicate_outputs(self, outputs: list[OutputDataset
| None]) -> list[Outp
# if the rowCount or size can be summed together.
if single_output.outputFacets:
single_output.outputFacets.pop("outputStatistics", None)
+
+ # If both outputs contain Column Level Lineage Facet - merge the
facets
Review Comment:
When bigquery receives two or more statements in a query it treats it as a
SCRIPT type job, and executes each statement in a separate child job. In that
case, we extract CLL per child job, parsing the exact part of the whole query
that run.
If we submit the following query to BQ, it will execute 3 child jobs and we
will receive `my-project.test.new_table` 3 times with different CLL, so in
order to get the whole picture we need to merge it.
```
CREATE OR REPLACE TABLE `my-project.test.new_table` AS
SELECT
id,
name,
age
FROM
`my-project.test.source`;
INSERT INTO `my-project.test.new_table`
SELECT
id,
name,
age
FROM
`my-project.test.import_table`;
INSERT INTO `new_table`
SELECT
NULL,
name,
age
FROM
`copy_result`;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]