kacpermuda commented on code in PR #44872:
URL: https://github.com/apache/airflow/pull/44872#discussion_r1887989193


##########
providers/src/airflow/providers/google/cloud/openlineage/mixins.py:
##########
@@ -167,6 +211,21 @@ def _deduplicate_outputs(self, outputs: list[OutputDataset 
| None]) -> list[Outp
             # if the rowCount or size can be summed together.
             if single_output.outputFacets:
                 single_output.outputFacets.pop("outputStatistics", None)
+
+            # If both outputs contain Column Level Lineage Facet - merge the 
facets

Review Comment:
   When bigquery receives two or more statements in a query it treats it as a 
SCRIPT type job, and executes each statement in a separate child job. In that 
case, we extract CLL per child job, parsing the exact part of the whole query 
that run.
   
   If we submit the following query to BQ, it will execute 3 child jobs and we 
will receive `my-project.test.new_table` 3 times with different CLL, so in 
order to get the whole picture we need to merge it.
   ```
   CREATE OR REPLACE TABLE `my-project.test.new_table` AS
   SELECT
     id,
     name,
     age
   FROM
     `my-project.test.source`;
   
   INSERT INTO `my-project.test.new_table`
   SELECT
     id,
     name,
     age
   FROM
     `my-project.test.import_table`;
   
   INSERT INTO `new_table`
   SELECT
     NULL,
     name,
     age
   FROM
     `copy_result`;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to