michalmodras commented on PR #44477:
URL: https://github.com/apache/airflow/pull/44477#issuecomment-2541737829

   To make sure I understand the target state:
   - With Kacper's changes, some additional metadata about parent job (Airflow 
DAG / task in this case) will be passed to the Spark job, and emitted in an 
OpenLineage event to OpenLineage-supporting-catalog by the Spark job itself.
   - Regardless of that, OpenLineage event at the Airflow level can/will be 
emitted.
   
   I think it's fair for each layer of orchestration to emit metadata that it 
has access to, for example depending on the Spark job type/implementation, the 
low level information about Spark execution, or, in case of Airflow, 
information about DAG/task/Airflow deployment. 
   
   For Airflow itself, to construct such lineage event, Airflow needs to be 
aware of the input/output assets (as long as we cannot link lineage events only 
by process identifier). SQL parsing can be a way to get this information for 
SQL-like jobs, in case of other types of jobs (not necessarily Spark jobs) we 
can for example query the service the operator is integrated with (e.g. with 
BigQuery jobs - we could query BigQuery API to get that information and emit 
event linking input/output assets with BigQuery job id, and DAG/Task/Airflow 
deployment id).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to