michalmodras commented on PR #44477: URL: https://github.com/apache/airflow/pull/44477#issuecomment-2541737829
To make sure I understand the target state: - With Kacper's changes, some additional metadata about parent job (Airflow DAG / task in this case) will be passed to the Spark job, and emitted in an OpenLineage event to OpenLineage-supporting-catalog by the Spark job itself. - Regardless of that, OpenLineage event at the Airflow level can/will be emitted. I think it's fair for each layer of orchestration to emit metadata that it has access to, for example depending on the Spark job type/implementation, the low level information about Spark execution, or, in case of Airflow, information about DAG/task/Airflow deployment. For Airflow itself, to construct such lineage event, Airflow needs to be aware of the input/output assets (as long as we cannot link lineage events only by process identifier). SQL parsing can be a way to get this information for SQL-like jobs, in case of other types of jobs (not necessarily Spark jobs) we can for example query the service the operator is integrated with (e.g. with BigQuery jobs - we could query BigQuery API to get that information and emit event linking input/output assets with BigQuery job id, and DAG/Task/Airflow deployment id). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org