kacpermuda opened a new pull request, #44477: URL: https://github.com/apache/airflow/pull/44477
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!-- Thank you for contributing! Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping committers for the review! In case of an existing issue, reference it using one of the following: closes: #ISSUE related: #ISSUE How to write a good git commit message: http://chris.beams.io/posts/git-commit/ --> This PR introduces a completely new feature to OpenLineage integration. **It will NOT impact users that are not using OpenLineage or have not explicitly enabled this feature (False by default).** ## TLDR; When explicitly enabled by the user for supported operators, we will automatically inject parent job information into the Spark job properties. For example, when submitting a Spark job using the DataprocSubmitJobOperator, we will include details about the Airflow task that triggered it so that the OpenLineage Spark integration can include them in parentRunFacet. ## Why ? To enable full pipeline visibility and track dependencies between jobs in OpenLineage, we utilize the parentRunFacet. This facet stores the identifier of the parent job that triggered the current job. This approach works across various integrations, f.e. you can pass Airflow’s job identifier to a Spark application if it was triggered by an Airflow operator. Currently, this process requires manual configuration by the user, such as leveraging [macros](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html): ``` DataprocSubmitJobOperator( task_id="my_task", # ... job={ # ... "spark.openlineage.parentJobNamespace": "{{ macros.OpenLineageProviderPlugin.lineage_job_namespace() }}", "spark.openlineage.parentJobName": "{{ macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}", "spark.openlineage.parentRunId": "{{ macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}" } ) ``` Understanding how various Airflow operators configure Spark allows us to automatically inject parent job information. ## Controlling the Behavior We provide users with a flexible control mechanism to manage this injection, combining per-operator enablement with a global fallback configuration. This design is inspired by the `deferrable` argument in Airflow. ```python ol_inject_parent_job_info: bool = conf.getboolean( "openlineage", "spark_inject_parent_job_info", fallback=False ) ``` Each supported operator will include an argument like `ol_inject_parent_job_info`, which defaults to the global configuration value of `openlineage.spark_inject_parent_job_info`. This approach allows users to: 1. Control behavior on a per-job basis by explicitly setting the argument. 2. Rely on a consistent default configuration for all jobs if the argument is not set. This design ensures both flexibility and ease of use, enabling users to fine-tune their workflows while minimizing repetitive configuration. I am aware that adding an OpenLineage-related argument to the operator will affect all users, even those not using OpenLineage, but since it defaults to False and can be ignored, I hope this will not pose any issues. ## How? The implementation is divided into three parts for better organization and clarity: 1. **Operator's Code (including the `execute` method):** Contains minimal logic to avoid overwhelming users who are not actively working with OpenLineage. 2. **Google's Provider OpenLineage Utils File:** Handles the logic for accessing Spark properties specific to a given operator or job. 3. **OpenLineage Provider's Utils:** Responsible for creating / extracting all necessary information in a format compatible with the OpenLineage Spark integration. We are also performing modifications to the Spark properties here. ## Next steps 1. **Expand Operator Coverage:** Increase support for additional operators by extending the parent job information injection to cover more cases. 2. **Automate Transport Configuration:** Implement similar automation for transport configurations, starting with HTTP, to streamline the integration process. # TODO - Add more documentation and description on how to use ths feature <!-- Please keep an empty line above the dashes. --> --- **^ Add meaningful description above** Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org