kacpermuda opened a new pull request, #44477:
URL: https://github.com/apache/airflow/pull/44477

   <!--
    Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at
   
      http://www.apache.org/licenses/LICENSE-2.0
   
    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
    -->
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   This PR introduces a completely new feature to OpenLineage integration. **It 
will NOT impact users that are not using OpenLineage or have not explicitly 
enabled this feature (False by default).**
   
   ## TLDR; 
   When explicitly enabled by the user for supported operators, we will 
automatically inject parent job information into the Spark job properties. For 
example, when submitting a Spark job using the DataprocSubmitJobOperator, we 
will include details about the Airflow task that triggered it so that the 
OpenLineage Spark integration can include them in parentRunFacet.
   
   ## Why ?
   
   To enable full pipeline visibility and track dependencies between jobs in 
OpenLineage, we utilize the parentRunFacet. This facet stores the identifier of 
the parent job that triggered the current job. This approach works across 
various integrations, f.e. you can pass Airflow’s job identifier to a Spark 
application if it was triggered by an Airflow operator. Currently, this process 
requires manual configuration by the user, such as leveraging 
[macros](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html):
   ```
   DataprocSubmitJobOperator(
       task_id="my_task", 
       # ... 
       job={ 
           # ...
           "spark.openlineage.parentJobNamespace": "{{ 
macros.OpenLineageProviderPlugin.lineage_job_namespace() }}",
           "spark.openlineage.parentJobName": "{{ 
macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}", 
           "spark.openlineage.parentRunId": "{{ 
macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}"
       } 
   )
   
   ```
   Understanding how various Airflow operators configure Spark allows us to 
automatically inject parent job information.
   
   ## Controlling the Behavior
   
   We provide users with a flexible control mechanism to manage this injection, 
combining per-operator enablement with a global fallback configuration. This 
design is inspired by the `deferrable` argument in Airflow.
   
   ```python
   ol_inject_parent_job_info: bool = conf.getboolean(
       "openlineage", "spark_inject_parent_job_info", fallback=False
   )
   ```
   Each supported operator will include an argument like 
`ol_inject_parent_job_info`, which defaults to the global configuration value 
of `openlineage.spark_inject_parent_job_info`. This approach allows users to:
   
   1. Control behavior on a per-job basis by explicitly setting the argument.
   2. Rely on a consistent default configuration for all jobs if the argument 
is not set.
   
   This design ensures both flexibility and ease of use, enabling users to 
fine-tune their workflows while minimizing repetitive configuration. I am aware 
that adding an OpenLineage-related argument to the operator will affect all 
users, even those not using OpenLineage, but since it defaults to False and can 
be ignored, I hope this will not pose any issues.
   
   ## How?
   The implementation is divided into three parts for better organization and 
clarity:
   
   1. **Operator's Code (including the `execute` method):**  
      Contains minimal logic to avoid overwhelming users who are not actively 
working with OpenLineage.
   
   2. **Google's Provider OpenLineage Utils File:**  
      Handles the logic for accessing Spark properties specific to a given 
operator or job.
   
   3. **OpenLineage Provider's Utils:**  
      Responsible for creating / extracting all necessary information in a 
format compatible with the OpenLineage Spark integration. We are also 
performing modifications to the Spark properties here.
   
   ## Next steps
   1. **Expand Operator Coverage:**  
      Increase support for additional operators by extending the parent job 
information injection to cover more cases.
   
   2. **Automate Transport Configuration:**  
      Implement similar automation for transport configurations, starting with 
HTTP, to streamline the integration process.
   
   
   # TODO 
   - Add more documentation and description on how to use ths feature
   
   
   <!-- Please keep an empty line above the dashes. -->
   ---
   **^ Add meaningful description above**
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a 
newsfragment file, named `{pr_number}.significant.rst` or 
`{issue_number}.significant.rst`, in 
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to