[PR] feat: automatically inject OL info into spark job in DataprocSubmitJobOperator [airflow]

via GitHub Fri, 29 Nov 2024 06:06:14 -0800


kacpermuda opened a new pull request, #44477:
URL: https://github.com/apache/airflow/pull/44477

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

In case of an existing issue, reference it using one of the following:

closes: #ISSUE
related: #ISSUE

How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->
This PR introduces a completely new feature to OpenLineage integration. **It
will NOT impact users that are not using OpenLineage or have not explicitly
enabled this feature (False by default).**

## TLDR;
When explicitly enabled by the user for supported operators, we will
automatically inject parent job information into the Spark job properties. For
example, when submitting a Spark job using the DataprocSubmitJobOperator, we
will include details about the Airflow task that triggered it so that the
OpenLineage Spark integration can include them in parentRunFacet.

## Why ?

To enable full pipeline visibility and track dependencies between jobs in
OpenLineage, we utilize the parentRunFacet. This facet stores the identifier of
the parent job that triggered the current job. This approach works across
various integrations, f.e. you can pass Airflow’s job identifier to a Spark
application if it was triggered by an Airflow operator. Currently, this process
requires manual configuration by the user, such as leveraging
[macros](https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html):
```
DataprocSubmitJobOperator(
task_id="my_task",
# ...
job={
# ...
"spark.openlineage.parentJobNamespace": "{{
macros.OpenLineageProviderPlugin.lineage_job_namespace() }}",
"spark.openlineage.parentJobName": "{{
macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}",
"spark.openlineage.parentRunId": "{{
macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}"
}
)

```
Understanding how various Airflow operators configure Spark allows us to
automatically inject parent job information.

## Controlling the Behavior

We provide users with a flexible control mechanism to manage this injection,
combining per-operator enablement with a global fallback configuration. This
design is inspired by the `deferrable` argument in Airflow.

```python
ol_inject_parent_job_info: bool = conf.getboolean(
"openlineage", "spark_inject_parent_job_info", fallback=False
)
```
Each supported operator will include an argument like
`ol_inject_parent_job_info`, which defaults to the global configuration value
of `openlineage.spark_inject_parent_job_info`. This approach allows users to:

1. Control behavior on a per-job basis by explicitly setting the argument.
2. Rely on a consistent default configuration for all jobs if the argument
is not set.

This design ensures both flexibility and ease of use, enabling users to
fine-tune their workflows while minimizing repetitive configuration. I am aware
that adding an OpenLineage-related argument to the operator will affect all
users, even those not using OpenLineage, but since it defaults to False and can
be ignored, I hope this will not pose any issues.

## How?
The implementation is divided into three parts for better organization and
clarity:

1. **Operator's Code (including the `execute` method):**
Contains minimal logic to avoid overwhelming users who are not actively
working with OpenLineage.

2. **Google's Provider OpenLineage Utils File:**
Handles the logic for accessing Spark properties specific to a given
operator or job.

3. **OpenLineage Provider's Utils:**
Responsible for creating / extracting all necessary information in a
format compatible with the OpenLineage Spark integration. We are also
performing modifications to the Spark properties here.

## Next steps
1. **Expand Operator Coverage:**
Increase support for additional operators by extending the parent job
information injection to cover more cases.

2. **Automate Transport Configuration:**
Implement similar automation for transport configurations, starting with
HTTP, to streamline the integration process.

# TODO
- Add more documentation and description on how to use ths feature

---
**^ Add meaningful description above**
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
In case of fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in a
newsfragment file, named `{pr_number}.significant.rst` or
`{issue_number}.significant.rst`, in
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat: automatically inject OL info into spark job in DataprocSubmitJobOperator [airflow]

Reply via email to