kacpermuda opened a new pull request, #61535: URL: https://github.com/apache/airflow/pull/61535
<!-- Thank you for contributing! Please provide above a brief description of the changes made in this pull request. Write a good git commit message following this guide: http://chris.beams.io/posts/git-commit/ Please make sure that your code changes are covered with tests. And in case of new features or big changes remember to adjust the documentation. Feel free to ping (in general) for the review if you do not see reaction for a few days (72 Hours is the minimum reaction time you can expect from volunteers) - we sometimes miss notifications. In case of an existing issue, reference it using one of the following: * closes: #ISSUE * related: #ISSUE --> ## TLDR: Add hook-level lineage (HLL) reporting to SQL hooks via send_sql_hook_lineage This PR introduces a standardized mechanism for SQL hooks to report execution metadata - SQL text, query parameters, job IDs, row counts, default database/schema - to the hook lineage collector using add_extra. I also bumped the required sql-common version for all modified providers, so that the HLL is being emitted. I've also added tests for most Hooks that use DbApiHook as base class, to make sure that even when some methods will be overwritten in the future, the Hook Level Lineage will still be sent (so for now we are mostly testing DbApiHook implementation multiple times, but if some db decides to overwrite `run()`, I need my test to fail so that new implementation also calls HLL collector). ### Important context The HLL collector is a no-op unless a collector is registered (e.g. by the OpenLineage provider). This means no runtime overhead for users who don't use lineage collection. ### Motivation Black-box operators (e.g. PythonOperator calling PostgresHook.run(sql)) currently produce no lineage. With this change, any registered collector can capture the SQL being executed, parse it for input/output datasets, and attach query IDs to lineage events - dramatically improving lineage quality without requiring operator-level changes. ### Follow-up PRs - OpenLineage consumer: modify the OL provider to consume these extras, parse SQL for datasets, and attach query_id to OL events - BigQueryHook insert_job: mix of sql and non-sql lineage, will do in a separate PR. - Additional non-SQL hooks: extend HLL to more hooks beyond SQL --- ##### Was generative AI tooling used to co-author this PR? <!-- If generative AI tooling has been used in the process of authoring this PR, please change below checkbox to `[X]` followed by the name of the tool, uncomment the "Generated-by". --> - [X] Yes (please specify the tool below) Co-authored by: Cursor following [the guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions) --- * Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)** for more information. Note: commit author/co-author name and email in commits become permanently public when merged. * For fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed. * When adding dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). * For significant user-facing changes create newsfragment: `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
