shahar1 commented on PR #61535:
URL: https://github.com/apache/airflow/pull/61535#issuecomment-3860772266

   > ## TLDR:
   > Add hook-level lineage (HLL) reporting to SQL hooks via 
send_sql_hook_lineage This PR introduces a standardized mechanism for SQL hooks 
to report execution metadata - SQL text, query parameters, job IDs, row counts, 
default database/schema - to the hook lineage collector using add_extra.
   > 
   > I also bumped the required sql-common version for all modified providers, 
so that the HLL is being emitted.
   > 
   > I've also added tests for most Hooks that use DbApiHook as base class, to 
make sure that even when some methods will be overwritten in the future, the 
Hook Level Lineage will still be sent (so for now we are mostly testing 
DbApiHook implementation multiple times, but if some db decides to overwrite 
`run()`, I need my test to fail so that new implementation also calls HLL 
collector).
   > 
   > ### Important context
   > The HLL collector is a no-op unless a collector is registered (e.g. by the 
OpenLineage provider). This means no runtime overhead for users who don't use 
lineage collection.
   > 
   > ### Motivation
   > Black-box operators (e.g. PythonOperator calling PostgresHook.run(sql)) 
currently produce no lineage. With this change, any registered collector can 
capture the SQL being executed, parse it for input/output datasets, and attach 
query IDs to lineage events - dramatically improving lineage quality without 
requiring operator-level changes.
   > 
   > ### Follow-up PRs
   > * OpenLineage consumer: modify the OL provider to consume these extras, 
parse SQL for datasets, and attach query_id to OL events
   > * BigQueryHook insert_job: mix of sql and non-sql lineage, will do in a 
separate PR.
   > * Additional non-SQL hooks: extend HLL to more hooks beyond SQL
   > 
   > ##### Was generative AI tooling used to co-author this PR?
   > * [x]  Yes (please specify the tool below)
   > 
   > Co-authored by: Cursor following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   > 
   > * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   > * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   > * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   > * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
   
   CI currently fails :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to