kaxil commented on issue #67515: URL: https://github.com/apache/airflow/issues/67515#issuecomment-4548272652
Coming in late. I'd like to push back on the framing of this sweep, and propose we wait until Python 3.15 support, which adds [PEP 810](https://peps.python.org/pep-0810/) -- [lazy imports](https://docs.python.org/3.15/whatsnew/3.15.html#whatsnew315-lazy-imports). ### The core problem: most of this is noise An Operator uses a Hook. The Hook imports the SDK. When a task runs, all three get loaded. Moving `from neo4j import GraphDatabase` from `Neo4jHook` module-level into `_create_driver()` doesn't change that chain. It just moves *when* the import happens. The worker that runs the Neo4j task pays the import cost either way: before, at parse; after, at first execute. Total wall-clock, peak memory, pod size: identical. The scheduler also does not process raw Dag files, that's just factually incorrect in Airflow 3 land! The only processes that actually save are ones that import the hook module but never call into the SDK: | Process | Calls into SDK? | Save | |---|---|---| | Dag-processor | No, never executes tasks | ~60 MB RSS, ~1s | | Worker running a non-Neo4j task in a DAG that mentions Neo4jOperator | No | ~60 MB, ~1s | | Worker running the Neo4j task itself | Yes | **0** | On `KubernetesExecutor` (one task per pod), the third row is the common case. End users perceive no change. The dag-processor benefit is real but small. The issue's headline numbers (e.g. 550ms/168 MB for `google`) come from measuring `import the whole provider` in isolation. Real DAG files do `from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator` and pay a fraction of that. And the heaviest google.* imports (`DEFAULT_RETRY`, `QueryJob`, `Row`, `Conflict`) are runtime values, not type-only, so they can't move to `TYPE_CHECKING` without restructuring the operator. ### Fail-fast loss is a real regression Module-level imports surface install/version errors at parse time via the `import_error` table and the UI banner. Moving driver imports into `_create_driver()` defers the error to first task execution, where it lands in worker logs scattered across pods. For the realistic failure mode (version skew between provider and SDK extra), that's a UX regression we're paying for a saving end users don't see. ### PEP 810 is the right path PEP 810 (Explicit Lazy Imports) was accepted Nov 3, 2025, targeting Python 3.15: https://peps.python.org/pep-0810/ & https://docs.python.org/3.15/whatsnew/3.15.html#whatsnew315-lazy-imports Two things we can do without method-internal refactors -- but let's wait until we support Python 3.15: 1. **`__lazy_modules__` polyfill, safe today:** ```python __lazy_modules__ = ['neo4j'] from neo4j import Driver, GraphDatabase # unchanged ``` Pythons <3.15 ignore the attribute. 3.15+ treats the imports as lazy. No mock-path changes, no inline imports, no fail-fast loss on current Pythons. 2. **Global flag in dag-processor (when we adopt 3.15):** `PYTHON_LAZY_IMPORTS=all` plus `sys.set_lazy_imports_filter(func)`. Every provider becomes lazy-loaded with zero per-provider PRs. The keyword form (`lazy from neo4j import GraphDatabase`) isn't usable in provider source until we drop <3.15, realistically 2028+. ### What I'd suggest 1. **Land `TYPE_CHECKING` guards on type-only imports.** Pure win, no trade-off. 2. **Lazy-load only conditionally-used deps** -- e.g. `pandas_gbq` inside `BigQueryHook.get_pandas_df`. The fail-fast loss is scoped to "users who never call this method never discover the install is broken." 3. **Skip the broad inline-import sweep.** The Operator-Hook-SDK chain means it's mostly cost-shifting, not cost-saving. 4. **Plan dag-processor `-X lazy_imports=all` adoption** with the 3.15 support cycle. Thanks to everyone who's already put work into this, the engineering on the PRs is fine, this is feedback on whether the broader campaign is worth doing. If someone has dag-processor RSS numbers from a representative parse load showing larger savings, I'd reconsider -- but isolated-import benchmarks aren't the right measurement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
