Hello everyone,

As we work on finishing off the code-level separation of Task SDK and Core 
(scheduler etc) we have come across some situations where we would like to 
share code between these.

However it’s not as straight forward of “just put it in a common dist they both 
depend upon” because one of the goals of the Task SDK separation was to have 
100% complete version independence between the two, ideally even if they are 
built into the same image and venv. Most of the reason why this isn’t straight 
forward comes down to backwards compatibility - if we make an change to the 
common/shared distribution 


We’ve listed the options we have thought about in 
https://github.com/apache/airflow/issues/51545 (but that covers some more 
things that I don’t want to get in to in this discussion such as possibly 
separating operators and executors out of a single provider dist.)

To give a concrete example of some code I would like to share 
https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py
 — logging config. Another thing we will want to share will be the 
AirflowConfigParser class from airflow.configuration (but notably: only the 
parser class, _not_ the default config values, again, lets not dwell on the 
specifics of that) 

So to bring the options listed in the issue here for discussion, broadly 
speaking there are two high-level approaches: 

1. A single shared distribution
2. No shared package and copy/duplicate code

The advantage of Approach 1 is that we only have the code in one place. However 
for me, at least in this specific case of Logging config or AirflowConfigParser 
class is that backwards compatibility is much much harder.

The main advantage of Approach 2 is the the code is released with/embedded in 
the dist (i.e. apache-airflow-task-sdk would contain the right version of the 
logging config and ConfigParser etc). The downside is that either the code will 
need to be duplicated in the repo, or better yet it would live in a single 
place in the repo, but some tooling (TBD) will automatically handle the 
duplication, either at commit time, or my preference, at release time.

For this kind of shared “utility” code I am very strongly leaning towards 
option 2 with automation, as otherwise I think the backwards compatibility 
requirements would make it unworkable (very quickly over time the combinations 
we would have to test would just be unreasonable) and I don’t feel confident we 
can have things as stable as we need to really deliver the version 
separation/independency I want to delivery with AIP-72.

So unless someone feels very strongly about this, I will come up with a draft 
PR for further discussion that will implement code sharing via “vendoring” it 
at build time. I have an idea of how I can achieve this so we have a single 
version in the repo and it’ll work there, but at runtime we vendor it in to the 
shipped dist so it lives at something like `airflow.sdk._vendor` etc.

In terms of repo layout, this likely means we would end up with:

airflow-core/pyproject.toml
airflow-core/src/
airflow-core/tests/
task-sdk/pyproject.toml
task-sdk/src/
task-sdk/tests/
airflow-common/src
airflow-common/tests/
# Possibly no airflow-common/pyproject.toml, as deps would be included in the 
downstream projects. TBD.

Thoughts and feedback welcomed.

Reply via email to