Re: Code sharing between Airflow Core and Task SDK - how do we achieve it

Jarek Potiuk Wed, 02 Jul 2025 09:49:35 -0700

> My discussion was not so much about this specific use case of the logging
config, but of sharing code in general which we know we need to do, as we
currently have imports from core to sdk and from sdk to core, and we need
to break that up.


Yep. I understand that - and my answer is (and this is what I tried to
explain) that it depends on what we are trying to share and that answer
might be different in different cases. That was essentially what I wrote.

On Wed, Jul 2, 2025 at 6:32 PM Ash Berlin-Taylor <a...@apache.org> wrote:

> My discussion was not so much about this specific use case of the logging
> config, but of sharing code in general which we know we need to do, as we
> currently have imports from core to sdk and from sdk to core, and we need
> to break that up.
>
> -a
>
> > On 2 Jul 2025, at 14:44, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > I think the answer **might** be different for different functionalities.
> > For example I see that "config" is likely a better candidate for "shared
> > distribution" than "logging" and for config we could use 1 where for
> > logging we could use a form of 2.
> >
> > On a high level (And looking at the code). How much of the logging code
> > needs to be actually shared?  I do not think much.
> >
> > Some assumptions:
> >
> > 1) the regular scheduler, triggerer, api_server etc. "application" logs
> - i
> > would say we should absolutely minimise our own code. There are plenty of
> > logging utilities that enhance standard logging framework and I have a
> > doubt we can do better work than them - even in integration with external
> > logging systems. We do not **really** need a shared logging code for
> "task
> > logs" and "application logs". We already cannot do it now - the "log
> > handler" of ours implemented in various providers and "remote logging" of
> > ours is only "task oriented". And there is - I think - no particular
> reason
> > to change it. So I'd say both configuration and implementation of "task
> > logs" and "application logs" can be implemented using completely
> different
> > libraries and even have completely different configuration - and there is
> > virtually no overlap and shared code there as far as I can see.
> >
> > 2) Task logs are a completely different story. Here we want to have
> custom
> > code and full control, StructLog is a good choice and we should bank more
> > on it. But when you look closely we really need two things there:
> >
> > a) ability to write logs from workers in triggerers (where task.sdk will
> be
> > (and should be) the only dependency)
> > b) ability to read logs by the UI (where more / airflow dependencies are
> > needed)
> >
> > And while those two have a "data structure" in common, the reading and
> > writing code can be completely separate.  I will base it on amazon
> provider
> > that also provides CloudWatch integration (one of log handlers).
> >
> > 1) We could have log writing (current write part of log_handler) in
> > "task-sdk" + "amazon-provider"
> > 2) We could keep writing logs in providers, and extract reading logs to
> new
> > distributions for remote logs - say "providers-airlfow-amazon-logging"
> > distribution which will have minimal number of dependencies that "ui"
> needs
> > to read amazon logs (and do the same with other log handlers).
> >
> > Very little code from those two implementations would have to be copied
> and
> > we could define very strict rules to make sure that the data
> structs/paths
> > etc. between the readers and writers remain compatible. Also it has the
> > nice effect that we could make sure "ui" does not need providers at all.
> If
> > we do it this way, pass rendered links (we already do) via database, and
> > figure out simple way of sharing connection definitions (via database) -
> we
> > could very easily make "api_server" "provider-less".
> >
> > The problem is - of course - Triggerer and Dag Processor- because if we
> > have single "airflow-core" package - that should also run triggerer,
> > triggerrer already has to "also" depend on "task.sdk" and "providers"
> (and
> > that one is not as easy to get rid of). And both have to have some ways
> of
> > sending logs that will be "task" and "dag" oriented and visible in the UI
> > (Triggerer more than Dag Processor I think).
> >
> > But that can be done in one of two ways:
> >
> > * (option 1) - separating out the Triggerer / Dag processor to separate
> > packages, assuming that a lot of triggerer / dag processor code is
> > independent from other parts of Airflow code. Which I am not sure if we
> > want and is doable now (it would likely necessitate another common
> package
> > that is shared between airflow-core and airflow-triggerer). Maybe that
> can
> > be a long term solution. But not likely something we want to invest now
> in.
> > * (option 2) - continuing what we do today - making sure that
> > "airflow-core" + "task.sdk" + "providers" + (new) "log providers" can all
> > be installed together (option 2). Option 2 will require even more
> > importance of our  CI  / dependency scrutiny - including constraints
> > generation (but we have made great progress on that and I think with the
> > new tooling it's quite a possible option).
> >
> > That would be my proposal for logging. Note that my thinking is different
> > for configuration. I think configuration could be rather easily moved to
> a
> > "common.config" package - we could likely agree and take care of
> > back-compatibility there etc. And also for other "parts" of code that we
> > want to share, the answer might be different as well.
> >
> > J.
> >
> >
> >
> > On Wed, Jul 2, 2025 at 2:36 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >
> >> Hello everyone,
> >>
> >> As we work on finishing off the code-level separation of Task SDK and
> Core
> >> (scheduler etc) we have come across some situations where we would like
> to
> >> share code between these.
> >>
> >> However it’s not as straight forward of “just put it in a common dist
> they
> >> both depend upon” because one of the goals of the Task SDK separation
> was
> >> to have 100% complete version independence between the two, ideally
> even if
> >> they are built into the same image and venv. Most of the reason why this
> >> isn’t straight forward comes down to backwards compatibility - if we
> make
> >> an change to the common/shared distribution
> >>
> >>
> >> We’ve listed the options we have thought about in
> >> https://github.com/apache/airflow/issues/51545 (but that covers some
> more
> >> things that I don’t want to get in to in this discussion such as
> possibly
> >> separating operators and executors out of a single provider dist.)
> >>
> >> To give a concrete example of some code I would like to share
> >>
> https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py
> >> — logging config. Another thing we will want to share will be the
> >> AirflowConfigParser class from airflow.configuration (but notably: only
> the
> >> parser class, _not_ the default config values, again, lets not dwell on
> the
> >> specifics of that)
> >>
> >> So to bring the options listed in the issue here for discussion, broadly
> >> speaking there are two high-level approaches:
> >>
> >> 1. A single shared distribution
> >> 2. No shared package and copy/duplicate code
> >>
> >> The advantage of Approach 1 is that we only have the code in one place.
> >> However for me, at least in this specific case of Logging config or
> >> AirflowConfigParser class is that backwards compatibility is much much
> >> harder.
> >>
> >> The main advantage of Approach 2 is the the code is released
> with/embedded
> >> in the dist (i.e. apache-airflow-task-sdk would contain the right
> version
> >> of the logging config and ConfigParser etc). The downside is that either
> >> the code will need to be duplicated in the repo, or better yet it would
> >> live in a single place in the repo, but some tooling (TBD) will
> >> automatically handle the duplication, either at commit time, or my
> >> preference, at release time.
> >>
> >> For this kind of shared “utility” code I am very strongly leaning
> towards
> >> option 2 with automation, as otherwise I think the backwards
> compatibility
> >> requirements would make it unworkable (very quickly over time the
> >> combinations we would have to test would just be unreasonable) and I
> don’t
> >> feel confident we can have things as stable as we need to really deliver
> >> the version separation/independency I want to delivery with AIP-72.
> >>
> >> So unless someone feels very strongly about this, I will come up with a
> >> draft PR for further discussion that will implement code sharing via
> >> “vendoring” it at build time. I have an idea of how I can achieve this
> so
> >> we have a single version in the repo and it’ll work there, but at
> runtime
> >> we vendor it in to the shipped dist so it lives at something like
> >> `airflow.sdk._vendor` etc.
> >>
> >> In terms of repo layout, this likely means we would end up with:
> >>
> >> airflow-core/pyproject.toml
> >> airflow-core/src/
> >> airflow-core/tests/
> >> task-sdk/pyproject.toml
> >> task-sdk/src/
> >> task-sdk/tests/
> >> airflow-common/src
> >> airflow-common/tests/
> >> # Possibly no airflow-common/pyproject.toml, as deps would be included
> in
> >> the downstream projects. TBD.
> >>
> >> Thoughts and feedback welcomed.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: Code sharing between Airflow Core and Task SDK - how do we achieve it

Reply via email to