My discussion was not so much about this specific use case of the logging 
config, but of sharing code in general which we know we need to do, as we 
currently have imports from core to sdk and from sdk to core, and we need to 
break that up.

-a

> On 2 Jul 2025, at 14:44, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> I think the answer **might** be different for different functionalities.
> For example I see that "config" is likely a better candidate for "shared
> distribution" than "logging" and for config we could use 1 where for
> logging we could use a form of 2.
> 
> On a high level (And looking at the code). How much of the logging code
> needs to be actually shared?  I do not think much.
> 
> Some assumptions:
> 
> 1) the regular scheduler, triggerer, api_server etc. "application" logs - i
> would say we should absolutely minimise our own code. There are plenty of
> logging utilities that enhance standard logging framework and I have a
> doubt we can do better work than them - even in integration with external
> logging systems. We do not **really** need a shared logging code for "task
> logs" and "application logs". We already cannot do it now - the "log
> handler" of ours implemented in various providers and "remote logging" of
> ours is only "task oriented". And there is - I think - no particular reason
> to change it. So I'd say both configuration and implementation of "task
> logs" and "application logs" can be implemented using completely different
> libraries and even have completely different configuration - and there is
> virtually no overlap and shared code there as far as I can see.
> 
> 2) Task logs are a completely different story. Here we want to have custom
> code and full control, StructLog is a good choice and we should bank more
> on it. But when you look closely we really need two things there:
> 
> a) ability to write logs from workers in triggerers (where task.sdk will be
> (and should be) the only dependency)
> b) ability to read logs by the UI (where more / airflow dependencies are
> needed)
> 
> And while those two have a "data structure" in common, the reading and
> writing code can be completely separate.  I will base it on amazon provider
> that also provides CloudWatch integration (one of log handlers).
> 
> 1) We could have log writing (current write part of log_handler) in
> "task-sdk" + "amazon-provider"
> 2) We could keep writing logs in providers, and extract reading logs to new
> distributions for remote logs - say "providers-airlfow-amazon-logging"
> distribution which will have minimal number of dependencies that "ui" needs
> to read amazon logs (and do the same with other log handlers).
> 
> Very little code from those two implementations would have to be copied and
> we could define very strict rules to make sure that the data structs/paths
> etc. between the readers and writers remain compatible. Also it has the
> nice effect that we could make sure "ui" does not need providers at all. If
> we do it this way, pass rendered links (we already do) via database, and
> figure out simple way of sharing connection definitions (via database) - we
> could very easily make "api_server" "provider-less".
> 
> The problem is - of course - Triggerer and Dag Processor- because if we
> have single "airflow-core" package - that should also run triggerer,
> triggerrer already has to "also" depend on "task.sdk" and "providers" (and
> that one is not as easy to get rid of). And both have to have some ways of
> sending logs that will be "task" and "dag" oriented and visible in the UI
> (Triggerer more than Dag Processor I think).
> 
> But that can be done in one of two ways:
> 
> * (option 1) - separating out the Triggerer / Dag processor to separate
> packages, assuming that a lot of triggerer / dag processor code is
> independent from other parts of Airflow code. Which I am not sure if we
> want and is doable now (it would likely necessitate another common package
> that is shared between airflow-core and airflow-triggerer). Maybe that can
> be a long term solution. But not likely something we want to invest now in.
> * (option 2) - continuing what we do today - making sure that
> "airflow-core" + "task.sdk" + "providers" + (new) "log providers" can all
> be installed together (option 2). Option 2 will require even more
> importance of our  CI  / dependency scrutiny - including constraints
> generation (but we have made great progress on that and I think with the
> new tooling it's quite a possible option).
> 
> That would be my proposal for logging. Note that my thinking is different
> for configuration. I think configuration could be rather easily moved to a
> "common.config" package - we could likely agree and take care of
> back-compatibility there etc. And also for other "parts" of code that we
> want to share, the answer might be different as well.
> 
> J.
> 
> 
> 
> On Wed, Jul 2, 2025 at 2:36 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> 
>> Hello everyone,
>> 
>> As we work on finishing off the code-level separation of Task SDK and Core
>> (scheduler etc) we have come across some situations where we would like to
>> share code between these.
>> 
>> However it’s not as straight forward of “just put it in a common dist they
>> both depend upon” because one of the goals of the Task SDK separation was
>> to have 100% complete version independence between the two, ideally even if
>> they are built into the same image and venv. Most of the reason why this
>> isn’t straight forward comes down to backwards compatibility - if we make
>> an change to the common/shared distribution
>> 
>> 
>> We’ve listed the options we have thought about in
>> https://github.com/apache/airflow/issues/51545 (but that covers some more
>> things that I don’t want to get in to in this discussion such as possibly
>> separating operators and executors out of a single provider dist.)
>> 
>> To give a concrete example of some code I would like to share
>> https://github.com/apache/airflow/blob/84897570bf7e438afb157ba4700768ea74824295/airflow-core/src/airflow/_logging/structlog.py
>> — logging config. Another thing we will want to share will be the
>> AirflowConfigParser class from airflow.configuration (but notably: only the
>> parser class, _not_ the default config values, again, lets not dwell on the
>> specifics of that)
>> 
>> So to bring the options listed in the issue here for discussion, broadly
>> speaking there are two high-level approaches:
>> 
>> 1. A single shared distribution
>> 2. No shared package and copy/duplicate code
>> 
>> The advantage of Approach 1 is that we only have the code in one place.
>> However for me, at least in this specific case of Logging config or
>> AirflowConfigParser class is that backwards compatibility is much much
>> harder.
>> 
>> The main advantage of Approach 2 is the the code is released with/embedded
>> in the dist (i.e. apache-airflow-task-sdk would contain the right version
>> of the logging config and ConfigParser etc). The downside is that either
>> the code will need to be duplicated in the repo, or better yet it would
>> live in a single place in the repo, but some tooling (TBD) will
>> automatically handle the duplication, either at commit time, or my
>> preference, at release time.
>> 
>> For this kind of shared “utility” code I am very strongly leaning towards
>> option 2 with automation, as otherwise I think the backwards compatibility
>> requirements would make it unworkable (very quickly over time the
>> combinations we would have to test would just be unreasonable) and I don’t
>> feel confident we can have things as stable as we need to really deliver
>> the version separation/independency I want to delivery with AIP-72.
>> 
>> So unless someone feels very strongly about this, I will come up with a
>> draft PR for further discussion that will implement code sharing via
>> “vendoring” it at build time. I have an idea of how I can achieve this so
>> we have a single version in the repo and it’ll work there, but at runtime
>> we vendor it in to the shipped dist so it lives at something like
>> `airflow.sdk._vendor` etc.
>> 
>> In terms of repo layout, this likely means we would end up with:
>> 
>> airflow-core/pyproject.toml
>> airflow-core/src/
>> airflow-core/tests/
>> task-sdk/pyproject.toml
>> task-sdk/src/
>> task-sdk/tests/
>> airflow-common/src
>> airflow-common/tests/
>> # Possibly no airflow-common/pyproject.toml, as deps would be included in
>> the downstream projects. TBD.
>> 
>> Thoughts and feedback welcomed.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
For additional commands, e-mail: dev-h...@airflow.apache.org

Reply via email to