Re: [DISCUSS] Java SDK and the Language Coordinator Layer [was: LAZY CONSENSUS]

Jarek Potiuk Wed, 13 May 2026 05:43:03 -0700

> The interface between the SDK and coordinator is currently minimal, and I
can’t imagine it being changed without also changing how Python tasks are
run—which would require a major version bump anyway. Therefore, the
coordinator packages at the moment should be completely forward compatible
for all minor releases on both sides.


I've seen it multiple of times. This is never the case actually. When you
have an API / contract between independent parties, you need to have
mechanism to keep compatibility (like Cadwyn). Unless you have some
mechanism to achieve it or prevention, this will drift for sure, quickly -
if only because we might find security bugs. For example we might change
how logging is passed, unless we have a clear API and mechanism to keep it
in sync there is no "minimal API that does not change.


> Bundle the first implementation entirely in the SDK is IMO asking for
more trouble since you would need to keep the implementation in it until
SDK 2.0 (perhaps with a deprecation warning). The providers were initially
bundled in apache-airflow since the actual classes already existed prior to
them being put in airflow.providers, and are only MOVED in apache-airflow
releases. In this situation, the coordinator implementation does not exist
in apache-airflow-sdk yet, and IMO it would be more productive to release
the coordinators separately than to ADD it.

Not really. I proposed an internal package that can be changed **any
time**. Users aren't supposed to use those items. We can clearly mark them
with "_" and also describe them thoroughly in the public API documentation.
And no. Initilaly providers were **not** in arflow at all - you started
from step 2. Step 1 is that they were added at some point in time long
before my time. Hooks and operators as "API" were creaed quite early in the
concept of Airflow - and the first implementations were added then. Then,
after common patterns emerged, those hooks and operators were grouped into
providers (they were not initially) and only moved out after quite some
time. As I see it - you even admit yourself that things will look
differently for different languages, and maybe even we will not need
bridges for some of them at all. So why should we introduce new concept if
we know currrently that it applies only to "JDK"? I fail to see why we
should proceed if we already know the patterns are unlikely to be reusable
in their current form.

> Logs emitted by user code in Python tasks are NOT JSON RIGHT NOW. They
are WRAPPED in a JSON format in the task runner, but the logs emitted by
the user themselves are plain strings. Emitting JSON logs from Java is
incompatible with the existing Task SDK implementation. It’s not whether
it’s easy or not, it’s not within the Java SDK or the coordinator’s
decision to change this.

If Ash and Amogh who were fighting with the new logs are fine with it,
that's also fine with me. But I would love to hear from them that this is
also fine.

*Ash,  Amogh* - what do you think?  Are you ok with logs being exchanged
with task-sdk via stdin/stdout? If yes - I am fine as well.


>> OL is mostly not an issue since Airflow’s integration is mostly at the
operator and hook level, but the Java SDK does not hav those concepts. We
can introduce interfaces to help user emit their own lineage info from Java
code, but again that should be added later.
>
> Similarly, I'm unsure if OpenLineage requires specific metadata to be
passed as all necessary information might already be present in the context
sent through the bridge. All I'm saying is we should talk to Maciej and
Kacper and drop a paragraph of what is needed (or explicitly state that we
considered it and here is the answer.

> From what I can tell, the task runner does not pass through anything into
child (task code) process. It does add context when it receives things from
the child, but that is entirely isolated from the coordinator and child
process’s control. I may be wrong, but if relevant people understanding the
topic do not speak up, we assume they are OK as-is.

I guess it is easy to just ask - we know who they are:


*Daniel, Christos,*

Would you need to pass some special metadata to such a Java "subprocess" in
the worker for OTEL to be able to emit metrics (and whether it is needed)
from such Java tasks.

*Maciej, Kacper,*

How about you? Is something "special" needed to enable OpenLineage data to
be published from such a Java subprocess? Is it needed?




On Wed, May 13, 2026 at 6:00 AM Tzu-ping Chung via dev <
[email protected]> wrote:

>
>
> > On 12 May 2026, at 20:48, Jarek Potiuk <[email protected]> wrote:
> >
> > First of all, changing "coordinator" to "bridge," which I proposed is
> not something I strongly favor; it simply allows me to build a better
> mental model and understand it better (because I now see it really as a
> "Java bridge". I will use "coordinator/bridge" later; we can decide on the
> actual name and packaging approach once we fully understand the scope.
> >>
> >> There is still a concept of parsing in JARs since we still need a
> mechanism for the Java code to define which task is being implemented. This
> is still called Dag and Task in the Java SDK to match the terminology in
> Python. But Airflow does not find Airflow dags in JARs. I’ll edit out the
> DAG File reference you mentioned (and some other use of “DAG parsing” to
> avoid confusion since this is not the same concept in the Airflow dag
> processor sense).
> >
> > Yeah. - but - if I understand correctly, the "coordinator/bridge" simply
> passes the "context" and "task" information to run. If I read it correctly,
> from Airflow's perspective, the "runner" simply starts the jar and tells it
> to "execute that task." The JAR contains all the code necessary to map the
> task name to the actual class that should run. If I understand correctly,
> the Scheduler, Parser, and even the Python worker running on any executor
> know nothing about the mapping of classes and task names to the specific
> Java class to run. How this code works, how the bridge works, and how it
> passes task-SDK messages are purely implementation details of the "Java
> Bridge." Do we have no plans to use this same architecture for other
> languages now? It **may** happen later, but this is not the goal for now?
> Does the presence of "Dag" and "Task" in Java merely serve to identify
> which class should run the task?
> >
> > Is that the correct understanding?
>
> Yes your understanding is correct. Whether it’s suitable for other
> languages—I am not entirely sure, to be honest. “Finding something to
> execute” can be quite different across languages (more accurately
> runtimes). For example, Go (likely the next SDK to use the coordinator
> architecture) compiles directly to a standalone binary, so you won’t have
> an environment like in Python or Java to point to, nor an equivalent to JAR
> metadata for the coordinator to identify “this binary contains tasks”
> without trying to execute it.
>
> There’s no doubt it CAN work somehow (worst case we require some
> additional configs on the coordinator), but it’s unclear HOW, so the
> details are currently left out for each coordinator to figure out. We can
> extract common patterns later.
>
>
> >
> >
> >> The JAR is not stored in a serialised dag (nor does a JAR contain
> serialised dags). It is configured in the worker environment. (JARs are
> similar to executing Python packages; the 'java' command knows where to
> look, or you can specify it in the coordinator with arguments.)
> >
> > I was referring more to the JAR path, not the JAR itself? In the
> previous (now removed) version of the AIP we had this:
> >
> >     ┌──────────────▼───────────────┐
> >     │  Metadata DB                 │
> >     │                              │
> >     │  serialized_dag: {           │  Stored as-is from the language
> runtime's
> >     │    "relative_fileloc":       │  SDK Serialized DAG JSON
> >     │       "path/to/example.jar"  │
> >     │  }                           │
> >     │  task_instance.queue         │
> >     └──────────────┬───────────────┘
> >
> > So my question about the JAR path stemmed from that. But I understand
> now, it's gone entirely. So I understand now that we completely remove any
> JAR reference and decouple from how compiled Java classes are made
> available to the "java" command executed by the subprocess.Popen (or
> whatever way we call it from the Python coordinator/bridge code?
>
> The Java classes (how the tasks are implemented) is completely decoupled
> from the dag structure, yes. (Hopefully I understand what you mean above
> correctly.)
>
>
> >>
> >> The above only happens on execution time, so the JARs (technically)
> doesn’t need to be in the dag bundle, only somewhere in the execution
> environment that the coordinator can locate. In practice maybe it should be
> included in the bundle since the implementation affects dag versioning, but
> on the other hand a JAR may be difficult to version since it is
> fundamentally a derived binary (technically an extension to ZIP).
> Personally I think we need some real world user experience to know the best
> approach here, so the AIP does not provide an opinion on this topic. The
> best practice needs to be researched and codified later.
> >
> > So, does this mean users will have to distribute and share their Java
> code completely independently from what they are used to with DAGs now ? I
> think this should be explicitly spelled out as a design decision because
> it's not something our users are used to. It's more akin to distributing
> packaged plugins currently, and I think we need to help our users map this
> into their mental model. So far, users had the "DAG" in a "bundle," and
> everything they needed to run was in that "bundle". Now they will need both
> "Python" code for the Dag in the bundle and "compiled Java JARs" elsewhere,
> distributed differently during deployment.  I have no problem with that as
> long as we clearly define it as a deployment model change for Dags.
>
> I can add a line to non-goals to specify AIP-108 does not cover deployment.
>
>
> >
> >> I don’t agree with putting the package under the airflow.sdk namespace
> since it complicates distribution a lot down the road. We already have
> interest for other languages, and the airflow.sdk namespace already
> contains a lot of Python SDK things as-is, it is not a good idea to also
> open it up for potential language runtimes. Furthermore, the coordinator
> layer is not a part of the DAG authoring interface, which airflow.sdk is
> supposed to contain. You could maybe argue it can go into a nested package
> such as airflow.sdk.execution_time.coordinators, but that is just too
> unnecessarily long. A separate package i.e. airflow.coordinators is the way
> to go in my opinion.
> >
> > To be honest, the Python package name is secondary. I think more
> important is the question of compatibility and related coupling - and
> question if increasing operational complexity (by introducing new types of
> distributions **today** as opposed to **when needed** is justified). We
> should first agree on the relation between "task-sdk" and "java
> coordinator/bridge" code before deciding on "distribution names" and
> "python packages". Let me simplify it with an example:
> >
> > * task-sdk 2.1 <-> bridge/coordinator 1.1
> > * task-sdk 2.2 <-> bridge/coordinator 1.2 (is 1.1 possible as well) `
> >
> > What's the relation? Do we assume that the bridge/coordinator and the
> task-sdk will always need to be released and pinned together?
> >
> > Or should we handle the situation where "task-sdk 2.2" is used with
> "bridge/coordinator 1.1," or perhaps where "task-sdk 2.1" is used with
> "bridge/coordinator 1.2"?
> >
> > In other words - how strongly coupled bridge/coordinator is with
> task/sdk ? Do we have a compatibility mechanism - or (for now) do we assume
> we always release them in sync and pin with ==? Or do we have any plans to
> do it (like we do with Cadwyn on airflow-core <> task.sdk.
> >
> > I am not against separate distributions (as you might be aware); I just
> want to understand what coupling we have and whether splitting it out
> offers any benefits. Yes, it might make task-sdk smaller, but not by much.
> We are only saving a bit of disk space for unused Python classes if there
> is such a strong coupling and no plans to decouple it in the mid-term. And
> we save the operational complexity of having to build and release even more
> - different kinds of packages.
> >
> > Also quoting your own:
> >
> > > The best practice needs to be researched and codified later.
> >
> > Precisely - by embedding the "coordinator/bridge" in tasks-sdk allows us
> to do that. Without introducing new concepts that are not yet well defined.
> Unless we want to build compatibility and allow installing different
> versions of those distributions, there is virtually no benefit to
> splitting, IMHO. We can still gain the same learning and see what works,
> what we need to fix much faster - with fewer distributions and less
> operational complexity. We can then implement the "coordinator" concept
> learning from that experience while already handling multiple languages
> (which can also be added more simply by following the same "embedded"
> way).  I often have this saying: "In order to make things reusable, we have
> to make them usable first." This is a classic example of it—we seem to be
> trying to implement a new concept with full-blown separate distributions,
> naming etc. without actually knowing how it will work, especially when
> generalized to different languages.
> >
> > This is exactly what was done "right" with providers: we kept them in
> the main "apache-airflow" package as a single monolith distribution - for
> quite some time, only separating them to "provider distributions" after the
> concept matured and we had many providers in the single distribution. I
> have a feeling that in this case we are trying too hard to generalize
> things that we do not even have yet. That was done exactly when we figured
> out how to decouple them, what compatibility we would provide, and how.
> >
> > We can make it "airflow.sdk._bridge.jdk" and call it a day. And refactor
> it later in coordinator distribution when we also have
> "airflow.sdk._bridge.golang," "airflow."sdk._bridge.typescript" - and we
> can change it all at any time in the future, even changing the name,
> etc.—because this will be coupled with the "sdk" version anyway, and it's
> an internal detail for the users.
>
> The interface between the SDK and coordinator is currently minimal, and I
> can’t imagine it being changed without also changing how Python tasks are
> run—which would require a major version bump anyway. Therefore, the
> coordinator packages at the moment should be completely forward compatible
> for all minor releases on both sides.
>
> Bundle the first implementation entirely in the SDK is IMO asking for more
> trouble since you would need to keep the implementation in it until SDK 2.0
> (perhaps with a deprecation warning). The providers were initially bundled
> in apache-airflow since the actual classes already existed prior to them
> being put in airflow.providers, and are only MOVED in apache-airflow
> releases. In this situation, the coordinator implementation does not exist
> in apache-airflow-sdk yet, and IMO it would be more productive to release
> the coordinators separately than to ADD it.
>
>
> >
> >>
> >> The supervisor process does use structured logging, but logs emitted
> from USER CODE is basically plain string. Technically it’s wrapped in a
> JSON to include some extra information (file location, stdout or stderr,
> etc.) but ultimately the user can only emit plain strings to Airflow.
> >>
> >> Java has its own logging architecture, but Airflow can only care about
> what Airflow understands. We can discuss what the best practice should be
> when you write Java tasks, but this AIP shouldn’t go beyond stdout and
> stderr streams.
> >
> > I think we need the same source code/location information from day one
> for operationalizing it. However, that does not change the concept and can
> be seen as an implementation detail. Adding a Java logger that emits json
> in the same format as the Python code should be **easy**. I am simply
> afraid we are trying to rediscover the same buffering stdin/stdout issues
> that caused us to change stdio/stderr to line-delimited JSON in
> "task/supervisor". We had, I believe, very good reasons for doing it. And I
> think it does not add more complexity. But yes - this might also be seen as
> an internal detail of the "Java bridge" implementation so I agree it does
> not have to be part of the AIP.
>
> Logs emitted by user code in Python tasks are NOT JSON RIGHT NOW. They are
> WRAPPED in a JSON format in the task runner, but the logs emitted by the
> user themselves are plain strings. Emitting JSON logs from Java is
> incompatible with the existing Task SDK implementation. It’s not whether
> it’s easy or not, it’s not within the Java SDK or the coordinator’s
> decision to change this.
>
>
> >>
> >> Even in the pure Python scenario, the supervisor layer does not receive
> OTel messages; they are sent directly from the worker process. The question
> would be entirely about the interface, which we can always add later.
> >
> > Correct. I agree it's not something the bridge should "handle."  However
> I think the "worker" process receives some information from the supervisor
> (span IDs?..) that allow the emitted metrics to be properly matched with
> spans. So bridge is more of a "pass through". I think all the AIP needs is
> to state, "Yes, we thought about it; we will pass all the necessary
> information for OTEL to work well and here are the items we are passing:
> ......".
> >
> >>
> >> OL is mostly not an issue since Airflow’s integration is mostly at the
> operator and hook level, but the Java SDK does not hav those concepts. We
> can introduce interfaces to help user emit their own lineage info from Java
> code, but again that should be added later.
> >
> > Similarly, I'm unsure if OpenLineage requires specific metadata to be
> passed as all necessary information might already be present in the context
> sent through the bridge. All I'm saying is we should talk to Maciej and
> Kacper and drop a paragraph of what is needed (or explicitly state that we
> considered it and here is the answer.
>
> From what I can tell, the task runner does not pass through anything into
> child (task code) process. It does add context when it receives things from
> the child, but that is entirely isolated from the coordinator and child
> process’s control. I may be wrong, but if relevant people understanding the
> topic do not speak up, we assume they are OK as-is.
>
>
> TP
>
>
>

Re: [DISCUSS] Java SDK and the Language Coordinator Layer [was: LAZY CONSENSUS]

Reply via email to