Re: [DISCUSS] Java SDK and the Language Coordinator Layer [was: LAZY CONSENSUS]

Zhe-You(Jason) Liu Mon, 27 Apr 2026 20:13:31 -0700

Hi Jens,

The ADRs are now available here [1] , hope that helps clarify some of your
concerns.


> As I understood Java is static compiled JAR files, no on-the-fly compile
from Java source tree (correct?)

That's correct -- the user will compile themselves and set the `[java]
bundles_folder` config to point to the directory of those JAR files.

> so actually the Dag parsing concept then is quite "static" and once
generated actually no need to re-parse the Dag in Java mode?
> Still if static JAR deployed how does the deploy lifecacly look like?
Would you need to restart with new deploy the Dag Parsr and respective
workers?

The lifecycle for dag-parsing will be the same as how the current
`DagFileProcessorProcess` acts. The coordinator comes into play before we
start the actual parse file entrypoint [2]. Regardless of what language the
parse file subprocess is implemented in, as long as it returns a valid
serialized Dag JSON in msgpack over IPC, the behavior will remain the same.
So there is no need to restart the `airflow dag-processor`, and the
`airflow worker` will not be involved in dag processing at all.

> Is it really realistic that a "LocalExecutor" needs to be supported or
can we limit it to e-g- only remote executors to reduce coupling and
complexity of operating the core?

The coordinator is the interface that decides how we want to launch the
subprocess for both dag-processing and workload-execution. This means it
will support **any** executor out of the box, as we integrate the
coordinator at the TaskSDK level for workload-execution [3].

> What overhead does the "Coordinator layer" generate compared to a Java
specific supervisor implementation?

The "Coordinator layer" is the interface for Airflow-Core to interact with
the target language subprocess. We still need a Java-specific supervisor
implementation, which is the first PR I mentioned in another thread -- Java
SDK [4].

> Is it not only a new / additional process but also IPC involved then. And
at least I saw also performance problems e.g. using very large XComs where
even heartbeats are lost due to long running IPC

I would consider this out of scope of the "Java SDK and the Coordinator
Layer" AIP, as the current Python-native TaskSDK supervisor will encounter
the same issue. The multi-language support here follows the same protocol
that the current TaskSDK uses.

[1]
https://github.com/apache/airflow/pull/65956/changes/876179ab55d3b31486ee52f4c27abd4e215b0fd0
[2]
https://github.com/apache/airflow/pull/65958/changes#diff-564fd0a8fbe4cc47864a8043fcc1389b33120c88bb35852b26f45c36b902f70bR540-R573
[3]
https://github.com/apache/airflow/pull/65958/changes#diff-5bef10ab2956abf7360dbf9b509b6e1113407874d24abcc1b276475051f13abfR1991-R2001
[4] https://github.com/apache/airflow/pull/65956

Thanks.

Best,
Jason

Best,
Jason

On Tue, Apr 28, 2026 at 3:03 AM Jens Scheffler <[email protected]> wrote:

> Thanks TP for raising this!
>
> I would need a sleep-over the block of information in the described
> details and might have some detail questions just to ensure I understood
> right. So in a ADR or AIP document might be better to comment than in an
> email thread.
>
> Things that jump into my head but there would be more coming thinking
> about it:
>
>   * As I understood Java is static compiled JAR files, no on-the-fly
>     compile from Java source tree (correct?) - in this case also the
>     Dags are "static until re-deploy" - so actually the Dag parsing
>     concept then is quite "static" and once generated actually no need
>     to re-parse the Dag in Java mode?
>   * Still if static JAR deployed how does the deploy lifecacly look
>     like? Would you need to restart with new deploy the Dag Parsr and
>     respective workers?
>   * Is it really realistic that a "LocalExecutor" needs to be supported
>     or can we limit it to e-g- only remote executors to reduce coupling
>     and complexity of operating the core?
>   * What overhead does the "Coordinator layer" generate compared to a
>     Java specific supervisor implementation? Is it not only a new /
>     additional process but also IPC involved then. And at least I saw
>     also performance problems e.g. using very large XComs where even
>     heartbeats are lost due to long running IPC
>     (https://github.com/apache/airflow/issues/64628)
>   * (There might be more coming :-) )
>
> Jens
>
> P.S.: Questions here also do not mean rejection but like to understand
> which complexity and overhead we have adding all this.
>
> On 27.04.26 15:21, Jarek Potiuk wrote:
> > Also I would like to point out one thing.
> >
> > This should not be `LAZY CONSENSUS` just yet, this is quite a big thing
> to
> > discuss. I missed the subject already had it.
> >
> > At this stage this is really a discussion (I renamed the thread. Because
> > ... we have never discussed it before.
> >
> > LAZY CONSENSUS should be really called for after initial discussion (on
> > devlist) points to us actually reaching the consensus.
> >
> > While this one is unlikely to cause much controversy (I think),
> sufficient
> > time for people to discuss and digest it before calling for lazy
> consensus
> > is a necessary prerequisite. We simply need time to build consensus.
> >
> > We have a few processes where we build a "general" consensus first, and
> > then we apply it only to particular cases (such as new providers). But in
> > most cases when we have a "big" thing to discuss, we need to build
> > consensus on devlist first. While in some cases people discussed things
> > off-list and came to some conclusions (whch is perfectly fine) - bringing
> > it to the list as a consensus, where we do not know if we achieved it
> yet,
> > is - I think - a bit premature.
> >
> > See the lazy consensus explanation [1] and "consenesus building [2]
> >
> > [1] Lazy consensus -
> >
> https://community.apache.org/committers/decisionMaking.html#lazy-consensus
> > [2]  Consensus building -
> >
> https://community.apache.org/committers/decisionMaking.html#consensus-building
> >
> > J.
> >
> > On Mon, Apr 27, 2026 at 1:27 PM Aritra Basu<[email protected]>
> > wrote:
> >
> >> Hey TP,
> >>
> >> Overall +1, This is quite an interesting implementation. A couple
> >> questions, is provider the right place for the coordinator? Don't have
> >> strong opinions or alternatives, but I am curious.
> >>
> >> Also for the parser wanted to understand a bit better how it works? I
> tried
> >> going through the SDK but wasn't able to fully understand it. Also +1 to
> >> Jarek's recommendation for documentation.
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Aritra Basu
> >>
> >> On Mon, 27 Apr 2026, 11:39 am Tzu-ping Chung via dev, <
> >> [email protected]> wrote:
> >>
> >>> Hi all,
> >>>
> >>> As mentioned in the latest dev call, we have been developing a Java SDK
> >>> with changes to Airflow in a separate fork[1]. We plan to start merging
> >> the
> >>> Java SDK work back into the OSS repository.
> >>>
> >>> We see this as a natural step following initial work in AIP-72[2],
> which
> >>> created “a clean language agnostic interface for task execution, with
> >>> support for multiple language bindings” (quoted from the proposal).
> >>>
> >>> The Java SDK also uses Ash’s addition of @task.stub[3] for the Go SDK,
> to
> >>> declare a task in a DAG to be “implemented elsewhere” (not in the
> >> annotated
> >>> function). Similar to the Go SDK, we also created a Java library that
> >> users
> >>> can use to write task implementations for Airflow to execute at
> runtime.
> >>>
> >>> [1]:https://github.com/astronomer/airflow/tree/feature/java-all
> >>> [2]:https://cwiki.apache.org/confluence/x/xgmTEg
> >>> [3]:https://github.com/apache/airflow/pull/56055
> >>>
> >>> The user-facing syntax for a stub task would be the same as implemented
> >> by
> >>> the Go SDK:
> >>>
> >>>      @task.stub(queue="java-tasks")
> >>>      def my_task(): ...
> >>>
> >>> With a new configuration option to map tasks in a pool to be executed
> by
> >> a
> >>> specific SDK:
> >>>
> >>>      [sdk]
> >>>      queue_to_sdk = {"java-tasks": "java"}
> >>>
> >>> The configuration is needed for some executors the Go SDK currently
> does
> >>> not support. The Go SDK currently relies on each executor worker
> process
> >> to
> >>> specify which queues they listen to, but this is not always viable,
> since
> >>> some executors—LocalExecutor, for example—do not have the concept of
> >> worker
> >>> processes.
> >>>
> >>> The Coordinator Layer
> >>> =====================
> >>>
> >>> When the Go SDK was implemented, it left out Runtime Airflow plugins
> as a
> >>> future topic. This includes custom XCom backends, secrets backends
> lookup
> >>> for connections and variables, etc. These components are implemented in
> >>> Python, and a Java task cannot easily use the feature unless we also
> >>> implement the lookup logic in Java. We don’t want to do that since it
> >>> introduces significant overhead to writing plugins, and the overhead
> >>> multiplies with each new language SDK.
> >>>
> >>> Fortunately, the current execution-time task runner already uses a
> >>> two-layer design. When an executor wants to run a task, it starts a
> >>> (Python) task runner process that talks to Airflow Core through the
> >>> Execution API, and *forks* another (Python) process, which talks to the
> >>> task runner through TCP, to run the actual task code. Airflow plugins
> >>> simply go into the task runner process.
> >>>
> >>> This design works well for us since it keeps all the Airflow plugins in
> >>> Python. The only thing missing is an abstraction for the task runner
> >>> process to run tasks in any language. We are calling this new layer the
> >>> **Coordinator**.
> >>>
> >>> When a DAG bundle is loaded, it not only tells Airflow how to find the
> >>> DAGs (and the tasks in them), but also how to *run* each task. Current
> >>> Python tasks use the Python Coordinator, running tasks by forking as
> >>> previously described. A new JVM Coordinator will instruct the task
> runner
> >>> how to run tasks packaged in JAR files.
> >>>
> >>> Each coordinator implements a base interface (BaseRuntimeCoordinator)
> >> that
> >>> handles three concerns:
> >>>
> >>> - Discovery: determining whether a given file belongs to this
> coordinator
> >>> (e.g. JAR files for Java).
> >>> - DAG parsing: returning a runtime-specific subprocess command to parse
> >>> DAG files in the target language.
> >>> - Task execution: returning a runtime-specific subprocess command to
> >>> execute tasks in the target runtime.
> >>>
> >>> The base class owns the full bridge lifecycle—TCP servers, subprocess
> >>> management, and cleanup—so language providers only need to implement
> >> these
> >>> three methods.
> >>>
> >>> The coordinator translates a DagFileParseRequest (for DAG parsing) and
> >>> StartupDetails (for Task execution) data model (as declared in Airflow)
> >>> into the appropriate commands for the target runtime. For example, a
> >> “java
> >>> -classpath ... /path/to/MainClass ...” subprocess command that points
> to
> >>> the correct JAR file and main class in this case.
> >>>
> >>> Coordinators as Airflow Providers
> >>> =================================
> >>>
> >>> The base coordinator interface and the Python coordinator will live in
> >>> “airflow.sdk.execution_time”. Other coordinators (for foreign
> languages)
> >>> are registered through the existing Airflow provider mechanism. Each
> SDK
> >>> provider declares its coordinator in its provider.yaml under a
> >>> “coordinators” extension point. Both ProvidersManager (airflow-core)
> and
> >>> ProvidersManagerTaskRuntime (task-sdk) discover coordinators through
> this
> >>> extension point. This means adding a new language runtime requires
> only a
> >>> provider package. No changes to Airflow Core are needed.
> >>>
> >>> The new JVM-based coordinator will live in the namespace
> >>> “airflow.providers.sdk.java”. This is not the most accurate name
> >>> (technically it should be “jvm” instead), but in practice most users
> will
> >>> recognize it, and (from my understanding) other JVM language users
> (e.g.
> >>> Kotlin, Scala) are already well-versed enough dealing with Java
> >>> interoperability to understand “java” means JVM in this context.
> >>>
> >>> Writing DAGs in Java
> >>> ====================
> >>>
> >>> This is not strictly connected to AIP-72, but considered by us as a
> >>> natural next step since we can now implement tasks in a foreign
> language.
> >>> Being able to define the DAG in the same language as the task
> >>> implementation is useful since writing Python, even if only with
> minimal
> >>> syntax, is still a hurdle for those not already familiar with, or even
> >>> allowed to run it. There are mainly three things we need on top of the
> >> task
> >>> implementation interface:
> >>>
> >>> - DAG flags (e.g. schedule, max_active_tasks)
> >>> - Task flags (e.g. trigger_rule, weight_rule)
> >>> - Task dependencies
> >>>
> >>> A proof-of-concept implementation is included with other changes
> proposed
> >>> elsewhere in this document.
> >>>
> >>> Lazy Consensus Topics
> >>> =====================
> >>>
> >>> We’re calling for lazy consensus for the following topics
> >>>
> >>> - A new “queue_to_sdk” configuration option to route tasks to a
> specific
> >>> language SDK
> >>> - A new coordinator layer in the SDK to route implementations at
> >> execution
> >>> time.
> >>> - New providers under airflow.providers.sdk to provide additional
> >> language
> >>> support.
> >>> - Develop the Go SDK to support the proposed model and a provider
> package
> >>> for the coordinator. (Existing features stay as-is; no breaking
> changes.)
> >>> - Add the new Java SDK and the corresponding provider package.
> >>>
> >>> TP
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail:[email protected]
> >>> For additional commands, e-mail:[email protected]
> >>>
> >>>

Re: [DISCUSS] Java SDK and the Language Coordinator Layer [was: LAZY CONSENSUS]

Reply via email to