jason810496 opened a new issue, #67111:
URL: https://github.com/apache/airflow/issues/67111

   ### Background
   
   In the AIP-108 dev@ thread ([2026-05-13 reply][maciej-reply]), Maciej
   Obuchowski outlined what OpenLineage needs from the Java SDK:
   
   - **Generic task lifecycle events** — when OL is enabled, every task
     execution emits OL start / complete / failure events via the listener
     framework, driven by the Python task runner. For Java tasks, the listener
     calls still fire from the Python side around the Java subprocess (e.g.
     near `on_task_instance_running` / `on_task_instance_success` in
     `task-sdk/src/airflow/sdk/execution_time/task_runner.py:1196` and
     `:1921`). No code change is required to keep this working for the
     "task ran, task succeeded" signal.
   - **Operator/hook-specific lineage data** — the part that currently relies
     on Python operators/hooks being in-process and reading state from the
     `TaskInstance` after `execute()`. This does **not** work for a Java task
     because the user code runs in a JVM subprocess that the Python listener
     cannot introspect.
   
   Maciej's conclusion: v1 of the Java SDK does not need OL emission from
   inside the Java task, but the IPC and Java-side API must **not block** a
   future lineage interface from being added. Concretely:
   
   > "able to send serialized data back from the task execution to Python;
   > and an API in the Java SDK for users to be able to specify that data."
   
   ### What needs to happen
   
   1. **Reserve a lineage channel on the coordinator IPC.** When the Java
      subprocess returns task results to the supervisor, the protocol should
      allow an optional serialized lineage payload alongside the existing
      result message. The base `BaseCoordinator` interface needs to expose a
      hook the supervisor calls with that payload (no-op by default).
   2. **Expose a Java SDK API for users to declare lineage data.** Minimal
      shape, mirroring how Python tasks can attach lineage to a `TaskInstance`
      today. Exact API to be decided once the IPC channel exists, but it
      should be:
      - opt-in (no overhead for tasks that don't use it),
      - language-idiomatic on the Java side,
      - resolvable to whatever serialized form the Python listener expects.
   3. **Wire the payload into the existing listener pipeline.** The Python
      supervisor should forward the serialized lineage data into the same
      `get_listener_manager().hook.*` call chain that already runs around
      Java task execution, so OL providers don't need any Java-specific code
      path.
   4. **Document the v1 boundary clearly.** The coordinator user guide should
      state that OL start/complete events fire for Java tasks today, but
      Java-side lineage extraction is a follow-up.
   
   ### Acceptance criteria
   
   - A user enabling OpenLineage sees start/complete events for Java stub
     tasks the same way as Python tasks (no regression from current
     behavior).
   - The coordinator IPC has a documented optional lineage field; a Java
     task that does not emit lineage produces exactly the same wire traffic
     as today.
   - A Java SDK user can attach lineage data from a task and see it land in
     the Python supervisor as a serialized payload available to OL listeners.
   - The OL provider's listener does not need a Java-specific branch — the
     Java payload reaches it through the same listener hooks as Python.
   - The coordinator user guide states the v1 boundary and links to this
     issue for follow-up work.
   
   ### Context
   
   - Dev@ thread: 
<https://lists.apache.org/thread/gjot4bxj9kygj2fk76kx6tyg8s4hr057>
     — Maciej Obuchowski reply on 2026-05-13 ("Generic task information and
     specific lineage data ..."). Jarek Potiuk's prior message on the same
     day pinged Maciej and Kacper for OL input.
   - AIP-108 wiki: 
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-108+Language+Task+SDK+and+the+Language+Coordinator+Layer>
   - Originating PRs: apache/airflow#65956 (Java SDK), apache/airflow#65958
     (Coordinator layer).
   - Related work:
     - #66543 — Java-based task and Dag-level callbacks. Callbacks may be
       the mechanism (or share machinery) for shipping lineage data back to
       Python.
     - #66838 — Pluggable communication channels. The IPC reservation
       in step 1 should be designed against whatever shape `BaseCoordinator`
       settles on there.
     - #66590 — Compatible protocol between coordinator and lang-SDK. The
       optional lineage field needs to fit the forward-compat contract being
       defined there.
   
   [maciej-reply]: 
https://lists.apache.org/thread/gjot4bxj9kygj2fk76kx6tyg8s4hr057


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to