Re: [DISCUSS] Move connection testing to workers

Kaxil Naik Thu, 26 Feb 2026 16:32:43 -0800

Sounds good, thanks. Feel free to handle them in separate PRs if needed.


On Thu, 26 Feb 2026 at 22:27, Anish Giri <[email protected]> wrote:

> Thanks Kaxil, Amogh. This is really helpful feedback.
> My earlier POC in #60618 was along these lines; a dedicated
> TestConnection workload type with its own scheduler dispatch, worker
> side timeout enforcement, and a DB backed state machine (PENDING →
> RUNNING → SUCCESS/FAILED). I moved to the ExecutorCallback-based
> approach in #62343 to reduce the scope, but it sounds like the
> dedicated workload was heading towards the right direction. That said,
> #60618 didn't address several of the concerns raised here and it had
> no queue routing (hardcoded first executor), no concurrency budget,
> and no reaper for stuck requests. So this isn't just a revert; it's
> building on that foundation with the operational controls you're
> describing. And based on the discussion, here is the scope I plan to
> implement in the updated PR:
>
> 1. Dedicated ConnectionTest workload: not transported as ExecutorCallback
> 2. Save & Test flow: connection must be saved; worker fetches via
> connection_id through secrets
> 3. Explicit routing: optional queue parameter on the API,
> configuration driven default. Team based routing for multi team setups
> as a natural follow up once that infrastructure lands (per Jarek's
> point that team is a connection attribute and routing should happen
> automatically)
> 4. Dispatch budget: configurable concurrency cap so connection tests
> can't starve task execution
> 5. Strict timeout + stale-work reaper: worker side timeout
> enforcement, scheduler side reaper for stuck RUNNING requests
> 6. Fail fast on unsupported executors: upfront validation rather than
> runtime failure
>
> If there are no objections or anything I've missed, I'll update #62343
> accordingly.
>
> Anish
>
> On Thu, Feb 26, 2026 at 1:52 PM Kaxil Naik <[email protected]> wrote:
> >
> > +1 on moving connection testing off the API server and onto workers for
> the
> > security/isolation reasons already discussed.
> >
> > I also think the queue concern is critical.
> >
> > Queues are a core routing abstraction in Airflow for directing work to
> > specific worker populations (runtime/network/location). For some of our
> > enterprise customer's deployments, some workers can reach certain
> > third-party systems while others cannot, so queue placement directly
> > affects the correctness of a connection test result.
> >
> > My main concern is architectural: connection testing is currently being
> > transported as an ExecutorCallback. Conceptually, connection testing is
> not
> > a callback; it is an ad-hoc workload. That mismatch creates two concrete
> > risks in the current model:
> >
> > 1) Routing correctness:
> >    Default/fallback queue behavior can run tests in the wrong worker
> > environment
> >
> > 2) Fairness and isolation:
> >    Callbacks are correctness-critical for state progression. Connection
> > tests are user-initiated and can hang; they should not compete in the
> same
> > priority lane.
> >
> > I agree callbacks are critical for correctness (they drive state/failure
> > progression), so we should preserve their ability to run promptly. At the
> > same time, connection testing is different in nature (user-initiated
> ad-hoc
> > workload, potentially long-running/hanging) and should not compete in the
> > same unbounded priority lane. A hanging connection test can occupy a
> worker
> > slot for a long time, so this needs strict timeout/reaper behavior in
> > addition to dispatch controls.
> >
> > So my suggestion is to require a dedicated ConnectionTest workload before
> > merge, with:
> > - explicit routing semantics (queue/executor/team) -- and the defaults
> can
> > be config driven,
> > - dedicated dispatch budget/concurrency control,
> > - strict timeout + stale-work reaper,
> > - fail-fast behavior on unsupported executors.
> >
> > This keeps the security goal while avoiding routing regressions and
> > scheduler starvation risk, and keeps callback lanes reserved for callback
> > semantics.
> >
> > Regards,
> > Kaxil
> >
> > On Thu, 26 Feb 2026 at 07:59, Amogh Desai <[email protected]> wrote:
> >
> > > Good discussion, Anish. +1 on moving connection testing to workers --
> > > running user supplied
> > > driver code on API server is the right thing to move away from.
> > >
> > > Strongly agree with Ephraim and Jarek here. If we require the
> connection to
> > > be saved before testing
> > > (so the worker can fetch it via connection_id through the standard
> secrets
> > > path), the UI button should
> > > make that explicit. "Save & Test" is exactly that and it will avoid any
> > > surprise of overwriting an existing
> > > connection while *just testing :)*
> > >
> > > Re queue routing: I think an optional queue parameter on the test
> > > connection API should be part of the
> > > initial implementation itself rather than follow up. The default can be
> > > *default*, which covers the common case.
> > > But there are some other scenarios too:
> > >
> > > * Workers on different queues deployed in different network zones
> meaning
> > > that the test would pass or fail
> > > depending on which queue makes it run.
> > >
> > > * Setups where different queues have different secrets backend, the
> worker
> > > might not even be able to resolve the
> > > connection at all.
> > >
> > > The cost of having an optional queue parameter is kinda low and I
> think it
> > > avoids shipping something that
> > > gives wrong answers in these setups, so if the cost of it is relatively
> > > low, can we consider doing it?
> > >
> > >
> > >
> > > Thanks & Regards,
> > > Amogh Desai
> > >
> > >
> > > On Tue, Feb 24, 2026 at 10:29 PM Jarek Potiuk <[email protected]>
> wrote:
> > >
> > > > > Maybe changing the button to 'save & test' would suffix.
> > > >
> > > > +10
> > > >
> > > > On Tue, Feb 24, 2026 at 1:17 PM Ephraim Anierobi <
> > > > [email protected]>
> > > > wrote:
> > > >
> > > > > Cool idea.
> > > > >
> > > > > On saving the connection automatically, I think we should make it
> > > > explicit
> > > > > in the UI that testing the connection will save the connection.
> This
> > > will
> > > > > help the users to know that they are not just testing but also
> creating
> > > > the
> > > > > connection with the entered credentials. Without this being
> explicit, I
> > > > > think users may unknowingly replace a connection while just trying
> if a
> > > > new
> > > > > connection would work.
> > > > >
> > > > > Maybe changing the button to 'save & test' would suffix.
> > > > >
> > > > > On 2026/02/22 03:50:11 Anish Giri wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I'd like to discuss moving connection testing off the API server
> and
> > > > > > onto workers. Jarek suggested this direction in a comment on
> #59643
> > > > > > [1], and I think the Callback infrastructure being built for
> running
> > > > > > callbacks on executors is the right foundation for it.
> > > > > >
> > > > > > Since 2.7.0, test_connection has been disabled by default
> (#32052).
> > > > > > Running it on the API server has two problems: the API server
> > > > > > shouldn't be executing user-supplied driver code (Jarek
> described the
> > > > > > ODBC/JDBC risks in detail on #59643), and workers typically have
> > > > > > network access to external systems that API servers don't, so
> test
> > > > > > results from the API server can be misleading.
> > > > > >
> > > > > > Ramit's generic Callback model (#54796 [2]) and Ferruzzi's
> > > > > > in-progress executor dispatch (#61153 [3]) together give us most
> of
> > > > > > what's needed. The flow would be:
> > > > > >
> > > > > > 1. UI calls POST /connections/test
> > > > > > 2. API server Fernet-encrypts the connection URI, creates an
> > > > > > ExecutorCallback pointing to the test function, returns an ID
> > > > > > 3. Scheduler dispatch loop (from #61153) picks it up, sends it
> > > > > > to the executor
> > > > > > 4. Worker decrypts the URI, builds a transient Connection, calls
> > > > > > test_connection(), reports result through the callback path
> > > > > > 5. UI polls GET /connections/test/{id} until it gets a terminal
> > > > > > state
> > > > > >
> > > > > > The connection-testing-specific code would be small: a POST
> endpoint
> > > > > > to queue the test, a GET endpoint to poll for results, and the
> worker
> > > > > > function that decrypts and runs test_connection().
> > > > > >
> > > > > > One thing I noticed: #61153's _enqueue_executor_callbacks
> currently
> > > > > > requires dag_run_id in the callback data dict, and
> > > ExecuteCallback.make
> > > > > > needs a DagRun for bundle info. Connection tests don't have a
> DagRun.
> > > > > > It would be a small change to make that optional. The dispatch
> query
> > > > > > itself is already generic (selects all PENDING
> ExecutorCallbacks). I
> > > > > > can take a look at decoupling that if it would be useful.
> > > > > >
> > > > > > A couple of other open questions:
> > > > > >
> > > > > > 1. The connection test needs to store an encrypted URI,
> conn_type,
> > > and
> > > > > > some timestamps. Is the Callback.data JSON column the right place
> > > > > > for that, or does it warrant its own small table?
> > > > > >
> > > > > > 2. Stale requests: if a worker crashes mid-test, the record stays
> > > > > > in a non-terminal state. Should there be a scheduler-side reaper
> > > > > > similar to zombie task detection, or is client-side timeout (60s
> > > > > > in the UI) enough?
> > > > > >
> > > > > > I explored this earlier in #60618 [4] with a self-contained
> > > > > > implementation. Now that the ExecutorCallback dispatch is taking
> > > shape
> > > > > > in #61153, building on top of will be in right direction.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > > > Anish
> > > > > >
> > > > > > [1] https://github.com/apache/airflow/pull/59643
> > > > > > [2] https://github.com/apache/airflow/pull/54796
> > > > > > [3] https://github.com/apache/airflow/pull/61153
> > > > > > [4] https://github.com/apache/airflow/pull/60618
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Move connection testing to workers

Reply via email to