Hi all,
I'd like to discuss moving connection testing off the API server and
onto workers. Jarek suggested this direction in a comment on #59643
[1], and I think the Callback infrastructure being built for running
callbacks on executors is the right foundation for it.
Since 2.7.0, test_connection has been disabled by default (#32052).
Running it on the API server has two problems: the API server
shouldn't be executing user-supplied driver code (Jarek described the
ODBC/JDBC risks in detail on #59643), and workers typically have
network access to external systems that API servers don't, so test
results from the API server can be misleading.
Ramit's generic Callback model (#54796 [2]) and Ferruzzi's
in-progress executor dispatch (#61153 [3]) together give us most of
what's needed. The flow would be:
1. UI calls POST /connections/test
2. API server Fernet-encrypts the connection URI, creates an
ExecutorCallback pointing to the test function, returns an ID
3. Scheduler dispatch loop (from #61153) picks it up, sends it
to the executor
4. Worker decrypts the URI, builds a transient Connection, calls
test_connection(), reports result through the callback path
5. UI polls GET /connections/test/{id} until it gets a terminal
state
The connection-testing-specific code would be small: a POST endpoint
to queue the test, a GET endpoint to poll for results, and the worker
function that decrypts and runs test_connection().
One thing I noticed: #61153's _enqueue_executor_callbacks currently
requires dag_run_id in the callback data dict, and ExecuteCallback.make
needs a DagRun for bundle info. Connection tests don't have a DagRun.
It would be a small change to make that optional. The dispatch query
itself is already generic (selects all PENDING ExecutorCallbacks). I
can take a look at decoupling that if it would be useful.
A couple of other open questions:
1. The connection test needs to store an encrypted URI, conn_type, and
some timestamps. Is the Callback.data JSON column the right place
for that, or does it warrant its own small table?
2. Stale requests: if a worker crashes mid-test, the record stays
in a non-terminal state. Should there be a scheduler-side reaper
similar to zombie task detection, or is client-side timeout (60s
in the UI) enough?
I explored this earlier in #60618 [4] with a self-contained
implementation. Now that the ExecutorCallback dispatch is taking shape
in #61153, building on top of will be in right direction.
Thoughts?
Anish
[1] https://github.com/apache/airflow/pull/59643
[2] https://github.com/apache/airflow/pull/54796
[3] https://github.com/apache/airflow/pull/61153
[4] https://github.com/apache/airflow/pull/60618
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]