timsaucer opened a new pull request, #1546:
URL: https://github.com/apache/datafusion-python/pull/1546

   # Which issue does this PR close?
   
   No associated issue. **PR 3 of 4** stacked on 
[#1545](https://github.com/apache/datafusion-python/pull/1545) (which is itself 
stacked on [#1544](https://github.com/apache/datafusion-python/pull/1544)). The 
diff against \`main\` is cumulative until the prior PRs merge — review the 
commits on \`pr3-toggle-sender-strict\` directly for the PR3 delta.
   
   # Rationale for this change
   
   PRs 1 and 2 ship Python UDFs inline through the codec. Two follow-on needs:
   
   1. **Cross-language wire bytes.** A producer that wants its serialized 
expression to round-trip through a non-Python decoder (e.g. a Rust-only 
datafusion-distributed worker) needs UDFs to travel by name, not as a 
cloudpickle blob.
   2. **Untrusted-input decoding.** A receiver that may read 
\`Expr.from_bytes\` input from an untrusted source must refuse to invoke 
\`cloudpickle.loads\` on the inline payload. (\`pickle.loads\` on untrusted 
input is still unsafe regardless of this toggle — see the security note in the 
docstrings.)
   
   Both needs are served by the same on/off switch at the codec level. The 
codec already sits on every session, so the toggle is naturally per-session.
   
   # What changes are included in this PR?
   
   **Codec layer.** \`PythonLogicalCodec\` and \`PythonPhysicalCodec\` gain a 
\`python_udf_inlining: bool\` (default \`true\`) plus a 
\`with_python_udf_inlining(enabled)\` builder. Encode paths short-circuit to 
\`inner\` when the toggle is off (UDFs travel by name); decode paths return an 
\`Execution\` error instead of invoking \`cloudpickle.loads\` if they recognize 
a \`DFPY*\` family magic on a strict codec. The refusal message names both the 
UDF and the wire family so an operator can immediately see whether to re-encode 
the bytes upstream or register the UDF on the receiver.
   
   **Session layer.** \`PySessionContext::with_python_udf_inlining(enabled)\` 
returns a new session whose stacked logical + physical codecs both carry the 
toggle. The \`Arc<SessionState>\` is shared (cheap clone), only the codec pair 
is rebuilt, so registrations and config stay attached. 
\`SessionContext.with_python_udf_inlining(*, enabled)\` is the Python wrapper. 
\`enabled\` is keyword-only because positional booleans at the call site read 
as opaque.
   
   **Sender-side context.** \`datafusion.ipc\` gains \`set_sender_ctx\` / 
\`get_sender_ctx\` / \`clear_sender_ctx\` thread-locals. \`Expr.__reduce__\` 
now consults \`get_sender_ctx()\` to pick the codec for outbound pickles — 
without that hook, \`pickle.dumps\` always invokes \`Expr.to_bytes()\` with no 
context, so a strict session would never affect the pickle path. 
\`Expr.to_bytes(ctx)\` calls with an explicit \`ctx\` are unaffected.
   
   **Tests.** \`test_pickle_expr.py\` picks up:
   - \`TestPythonUdfInliningToggle\` — round-trips through a strict session, 
asserts the strict-side refusal error, exercises the explicit-ctx fast path, 
and covers an off-then-on toggle to ensure the field is not sticky.
   - \`TestWorkerCtxLifecycle\` and \`TestSenderCtxLifecycle\` — 
set/clear/threading semantics for the two thread-locals.
   
   \`test_pickle_multiprocessing.py\` (new) plus 
\`_pickle_multiprocessing_helpers.py\` (new) exercise the full driver → worker 
round-trip on a \`multiprocessing.Pool\` with \`set_worker_ctx\` installed in 
the initializer and \`set_sender_ctx\` on the driver.
   
   **CI.** \`.github/workflows/test.yml\` gets a 30-minute \`timeout-minutes\` 
backstop so a hung pickle worker (e.g. during a future regression) cannot block 
the matrix indefinitely.
   
   # Are there any user-facing changes?
   
   Yes:
   
   - \`SessionContext.with_python_udf_inlining(*, enabled: bool)\` is a new 
public method. Use \`enabled=False\` to either (a) produce cross-language wire 
bytes or (b) refuse to deserialize inline Python payloads from untrusted bytes.
   - \`datafusion.ipc.set_sender_ctx\` / \`get_sender_ctx\` / 
\`clear_sender_ctx\` are new public functions for propagating a configured 
session through \`pickle.dumps\`.
   
   No breaking changes — the toggle defaults to \`true\`, matching pre-PR 
behavior. \`api change\` label not added.
   
   The user-guide page documenting the full pattern (and the multiprocessing / 
Ray runnable examples) lands in PR 4 of this series.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to