Tian Gao created SPARK-54559:
--------------------------------
Summary: Refactor the UDF protocol
Key: SPARK-54559
URL: https://issues.apache.org/jira/browse/SPARK-54559
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Tian Gao
We have a very ad-hoc UDF protocol now. People just add random `read_int` to
the UDF reading logic and it crashes the maintainability.
We have requests regarding using grpc for UDF workers, but it's too much for a
single change.
Nonetheless, we should still try to make the current code more maintainable and
move towards a "defined protocol" (it doesn't have to be grpc but we should
have some validation for wrong messages, instead of hang at random places).
The ultimate goal for now is to separate the whole socket communication from
the UDF logic - read everything at once and validate it, then use it for
UDF/UDTF creation.
We will divide this effort into the following steps (to make sure code sync is
not a disaster).
# Create a single data structure for runner_conf
# Unify the protocol for runner_conf so every UDF (specifically vanilla
Python) sends a runner_conf (it can have a length of 0).
# Unify how the argument offsets are passed - they should use the same
protocol so it's not determined based on eval type.
# Move all the socket reading from read_udf/read_udtf to a unified place. We
read the whole thing (except for data) and build functions based on it.
# Add some sanity validation to our data (for example, for some integer value,
we are expecting a sane number. A super large number suggests protocol issue).
# Potential cleanups for some other stuff that can be unified (it would be
clear after we put all the logic together).
We expect most of the changes done in worker.py - with some minor changes on
JVM side to unify protocols.
Once this is done, we should be easily convert it to any fancy RPC protocols if
we need to.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]