Tian Gao created SPARK-54559:
--------------------------------

             Summary: Refactor the UDF protocol
                 Key: SPARK-54559
                 URL: https://issues.apache.org/jira/browse/SPARK-54559
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Tian Gao


We have a very ad-hoc UDF protocol now. People just add random `read_int` to 
the UDF reading logic and it crashes the maintainability.

We have requests regarding using grpc for UDF workers, but it's too much for a 
single change.

Nonetheless, we should still try to make the current code more maintainable and 
move towards a "defined protocol" (it doesn't have to be grpc but we should 
have some validation for wrong messages, instead of hang at random places).

The ultimate goal for now is to separate the whole socket communication from 
the UDF logic - read everything at once and validate it, then use it for 
UDF/UDTF creation.

We will divide this effort into the following steps (to make sure code sync is 
not a disaster).
 # Create a single data structure for runner_conf
 # Unify the protocol for runner_conf so every UDF (specifically vanilla 
Python) sends a runner_conf (it can have a length of 0).
 # Unify how the argument offsets are passed - they should use the same 
protocol so it's not determined based on eval type.
 # Move all the socket reading from read_udf/read_udtf to a unified place. We 
read the whole thing (except for data) and build functions based on it.
 # Add some sanity validation to our data (for example, for some integer value, 
we are expecting a sane number. A super large number suggests protocol issue).
 # Potential cleanups for some other stuff that can be unified (it would be 
clear after we put all the logic together).

We expect most of the changes done in worker.py - with some minor changes on 
JVM side to unify protocols.

Once this is done, we should be easily convert it to any fancy RPC protocols if 
we need to.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to