[
https://issues.apache.org/jira/browse/SPARK-54598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yicong Huang updated SPARK-54598:
---------------------------------
Description:
Currently we always fetch UDFs (function and its arguments) in the logic of
each different UDF, which is pretty redundent.
The current implementation has redundant UDF reading logic scattered throughout
`read_udfs()`:
**Single UDF pattern** (repeated in multiple branches):
```python
arg_offsets, f = read_single_udf(
pickleSer, infile, eval_type, runner_conf, udf_index=0, profiler=profiler
)
parsed_offsets = extract_key_value_indexes(arg_offsets) # when needed
```
**Multiple UDFs pattern** (repeated in multiple branches):
```python
udfs = []
for i in range(num_udfs):
udfs.append(
read_single_udf(
pickleSer, infile, eval_type, runner_conf, udf_index=i,
profiler=profiler
)
)
was:
Currently we always fetch UDFs (function and its arguments)
Single UDF:
```
arg_offsets, f = read_single_udf(
pickleSer, infile, eval_type, runner_conf, udf_index=0, profiler=profiler
)
parsed_offsets=extract_key_value_indexes(arg_offsets)
```
Multi UDFs:
{code:python}
udfs = []
for i in range(num_udfs):
udfs.append(
read_single_udf(
pickleSer, infile, eval_type, runner_conf, udf_index=i,
profiler=profiler
)
)
{code}
> Refactor UDF fetching logic out from invocation
> -----------------------------------------------
>
> Key: SPARK-54598
> URL: https://issues.apache.org/jira/browse/SPARK-54598
> Project: Spark
> Issue Type: Task
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
>
> Currently we always fetch UDFs (function and its arguments) in the logic of
> each different UDF, which is pretty redundent.
> The current implementation has redundant UDF reading logic scattered
> throughout `read_udfs()`:
> **Single UDF pattern** (repeated in multiple branches):
> ```python
> arg_offsets, f = read_single_udf(
> pickleSer, infile, eval_type, runner_conf, udf_index=0, profiler=profiler
> )
> parsed_offsets = extract_key_value_indexes(arg_offsets) # when needed
> ```
> **Multiple UDFs pattern** (repeated in multiple branches):
> ```python
> udfs = []
> for i in range(num_udfs):
> udfs.append(
> read_single_udf(
> pickleSer, infile, eval_type, runner_conf, udf_index=i,
> profiler=profiler
> )
> )
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]