Yicong Huang created SPARK-55197:
------------------------------------
Summary: Extract _insert_stream_start helper to deduplicate
START_ARROW_STREAM signal logic
Key: SPARK-55197
URL: https://issues.apache.org/jira/browse/SPARK-55197
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Multiple Arrow serializers repeat the same pattern for writing
{{START_ARROW_STREAM}} before dumping batches:
{code:python}
first = next(iterator, None)
if first is None:
return
write_int(SpecialLengths.START_ARROW_STREAM, stream)
# then chain first with rest...
{code}
This pattern appears in {{ArrowStreamUDFSerializer}},
{{ArrowStreamPandasUDFSerializer}}, {{ArrowStreamArrowUDFSerializer}}, and
{{ApplyInPandasWithStateSerializer}}.
Proposal: Extract a {{_insert_stream_start}} helper in
{{ArrowStreamSerializer}} to centralize this logic.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]