[
https://issues.apache.org/jira/browse/SPARK-55583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruifeng Zheng reassigned SPARK-55583:
-------------------------------------
Assignee: Yicong Huang
> Wrap Arrow VectorLoader IllegalArgumentException as PYTHON_DATA_SOURCE_ERROR
> in Python Data Source read
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-55583
> URL: https://issues.apache.org/jira/browse/SPARK-55583
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Assignee: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> h2. Problem
> When a Python Data Source reader returns Arrow batches whose buffer layout
> does not match the expected schema, Arrow {{VectorLoader.load()}} throws a
> raw {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: not all nodes, buffers and
> variadicBufferCounts were consumed.
> nodes: [ArrowFieldNode [length=1, nullCount=1]]
> buffers: [ArrowBuf[1595], ArrowBuf[1596]]
> variadicBufferCounts: []
> at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
> at
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
> at
> o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
> {code}
> This error message is cryptic and not user-friendly. It can be mistaken for
> an internal system error.
> h2. Proposal
> Catch {{IllegalArgumentException}} in the Python Data Source specific code
> path and wrap it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output
> schema context. The fix should only affect Python Data Source reader
> execution and not impact other Python execution paths (UDFs, Pandas API,
> etc.).
> The improved error message should provide:
> - Clear indication this is a Python Data Source error
> - Expected output schema
> - Original Arrow error details for debugging
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]