Yicong Huang created SPARK-55583:
------------------------------------
Summary: Wrap Arrow VectorLoader IllegalArgumentException as
PYTHON_DATA_SOURCE_ERROR in Python Data Source read
Key: SPARK-55583
URL: https://issues.apache.org/jira/browse/SPARK-55583
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
When a Python Data Source reader returns Arrow batches whose buffer layout does
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw
{{IllegalArgumentException}}:
{code}
java.lang.IllegalArgumentException: not all nodes, buffers and
variadicBufferCounts were consumed.
nodes: [ArrowFieldNode [length=1, nullCount=1]]
buffers: [ArrowBuf[1595], ArrowBuf[1596]]
variadicBufferCounts: []
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
at
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
at
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}
This error message is cryptic and not user-friendly. It can be mistaken for an
internal system error.
The fix catches {{IllegalArgumentException}} in
{{PythonPartitionReaderFactory}} (the Python Data Source specific entry point)
and wraps it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output schema
context, so users get a clear message like:
{code}
[PYTHON_DATA_SOURCE_ERROR] Failed to read from Python data source reader:
The Arrow batch returned by the Python data source does not match the expected
output schema.
Expected: StructType(...). <original Arrow error message>
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]