[
https://issues.apache.org/jira/browse/SPARK-55583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yicong Huang updated SPARK-55583:
---------------------------------
Description:
h2. Problem
When a Python Data Source reader returns Arrow batches whose buffer layout does
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw
{{IllegalArgumentException}}:
{code}
java.lang.IllegalArgumentException: not all nodes, buffers and
variadicBufferCounts were consumed.
nodes: [ArrowFieldNode [length=1, nullCount=1]]
buffers: [ArrowBuf[1595], ArrowBuf[1596]]
variadicBufferCounts: []
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
at
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
at
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}
This error message is cryptic and not user-friendly. It can be mistaken for an
internal system error.
h2. Proposal
Catch {{IllegalArgumentException}} in the Python Data Source specific code path
and wrap it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output schema
context. The fix should only affect Python Data Source reader execution and not
impact other Python execution paths (UDFs, Pandas API, etc.).
The improved error message should provide:
- Clear indication this is a Python Data Source error
- Expected output schema
- Original Arrow error details for debugging
was:
When a Python Data Source reader returns Arrow batches whose buffer layout does
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw
{{IllegalArgumentException}}:
{code}
java.lang.IllegalArgumentException: not all nodes, buffers and
variadicBufferCounts were consumed.
nodes: [ArrowFieldNode [length=1, nullCount=1]]
buffers: [ArrowBuf[1595], ArrowBuf[1596]]
variadicBufferCounts: []
at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
at
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
at
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}
This error message is cryptic and not user-friendly. It can be mistaken for an
internal system error.
*Note: This issue has been resolved using a different approach than originally
proposed. The fix was implemented by catching the exception at a different
layer in the codebase.*
> Wrap Arrow VectorLoader IllegalArgumentException as PYTHON_DATA_SOURCE_ERROR
> in Python Data Source read
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-55583
> URL: https://issues.apache.org/jira/browse/SPARK-55583
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> h2. Problem
> When a Python Data Source reader returns Arrow batches whose buffer layout
> does not match the expected schema, Arrow {{VectorLoader.load()}} throws a
> raw {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: not all nodes, buffers and
> variadicBufferCounts were consumed.
> nodes: [ArrowFieldNode [length=1, nullCount=1]]
> buffers: [ArrowBuf[1595], ArrowBuf[1596]]
> variadicBufferCounts: []
> at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
> at
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
> at
> o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
> {code}
> This error message is cryptic and not user-friendly. It can be mistaken for
> an internal system error.
> h2. Proposal
> Catch {{IllegalArgumentException}} in the Python Data Source specific code
> path and wrap it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output
> schema context. The fix should only affect Python Data Source reader
> execution and not impact other Python execution paths (UDFs, Pandas API,
> etc.).
> The improved error message should provide:
> - Clear indication this is a Python Data Source error
> - Expected output schema
> - Original Arrow error details for debugging
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]