[ 
https://issues.apache.org/jira/browse/SPARK-55583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Huang updated SPARK-55583:
---------------------------------
    Description: 
h2. Problem

When a Python Data Source reader returns Arrow batches whose buffer layout does 
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw 
{{IllegalArgumentException}}:

{code}
java.lang.IllegalArgumentException: not all nodes, buffers and 
variadicBufferCounts were consumed.
  nodes: [ArrowFieldNode [length=1, nullCount=1]]
  buffers: [ArrowBuf[1595], ArrowBuf[1596]]
  variadicBufferCounts: []
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
    at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
    at 
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}

This error message is cryptic and not user-friendly. It can be mistaken for an 
internal system error.

h2. Proposal

Catch {{IllegalArgumentException}} in the Python Data Source specific code path 
and wrap it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output schema 
context. The fix should only affect Python Data Source reader execution and not 
impact other Python execution paths (UDFs, Pandas API, etc.).

The improved error message should provide:
- Clear indication this is a Python Data Source error
- Expected output schema
- Original Arrow error details for debugging

  was:
When a Python Data Source reader returns Arrow batches whose buffer layout does 
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw 
{{IllegalArgumentException}}:

{code}
java.lang.IllegalArgumentException: not all nodes, buffers and 
variadicBufferCounts were consumed.
  nodes: [ArrowFieldNode [length=1, nullCount=1]]
  buffers: [ArrowBuf[1595], ArrowBuf[1596]]
  variadicBufferCounts: []
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
    at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
    at 
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}

This error message is cryptic and not user-friendly. It can be mistaken for an 
internal system error.

*Note: This issue has been resolved using a different approach than originally 
proposed. The fix was implemented by catching the exception at a different 
layer in the codebase.*


> Wrap Arrow VectorLoader IllegalArgumentException as PYTHON_DATA_SOURCE_ERROR 
> in Python Data Source read
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-55583
>                 URL: https://issues.apache.org/jira/browse/SPARK-55583
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Problem
> When a Python Data Source reader returns Arrow batches whose buffer layout 
> does not match the expected schema, Arrow {{VectorLoader.load()}} throws a 
> raw {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: not all nodes, buffers and 
> variadicBufferCounts were consumed.
>   nodes: [ArrowFieldNode [length=1, nullCount=1]]
>   buffers: [ArrowBuf[1595], ArrowBuf[1596]]
>   variadicBufferCounts: []
>     at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
>     at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
>     at 
> o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
> {code}
> This error message is cryptic and not user-friendly. It can be mistaken for 
> an internal system error.
> h2. Proposal
> Catch {{IllegalArgumentException}} in the Python Data Source specific code 
> path and wrap it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output 
> schema context. The fix should only affect Python Data Source reader 
> execution and not impact other Python execution paths (UDFs, Pandas API, 
> etc.).
> The improved error message should provide:
> - Clear indication this is a Python Data Source error
> - Expected output schema
> - Original Arrow error details for debugging



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to