[ 
https://issues.apache.org/jira/browse/SPARK-55583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Huang updated SPARK-55583:
---------------------------------
    Description: 
When a Python Data Source reader returns Arrow batches whose buffer layout does 
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw 
{{IllegalArgumentException}}:

{code}
java.lang.IllegalArgumentException: not all nodes, buffers and 
variadicBufferCounts were consumed.
  nodes: [ArrowFieldNode [length=1, nullCount=1]]
  buffers: [ArrowBuf[1595], ArrowBuf[1596]]
  variadicBufferCounts: []
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
    at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
    at 
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}

This error message is cryptic and not user-friendly. It can be mistaken for an 
internal system error.

*Note: This issue has been resolved using a different approach than originally 
proposed. The fix was implemented by catching the exception at a different 
layer in the codebase.*

  was:
When a Python Data Source reader returns Arrow batches whose buffer layout does 
not match the expected schema, Arrow {{VectorLoader.load()}} throws a raw 
{{IllegalArgumentException}}:

{code}
java.lang.IllegalArgumentException: not all nodes, buffers and 
variadicBufferCounts were consumed.
  nodes: [ArrowFieldNode [length=1, nullCount=1]]
  buffers: [ArrowBuf[1595], ArrowBuf[1596]]
  variadicBufferCounts: []
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
    at 
org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
    at 
o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
{code}

This error message is cryptic and not user-friendly. It can be mistaken for an 
internal system error.

The fix catches {{IllegalArgumentException}} in 
{{PythonPartitionReaderFactory}} (the Python Data Source specific entry point) 
and wraps it as {{PYTHON_DATA_SOURCE_ERROR}} with the expected output schema 
context, so users get a clear message like:

{code}
[PYTHON_DATA_SOURCE_ERROR] Failed to read from Python data source reader:
The Arrow batch returned by the Python data source does not match the expected 
output schema.
Expected: StructType(...). <original Arrow error message>
{code}


> Wrap Arrow VectorLoader IllegalArgumentException as PYTHON_DATA_SOURCE_ERROR 
> in Python Data Source read
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-55583
>                 URL: https://issues.apache.org/jira/browse/SPARK-55583
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> When a Python Data Source reader returns Arrow batches whose buffer layout 
> does not match the expected schema, Arrow {{VectorLoader.load()}} throws a 
> raw {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: not all nodes, buffers and 
> variadicBufferCounts were consumed.
>   nodes: [ArrowFieldNode [length=1, nullCount=1]]
>   buffers: [ArrowBuf[1595], ArrowBuf[1596]]
>   variadicBufferCounts: []
>     at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:98)
>     at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:161)
>     at 
> o.a.s.sql.execution.python.ArrowOutputProcessorImpl.loadBatch(PythonArrowOutput.scala)
> {code}
> This error message is cryptic and not user-friendly. It can be mistaken for 
> an internal system error.
> *Note: This issue has been resolved using a different approach than 
> originally proposed. The fix was implemented by catching the exception at a 
> different layer in the codebase.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to