[jira] [Updated] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

Bryan Cutler (Jira) Mon, 05 Oct 2020 23:41:16 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-33073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Cutler updated SPARK-33073:
---------------------------------
    Description: 
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set and using exception chaining to better 
show the original error. Also, any safe failures would be a ValueError, which 
ArrowInvaildError is a subclass, so the catch could be made more narrow.

  was:
Currently, when converting from Pandas to Arrow for Pandas UDF return values or 
from createDataFrame(), PySpark will catch all ArrowExceptions and display info 
on how to disable the safe conversion config. This is displayed with the 
original error as a tuple:

{noformat}
('Exception thrown when converting pandas.Series (object) to Arrow Array 
(int32). It can be caused by overflows or other unsafe conversions warned by 
Arrow. Arrow safe type check can be disabled by using SQL config 
`spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
not convert a with type str: tried to convert to int'))
{noformat}

The problem is that this is meant mainly for thing like float truncation or 
overflow, but will also show if the user has an invalid schema with types that 
are incompatible. The extra information is confusing in this case and the real 
error is buried.

This could be improved by only printing the extra info on how to disable safe 
checking if the config is actually set. Also, any safe failures would be a 
ValueError, which ArrowInvaildError is a subclass, so the catch could be made 
more narrow.


> Improve error handling on Pandas to Arrow conversion failures
> -------------------------------------------------------------
>
>                 Key: SPARK-33073
>                 URL: https://issues.apache.org/jira/browse/SPARK-33073
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.0.1
>            Reporter: Bryan Cutler
>            Priority: Major
>
> Currently, when converting from Pandas to Arrow for Pandas UDF return values 
> or from createDataFrame(), PySpark will catch all ArrowExceptions and display 
> info on how to disable the safe conversion config. This is displayed with the 
> original error as a tuple:
> {noformat}
> ('Exception thrown when converting pandas.Series (object) to Arrow Array 
> (int32). It can be caused by overflows or other unsafe conversions warned by 
> Arrow. Arrow safe type check can be disabled by using SQL config 
> `spark.sql.execution.pandas.convertToArrowArraySafely`.', ArrowInvalid('Could 
> not convert a with type str: tried to convert to int'))
> {noformat}
> The problem is that this is meant mainly for thing like float truncation or 
> overflow, but will also show if the user has an invalid schema with types 
> that are incompatible. The extra information is confusing in this case and 
> the real error is buried.
> This could be improved by only printing the extra info on how to disable safe 
> checking if the config is actually set and using exception chaining to better 
> show the original error. Also, any safe failures would be a ValueError, which 
> ArrowInvaildError is a subclass, so the catch could be made more narrow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33073) Improve error handling on Pandas to Arrow conversion failures

Reply via email to