[ 
https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38183:
--------------------------------
    Description: 
Since pandas API on Spark follows the behavior of pandas, not SQL, some 
unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.

For example,
 * It raises exception when {{div}} & {{mod}} related methods returns null 
(e.g. {{{}DataFrame.rmod{}}})

{code:java}
>>> df
   angels  degress
0       0      360
1       3      180
2       4      360
>>> df.rmod(2)
Traceback (most recent call last):
...
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 
(TID 165) (172.30.1.44 executor driver): 
org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
(except for ANSI interval type) to bypass this error.{code}
 * It raises exception when DataFrame for {{ps.melt}} has not the same column 
type.

 
{code:java}
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> ps.melt(df)
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
struct('B', B), struct('C', C))' due to data type mismatch: input to function 
array should all be the same type, but it's 
[struct<variable:string,value:string>, struct<variable:string,value:bigint>, 
struct<variable:string,value:bigint>]
To fix the error, you might need to add explicit type casts. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.;
'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
__natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
pairs#269]
+- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
monotonically_increasing_id() AS __natural_order__#231L]
   +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
 * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
remove the entire index

{code:java}
>>> idx
CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
ordered=False, dtype='category')
>>> idx.remove_categories('b')
22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
...
...{code}
 * It raises exception when {{CategoricalIndex.set_categories}} doesn't set the 
entire index

{code:java}
>>> idx.set_categories(['b', 'c'])
22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215)
org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.
...
...{code}
 * It raises exception when {{ps.to_numeric}} get a non-numeric type

{code:java}
>>> psser
0    apple
1      1.0
2        2
3       -3
dtype: object
>>> ps.to_numeric(psser)
22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 328)
org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
spark.sql.ansi.enabled to false to bypass this error.
...{code}
 * It raises exception when {{strings.StringMethods.rsplit}} - also 
{{strings.StringMethods.split}} - with {{expand=True}} returns null columns

{code:java}
>>> s
0                       this is a regular sentence
1    https://docs.python.org/3/tutorial/index.html
2                                             None
dtype: object
>>> s.str.split(n=4, expand=True)
22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 356)
org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false to 
bypass this error.{code}
 * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and the 
categories of {{CategoricalDtype}} is not matched with data.

{code:java}
>>> psser
0    1994-01-31
1    1994-02-01
2    1994-02-02
dtype: object
>>> cat_type
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> psser.astype(cat_type)
22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 468)
org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. If 
necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
error.{code}
Not only for the example cases, if the internal SQL function used to implement 
the function has different behavior according to ANSI options, an unexpected 
error may occur.

So we might need to show proper warning message when creating pandas-on-Spark 
session.

  was:
Since pandas API on Spark follows the behavior of pandas, not SQL, some 
unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.

 

So we might need to show proper warning message when creating pandas-on-Spark 
session.


> Show warning when creating pandas-on-Spark session under ANSI mode.
> -------------------------------------------------------------------
>
>                 Key: SPARK-38183
>                 URL: https://issues.apache.org/jira/browse/SPARK-38183
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: Haejoon Lee
>            Priority: Major
>
> Since pandas API on Spark follows the behavior of pandas, not SQL, some 
> unexpected behavior can be occurred when "spark.sql.ansi.enable" is True.
> For example,
>  * It raises exception when {{div}} & {{mod}} related methods returns null 
> (e.g. {{{}DataFrame.rmod{}}})
> {code:java}
> >>> df
>    angels  degress
> 0       0      360
> 1       3      180
> 2       4      360
> >>> df.rmod(2)
> Traceback (most recent call last):
> ...
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 32.0 (TID 165) (172.30.1.44 executor driver): 
> org.apache.spark.SparkArithmeticException: divide by zero. To return NULL 
> instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false 
> (except for ANSI interval type) to bypass this error.{code}
>  * It raises exception when DataFrame for {{ps.melt}} has not the same column 
> type.
>  
> {code:java}
> >>> df
>    A  B  C
> 0  a  1  2
> 1  b  3  4
> 2  c  5  6
> >>> ps.melt(df)
> Traceback (most recent call last):
> ...
> pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), 
> struct('B', B), struct('C', C))' due to data type mismatch: input to function 
> array should all be the same type, but it's 
> [struct<variable:string,value:string>, struct<variable:string,value:bigint>, 
> struct<variable:string,value:bigint>]
> To fix the error, you might need to add explicit type casts. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.;
> 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> __natural_order__#231L, explode(array(struct(variable, A, value, A#224), 
> struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS 
> pairs#269]
> +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, 
> monotonically_increasing_id() AS __natural_order__#231L]
>    +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code}
>  * It raises exception when {{CategoricalIndex.remove_categories}} doesn't 
> remove the entire index
> {code:java}
> >>> idx
> CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], 
> ordered=False, dtype='category')
> >>> idx.remove_categories('b')
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key b does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{CategoricalIndex.set_categories}} doesn't set 
> the entire index
> {code:java}
> >>> idx.set_categories(['b', 'c'])
> 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 
> 215)
> org.apache.spark.SparkNoSuchElementException: Key a does not exist. If 
> necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.
> ...
> ...{code}
>  * It raises exception when {{ps.to_numeric}} get a non-numeric type
> {code:java}
> >>> psser
> 0    apple
> 1      1.0
> 2        2
> 3       -3
> dtype: object
> >>> ps.to_numeric(psser)
> 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 
> 328)
> org.apache.spark.SparkNumberFormatException: invalid input syntax for type 
> numeric: apple. To return NULL instead, use 'try_cast'. If necessary set 
> spark.sql.ansi.enabled to false to bypass this error.
> ...{code}
>  * It raises exception when {{strings.StringMethods.rsplit}} - also 
> {{strings.StringMethods.split}} - with {{expand=True}} returns null columns
> {code:java}
> >>> s
> 0                       this is a regular sentence
> 1    https://docs.python.org/3/tutorial/index.html
> 2                                             None
> dtype: object
> >>> s.str.split(n=4, expand=True)
> 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 
> 356)
> org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, 
> numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false 
> to bypass this error.{code}
>  * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and 
> the categories of {{CategoricalDtype}} is not matched with data.
> {code:java}
> >>> psser
> 0    1994-01-31
> 1    1994-02-01
> 2    1994-02-02
> dtype: object
> >>> cat_type
> CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
> >>> psser.astype(cat_type)
> 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 
> 468)
> org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. 
> If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this 
> error.{code}
> Not only for the example cases, if the internal SQL function used to 
> implement the function has different behavior according to ANSI options, an 
> unexpected error may occur.
> So we might need to show proper warning message when creating pandas-on-Spark 
> session.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to