[ https://issues.apache.org/jira/browse/SPARK-38183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haejoon Lee updated SPARK-38183: -------------------------------- Description: Since pandas API on Spark follows the behavior of pandas, not SQL, some unexpected behavior can be occurred when "spark.sql.ansi.enable" is True. For example, * It raises exception when {{div}} & {{mod}} related methods returns null (e.g. {{{}DataFrame.rmod{}}}) {code:java} >>> df angels degress 0 0 360 1 3 180 2 4 360 >>> df.rmod(2) Traceback (most recent call last): ... : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 165) (172.30.1.44 executor driver): org.apache.spark.SparkArithmeticException: divide by zero. To return NULL instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false (except for ANSI interval type) to bypass this error.{code} * It raises exception when DataFrame for {{ps.melt}} has not the same column type. {code:java} >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6 >>> ps.melt(df) Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), struct('B', B), struct('C', C))' due to data type mismatch: input to function array should all be the same type, but it's [struct<variable:string,value:string>, struct<variable:string,value:bigint>, struct<variable:string,value:bigint>] To fix the error, you might need to add explicit type casts. If necessary set spark.sql.ansi.enabled to false to bypass this error.; 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, __natural_order__#231L, explode(array(struct(variable, A, value, A#224), struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS pairs#269] +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, monotonically_increasing_id() AS __natural_order__#231L] +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code} * It raises exception when {{CategoricalIndex.remove_categories}} doesn't remove the entire index {code:java} >>> idx CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category') >>> idx.remove_categories('b') 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215) org.apache.spark.SparkNoSuchElementException: Key b does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error. ... ...{code} * It raises exception when {{CategoricalIndex.set_categories}} doesn't set the entire index {code:java} >>> idx.set_categories(['b', 'c']) 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID 215) org.apache.spark.SparkNoSuchElementException: Key a does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error. ... ...{code} * It raises exception when {{ps.to_numeric}} get a non-numeric type {code:java} >>> psser 0 apple 1 1.0 2 2 3 -3 dtype: object >>> ps.to_numeric(psser) 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 328) org.apache.spark.SparkNumberFormatException: invalid input syntax for type numeric: apple. To return NULL instead, use 'try_cast'. If necessary set spark.sql.ansi.enabled to false to bypass this error. ...{code} * It raises exception when {{strings.StringMethods.rsplit}} - also {{strings.StringMethods.split}} - with {{expand=True}} returns null columns {code:java} >>> s 0 this is a regular sentence 1 https://docs.python.org/3/tutorial/index.html 2 None dtype: object >>> s.str.split(n=4, expand=True) 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID 356) org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code} * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and the categories of {{CategoricalDtype}} is not matched with data. {code:java} >>> psser 0 1994-01-31 1 1994-02-01 2 1994-02-02 dtype: object >>> cat_type CategoricalDtype(categories=['a', 'b', 'c'], ordered=False) >>> psser.astype(cat_type) 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID 468) org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this error.{code} Not only for the example cases, if the internal SQL function used to implement the function has different behavior according to ANSI options, an unexpected error may occur. So we might need to show proper warning message when creating pandas-on-Spark session. was: Since pandas API on Spark follows the behavior of pandas, not SQL, some unexpected behavior can be occurred when "spark.sql.ansi.enable" is True. So we might need to show proper warning message when creating pandas-on-Spark session. > Show warning when creating pandas-on-Spark session under ANSI mode. > ------------------------------------------------------------------- > > Key: SPARK-38183 > URL: https://issues.apache.org/jira/browse/SPARK-38183 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.3.0 > Reporter: Haejoon Lee > Priority: Major > > Since pandas API on Spark follows the behavior of pandas, not SQL, some > unexpected behavior can be occurred when "spark.sql.ansi.enable" is True. > For example, > * It raises exception when {{div}} & {{mod}} related methods returns null > (e.g. {{{}DataFrame.rmod{}}}) > {code:java} > >>> df > angels degress > 0 0 360 > 1 3 180 > 2 4 360 > >>> df.rmod(2) > Traceback (most recent call last): > ... > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 32.0 (TID 165) (172.30.1.44 executor driver): > org.apache.spark.SparkArithmeticException: divide by zero. To return NULL > instead, use 'try_divide'. If necessary set spark.sql.ansi.enabled to false > (except for ANSI interval type) to bypass this error.{code} > * It raises exception when DataFrame for {{ps.melt}} has not the same column > type. > > {code:java} > >>> df > A B C > 0 a 1 2 > 1 b 3 4 > 2 c 5 6 > >>> ps.melt(df) > Traceback (most recent call last): > ... > pyspark.sql.utils.AnalysisException: cannot resolve 'array(struct('A', A), > struct('B', B), struct('C', C))' due to data type mismatch: input to function > array should all be the same type, but it's > [struct<variable:string,value:string>, struct<variable:string,value:bigint>, > struct<variable:string,value:bigint>] > To fix the error, you might need to add explicit type casts. If necessary set > spark.sql.ansi.enabled to false to bypass this error.; > 'Project [__index_level_0__#223L, A#224, B#225L, C#226L, > __natural_order__#231L, explode(array(struct(variable, A, value, A#224), > struct(variable, B, value, B#225L), struct(variable, C, value, C#226L))) AS > pairs#269] > +- Project [__index_level_0__#223L, A#224, B#225L, C#226L, > monotonically_increasing_id() AS __natural_order__#231L] > +- LogicalRDD [__index_level_0__#223L, A#224, B#225L, C#226L], false{code} > * It raises exception when {{CategoricalIndex.remove_categories}} doesn't > remove the entire index > {code:java} > >>> idx > CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'], categories=['a', 'b', 'c'], > ordered=False, dtype='category') > >>> idx.remove_categories('b') > 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID > 215) > org.apache.spark.SparkNoSuchElementException: Key b does not exist. If > necessary set spark.sql.ansi.strictIndexOperator to false to bypass this > error. > ... > ...{code} > * It raises exception when {{CategoricalIndex.set_categories}} doesn't set > the entire index > {code:java} > >>> idx.set_categories(['b', 'c']) > 22/02/14 09:16:14 ERROR Executor: Exception in task 2.0 in stage 41.0 (TID > 215) > org.apache.spark.SparkNoSuchElementException: Key a does not exist. If > necessary set spark.sql.ansi.strictIndexOperator to false to bypass this > error. > ... > ...{code} > * It raises exception when {{ps.to_numeric}} get a non-numeric type > {code:java} > >>> psser > 0 apple > 1 1.0 > 2 2 > 3 -3 > dtype: object > >>> ps.to_numeric(psser) > 22/02/14 09:22:36 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID > 328) > org.apache.spark.SparkNumberFormatException: invalid input syntax for type > numeric: apple. To return NULL instead, use 'try_cast'. If necessary set > spark.sql.ansi.enabled to false to bypass this error. > ...{code} > * It raises exception when {{strings.StringMethods.rsplit}} - also > {{strings.StringMethods.split}} - with {{expand=True}} returns null columns > {code:java} > >>> s > 0 this is a regular sentence > 1 https://docs.python.org/3/tutorial/index.html > 2 None > dtype: object > >>> s.str.split(n=4, expand=True) > 22/02/14 09:26:23 ERROR Executor: Exception in task 5.0 in stage 69.0 (TID > 356) > org.apache.spark.SparkArrayIndexOutOfBoundsException: Invalid index: 1, > numElements: 1. If necessary set spark.sql.ansi.strictIndexOperator to false > to bypass this error.{code} > * It raises exception when {{as_type}} with {{{}CategoricalDtype{}}}, and > the categories of {{CategoricalDtype}} is not matched with data. > {code:java} > >>> psser > 0 1994-01-31 > 1 1994-02-01 > 2 1994-02-02 > dtype: object > >>> cat_type > CategoricalDtype(categories=['a', 'b', 'c'], ordered=False) > >>> psser.astype(cat_type) > 22/02/14 09:34:56 ERROR Executor: Exception in task 5.0 in stage 90.0 (TID > 468) > org.apache.spark.SparkNoSuchElementException: Key 1994-02-01 does not exist. > If necessary set spark.sql.ansi.strictIndexOperator to false to bypass this > error.{code} > Not only for the example cases, if the internal SQL function used to > implement the function has different behavior according to ANSI options, an > unexpected error may occur. > So we might need to show proper warning message when creating pandas-on-Spark > session. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org