[jira] [Resolved] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server

Hyukjin Kwon (Jira) Thu, 02 May 2024 03:06:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-48085.
----------------------------------
    Resolution: Invalid

We actually made this all compatible in master branch. I will avoid backporting 
them because they are just tests.

> ANSI enabled by default brings different results in the tests in 3.5 client 
> <> 4.0 server
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-48085
>                 URL: https://issues.apache.org/jira/browse/SPARK-48085
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Connect, PySpark, SQL
>    Affects Versions: 4.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> {code}
> ======================================================================
> FAIL [0.169s]: test_checking_csv_header 
> (pyspark.sql.tests.connect.test_parity_datasources.DataSourcesParityTests.test_checking_csv_header)
> ----------------------------------------------------------------------
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered 
> error while reading file 
> file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-00000-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
>   SQLSTATE: KD001
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_datasources.py",
>  line 
> [167](https://github.com/HyukjinKwon/spark/actions/runs/8908464265/job/24464135564#step:9:168),
>  in test_checking_csv_header
>     self.assertRaisesRegex(
> AssertionError: "CSV header does not conform to the schema" does not match 
> "(org.apache.spark.SparkException) [FAILED_READ_FILE.NO_HINT] Encountered 
> error while reading file 
> file:///home/runner/work/spark/spark-3.5/python/target/38acabf5-710b-4c21-b359-f61619e2adc7/tmpm7qyq23g/part-00000-d6c8793b-772d-44e7-bcca-6eeae9cc0ec7-c000.csv.
>   SQLSTATE: KD001"
> {code}
> {code}
> ======================================================================
> ERROR [0.059s]: test_large_variable_types 
> (pyspark.sql.tests.connect.test_parity_pandas_map.MapInPandasParityTests.test_large_variable_types)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_map.py",
>  line 115, in test_large_variable_types
>     actual = df.mapInPandas(func, "str string, bin binary").collect()
>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
> line 1645, in collect
>     table, schema = self._session.client.to_table(query)
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 858, in to_table
>     table, schema, _, _, _ = self._execute_and_fetch(req)
>                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
>     for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
>     self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
>     self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
>     raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.IllegalArgumentException: 
> [INVALID_PARAMETER_VALUE.CHARSET] The value of parameter(s) `charset` in 
> `encode` is invalid: expects one of the charsets 'US-ASCII', 'ISO-8859-1', 
> 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16', but got utf8. SQLSTATE: 
> 2[202](https://github.com/HyukjinKwon/spark/actions/runs/8909131027/job/24465959134#step:9:203)3
> {code}
> {code}
> ======================================================================
> ERROR [0.024s]: test_assert_approx_equal_decimaltype_custom_rtol_pass 
> (pyspark.sql.tests.connect.test_utils.ConnectUtilsTests.test_assert_approx_equal_decimaltype_custom_rtol_pass)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/test_utils.py", 
> line 279, in test_assert_approx_equal_decimaltype_custom_rtol_pass
>     assertDataFrameEqual(df1, df2, rtol=1e-1)
>   File "/home/runner/work/spark/spark-3.5/python/pyspark/testing/utils.py", 
> line 595, in assertDataFrameEqual
>     actual_list = actual.collect()
>                   ^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/dataframe.py", 
> line 1645, in collect
>     table, schema = self._session.client.to_table(query)
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 858, in to_table
>     table, schema, _, _, _ = self._execute_and_fetch(req)
>                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1283, in _execute_and_fetch
>     for response in self._execute_and_fetch_as_iterator(req):
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1264, in _execute_and_fetch_as_iterator
>     self._handle_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1503, in _handle_error
>     self._handle_rpc_error(error)
>   File 
> "/home/runner/work/spark/spark-3.5/python/pyspark/sql/connect/client/core.py",
>  line 1539, in _handle_rpc_error
>     raise convert_exception(info, status.message) from None
> pyspark.errors.exceptions.connect.ArithmeticException: 
> [NUMERIC_VALUE_OUT_OF_RANGE.WITH_SUGGESTION]  83.14 cannot be represented as 
> Decimal(4, 3). If necessary set "spark.sql.ansi.enabled" to "false" to bypass 
> this error, and return NULL instead. SQLSTATE: 22003
> ----------------------------------------------------------------------
> {code}
> {code}
> File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/dataframe.py", 
> line 1057, in pyspark.sql.connect.dataframe.DataFrame.union
> Failed example:
>     df3.show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/doctest.py", line 
> 1355, in __run
>         exec(compile(example.source, filename, "single",
>       File "<doctest pyspark.sql.connect.dataframe.DataFrame.union[10]>", 
> line 1, in <module>
>         df3.show()
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/dataframe.py", 
> line 996, in show
>         print(self._show_string(n, truncate, vertical))
>               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/dataframe.py", 
> line 753, in _show_string
>         ).toPandas()
>           ^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/dataframe.py", 
> line 1663, in toPandas
>         return self._session.client.to_pandas(query)
>                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 873, in to_pandas
>         table, schema, metrics, observed_metrics, _ = self._execute_and_fetch(
>                                                       ^^^^^^^^^^^^^^^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1283, in _execute_and_fetch
>         for response in self._execute_and_fetch_as_iterator(req):
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1264, in _execute_and_fetch_as_iterator
>         self._handle_error(error)
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1503, in _handle_error
>         self._handle_rpc_error(error)
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1539, in _handle_rpc_error
>         raise convert_exception(info, status.message) from None
>     pyspark.errors.exceptions.connect.NumberFormatException: 
> [CAST_INVALID_INPUT] The value 'Alice' of the type "STRING" cannot be cast to 
> "BIGINT" because it is malformed. Correct the value as per the syntax, or 
> change its target type. Use `try_cast` to tolerate malformed input and return 
> NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass 
> this error. SQLSTATE: 22018
>     JVM stacktrace:
>     org.apache.spark.SparkNumberFormatException: [CAST_INVALID_INPUT] The 
> value 'Alice' of the type "STRING" cannot be cast to "BIGINT" because it is 
> malformed. Correct the value as per the syntax, or change its target type. 
> Use `try_cast` to tolerate malformed input and return NULL instead. If 
> necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 
> SQLSTATE: 22018
>       at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidInputInCastToNumberError(QueryExecutionErrors.scala:145)
>       at 
> org.apache.spark.sql.catalyst.util.UTF8StringUtils$.withException(UTF8StringUtils.scala:51)
>       at 
> org.apache.spark.sql.catalyst.util.UTF8StringUtils$.toLongExact(UTF8StringUtils.scala:31)
>       at 
> org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToLong$2(Cast.scala:770)
>       at 
> org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToLong$2$adapted(Cast.scala:770)
>       at 
> org.apache.spark.sql.catalyst.expressions.Cast.buildCast(Cast.scala:565)
>       at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castToLong...
> {code}
> {code}
> **********************************************************************
> File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/functions.py", 
> line 3546, in pyspark.sql.connect.functions.current_database
> Failed example:
>     spark.range(1).select(current_database()).show()
> Expected:
>     +------------------+
>     |current_database()|
>     +------------------+
>     |           default|
>     +------------------+
> Got:
>     +----------------+
>     |current_schema()|
>     +----------------+
>     |         default|
>     +----------------+
>     <BLANKLINE>
> **********************************************************************
> File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/functions.py", 
> line 3547, in pyspark.sql.connect.functions.current_schema
> Failed example:
>     spark.range(1).select(sf.current_schema()).show()
> Expected:
>     +------------------+
>     |current_database()|
>     +------------------+
>     |           default|
>     +------------------+
> Got:
>     +----------------+
>     |current_schema()|
>     +----------------+
>     |         default|
>     +----------------+
>     <BLANKLINE>
> **********************************************************************
> File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/functions.py", 
> line 3310, in pyspark.sql.connect.functions.to_unix_timestamp
> Failed example:
>     df.select(to_unix_timestamp(df.e).alias('r')).collect()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/doctest.py", line 
> 1355, in __run
>         exec(compile(example.source, filename, "single",
>       File "<doctest pyspark.sql.connect.functions.to_unix_timestamp[6]>", 
> line 1, in <module>
>         df.select(to_unix_timestamp(df.e).alias('r')).collect()
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/dataframe.py", 
> line 1645, in collect
>         table, schema = self._session.client.to_table(query)
>                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 858, in to_table
>         table, schema, _, _, _ = self._execute_and_fetch(req)
>                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1283, in _execute_and_fetch
>         for response in self._execute_and_fetch_as_iterator(req):
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1264, in _execute_and_fetch_as_iterator
>         self._handle_error(error)
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1503, in _handle_error
>         self._handle_rpc_error(error)
>       File 
> "/home/runner/work/spark/spark-35/python/pyspark/sql/connect/client/core.py", 
> line 1539, in _handle_rpc_error
>         raise convert_exception(info, status.message) from None
>     pyspark.errors.exceptions.connect.DateTimeException: 
> [CANNOT_PARSE_TIMESTAMP] Text '2016-04-08' could not be parsed at index 10. 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. 
> SQLSTATE: 22007
>     JVM stacktrace:
>     org.apache.spark.SparkDateTimeException: [CANNOT_PARSE_TIMESTAMP] Text 
> '2016-04-08' could not be parsed at index 10. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22007
>       at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.ansiDateTimeParseError(QueryExecutionErrors.scala:271)
>       at 
> org.apache.spark.sql.catalyst.expressions.ToTimestamp.eval(datetimeExpressions.scala:1300)
>       at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:159)
>       at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:89)
>       at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$48.$anonfun$applyOrElse$82(Optimizer.scala:2[208](https://github.com/HyukjinKwon/spark/actions/runs/8918871289/job/24494177776#step:9:209))
>       at scala.collection.immutable.List.map(List.scala:247)
>       at scala.collection.immutable.List.map(List.scala:79)
>       at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$48.applyOrElse(Optimizer.scala:[220](https://github.com/HyukjinKwon/spark/actions/runs/8918871289/job/24494177776#step:9:221)8)
>       at org.apache.spark.sql.catalyst.optimizer...
> **********************************************************************
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48085) ANSI enabled by default brings different results in the tests in 3.5 client <> 4.0 server

Reply via email to