[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527893#comment-17527893 ] Apache Spark commented on SPARK-39015: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36351 > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL: https://issues.apache.org/jira/browse/SPARK-39015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Raza Jafri >Priority: Major > > [~maxgekk] submitted a > [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] > that tries to convert the key to SQL but that part of the code is blowing > up. > {code:java} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.DataTypes > val arrayStructureData = Seq( > Row(Map("hair"->"black", "eye"->"brown")), > Row(Map("hair"->"blond", "eye"->"blue")), > Row(Map())) > val mapType = DataTypes.createMapType(StringType,StringType) > val arrayStructureSchema = new StructType() > .add("properties", mapType) > val mapTypeDF = spark.createDataFrame( > spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) > mapTypeDF.selectExpr("element_at(properties, 'hair')").show > // Exiting paste mode, now interpreting. > ++ > |element_at(properties, hair)| > ++ > | black| > | blond| > |null| > ++ > scala> spark.conf.set("spark.sql.ansi.enabled", true) > scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show > 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for 'hair' of class org.apache.spark.unsafe.types.UTF8String. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > {code} > Seems like it's trying to convert UTF8String to a sql literal -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527892#comment-17527892 ] Apache Spark commented on SPARK-39015: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36351 > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL: https://issues.apache.org/jira/browse/SPARK-39015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Raza Jafri >Priority: Major > > [~maxgekk] submitted a > [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] > that tries to convert the key to SQL but that part of the code is blowing > up. > {code:java} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.DataTypes > val arrayStructureData = Seq( > Row(Map("hair"->"black", "eye"->"brown")), > Row(Map("hair"->"blond", "eye"->"blue")), > Row(Map())) > val mapType = DataTypes.createMapType(StringType,StringType) > val arrayStructureSchema = new StructType() > .add("properties", mapType) > val mapTypeDF = spark.createDataFrame( > spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) > mapTypeDF.selectExpr("element_at(properties, 'hair')").show > // Exiting paste mode, now interpreting. > ++ > |element_at(properties, hair)| > ++ > | black| > | blond| > |null| > ++ > scala> spark.conf.set("spark.sql.ansi.enabled", true) > scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show > 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for 'hair' of class org.apache.spark.unsafe.types.UTF8String. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > {code} > Seems like it's trying to convert UTF8String to a sql literal -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39015: Assignee: Apache Spark > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL: https://issues.apache.org/jira/browse/SPARK-39015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Raza Jafri >Assignee: Apache Spark >Priority: Major > > [~maxgekk] submitted a > [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] > that tries to convert the key to SQL but that part of the code is blowing > up. > {code:java} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.DataTypes > val arrayStructureData = Seq( > Row(Map("hair"->"black", "eye"->"brown")), > Row(Map("hair"->"blond", "eye"->"blue")), > Row(Map())) > val mapType = DataTypes.createMapType(StringType,StringType) > val arrayStructureSchema = new StructType() > .add("properties", mapType) > val mapTypeDF = spark.createDataFrame( > spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) > mapTypeDF.selectExpr("element_at(properties, 'hair')").show > // Exiting paste mode, now interpreting. > ++ > |element_at(properties, hair)| > ++ > | black| > | blond| > |null| > ++ > scala> spark.conf.set("spark.sql.ansi.enabled", true) > scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show > 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for 'hair' of class org.apache.spark.unsafe.types.UTF8String. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > {code} > Seems like it's trying to convert UTF8String to a sql literal -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39015: Assignee: (was: Apache Spark) > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL: https://issues.apache.org/jira/browse/SPARK-39015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Raza Jafri >Priority: Major > > [~maxgekk] submitted a > [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] > that tries to convert the key to SQL but that part of the code is blowing > up. > {code:java} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.DataTypes > val arrayStructureData = Seq( > Row(Map("hair"->"black", "eye"->"brown")), > Row(Map("hair"->"blond", "eye"->"blue")), > Row(Map())) > val mapType = DataTypes.createMapType(StringType,StringType) > val arrayStructureSchema = new StructType() > .add("properties", mapType) > val mapTypeDF = spark.createDataFrame( > spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) > mapTypeDF.selectExpr("element_at(properties, 'hair')").show > // Exiting paste mode, now interpreting. > ++ > |element_at(properties, hair)| > ++ > | black| > | blond| > |null| > ++ > scala> spark.conf.set("spark.sql.ansi.enabled", true) > scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show > 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for 'hair' of class org.apache.spark.unsafe.types.UTF8String. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > {code} > Seems like it's trying to convert UTF8String to a sql literal -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39015: - Component/s: SQL (was: Spark Core) > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL: https://issues.apache.org/jira/browse/SPARK-39015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Raza Jafri >Priority: Major > > [~maxgekk] submitted a > [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] > that tries to convert the key to SQL but that part of the code is blowing > up. > {code:java} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.Row > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.types.StringType > import org.apache.spark.sql.types.DataTypes > val arrayStructureData = Seq( > Row(Map("hair"->"black", "eye"->"brown")), > Row(Map("hair"->"blond", "eye"->"blue")), > Row(Map())) > val mapType = DataTypes.createMapType(StringType,StringType) > val arrayStructureSchema = new StructType() > .add("properties", mapType) > val mapTypeDF = spark.createDataFrame( > spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) > mapTypeDF.selectExpr("element_at(properties, 'hair')").show > // Exiting paste mode, now interpreting. > ++ > |element_at(properties, hair)| > ++ > | black| > | blond| > |null| > ++ > scala> spark.conf.set("spark.sql.ansi.enabled", true) > scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show > 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) > org.apache.spark.SparkRuntimeException: The feature is not supported: literal > for 'hair' of class org.apache.spark.unsafe.types.UTF8String. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > {code} > Seems like it's trying to convert UTF8String to a sql literal -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39014. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36348 [https://github.com/apache/spark/pull/36348] > Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex > > > Key: SPARK-39014 > URL: https://issues.apache.org/jira/browse/SPARK-39014 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Assignee: Yaohua Cui >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39014: Assignee: Yaohua Cui > Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex > > > Key: SPARK-39014 > URL: https://issues.apache.org/jira/browse/SPARK-39014 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Assignee: Yaohua Cui >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38976) spark-sql. overwrite. hive table-duplicate records
[ https://issues.apache.org/jira/browse/SPARK-38976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38976. -- Resolution: Invalid > spark-sql. overwrite. hive table-duplicate records > -- > > Key: SPARK-38976 > URL: https://issues.apache.org/jira/browse/SPARK-38976 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: wesharn >Priority: Major > > It occured duplicate records when spark-sql overwrite hive table . when > spark job has failure stages,but dateframe has any duplicate id? when I run > the job again, the reasult is correct.It confused me.why? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38976) spark-sql. overwrite. hive table-duplicate records
[ https://issues.apache.org/jira/browse/SPARK-38976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527865#comment-17527865 ] Hyukjin Kwon commented on SPARK-38976: -- [~wesharn] I think it's best to interact in dev mailing list first. Let's file an issue if it's confirmed > spark-sql. overwrite. hive table-duplicate records > -- > > Key: SPARK-38976 > URL: https://issues.apache.org/jira/browse/SPARK-38976 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: wesharn >Priority: Major > > It occured duplicate records when spark-sql overwrite hive table . when > spark job has failure stages,but dateframe has any duplicate id? when I run > the job again, the reasult is correct.It confused me.why? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39017) Change Java8 datetime support to configurable
Weicheng Wang created SPARK-39017: - Summary: Change Java8 datetime support to configurable Key: SPARK-39017 URL: https://issues.apache.org/jira/browse/SPARK-39017 Project: Spark Issue Type: Brainstorming Components: SQL Affects Versions: 3.2.1 Reporter: Weicheng Wang *Background:* In the Spark 3.1.0 there is an improvement introduced to enable Java 8 datetime API by default. It stops user set this configuration, *spark.sql.datetime.java8API.enabled,* in command line or configuration file when using Spark SQL shell or Spark Thrift Server. The only way to set it is in the SQL Session using SET command like: {code:java} spark-sql> SET spark.sql.datetime.java8API.enabled=false {code} There are a few issues related to this improvement * [https://github.com/apache/iceberg/issues/2530] * [https://github.com/delta-io/delta/issues/760] There is a solution in 3.2.0 for it by having *LocalDateConverter* uses *DateConverter* and in the *DateConverter* it handles both *LocalDate* and *Date* types. *Discussion:* I think we should give back the ability to user to set this option in configuration file and command line. Both of the improvement defeats the reason why we have a java8API option configurable. Please advise Thanks -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38820) Support Index can hold arbitrary ExtensionArrays
[ https://issues.apache.org/jira/browse/SPARK-38820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525648#comment-17525648 ] Yikun Jiang edited comment on SPARK-38820 at 4/26/22 3:18 AM: -- [https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays] https://github.com/pandas-dev/pandas/commit/e750c94bf1 was (Author: yikunkero): https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays > Support Index can hold arbitrary ExtensionArrays > > > Key: SPARK-38820 > URL: https://issues.apache.org/jira/browse/SPARK-38820 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > {code:java} > ERROR [1.717s]: test_astype > (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanExtensionOpsTest) > -- > Traceback (most recent call last): > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 121, in > assertPandasEqual > assert_series_equal( > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 1019, in assert_series_equal > assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") > File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", > line 506, in assert_attr_equal > raise_assert_detail(obj, msg, left_attr, right_attr) > AssertionError: Attributes of Series are differentAttribute "dtype" are > different > [left]: CategoricalDtype(categories=[False, True], ordered=False) > [right]: CategoricalDtype(categories=[False, True], ordered=False)The above > exception was the direct cause of the following exception:Traceback (most > recent call last): > File > "/__w/spark/spark/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", > line 746, in test_astype > self.assert_eq(pser.astype("category"), psser.astype("category")) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 229, in > assert_eq > self.assertPandasEqual(lobj, robj, check_exact=check_exact) > File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 134, in > assertPandasEqual > raise AssertionError(msg) from e > AssertionError: Attributes of Series are differentAttribute "dtype" are > different > [left]: CategoricalDtype(categories=[False, True], ordered=False) > [right]: CategoricalDtype(categories=[False, True], ordered=False)Left: > Name: this, dtype: category > Categories (2, boolean): [False, True] > categoryRight: > Name: this, dtype: category > Categories (2, object): [False, True] > category {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38700) Use error classes in the execution errors of save mode
[ https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38700: Assignee: (was: Apache Spark) > Use error classes in the execution errors of save mode > -- > > Key: SPARK-38700 > URL: https://issues.apache.org/jira/browse/SPARK-38700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * unsupportedSaveModeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38700) Use error classes in the execution errors of save mode
[ https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527862#comment-17527862 ] Apache Spark commented on SPARK-38700: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36350 > Use error classes in the execution errors of save mode > -- > > Key: SPARK-38700 > URL: https://issues.apache.org/jira/browse/SPARK-38700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * unsupportedSaveModeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38700) Use error classes in the execution errors of save mode
[ https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527863#comment-17527863 ] Apache Spark commented on SPARK-38700: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36350 > Use error classes in the execution errors of save mode > -- > > Key: SPARK-38700 > URL: https://issues.apache.org/jira/browse/SPARK-38700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * unsupportedSaveModeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38700) Use error classes in the execution errors of save mode
[ https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38700: Assignee: Apache Spark > Use error classes in the execution errors of save mode > -- > > Key: SPARK-38700 > URL: https://issues.apache.org/jira/browse/SPARK-38700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryExecutionErrors: > * unsupportedSaveModeError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryExecutionErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39016) Fix compilation warnings related to "`enum` will become a keyword in Scala 3"
[ https://issues.apache.org/jira/browse/SPARK-39016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39016: - Summary: Fix compilation warnings related to "`enum` will become a keyword in Scala 3" (was: Fix compilation warnings related to "Wrap `enum` in backticks to use it as an identifier") > Fix compilation warnings related to "`enum` will become a keyword in Scala 3" > - > > Key: SPARK-39016 > URL: https://issues.apache.org/jira/browse/SPARK-39016 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > [WARNING] > spark-source/core/src/test/scala/org/apache/spark/internal/config/ConfigEntrySuite.scala:172: > [deprecation @ | origin= | version=2.13.7] Wrap `enum` in backticks to use > it as an identifier, it will become a keyword in Scala 3. > [WARNING] > spark-source/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala:553: > [deprecation @ | origin= | version=2.13.7] Wrap `enum` in backticks to use > it as an identifier, it will become a keyword in Scala 3. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39016) Fix compilation warnings related to "Wrap `enum` in backticks to use it as an identifier"
Yang Jie created SPARK-39016: Summary: Fix compilation warnings related to "Wrap `enum` in backticks to use it as an identifier" Key: SPARK-39016 URL: https://issues.apache.org/jira/browse/SPARK-39016 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.4.0 Reporter: Yang Jie [WARNING] spark-source/core/src/test/scala/org/apache/spark/internal/config/ConfigEntrySuite.scala:172: [deprecation @ | origin= | version=2.13.7] Wrap `enum` in backticks to use it as an identifier, it will become a keyword in Scala 3. [WARNING] spark-source/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala:553: [deprecation @ | origin= | version=2.13.7] Wrap `enum` in backticks to use it as an identifier, it will become a keyword in Scala 3. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
[ https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38989. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36306 [https://github.com/apache/spark/pull/36306] > Implement `ignore_index` of `DataFrame/Series.sample` > - > > Key: SPARK-38989 > URL: https://issues.apache.org/jira/browse/SPARK-38989 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`
[ https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38989: Assignee: Xinrong Meng > Implement `ignore_index` of `DataFrame/Series.sample` > - > > Key: SPARK-38989 > URL: https://issues.apache.org/jira/browse/SPARK-38989 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Implement `ignore_index` of `DataFrame/Series.sample` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-39015: --- Description: [~maxgekk] submitted a [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] that tries to convert the key to SQL but that part of the code is blowing up. {code:java} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType,StringType) val arrayStructureSchema = new StructType() .add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) mapTypeDF.selectExpr("element_at(properties, 'hair')").show // Exiting paste mode, now interpreting. ++ |element_at(properties, hair)| ++ | black| | blond| |null| ++ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] {code} Seems like it's trying to convert UTF8String to a sql literal was: [~maxgekk] submitted a [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] that tries to convert the key to SQL but that part of the code is blowing up. {code:java} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType,StringType) val arrayStructureSchema = new StructType() .add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) mapTypeDF.selectExpr("element_at(properties, 'hair')").show // Exiting paste mode, now interpreting. ++ |element_at(properties, hair)| ++ | black| | blond| |null| ++ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] {code} > SparkRuntimeException when trying to get non-existent key in a map > -- > >
[jira] [Commented] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527829#comment-17527829 ] Apache Spark commented on SPARK-39014: -- User 'Yaohua628' has created a pull request for this issue: https://github.com/apache/spark/pull/36348 > Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex > > > Key: SPARK-39014 > URL: https://issues.apache.org/jira/browse/SPARK-39014 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39014: Assignee: (was: Apache Spark) > Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex > > > Key: SPARK-39014 > URL: https://issues.apache.org/jira/browse/SPARK-39014 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39014: Assignee: Apache Spark > Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex > > > Key: SPARK-39014 > URL: https://issues.apache.org/jira/browse/SPARK-39014 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
[ https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-39015: --- Description: [~maxgekk] submitted a [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] that tries to convert the key to SQL but that part of the code is blowing up. {code:java} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType,StringType) val arrayStructureSchema = new StructType() .add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) mapTypeDF.selectExpr("element_at(properties, 'hair')").show // Exiting paste mode, now interpreting. ++ |element_at(properties, hair)| ++ | black| | blond| |null| ++ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] {code} was: [~maxgekk] submitted a [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] that tries to convert the key to SQL but that part of the code is blowing up. ``` scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType,StringType) val arrayStructureSchema = new StructType() .add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) mapTypeDF.selectExpr("element_at(properties, 'hair')").show // Exiting paste mode, now interpreting. ++ |element_at(properties, hair)| ++ | black| | blond| |null| ++ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] ``` > SparkRuntimeException when trying to get non-existent key in a map > -- > > Key: SPARK-39015 > URL:
[jira] [Created] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map
Raza Jafri created SPARK-39015: -- Summary: SparkRuntimeException when trying to get non-existent key in a map Key: SPARK-39015 URL: https://issues.apache.org/jira/browse/SPARK-39015 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Raza Jafri [~maxgekk] submitted a [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e] that tries to convert the key to SQL but that part of the code is blowing up. ``` scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.DataTypes val arrayStructureData = Seq( Row(Map("hair"->"black", "eye"->"brown")), Row(Map("hair"->"blond", "eye"->"blue")), Row(Map())) val mapType = DataTypes.createMapType(StringType,StringType) val arrayStructureSchema = new StructType() .add("properties", mapType) val mapTypeDF = spark.createDataFrame( spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema) mapTypeDF.selectExpr("element_at(properties, 'hair')").show // Exiting paste mode, now interpreting. ++ |element_at(properties, hair)| ++ | black| | blond| |null| ++ scala> spark.conf.set("spark.sql.ansi.enabled", true) scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23) org.apache.spark.SparkRuntimeException: The feature is not supported: literal for 'hair' of class org.apache.spark.unsafe.types.UTF8String. at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] ``` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
Yaohua Zhao created SPARK-39014: --- Summary: Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex Key: SPARK-39014 URL: https://issues.apache.org/jira/browse/SPARK-39014 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527811#comment-17527811 ] Apache Spark commented on SPARK-39001: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36346 > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527812#comment-17527812 ] Apache Spark commented on SPARK-39001: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36346 > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39008) Change ASF as a single author in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-39008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39008: Assignee: Hyukjin Kwon > Change ASF as a single author in Spark distribution > --- > > Key: SPARK-39008 > URL: https://issues.apache.org/jira/browse/SPARK-39008 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We mention several original developers in authors in pom.xml or > R/pkg/DESRIPTION while the project is being maintained under ASF orgnization. > We should probably remove all and keep ASF as a single author. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39008) Change ASF as a single author in Spark distribution
[ https://issues.apache.org/jira/browse/SPARK-39008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39008. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36337 [https://github.com/apache/spark/pull/36337] > Change ASF as a single author in Spark distribution > --- > > Key: SPARK-39008 > URL: https://issues.apache.org/jira/browse/SPARK-39008 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > We mention several original developers in authors in pom.xml or > R/pkg/DESRIPTION while the project is being maintained under ASF orgnization. > We should probably remove all and keep ASF as a single author. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns
[ https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527789#comment-17527789 ] Apache Spark commented on SPARK-39013: -- User 'jackierwzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/36345 > Parser changes to enforce `()` for creating table without any columns > - > > Key: SPARK-39013 > URL: https://issues.apache.org/jira/browse/SPARK-39013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4 >Reporter: Jackie Zhang >Priority: Major > > We would like to enforce the `()` for `CREATE TABLE` queries to explicit > indicate a table without any columns will be created. > E.g. `CREATE TABLE table () USING DELTA`. > Existing behavior of CTAS and CREATE external table at location are not > affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns
[ https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39013: Assignee: (was: Apache Spark) > Parser changes to enforce `()` for creating table without any columns > - > > Key: SPARK-39013 > URL: https://issues.apache.org/jira/browse/SPARK-39013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4 >Reporter: Jackie Zhang >Priority: Major > > We would like to enforce the `()` for `CREATE TABLE` queries to explicit > indicate a table without any columns will be created. > E.g. `CREATE TABLE table () USING DELTA`. > Existing behavior of CTAS and CREATE external table at location are not > affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns
[ https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527788#comment-17527788 ] Apache Spark commented on SPARK-39013: -- User 'jackierwzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/36345 > Parser changes to enforce `()` for creating table without any columns > - > > Key: SPARK-39013 > URL: https://issues.apache.org/jira/browse/SPARK-39013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4 >Reporter: Jackie Zhang >Priority: Major > > We would like to enforce the `()` for `CREATE TABLE` queries to explicit > indicate a table without any columns will be created. > E.g. `CREATE TABLE table () USING DELTA`. > Existing behavior of CTAS and CREATE external table at location are not > affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns
[ https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39013: Assignee: Apache Spark > Parser changes to enforce `()` for creating table without any columns > - > > Key: SPARK-39013 > URL: https://issues.apache.org/jira/browse/SPARK-39013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4 >Reporter: Jackie Zhang >Assignee: Apache Spark >Priority: Major > > We would like to enforce the `()` for `CREATE TABLE` queries to explicit > indicate a table without any columns will be created. > E.g. `CREATE TABLE table () USING DELTA`. > Existing behavior of CTAS and CREATE external table at location are not > affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns
[ https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackie Zhang updated SPARK-39013: - Summary: Parser changes to enforce `()` for creating table without any columns (was: Parse changes to enforce `()` for creating table without any columns) > Parser changes to enforce `()` for creating table without any columns > - > > Key: SPARK-39013 > URL: https://issues.apache.org/jira/browse/SPARK-39013 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4 >Reporter: Jackie Zhang >Priority: Major > > We would like to enforce the `()` for `CREATE TABLE` queries to explicit > indicate a table without any columns will be created. > E.g. `CREATE TABLE table () USING DELTA`. > Existing behavior of CTAS and CREATE external table at location are not > affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39013) Parse changes to enforce `()` for creating table without any columns
Jackie Zhang created SPARK-39013: Summary: Parse changes to enforce `()` for creating table without any columns Key: SPARK-39013 URL: https://issues.apache.org/jira/browse/SPARK-39013 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4 Reporter: Jackie Zhang We would like to enforce the `()` for `CREATE TABLE` queries to explicit indicate a table without any columns will be created. E.g. `CREATE TABLE table () USING DELTA`. Existing behavior of CTAS and CREATE external table at location are not affected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39012: - Description: When Spark needs to infer schema, it needs to parse string to a type. Not all data types are supported so far in this path. For example, binary is known to not be supported. If a user uses binary column, and if the user does not use a metastore, then SparkSQL could fall back to schema inference thus fail to execute during table scan. This should be a bug as schema inference is supported but some types are missing. string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also because when converting from a string, small scale type won't be identified if there is a larger scale type. For example, short and long Based on Spark SQL data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the following types: BINARY BOOLEAN And there are two types that I am not sure if SparkSQL is supporting: YearMonthIntervalType DayTimeIntervalType was: When Spark needs to infer schema, it needs to parse string to a type. Not all data types are supported so far in this path. For example, binary is known to not be supported. string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also because when converting from a string, small scale type won't be identified if there is a larger scale type. For example, short and long Based on Spark SQL data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the following types: BINARY BOOLEAN And there are two types that I am not sure if SparkSQL is supporting: YearMonthIntervalType DayTimeIntervalType > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is known to > not be supported. If a user uses binary column, and if the user does not use > a metastore, then SparkSQL could fall back to schema inference thus fail to > execute during table scan. This should be a bug as schema inference is > supported but some types are missing. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also > because when converting from a string, small scale type won't be identified > if there is a larger scale type. For example, short and long > Based on Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support > the following types: > BINARY > BOOLEAN > And there are two types that I am not sure if SparkSQL is supporting: > YearMonthIntervalType > DayTimeIntervalType -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39012: - Description: When Spark needs to infer schema, it needs to parse string to a type. Not all data types are supported so far in this path. For example, binary is known to not be supported. string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also because when converting from a string, small scale type won't be identified if there is a larger scale type. For example, short and long Based on Spark SQL data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the following types: BINARY BOOLEAN And there are two types that I am not sure if SparkSQL is supporting: YearMonthIntervalType DayTimeIntervalType was: When Spark needs to infer schema, it needs to parse string to a type. Not all data types are supported so far in this path. For example, binary is spotted to not supported. string might be converted to all types except ARRAY, MAP, STRUCT, etc. Spark SQL data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is known to > not be supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also > because when converting from a string, small scale type won't be identified > if there is a larger scale type. For example, short and long > Based on Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support > the following types: > BINARY > BOOLEAN > And there are two types that I am not sure if SparkSQL is supporting: > YearMonthIntervalType > DayTimeIntervalType -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527770#comment-17527770 ] Apache Spark commented on SPARK-39012: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/36344 > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39012: Assignee: Apache Spark > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39012: Assignee: (was: Apache Spark) > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527768#comment-17527768 ] Apache Spark commented on SPARK-39012: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/36344 > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527767#comment-17527767 ] Rui Wang commented on SPARK-39012: -- PR is ready to support binary type https://github.com/apache/spark/pull/36344 > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39012) SparkSQL Infer schema path does not support all data types
Rui Wang created SPARK-39012: Summary: SparkSQL Infer schema path does not support all data types Key: SPARK-39012 URL: https://issues.apache.org/jira/browse/SPARK-39012 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Rui Wang When Spark needs to infer schema, it needs to parse string to a type. Not all data types are supported so far in this path. For example, binary is spotted to not supported. string might be converted to all types except ARRAY, MAP, STRUCT, etc. Spark SQL data types: https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39012: - Summary: SparkSQL infer schema does not support all data types (was: SparkSQL Infer schema path does not support all data types) > SparkSQL infer schema does not support all data types > - > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is spotted > to not supported. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. > Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35739) [Spark Sql] Add Java-comptable Dataset.join overloads
[ https://issues.apache.org/jira/browse/SPARK-35739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527751#comment-17527751 ] Apache Spark commented on SPARK-35739: -- User 'brandondahler' has created a pull request for this issue: https://github.com/apache/spark/pull/36343 > [Spark Sql] Add Java-comptable Dataset.join overloads > - > > Key: SPARK-35739 > URL: https://issues.apache.org/jira/browse/SPARK-35739 > Project: Spark > Issue Type: Improvement > Components: Java API, SQL >Affects Versions: 2.0.0, 3.0.0 >Reporter: Brandon Dahler >Priority: Minor > > h2. Problem > When using Spark SQL with Java, the required syntax to utilize the following > two overloads are unnatural and not obvious to developers that haven't had to > interoperate with Scala before: > {code:java} > def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame > def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): > DataFrame > {code} > Examples: > Java 11 > {code:java} > Dataset dataset1 = ...; > Dataset dataset2 = ...; > // Overload with multiple usingColumns, no join type > dataset1 > .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2)) > .show(); > // Overload with multiple usingColumns and a join type > dataset1 > .join( > dataset2, > JavaConverters.asScalaBuffer(List.of("column", "column2")), > "left") > .show(); > {code} > > Additionally there is no overload that takes a single usingColumnn and a > joinType, forcing the developer to use the Seq[String] overload regardless of > language. > Examples: > Scala > {code:java} > val dataset1 :DataFrame = ...; > val dataset2 :DataFrame = ...; > dataset1 > .join(dataset2, Seq("column"), "left") > .show(); > {code} > > Java 11 > {code:java} > Dataset dataset1 = ...; > Dataset dataset2 = ...; > dataset1 > .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left") > .show(); > {code} > h2. Proposed Improvement > Add 3 additional overloads to Dataset: > > {code:java} > def join(right: Dataset[_], usingColumn: List[String]): DataFrame > def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame > def join(right: Dataset[_], usingColumn: List[String], joinType: String): > DataFrame > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38954) Implement sharing of cloud credentials among driver and executors
[ https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38954: -- Affects Version/s: 3.4.0 (was: 3.2.1) > Implement sharing of cloud credentials among driver and executors > - > > Key: SPARK-38954 > URL: https://issues.apache.org/jira/browse/SPARK-38954 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Parth Chandra >Priority: Major > > Currently Spark uses external implementations (e.g. hadoop-aws) to access > cloud services like S3. In order to access the actual service, these > implementations use credentials provider implementations that obtain > credentials to allow access to the cloud service. > These credentials are typically session credentials, which means that they > expire after a fixed time. Sometimes, this expiry can be only an hour and for > a spark job that runs for many hours (or spark streaming job that runs > continuously), the credentials have to be renewed periodically. > In many organizations, the process of getting credentials may multi-step. The > organization has an identity provider service that provides authentication > for the user, while the cloud service provider provides authorization for the > roles the user has access to. Once the user is authenticated and her role > verified, the credentials are generated for a new session. > In a large setup with hundreds of Spark jobs and thousands of executors, each > executor is then spending a lot of time getting credentials and this may put > unnecessary load on the backend authentication services. > The alleviate this, we can use Spark's architecture to obtain the credentials > once in the driver and push the credentials to the executors. In addition, > the driver can check the expiry of the credentials and push updated > credentials to the executors. This is relatively easy to do since the rpc > mechanism to implement this is already in place and is used similarly for > Kerberos delegation tokens. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38742) Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite
[ https://issues.apache.org/jira/browse/SPARK-38742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38742. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36280 [https://github.com/apache/spark/pull/36280] > Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite > -- > > Key: SPARK-38742 > URL: https://issues.apache.org/jira/browse/SPARK-38742 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: panbingkun >Priority: Major > Fix For: 3.4.0 > > > Move tests for the error class MISSING_COLUMN from SQLQuerySuite to > QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38742) Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite
[ https://issues.apache.org/jira/browse/SPARK-38742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38742: Assignee: panbingkun > Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite > -- > > Key: SPARK-38742 > URL: https://issues.apache.org/jira/browse/SPARK-38742 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: panbingkun >Priority: Major > > Move tests for the error class MISSING_COLUMN from SQLQuerySuite to > QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
[ https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-38939: Fix Version/s: 3.3.0 (was: 3.4.0) > Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax > - > > Key: SPARK-38939 > URL: https://issues.apache.org/jira/browse/SPARK-38939 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Jackie Zhang >Assignee: Jackie Zhang >Priority: Major > Fix For: 3.3.0 > > > Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error > if the column doesn't exist. We would like to provide an (IF EXISTS) syntax > to provide better user experience for downstream handlers (such as Delta) > that support it, and make consistent with some other DMLs such as `DROP TABLE > (IF EXISTS)` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39001. -- Fix Version/s: 3.3.0 3.4.0 Resolution: Fixed Issue resolved by pull request 36339 [https://github.com/apache/spark/pull/36339] > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39001: Assignee: Hyukjin Kwon > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39007) Use double quotes for SQL configs in error messages
[ https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39007: - Fix Version/s: 3.3.0 > Use double quotes for SQL configs in error messages > --- > > Key: SPARK-39007 > URL: https://issues.apache.org/jira/browse/SPARK-39007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
[ https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38939. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36252 [https://github.com/apache/spark/pull/36252] > Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax > - > > Key: SPARK-38939 > URL: https://issues.apache.org/jira/browse/SPARK-38939 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Jackie Zhang >Assignee: Jackie Zhang >Priority: Major > Fix For: 3.4.0 > > > Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error > if the column doesn't exist. We would like to provide an (IF EXISTS) syntax > to provide better user experience for downstream handlers (such as Delta) > that support it, and make consistent with some other DMLs such as `DROP TABLE > (IF EXISTS)` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
[ https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38939: --- Assignee: Jackie Zhang > Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax > - > > Key: SPARK-38939 > URL: https://issues.apache.org/jira/browse/SPARK-38939 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Jackie Zhang >Assignee: Jackie Zhang >Priority: Major > > Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error > if the column doesn't exist. We would like to provide an (IF EXISTS) syntax > to provide better user experience for downstream handlers (such as Delta) > that support it, and make consistent with some other DMLs such as `DROP TABLE > (IF EXISTS)` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527648#comment-17527648 ] jagadeesh commented on SPARK-25355: --- [~pedro.rossi] , we are running into problem with this feature enable in Spark 3.2 on K8s. any insights?? Appreciate your help. here are the setups follows, * Service id is configured properly in HDFS side . {code:java} hadoop.proxyuser.serviceid.groups * hadoop.proxyuser.serviceid.hosts * hadoop.proxyuser.serviceid.users * {code} * Getting service id Kerberos ticket in spark client. * Running spark job without --proxy-user connecting to Kerberized HDFS cluster - {color:#00875a}WORKS AS EXPECTED .{color} * Running spark job with --proxy-user= connecting to Kerberized HDFS cluster - {color:#de350b}FAILS{color} {code:java} $SPARK_HOME/bin/spark-submit \ --master \ --deploy-mode cluster \ --proxy-user \ --name spark-javawordcount \ --class org.apache.spark.examples.JavaWordCount \ --conf spark.kubernetes.container.image=\ --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \ --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.driver.limit.cores=1 \ --conf spark.executor.instances=2 \ --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.namespace= \ --conf spark.eventLog.enabled=true \ --conf spark.eventLog.dir=hdfs://:8020/scaas/shs_logs \ --conf spark.kubernetes.file.upload.path=hdfs://:8020/tmp \ $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar /user//input{code} * ERROR logs from Driver pod {code:java} ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress= --deploy-mode client --proxy-user --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.JavaWordCount spark-internal /user//input WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 22/04/21 17:50:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/04/21 17:50:30 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:50:31 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:50:37 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:50:53 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:51:32 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:52:07 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 22/04/21 17:52:27 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN,
[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527611#comment-17527611 ] Apache Spark commented on SPARK-38879: -- User 'pralabhkumar' has created a pull request for this issue: https://github.com/apache/spark/pull/36342 > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38879: Assignee: Apache Spark > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Apache Spark >Priority: Minor > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38879: Assignee: (was: Apache Spark) > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37696) Optimizer exceeds max iterations
[ https://issues.apache.org/jira/browse/SPARK-37696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37696: - Affects Version/s: 3.2.1 > Optimizer exceeds max iterations > > > Key: SPARK-37696 > URL: https://issues.apache.org/jira/browse/SPARK-37696 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1 >Reporter: Denis Tarima >Priority: Minor > > A specific scenario causing Spark's failure in tests and a warning in > production: > 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) > reached for batch Operator Optimization before Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) > reached for batch Operator Optimization after Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > > To reproduce run the following commands in `spark-shell`: > {{// define case class for a struct type in an array}} > {{case class S(v: Int, v2: Int)}} > > {{// prepare a table with an array of structs}} > {{Seq((10, Seq(S(1, 2.toDF("i", "data").write.saveAsTable("tbl")}} > > {{// select using SQL and join with a dataset using "left_anti"}} > {{spark.sql("select i, data[size(data) - 1].v from > tbl").join(Seq(10).toDF("i"), Seq("i"), "left_anti").show()}} > > The following conditions are required: > # Having additional `v2` field in `S` > # Using `{{{}data[size(data) - 1]{}}}` instead of `{{{}element_at(data, > -1){}}}` > # Using `{{{}left_anti{}}}` in join operation > > The same behavior was observed in `master` branch and `3.1.1`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38868) `assert_true` fails unconditionnaly after `left_outer` joins
[ https://issues.apache.org/jira/browse/SPARK-38868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527585#comment-17527585 ] Apache Spark commented on SPARK-38868: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/36341 > `assert_true` fails unconditionnaly after `left_outer` joins > > > Key: SPARK-38868 > URL: https://issues.apache.org/jira/browse/SPARK-38868 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1, 3.3.0, 3.4.0 >Reporter: Fabien Dubosson >Priority: Major > > When `assert_true` is used after a `left_outer` join the assert exception is > raised even though all the rows meet the condition. Using an `inner` join > does not expose this issue. > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > session = SparkSession.builder.getOrCreate() > entries = session.createDataFrame( > [ > ("a", 1), > ("b", 2), > ("c", 3), > ], > ["id", "outcome_id"], > ) > outcomes = session.createDataFrame( > [ > (1, 12), > (2, 34), > (3, 32), > ], > ["outcome_id", "outcome_value"], > ) > # Inner join works as expected > ( > entries.join(outcomes, on="outcome_id", how="inner") > .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) > .filter(sf.col("valid").isNull()) > .show() > ) > # Left join fails with «'('outcome_value > 10)' is not true!» even though it > is the case > ( > entries.join(outcomes, on="outcome_id", how="left_outer") > .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) > .filter(sf.col("valid").isNull()) > .show() > ){code} > Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am > not sure if "native" Spark exposes this issue as well or not, I don't have > the knowledge/setup to test that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39011) V2 Filter to ORC Predicate support
[ https://issues.apache.org/jira/browse/SPARK-39011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-39011: --- Summary: V2 Filter to ORC Predicate support (was: V2 Filter to ORC Filter support) > V2 Filter to ORC Predicate support > -- > > Key: SPARK-39011 > URL: https://issues.apache.org/jira/browse/SPARK-39011 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4 >Reporter: Huaxin Gao >Priority: Major > > add V2 filter to ORC predicate support -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39011) V2 Filter to ORC Filter support
Huaxin Gao created SPARK-39011: -- Summary: V2 Filter to ORC Filter support Key: SPARK-39011 URL: https://issues.apache.org/jira/browse/SPARK-39011 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4 Reporter: Huaxin Gao add V2 filter to ORC predicate support -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39010) V2 Filter to Parquet Predicate support
Huaxin Gao created SPARK-39010: -- Summary: V2 Filter to Parquet Predicate support Key: SPARK-39010 URL: https://issues.apache.org/jira/browse/SPARK-39010 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4 Reporter: Huaxin Gao Add support for V2 Filter to Parquet Predicate -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38667) Optimizer generates error when using inner join along with sequence
[ https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars resolved SPARK-38667. -- Resolution: Resolved > Optimizer generates error when using inner join along with sequence > --- > > Key: SPARK-38667 > URL: https://issues.apache.org/jira/browse/SPARK-38667 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2 >Reporter: Lars >Priority: Major > > This issue occurred in a more complex scenario, so I've broken it down into a > simple case. > {*}Steps to reproduce{*}: Execute the following example. The code should run > without errors, but instead a *java.lang.IllegalArgumentException: Illegal > sequence boundaries: 4 to 2 by 1* is thrown. > {code:java} > package com.example > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > object SparkIssue { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .master("local[*]") > .getOrCreate() > val dfA = spark > .createDataFrame(Seq((1, 1), (2, 4))) > .toDF("a1", "a2") > val dfB = spark > .createDataFrame(Seq((1, 5), (2, 2))) > .toDF("b1", "b2") > dfA.join(dfB, dfA("a1") === dfB("b1"), "inner") > .where(col("a2") < col("b2")) > .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1 > .show() > spark.stop() > } > } > {code} > When I look at the Optimized Logical Plan I can see that the Inner Join and > the Filter are brought together, with an additional check for an empty > Sequence. The exception is thrown because the Sequence check is executed > before the Filter. > {code:java} > == Parsed Logical Plan == > 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), > None)) AS x#24] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Analyzed Logical Plan == > a1: int, a2: int, b1: int, b2: int, x: int > Project [a1#4, a2#5, b1#12, b2#13, x#25] > +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), > false, [x#25] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Optimized Logical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, > [x#25] > +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), > true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12)) > :- LocalRelation [a1#4, a2#5] > +- LocalRelation [b1#12, b2#13] > == Physical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, > a2#5, b1#12, b2#13], false, [x#25] > +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, > ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND > (a2#5 < b2#13)), false > :- *(1) LocalTableScan [a1#4, a2#5] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [id=#15] > +- LocalTableScan [b1#12, b2#13] > {code} > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38667) Optimizer generates error when using inner join along with sequence
[ https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527547#comment-17527547 ] Lars commented on SPARK-38667: -- Thanks all for pointing this out. Changed the affected version to 3.1.2 and resolved this issue. > Optimizer generates error when using inner join along with sequence > --- > > Key: SPARK-38667 > URL: https://issues.apache.org/jira/browse/SPARK-38667 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2 >Reporter: Lars >Priority: Major > > This issue occurred in a more complex scenario, so I've broken it down into a > simple case. > {*}Steps to reproduce{*}: Execute the following example. The code should run > without errors, but instead a *java.lang.IllegalArgumentException: Illegal > sequence boundaries: 4 to 2 by 1* is thrown. > {code:java} > package com.example > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > object SparkIssue { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .master("local[*]") > .getOrCreate() > val dfA = spark > .createDataFrame(Seq((1, 1), (2, 4))) > .toDF("a1", "a2") > val dfB = spark > .createDataFrame(Seq((1, 5), (2, 2))) > .toDF("b1", "b2") > dfA.join(dfB, dfA("a1") === dfB("b1"), "inner") > .where(col("a2") < col("b2")) > .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1 > .show() > spark.stop() > } > } > {code} > When I look at the Optimized Logical Plan I can see that the Inner Join and > the Filter are brought together, with an additional check for an empty > Sequence. The exception is thrown because the Sequence check is executed > before the Filter. > {code:java} > == Parsed Logical Plan == > 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), > None)) AS x#24] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Analyzed Logical Plan == > a1: int, a2: int, b1: int, b2: int, x: int > Project [a1#4, a2#5, b1#12, b2#13, x#25] > +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), > false, [x#25] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Optimized Logical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, > [x#25] > +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), > true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12)) > :- LocalRelation [a1#4, a2#5] > +- LocalRelation [b1#12, b2#13] > == Physical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, > a2#5, b1#12, b2#13], false, [x#25] > +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, > ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND > (a2#5 < b2#13)), false > :- *(1) LocalTableScan [a1#4, a2#5] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [id=#15] > +- LocalTableScan [b1#12, b2#13] > {code} > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38667) Optimizer generates error when using inner join along with sequence
[ https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lars updated SPARK-38667: - Affects Version/s: 3.1.2 (was: 3.2.1) > Optimizer generates error when using inner join along with sequence > --- > > Key: SPARK-38667 > URL: https://issues.apache.org/jira/browse/SPARK-38667 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2 >Reporter: Lars >Priority: Major > > This issue occurred in a more complex scenario, so I've broken it down into a > simple case. > {*}Steps to reproduce{*}: Execute the following example. The code should run > without errors, but instead a *java.lang.IllegalArgumentException: Illegal > sequence boundaries: 4 to 2 by 1* is thrown. > {code:java} > package com.example > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.functions._ > object SparkIssue { > def main(args: Array[String]): Unit = { > val spark = SparkSession > .builder() > .master("local[*]") > .getOrCreate() > val dfA = spark > .createDataFrame(Seq((1, 1), (2, 4))) > .toDF("a1", "a2") > val dfB = spark > .createDataFrame(Seq((1, 5), (2, 2))) > .toDF("b1", "b2") > dfA.join(dfB, dfA("a1") === dfB("b1"), "inner") > .where(col("a2") < col("b2")) > .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1 > .show() > spark.stop() > } > } > {code} > When I look at the Optimized Logical Plan I can see that the Inner Join and > the Filter are brought together, with an additional check for an empty > Sequence. The exception is thrown because the Sequence check is executed > before the Filter. > {code:java} > == Parsed Logical Plan == > 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), > None)) AS x#24] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Analyzed Logical Plan == > a1: int, a2: int, b1: int, b2: int, x: int > Project [a1#4, a2#5, b1#12, b2#13, x#25] > +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), > false, [x#25] > +- Filter (a2#5 < b2#13) > +- Join Inner, (a1#4 = b1#12) > :- Project [_1#0 AS a1#4, _2#1 AS a2#5] > : +- LocalRelation [_1#0, _2#1] > +- Project [_1#8 AS b1#12, _2#9 AS b2#13] > +- LocalRelation [_1#8, _2#9] > == Optimized Logical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, > [x#25] > +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), > true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12)) > :- LocalRelation [a1#4, a2#5] > +- LocalRelation [b1#12, b2#13] > == Physical Plan == > Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, > a2#5, b1#12, b2#13], false, [x#25] > +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, > ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND > (a2#5 < b2#13)), false > :- *(1) LocalTableScan [a1#4, a2#5] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [id=#15] > +- LocalTableScan [b1#12, b2#13] > {code} > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527527#comment-17527527 ] Nicholas Chammas commented on SPARK-37222: -- Thanks for the detailed report, [~ssmith]. I am hitting this issue as well on Spark 3.2.1, and your minimal test case also reproduces the issue for me. How did you break down the optimization into its individual steps like that? That was very helpful. I was able to use your breakdown to work around the issue by excluding {{{}PushDownLeftSemiAntiJoin{}}}: {code:java} spark.conf.set( "spark.sql.optimizer.excludedRules", "org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin" ){code} If I run that before running the problematic query (including your test case), it seems to work around the issue. > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id,
[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures
[ https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37222: - Affects Version/s: 3.2.1 > Max iterations reached in Operator Optimization w/left_anti or left_semi join > and nested structures > --- > > Key: SPARK-37222 > URL: https://issues.apache.org/jira/browse/SPARK-37222 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.1.2, 3.2.0, 3.2.1 > Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and > with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, > 2021. > The problem does not occur with Spark 3.0.1. > >Reporter: Shawn Smith >Priority: Major > > The query optimizer never reaches a fixed point when optimizing the query > below. This manifests as a warning: > > WARN: Max iterations (100) reached for batch Operator Optimization before > > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a > > larger value. > But the suggested fix won't help. The actual problem is that the optimizer > fails to make progress on each iteration and gets stuck in a loop. > In practice, Spark logs a warning but continues on and appears to execute the > query successfully, albeit perhaps sub-optimally. > To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and > 3.2.0 but not 3.0.1 it will throw an exception: > {noformat} > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > {noformat} > Looking at the query plan as the optimizer iterates in > {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations: > {noformat} > Project [id#2, _gen_alias_108#108L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_108#108L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > And here's the plan after one more iteration. You can see that all that has > changed is new aliases for the column in the nested column: > "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}". > {noformat} > Project [id#2, _gen_alias_109#109L AS nested.n#28L] > +- Join LeftAnti, (id#2 = id#18) >:- Project [id#2, nested#3.n AS _gen_alias_109#109L] >: +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, > deserialized, 1 replicas) >:+- LocalTableScan , [id#2, nested#3] >+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- LocalTableScan , [id#18] > {noformat} > The optimizer continues looping and tweaking the alias until it hits the max > iteration count and bails out. > Here's an example that includes a stack trace: > {noformat} > $ bin/spark-shell > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.2.0 > /_/ > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste > // Entering paste mode (ctrl-D to finish) > case class Nested(b: Boolean, n: Long) > case class Table(id: String, nested: Nested) > case class Identifier(id: String) > locally { > System.setProperty("spark.testing", "true") // Fail instead of logging a > warning > val df = List.empty[Table].toDS.cache() > val ids = List.empty[Identifier].toDS.cache() > df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi" > .select('id, 'nested("n")) > .explain() > } > // Exiting paste mode, now interpreting. > java.lang.RuntimeException: Max iterations (100) reached for batch Operator > Optimization before Inferring Filters, please set > 'spark.sql.optimizer.maxIterations' to a larger value. > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) >
[jira] [Commented] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets
[ https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527512#comment-17527512 ] Chris Kimmel commented on SPARK-38983: -- Thanks for your comment, [~hyukjin.kwon] . This issue is about the misleading error message. I edited the ticket to clarify. > Pyspark throws AnalysisException with incorrect error message when using > .grouping() or .groupingId() (AnalysisException: grouping() can only be used > with GroupingSets/Cube/Rollup;) > - > > Key: SPARK-38983 > URL: https://issues.apache.org/jira/browse/SPARK-38983 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.2, 3.2.1 > Environment: I have reproduced this error in two environments. I > would be happy to answer questions about either. > h1. Environment 1 > I first encountered this error on my employer's Azure Databricks cluster, > which runs Spark version 3.1.2. I have limited access to cluster > configuration information, but I can ask if it will help. > h1. Environment 2 > I reproduced the error by running the same code in the Pyspark shell from > Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to > environment information here. Running {{spark-submit --version}} produced the > following output: > {{Welcome to Spark version 3.2.1}} > {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}} > {{Branch HEAD}} > {{Compiled by user hgao on 2022-01-20T19:26:14Z}} > {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}} > {{Url https://github.com/apache/spark}} >Reporter: Chris Kimmel >Priority: Minor > Labels: cube, error_message_improvement, exception-handling, > grouping, rollup > > h1. In a nutshell > Pyspark emits an incorrect error message when committing a type error with > the results of the {{grouping()}} function. > h1. Code to reproduce > {{print(spark.version) # My environment, Azure DataBricks, defines spark > automatically.}} > {{from pyspark.sql import functions as f}} > {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}} > {{ ('a',),}} > {{ ('b',),}} > {{]}} > {{s = t.StructType([}} > {{ t.StructField('col1', t.StringType())}} > {{])}} > {{df = spark.createDataFrame(l, s)}} > {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}} > {{ df}} > {{ .cube(f.col('col1'))}} > {{ .agg(f.grouping('col1') & f.lit(True))}} > {{ .collect()}} > {{)}} > h1. Expected results > The code produces an {{AnalysisException()}} with error message along the > lines of: > {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data > type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and > boolean).;}} > h1. Actual results > The code throws an {{AnalysisException()}} with error message > {{AnalysisException: grouping() can only be used with > GroupingSets/Cube/Rollup;}} > Python provides the following traceback: > {{---}} > {{AnalysisException Traceback (most recent call > last)}} > {{ in }} > {{ 15 }} > {{ 16 ( # This expression raises an AnalysisException()}} > {{---> 17 df}} > {{ 18 .cube(f.col('col1'))}} > {{{} 19 .agg(f.grouping('col1') & > f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in > agg(self, *exprs){}}} > {{ 116 # Columns}} > {{ 117 assert all(isinstance(c, Column) for c in exprs), "all > exprs should be Column"}} > {{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}} > {{ 119 _to_seq(self.sql_ctx._sc, [c._jc > for c in exprs[1:]]))}} > {{{} 120 return DataFrame(jdf, > self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py > in {_}{{_}}call{{_}}{_}(self, *args){}}} > {{ 1302 }} > {{ 1303 answer = self.gateway_client.send_command(command)}} > {{-> 1304 return_value = get_return_value(}} > {{ 1305 answer, self.gateway_client, self.target_id, > self.name)}} > {{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, > **kw)}} > {{ 121 # Hide where the exception came from that shows a > non-Pythonic}} > {{ 122 # JVM exception message.}} > {{--> 123 raise converted from None}} > {{ 124 else:}} > {{{} 125 raise{}}}{{{}AnalysisException: grouping() can > only be used with GroupingSets/Cube/Rollup;{}}} > {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) > AS (grouping(col1) AND true)#551|#548,
[jira] [Updated] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/C
[ https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Kimmel updated SPARK-38983: - Description: h1. In a nutshell Pyspark emits an incorrect error message when committing a type error with the results of the {{grouping()}} function. h1. Code to reproduce {{print(spark.version) # My environment, Azure DataBricks, defines spark automatically.}} {{from pyspark.sql import functions as f}} {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}} {{ ('a',),}} {{ ('b',),}} {{]}} {{s = t.StructType([}} {{ t.StructField('col1', t.StringType())}} {{])}} {{df = spark.createDataFrame(l, s)}} {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}} {{ df}} {{ .cube(f.col('col1'))}} {{ .agg(f.grouping('col1') & f.lit(True))}} {{ .collect()}} {{)}} h1. Expected results The code produces an {{AnalysisException()}} with error message along the lines of: {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and boolean).;}} h1. Actual results The code throws an {{AnalysisException()}} with error message {{AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;}} Python provides the following traceback: {{---}} {{AnalysisException Traceback (most recent call last)}} {{ in }} {{ 15 }} {{ 16 ( # This expression raises an AnalysisException()}} {{---> 17 df}} {{ 18 .cube(f.col('col1'))}} {{{} 19 .agg(f.grouping('col1') & f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs){}}} {{ 116 # Columns}} {{ 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"}} {{--> 118 jdf = self._jgd.agg(exprs[0]._jc,}} {{ 119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))}} {{{} 120 return DataFrame(jdf, self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in {_}{{_}}call{{_}}{_}(self, *args){}}} {{ 1302 }} {{ 1303 answer = self.gateway_client.send_command(command)}} {{-> 1304 return_value = get_return_value(}} {{ 1305 answer, self.gateway_client, self.target_id, self.name)}} {{ 1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}} {{ 121 # Hide where the exception came from that shows a non-Pythonic}} {{ 122 # JVM exception message.}} {{--> 123 raise converted from None}} {{ 124 else:}} {{{} 125 raise{}}}{{{}AnalysisException: grouping() can only be used with GroupingSets/Cube/Rollup;{}}} {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS (grouping(col1) AND true)#551]}} {{+- LogicalRDD [col1#548|#548], false}} h1. Workaround _Note:_ The reason I opened this ticket is that, when the user makes a particular type error, the resulting error message is misleading. The code snippet below shows how to fix that type error. It does not address the false-error-message bug, which is the focus of this ticket. Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that {{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False. {{( # This expression does not raise an AnalysisException()}} {{ df}} {{ .cube(f.col('col1'))}} {{ .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}} {{ .collect()}} {{)}} h1. Additional notes The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to reproduce". The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in "Code to reproduce". h1. Related tickets https://issues.apache.org/jira/browse/SPARK-22748 h1. Relevant documentation * [Spark SQL GROUPBY, ROLLUP, and CUBE semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html] * [DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html] * [DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html] * [DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html] * [functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html] * [functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html] was: h1. Code to reproduce {{print(spark.version) # My environment, Azure DataBricks, defines spark automatically.}} {{from pyspark.sql
[jira] [Commented] (SPARK-39007) Use double quotes for SQL configs in error messages
[ https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527501#comment-17527501 ] Apache Spark commented on SPARK-39007: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36340 > Use double quotes for SQL configs in error messages > --- > > Key: SPARK-39007 > URL: https://issues.apache.org/jira/browse/SPARK-39007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39007) Use double quotes for SQL configs in error messages
[ https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527498#comment-17527498 ] Apache Spark commented on SPARK-39007: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/36340 > Use double quotes for SQL configs in error messages > --- > > Key: SPARK-39007 > URL: https://issues.apache.org/jira/browse/SPARK-39007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39009) Spark Log4j vul - CVE-2021-44228
[ https://issues.apache.org/jira/browse/SPARK-39009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-39009. -- Resolution: Duplicate https://issues.apache.org/jira/browse/SPARK-6305 But this is not how to use JIRA; read https://spark.apache.org/contributing.html > Spark Log4j vul - CVE-2021-44228 > > > Key: SPARK-39009 > URL: https://issues.apache.org/jira/browse/SPARK-39009 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.0.0 > Environment: Production >Reporter: Prakash Shankar >Priority: Major > > When can we expect the spark 3.3 release, can you please confirm whether > it’ll fix the log4j issue. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.
[ https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-37174: Attachment: (was: info.txt) > WARN WindowExec: No Partition Defined is being printed 4 times. > > > Key: SPARK-37174 > URL: https://issues.apache.org/jira/browse/SPARK-37174 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Hi I use this code > {code:java} > f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json") > pf01 = f01.to_pandas_on_spark() > pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x)) > pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = > ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"]) > pf01.info(){code} > > sometimes it prints > > {code:java} > 21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:04 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > 21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: > DataFrame is highly fragmented. This is usually the result of calling > `frame.insert` many times, which has poor performance. Consider joining all > columns at once using pd.concat(axis=1) instead. To get a de-fragmented > frame, use `newframe = frame.copy()` > df[column_name] = series > /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` > loads all data into the driver's memory. It should only be used if the > resulting pandas Series is expected to be small. > warnings.warn(message, UserWarning) > 21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > > and some other times it "just" prints > > {code:java} > 21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation. > 21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window > operation! Moving all data to a single partition, this can cause serious > performance degradation.{code} > Why does it print df[column_name] = series ? > > can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ? > and warnings.warn(message, UserWarning) ? > and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving > all data to a single partition, this can cause serious performance > degradation.? > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.
[ https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527432#comment-17527432 ] Bjørn Jørgensen commented on SPARK-38988: - I add a new fil "warning printed.txt" it show that it depends one the dataframe size. So if you have a dataframe Int64Index: 34 entries, 0 to 33 Data columns (total 37 columns): The warning won`t get printed. If the datafreme is Int64Index: 109 entries, 0 to 108 Data columns (total 112 columns): Then the warning is printed 13 times. > Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get > printed many times. > --- > > Key: SPARK-38988 > URL: https://issues.apache.org/jira/browse/SPARK-38988 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: Untitled.html, info.txt, warning printed.txt > > > I add a file and a notebook with the info msg I get when I run df.info() > Spark master build from 13.04.22. > df.shape > (763300, 224) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.
[ https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-38988: Attachment: warning printed.txt > Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get > printed many times. > --- > > Key: SPARK-38988 > URL: https://issues.apache.org/jira/browse/SPARK-38988 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0, 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: Untitled.html, info.txt, warning printed.txt > > > I add a file and a notebook with the info msg I get when I run df.info() > Spark master build from 13.04.22. > df.shape > (763300, 224) -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38965) Optimize RemoteBlockPushResolver with a memory pool
[ https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-38965: Summary: Optimize RemoteBlockPushResolver with a memory pool (was: Retry transfer blocks for exceptions listed in the error handler ) > Optimize RemoteBlockPushResolver with a memory pool > --- > > Key: SPARK-38965 > URL: https://issues.apache.org/jira/browse/SPARK-38965 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Minor > > For push-based shuffle service, there are many > {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks > outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is > writing, the others map tasks pushed blocks will failed in {{onComplete()}} > method. > And {{RemoteBlockPushResolver}} has no memory limit , so many executors will > OOM when there are many small pushed blocks waiting to be written to the > final data file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527402#comment-17527402 ] Apache Spark commented on SPARK-39001: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36339 > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39001: Assignee: Apache Spark > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39001: Assignee: (was: Apache Spark) > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527401#comment-17527401 ] Apache Spark commented on SPARK-39001: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36339 > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions
[ https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527400#comment-17527400 ] Hyukjin Kwon commented on SPARK-39001: -- Actually this is pretty straightforward. let me just make a quick PR > Document which options are unsupported in CSV and JSON functions > > > Key: SPARK-39001 > URL: https://issues.apache.org/jira/browse/SPARK-39001 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options > don't work in expressions because some of them are plan-wise options like > parseMode = DROPMALFORMED. > We should document that which options are not working. possibly we should > also throw an exception. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38965) Retry transfer blocks for exceptions listed in the error handler
[ https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-38965: Description: For push-based shuffle service, there are many {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is writing, the others map tasks pushed blocks will failed in {{onComplete()}} method. And {{RemoteBlockPushResolver}} has no memory limit , so many executors will OOM when there are many small pushed blocks waiting to be written to the final data file. was: We should retry transfer blocks if *errorHandler.shouldRetryError(e)* return true, Even though that exception may not a IOException, for example: {code:java} org.apache.spark.network.server.BlockPushNonFatalFailure: Block shufflePush_0_0_3316_5647 experienced merge collision on the server side {code} > Retry transfer blocks for exceptions listed in the error handler > - > > Key: SPARK-38965 > URL: https://issues.apache.org/jira/browse/SPARK-38965 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Minor > > For push-based shuffle service, there are many > {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks > outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is > writing, the others map tasks pushed blocks will failed in {{onComplete()}} > method. > And {{RemoteBlockPushResolver}} has no memory limit , so many executors will > OOM when there are many small pushed blocks waiting to be written to the > final data file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39007) Use double quotes for SQL configs in error messages
[ https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39007. -- Resolution: Fixed Issue resolved by pull request 36335 [https://github.com/apache/spark/pull/36335] > Use double quotes for SQL configs in error messages > --- > > Key: SPARK-39007 > URL: https://issues.apache.org/jira/browse/SPARK-39007 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > All SQL configs should be printed in SQL style in error messages, and wrapped > by double quotes. For example, the config spark.sql.ansi.enabled should be > highlighted as "spark.sql.ansi.enabled" to make it more visible in error > messages. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38999) Refactor DataSourceScanExec code to
[ https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38999: --- Assignee: Utkarsh Agarwal > Refactor DataSourceScanExec code to > > > Key: SPARK-38999 > URL: https://issues.apache.org/jira/browse/SPARK-38999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > > Currently the code for `FileSourceScanExec` class, the physical node for the > file scans is quite complex and lengthy. The class should be refactored into > a trait `FileSourceScanLike` which implements basic functionality like > metrics and file listing. The execution specific code can then live inside > `FileSourceScanExec` which will subclass `FileSourceScanLike`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38999) Refactor DataSourceScanExec code to
[ https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38999. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36327 [https://github.com/apache/spark/pull/36327] > Refactor DataSourceScanExec code to > > > Key: SPARK-38999 > URL: https://issues.apache.org/jira/browse/SPARK-38999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > Fix For: 3.4.0 > > > Currently the code for `FileSourceScanExec` class, the physical node for the > file scans is quite complex and lengthy. The class should be refactored into > a trait `FileSourceScanLike` which implements basic functionality like > metrics and file listing. The execution specific code can then live inside > `FileSourceScanExec` which will subclass `FileSourceScanLike`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters
[ https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38981: - Priority: Major (was: Critical) > Unexpected commutative property of udf/pandas_udf and filters > - > > Key: SPARK-38981 > URL: https://issues.apache.org/jira/browse/SPARK-38981 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Maximilian Sackel >Priority: Major > Labels: beginner > Attachments: optimization_udf_filter.html, screenshot-1.png, > screenshot-2.png > > > Hello all, > When running the attached minmal working example in the attachments, the > order of the filter and the UDF is swapped by the optimizer. This can lead to > errors, which are difficult to debug. In the documentation I have found no > reference to such behavior. > Is this a bug or a functionality which is poorly documented? > With kind regards, > Max -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters
[ https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38981: - Component/s: PySpark > Unexpected commutative property of udf/pandas_udf and filters > - > > Key: SPARK-38981 > URL: https://issues.apache.org/jira/browse/SPARK-38981 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark >Affects Versions: 3.2.1 >Reporter: Maximilian Sackel >Priority: Major > Attachments: optimization_udf_filter.html, screenshot-1.png, > screenshot-2.png > > > Hello all, > When running the attached minmal working example in the attachments, the > order of the filter and the UDF is swapped by the optimizer. This can lead to > errors, which are difficult to debug. In the documentation I have found no > reference to such behavior. > Is this a bug or a functionality which is poorly documented? > With kind regards, > Max -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters
[ https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38981: - Labels: (was: beginner) > Unexpected commutative property of udf/pandas_udf and filters > - > > Key: SPARK-38981 > URL: https://issues.apache.org/jira/browse/SPARK-38981 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Maximilian Sackel >Priority: Major > Attachments: optimization_udf_filter.html, screenshot-1.png, > screenshot-2.png > > > Hello all, > When running the attached minmal working example in the attachments, the > order of the filter and the UDF is swapped by the optimizer. This can lead to > errors, which are difficult to debug. In the documentation I have found no > reference to such behavior. > Is this a bug or a functionality which is poorly documented? > With kind regards, > Max -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39009) Spark Log4j vul - CVE-2021-44228
Prakash Shankar created SPARK-39009: --- Summary: Spark Log4j vul - CVE-2021-44228 Key: SPARK-39009 URL: https://issues.apache.org/jira/browse/SPARK-39009 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Environment: Production Reporter: Prakash Shankar When can we expect the spark 3.3 release, can you please confirm whether it’ll fix the log4j issue. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters
[ https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527322#comment-17527322 ] Maximilian Sackel commented on SPARK-38981: --- To fill the minimal working example with some more input, I'll try to motivate it. In general, a function should be applied to a large table for a certain category type. Therefore the task is divided into 3 subtasks. a) For each row determine the category types using udf. b) Filter rows by the searched category types c) Calculate values for the category types using udf. If rows are used in the calculation which do not correspond to the category, an error is thrown during the calculation. d) the error terminates the whole process Simply adding the rule "org.apache.spark.sql.catalyst.optimizer.PushDownPredicate" to the exclude rules does not seem to solve the problem. [~hyukjin.kwon] it would be realy nice if you could refer me to the appropriate place in the documentation, where I can start testing. The basic Idea is to exclude the optimizer rules for the corresponding lines and then reactivate it, to make use of the optimizer algorithms again? > Unexpected commutative property of udf/pandas_udf and filters > - > > Key: SPARK-38981 > URL: https://issues.apache.org/jira/browse/SPARK-38981 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 >Reporter: Maximilian Sackel >Priority: Critical > Labels: beginner > Attachments: optimization_udf_filter.html, screenshot-1.png, > screenshot-2.png > > > Hello all, > When running the attached minmal working example in the attachments, the > order of the filter and the UDF is swapped by the optimizer. This can lead to > errors, which are difficult to debug. In the documentation I have found no > reference to such behavior. > Is this a bug or a functionality which is poorly documented? > With kind regards, > Max -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org