date:20220425

[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527893#comment-17527893
 ] 

Apache Spark commented on SPARK-39015:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36351

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Priority: Major
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527892#comment-17527892
 ] 

Apache Spark commented on SPARK-39015:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36351

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Priority: Major
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39015:


Assignee: Apache Spark

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Assignee: Apache Spark
>Priority: Major
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39015:


Assignee: (was: Apache Spark)

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Priority: Major
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39015:
-
Component/s: SQL
 (was: Spark Core)

> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL: https://issues.apache.org/jira/browse/SPARK-39015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Raza Jafri
>Priority: Major
>
> [~maxgekk] submitted a 
> [commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
>  that tries to convert the key to SQL but that part of the code is blowing 
> up. 
> {code:java}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types.StringType
> import org.apache.spark.sql.types.DataTypes
> val arrayStructureData = Seq(
> Row(Map("hair"->"black", "eye"->"brown")),
> Row(Map("hair"->"blond", "eye"->"blue")),
> Row(Map()))
> val mapType  = DataTypes.createMapType(StringType,StringType)
> val arrayStructureSchema = new StructType()
> .add("properties", mapType)
> val mapTypeDF = spark.createDataFrame(
> spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> // Exiting paste mode, now interpreting.
> ++
> |element_at(properties, hair)|
> ++
> |   black|
> |   blond|
> |null|
> ++
> scala> spark.conf.set("spark.sql.ansi.enabled", true)
> scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
> 22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
> for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
> ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
> {code}
> Seems like it's trying to convert UTF8String to a sql literal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39014.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36348
[https://github.com/apache/spark/pull/36348]

> Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
> 
>
> Key: SPARK-39014
> URL: https://issues.apache.org/jira/browse/SPARK-39014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Assignee: Yaohua Cui
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39014:


Assignee: Yaohua Cui

> Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
> 
>
> Key: SPARK-39014
> URL: https://issues.apache.org/jira/browse/SPARK-39014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Assignee: Yaohua Cui
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38976) spark-sql. overwrite. hive table-duplicate records

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38976.
--
Resolution: Invalid

> spark-sql. overwrite. hive table-duplicate records
> --
>
> Key: SPARK-38976
> URL: https://issues.apache.org/jira/browse/SPARK-38976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: wesharn
>Priority: Major
>
> It occured duplicate records  when spark-sql overwrite hive table . when 
> spark job has failure stages,but dateframe has any duplicate id? when I run 
> the job again, the reasult is correct.It confused me.why?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38976) spark-sql. overwrite. hive table-duplicate records

2022-04-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527865#comment-17527865
 ] 

Hyukjin Kwon commented on SPARK-38976:
--

[~wesharn] I think it's best to interact in dev mailing list first. Let's file 
an issue if it's confirmed

> spark-sql. overwrite. hive table-duplicate records
> --
>
> Key: SPARK-38976
> URL: https://issues.apache.org/jira/browse/SPARK-38976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: wesharn
>Priority: Major
>
> It occured duplicate records  when spark-sql overwrite hive table . when 
> spark job has failure stages,but dateframe has any duplicate id? when I run 
> the job again, the reasult is correct.It confused me.why?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39017) Change Java8 datetime support to configurable

2022-04-25 Thread Weicheng Wang (Jira)

Weicheng Wang created SPARK-39017:
-

 Summary: Change Java8 datetime support to configurable 
 Key: SPARK-39017
 URL: https://issues.apache.org/jira/browse/SPARK-39017
 Project: Spark
  Issue Type: Brainstorming
  Components: SQL
Affects Versions: 3.2.1
Reporter: Weicheng Wang


*Background:*

In the Spark 3.1.0 there is an improvement introduced to enable Java 8 datetime 
API by default.

It stops user set this configuration, *spark.sql.datetime.java8API.enabled,* in 
command line or configuration file when using Spark SQL shell or Spark Thrift 
Server.

The only way to set it is in the SQL Session using SET command like:
{code:java}
spark-sql> SET spark.sql.datetime.java8API.enabled=false {code}
There are a few issues related to this improvement 
 * [https://github.com/apache/iceberg/issues/2530]
 * [https://github.com/delta-io/delta/issues/760]

There is a solution in 3.2.0 for it by having *LocalDateConverter* uses 
*DateConverter* and in the *DateConverter* it handles both *LocalDate* and 
*Date* types.

 

*Discussion:*

I think we should give back the ability to user to set this option in 
configuration file and command line. Both of the improvement defeats the reason 
why we have a java8API option configurable.

 

Please advise

Thanks

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38820) Support Index can hold arbitrary ExtensionArrays

2022-04-25 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525648#comment-17525648
 ] 

Yikun Jiang edited comment on SPARK-38820 at 4/26/22 3:18 AM:
--

[https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays]

 

https://github.com/pandas-dev/pandas/commit/e750c94bf1


was (Author: yikunkero):
https://pandas.pydata.org/docs/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays

> Support Index can hold arbitrary ExtensionArrays
> 
>
> Key: SPARK-38820
> URL: https://issues.apache.org/jira/browse/SPARK-38820
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:java}
> ERROR [1.717s]: test_astype 
> (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanExtensionOpsTest)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 121, in 
> assertPandasEqual
>     assert_series_equal(
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 1019, in assert_series_equal
>     assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
>   File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", 
> line 506, in assert_attr_equal
>     raise_assert_detail(obj, msg, left_attr, right_attr)
> AssertionError: Attributes of Series are differentAttribute "dtype" are 
> different
> [left]:  CategoricalDtype(categories=[False, True], ordered=False)
> [right]: CategoricalDtype(categories=[False, True], ordered=False)The above 
> exception was the direct cause of the following exception:Traceback (most 
> recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py",
>  line 746, in test_astype
>     self.assert_eq(pser.astype("category"), psser.astype("category"))
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 229, in 
> assert_eq
>     self.assertPandasEqual(lobj, robj, check_exact=check_exact)
>   File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 134, in 
> assertPandasEqual
>     raise AssertionError(msg) from e
> AssertionError: Attributes of Series are differentAttribute "dtype" are 
> different
> [left]:  CategoricalDtype(categories=[False, True], ordered=False)
> [right]: CategoricalDtype(categories=[False, True], ordered=False)Left:
> Name: this, dtype: category
> Categories (2, boolean): [False, True]
> categoryRight:
> Name: this, dtype: category
> Categories (2, object): [False, True]
> category {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38700) Use error classes in the execution errors of save mode

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38700:


Assignee: (was: Apache Spark)

> Use error classes in the execution errors of save mode
> --
>
> Key: SPARK-38700
> URL: https://issues.apache.org/jira/browse/SPARK-38700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unsupportedSaveModeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38700) Use error classes in the execution errors of save mode

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527862#comment-17527862
 ] 

Apache Spark commented on SPARK-38700:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36350

> Use error classes in the execution errors of save mode
> --
>
> Key: SPARK-38700
> URL: https://issues.apache.org/jira/browse/SPARK-38700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unsupportedSaveModeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38700) Use error classes in the execution errors of save mode

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527863#comment-17527863
 ] 

Apache Spark commented on SPARK-38700:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36350

> Use error classes in the execution errors of save mode
> --
>
> Key: SPARK-38700
> URL: https://issues.apache.org/jira/browse/SPARK-38700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unsupportedSaveModeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38700) Use error classes in the execution errors of save mode

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38700:


Assignee: Apache Spark

> Use error classes in the execution errors of save mode
> --
>
> Key: SPARK-38700
> URL: https://issues.apache.org/jira/browse/SPARK-38700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * unsupportedSaveModeError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39016) Fix compilation warnings related to "`enum` will become a keyword in Scala 3"

2022-04-25 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39016:
-
Summary: Fix compilation warnings related to "`enum` will become a keyword 
in Scala 3"  (was: Fix compilation warnings related to "Wrap `enum` in 
backticks to use it as an identifier")

> Fix compilation warnings related to "`enum` will become a keyword in Scala 3"
> -
>
> Key: SPARK-39016
> URL: https://issues.apache.org/jira/browse/SPARK-39016
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> [WARNING] 
> spark-source/core/src/test/scala/org/apache/spark/internal/config/ConfigEntrySuite.scala:172:
>  [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use 
> it as an identifier, it will become a keyword in Scala 3.
> [WARNING] 
> spark-source/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala:553:
>  [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use 
> it as an identifier, it will become a keyword in Scala 3.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39016) Fix compilation warnings related to "Wrap `enum` in backticks to use it as an identifier"

2022-04-25 Thread Yang Jie (Jira)

Yang Jie created SPARK-39016:


 Summary: Fix compilation warnings related to "Wrap `enum` in 
backticks to use it as an identifier"
 Key: SPARK-39016
 URL: https://issues.apache.org/jira/browse/SPARK-39016
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.4.0
Reporter: Yang Jie


[WARNING] 
spark-source/core/src/test/scala/org/apache/spark/internal/config/ConfigEntrySuite.scala:172:
 [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use it 
as an identifier, it will become a keyword in Scala 3.
[WARNING] 
spark-source/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala:553:
 [deprecation @  | origin= | version=2.13.7] Wrap `enum` in backticks to use it 
as an identifier, it will become a keyword in Scala 3.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38989.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36306
[https://github.com/apache/spark/pull/36306]

> Implement `ignore_index` of `DataFrame/Series.sample`
> -
>
> Key: SPARK-38989
> URL: https://issues.apache.org/jira/browse/SPARK-38989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38989) Implement `ignore_index` of `DataFrame/Series.sample`

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38989:


Assignee: Xinrong Meng

> Implement `ignore_index` of `DataFrame/Series.sample`
> -
>
> Key: SPARK-38989
> URL: https://issues.apache.org/jira/browse/SPARK-38989
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Implement `ignore_index` of `DataFrame/Series.sample`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Raza Jafri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-39015:
---
Description: 
[~maxgekk] submitted a 
[commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
 that tries to convert the key to SQL but that part of the code is blowing up. 


{code:java}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType,StringType)

val arrayStructureSchema = new StructType()
.add("properties", mapType)



val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

mapTypeDF.selectExpr("element_at(properties, 'hair')").show

// Exiting paste mode, now interpreting.

++
|element_at(properties, hair)|
++
|   black|
|   blond|
|null|
++

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]

{code}

Seems like it's trying to convert UTF8String to a sql literal

  was:
[~maxgekk] submitted a 
[commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
 that tries to convert the key to SQL but that part of the code is blowing up. 


{code:java}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType,StringType)

val arrayStructureSchema = new StructType()
.add("properties", mapType)



val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

mapTypeDF.selectExpr("element_at(properties, 'hair')").show

// Exiting paste mode, now interpreting.

++
|element_at(properties, hair)|
++
|   black|
|   blond|
|null|
++

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]

{code}


> SparkRuntimeException when trying to get non-existent key in a map
> --
>
>

[jira] [Commented] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527829#comment-17527829
 ] 

Apache Spark commented on SPARK-39014:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/36348

> Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
> 
>
> Key: SPARK-39014
> URL: https://issues.apache.org/jira/browse/SPARK-39014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39014:


Assignee: (was: Apache Spark)

> Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
> 
>
> Key: SPARK-39014
> URL: https://issues.apache.org/jira/browse/SPARK-39014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39014:


Assignee: Apache Spark

> Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
> 
>
> Key: SPARK-39014
> URL: https://issues.apache.org/jira/browse/SPARK-39014
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Raza Jafri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-39015:
---
Description: 
[~maxgekk] submitted a 
[commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
 that tries to convert the key to SQL but that part of the code is blowing up. 


{code:java}
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType,StringType)

val arrayStructureSchema = new StructType()
.add("properties", mapType)



val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

mapTypeDF.selectExpr("element_at(properties, 'hair')").show

// Exiting paste mode, now interpreting.

++
|element_at(properties, hair)|
++
|   black|
|   blond|
|null|
++

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]

{code}

  was:
[~maxgekk] submitted a 
[commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
 that tries to convert the key to SQL but that part of the code is blowing up. 

```
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType,StringType)

val arrayStructureSchema = new StructType()
.add("properties", mapType)



val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

mapTypeDF.selectExpr("element_at(properties, 'hair')").show

// Exiting paste mode, now interpreting.

++
|element_at(properties, hair)|
++
|   black|
|   blond|
|null|
++

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]


```


> SparkRuntimeException when trying to get non-existent key in a map
> --
>
> Key: SPARK-39015
> URL:

[jira] [Created] (SPARK-39015) SparkRuntimeException when trying to get non-existent key in a map

2022-04-25 Thread Raza Jafri (Jira)

Raza Jafri created SPARK-39015:
--

 Summary: SparkRuntimeException when trying to get non-existent key 
in a map
 Key: SPARK-39015
 URL: https://issues.apache.org/jira/browse/SPARK-39015
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Raza Jafri


[~maxgekk] submitted a 
[commit|https://github.com/apache/spark/commit/bc8c264851457d8ef59f5b332c79296651ec5d1e]
 that tries to convert the key to SQL but that part of the code is blowing up. 

```
scala> :pa
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.DataTypes

val arrayStructureData = Seq(
Row(Map("hair"->"black", "eye"->"brown")),
Row(Map("hair"->"blond", "eye"->"blue")),
Row(Map()))

val mapType  = DataTypes.createMapType(StringType,StringType)

val arrayStructureSchema = new StructType()
.add("properties", mapType)



val mapTypeDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)

mapTypeDF.selectExpr("element_at(properties, 'hair')").show

// Exiting paste mode, now interpreting.

++
|element_at(properties, hair)|
++
|   black|
|   blond|
|null|
++

scala> spark.conf.set("spark.sql.ansi.enabled", true)

scala> mapTypeDF.selectExpr("element_at(properties, 'hair')").show
22/04/25 18:26:01 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
org.apache.spark.SparkRuntimeException: The feature is not supported: literal 
for 'hair' of class org.apache.spark.unsafe.types.UTF8String.
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError(QueryExecutionErrors.scala:240)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:101) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue(QueryErrorsBase.scala:44)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryErrorsBase.toSQLValue$(QueryErrorsBase.scala:43)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.errors.QueryExecutionErrors$.toSQLValue(QueryExecutionErrors.scala:69)
 ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]


```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Yaohua Zhao (Jira)

Yaohua Zhao created SPARK-39014:
---

 Summary: Respect ignoreMissingFiles from Data Source options in 
InMemoryFileIndex
 Key: SPARK-39014
 URL: https://issues.apache.org/jira/browse/SPARK-39014
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527811#comment-17527811
 ] 

Apache Spark commented on SPARK-39001:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36346

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527812#comment-17527812
 ] 

Apache Spark commented on SPARK-39001:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36346

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39008) Change ASF as a single author in Spark distribution

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39008:


Assignee: Hyukjin Kwon

> Change ASF as a single author in Spark distribution
> ---
>
> Key: SPARK-39008
> URL: https://issues.apache.org/jira/browse/SPARK-39008
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We mention several original developers in authors in pom.xml or 
> R/pkg/DESRIPTION while the project is being maintained under ASF orgnization. 
> We should probably remove all and keep ASF as a single author.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39008) Change ASF as a single author in Spark distribution

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39008.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36337
[https://github.com/apache/spark/pull/36337]

> Change ASF as a single author in Spark distribution
> ---
>
> Key: SPARK-39008
> URL: https://issues.apache.org/jira/browse/SPARK-39008
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>
> We mention several original developers in authors in pom.xml or 
> R/pkg/DESRIPTION while the project is being maintained under ASF orgnization. 
> We should probably remove all and keep ASF as a single author.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527789#comment-17527789
 ] 

Apache Spark commented on SPARK-39013:
--

User 'jackierwzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36345

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Jackie Zhang
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39013:


Assignee: (was: Apache Spark)

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Jackie Zhang
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527788#comment-17527788
 ] 

Apache Spark commented on SPARK-39013:
--

User 'jackierwzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36345

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Jackie Zhang
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39013:


Assignee: Apache Spark

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Jackie Zhang
>Assignee: Apache Spark
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39013) Parser changes to enforce `()` for creating table without any columns

2022-04-25 Thread Jackie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackie Zhang updated SPARK-39013:
-
Summary: Parser changes to enforce `()` for creating table without any 
columns  (was: Parse changes to enforce `()` for creating table without any 
columns)

> Parser changes to enforce `()` for creating table without any columns
> -
>
> Key: SPARK-39013
> URL: https://issues.apache.org/jira/browse/SPARK-39013
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Jackie Zhang
>Priority: Major
>
> We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
> indicate a table without any columns will be created.
> E.g. `CREATE TABLE table () USING DELTA`.
> Existing behavior of CTAS and CREATE external table at location are not 
> affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39013) Parse changes to enforce `()` for creating table without any columns

2022-04-25 Thread Jackie Zhang (Jira)

Jackie Zhang created SPARK-39013:


 Summary: Parse changes to enforce `()` for creating table without 
any columns
 Key: SPARK-39013
 URL: https://issues.apache.org/jira/browse/SPARK-39013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4
Reporter: Jackie Zhang


We would like to enforce the `()` for `CREATE TABLE` queries to explicit 
indicate a table without any columns will be created.

E.g. `CREATE TABLE table () USING DELTA`.

Existing behavior of CTAS and CREATE external table at location are not 
affected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-39012:
-
Description: 
When Spark needs to infer schema, it needs to parse string to a type. Not all 
data types are supported so far in this path. For example, binary is known to 
not be supported. If a user uses binary column, and if the user does not use a 
metastore, then SparkSQL could fall back to schema inference thus fail to 
execute during table scan. This should be a bug as schema inference is 
supported but some types are missing.

string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
because when converting from a string, small scale type won't be identified if 
there is a larger scale type. For example, short and long 

Based on Spark SQL data types: 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the 
following types:

BINARY
BOOLEAN

And there are two types that I am not sure if SparkSQL is supporting:
YearMonthIntervalType
DayTimeIntervalType


  was:
When Spark needs to infer schema, it needs to parse string to a type. Not all 
data types are supported so far in this path. For example, binary is known to 
not be supported. 

string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
because when converting from a string, small scale type won't be identified if 
there is a larger scale type. For example, short and long 

Based on Spark SQL data types: 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the 
following types:

BINARY
BOOLEAN

And there are two types that I am not sure if SparkSQL is supporting:
YearMonthIntervalType
DayTimeIntervalType



> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is known to 
> not be supported. If a user uses binary column, and if the user does not use 
> a metastore, then SparkSQL could fall back to schema inference thus fail to 
> execute during table scan. This should be a bug as schema inference is 
> supported but some types are missing.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
> because when converting from a string, small scale type won't be identified 
> if there is a larger scale type. For example, short and long 
> Based on Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support 
> the following types:
> BINARY
> BOOLEAN
> And there are two types that I am not sure if SparkSQL is supporting:
> YearMonthIntervalType
> DayTimeIntervalType



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-39012:
-
Description: 
When Spark needs to infer schema, it needs to parse string to a type. Not all 
data types are supported so far in this path. For example, binary is known to 
not be supported. 

string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
because when converting from a string, small scale type won't be identified if 
there is a larger scale type. For example, short and long 

Based on Spark SQL data types: 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support the 
following types:

BINARY
BOOLEAN

And there are two types that I am not sure if SparkSQL is supporting:
YearMonthIntervalType
DayTimeIntervalType


  was:
When Spark needs to infer schema, it needs to parse string to a type. Not all 
data types are supported so far in this path. For example, binary is spotted to 
not supported.

string might be converted to all types except ARRAY, MAP, STRUCT, etc.

Spark SQL data types: 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html


> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is known to 
> not be supported. 
> string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
> because when converting from a string, small scale type won't be identified 
> if there is a larger scale type. For example, short and long 
> Based on Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support 
> the following types:
> BINARY
> BOOLEAN
> And there are two types that I am not sure if SparkSQL is supporting:
> YearMonthIntervalType
> DayTimeIntervalType



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527770#comment-17527770
 ] 

Apache Spark commented on SPARK-39012:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/36344

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39012:


Assignee: Apache Spark

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39012:


Assignee: (was: Apache Spark)

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527768#comment-17527768
 ] 

Apache Spark commented on SPARK-39012:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/36344

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527767#comment-17527767
 ] 

Rui Wang commented on SPARK-39012:
--

PR is ready to support binary type https://github.com/apache/spark/pull/36344

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39012) SparkSQL Infer schema path does not support all data types

2022-04-25 Thread Rui Wang (Jira)

Rui Wang created SPARK-39012:


 Summary: SparkSQL Infer schema path does not support all data types
 Key: SPARK-39012
 URL: https://issues.apache.org/jira/browse/SPARK-39012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Rui Wang


When Spark needs to infer schema, it needs to parse string to a type. Not all 
data types are supported so far in this path. For example, binary is spotted to 
not supported.

string might be converted to all types except ARRAY, MAP, STRUCT, etc.

Spark SQL data types: 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39012) SparkSQL infer schema does not support all data types

2022-04-25 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-39012:
-
Summary: SparkSQL infer schema does not support all data types  (was: 
SparkSQL Infer schema path does not support all data types)

> SparkSQL infer schema does not support all data types
> -
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is spotted 
> to not supported.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc.
> Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35739) [Spark Sql] Add Java-comptable Dataset.join overloads

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527751#comment-17527751
 ] 

Apache Spark commented on SPARK-35739:
--

User 'brandondahler' has created a pull request for this issue:
https://github.com/apache/spark/pull/36343

> [Spark Sql] Add Java-comptable Dataset.join overloads
> -
>
> Key: SPARK-35739
> URL: https://issues.apache.org/jira/browse/SPARK-35739
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Brandon Dahler
>Priority: Minor
>
> h2. Problem
> When using Spark SQL with Java, the required syntax to utilize the following 
> two overloads are unnatural and not obvious to developers that haven't had to 
> interoperate with Scala before:
> {code:java}
> def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> Examples:
> Java 11 
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> // Overload with multiple usingColumns, no join type
> dataset1
>   .join(dataset2, JavaConverters.asScalaBuffer(List.of("column", "column2))
>   .show();
> // Overload with multiple usingColumns and a join type
> dataset1
>   .join(
> dataset2,
> JavaConverters.asScalaBuffer(List.of("column", "column2")),
> "left")
>   .show();
> {code}
>  
>  Additionally there is no overload that takes a single usingColumnn and a 
> joinType, forcing the developer to use the Seq[String] overload regardless of 
> language.
> Examples:
> Scala
> {code:java}
> val dataset1 :DataFrame = ...;
> val dataset2 :DataFrame = ...;
> dataset1
>   .join(dataset2, Seq("column"), "left")
>   .show();
> {code}
>  
>  Java 11
> {code:java}
> Dataset dataset1 = ...;
> Dataset dataset2 = ...;
> dataset1
>  .join(dataset2, JavaConverters.asScalaBuffer(List.of("column")), "left")
>  .show();
> {code}
> h2. Proposed Improvement
> Add 3 additional overloads to Dataset:
>   
> {code:java}
> def join(right: Dataset[_], usingColumn: List[String]): DataFrame
> def join(right: Dataset[_], usingColumn: String, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumn: List[String], joinType: String): 
> DataFrame
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38954) Implement sharing of cloud credentials among driver and executors

2022-04-25 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38954:
--
Affects Version/s: 3.4.0
   (was: 3.2.1)

> Implement sharing of cloud credentials among driver and executors
> -
>
> Key: SPARK-38954
> URL: https://issues.apache.org/jira/browse/SPARK-38954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Parth Chandra
>Priority: Major
>
> Currently Spark uses external implementations (e.g. hadoop-aws) to access 
> cloud services like S3. In order to access the actual service, these 
> implementations use credentials provider implementations that obtain 
> credentials to allow access to the cloud service.
> These credentials are typically session credentials, which means that they 
> expire after a fixed time. Sometimes, this expiry can be only an hour and for 
> a spark job that runs for many hours (or spark streaming job that runs 
> continuously), the credentials have to be renewed periodically.
> In many organizations, the process of getting credentials may multi-step. The 
> organization has an identity provider service that provides authentication 
> for the user, while the cloud service provider provides authorization for the 
> roles the user has access to. Once the user is authenticated and her role 
> verified, the credentials are generated for a new session.
> In a large setup with hundreds of Spark jobs and thousands of executors, each 
> executor is then spending a lot of time getting credentials and this may put 
> unnecessary load on the backend authentication services.
> The alleviate this, we can use Spark's architecture to obtain the credentials 
> once in the driver and push the credentials to the executors. In addition, 
> the driver can check the expiry of the credentials and push updated 
> credentials to the executors. This is relatively easy to do since the rpc 
> mechanism to implement this is already in place and is used similarly for 
> Kerberos delegation tokens.
>   



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38742) Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38742.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36280
[https://github.com/apache/spark/pull/36280]

> Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite
> --
>
> Key: SPARK-38742
> URL: https://issues.apache.org/jira/browse/SPARK-38742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Major
> Fix For: 3.4.0
>
>
> Move tests for the error class MISSING_COLUMN from SQLQuerySuite to 
> QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38742) Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38742:


Assignee: panbingkun

> Move the tests `MISSING_COLUMN` to QueryCompilationErrorsSuite
> --
>
> Key: SPARK-38742
> URL: https://issues.apache.org/jira/browse/SPARK-38742
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Major
>
> Move tests for the error class MISSING_COLUMN from SQLQuerySuite to 
> QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax

2022-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-38939:

Fix Version/s: 3.3.0
   (was: 3.4.0)

> Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
> -
>
> Key: SPARK-38939
> URL: https://issues.apache.org/jira/browse/SPARK-38939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Jackie Zhang
>Assignee: Jackie Zhang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error 
> if the column doesn't exist. We would like to provide an (IF EXISTS) syntax 
> to provide better user experience for downstream handlers (such as Delta) 
> that support it, and make consistent with some other DMLs such as `DROP TABLE 
> (IF EXISTS)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39001.
--
Fix Version/s: 3.3.0
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 36339
[https://github.com/apache/spark/pull/36339]

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39001:


Assignee: Hyukjin Kwon

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39007) Use double quotes for SQL configs in error messages

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39007:
-
Fix Version/s: 3.3.0

> Use double quotes for SQL configs in error messages
> ---
>
> Key: SPARK-39007
> URL: https://issues.apache.org/jira/browse/SPARK-39007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> All SQL configs should be printed in SQL style in error messages, and wrapped 
> by double quotes. For example, the config spark.sql.ansi.enabled should be 
> highlighted as "spark.sql.ansi.enabled" to make it more visible in error 
> messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax

2022-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38939.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36252
[https://github.com/apache/spark/pull/36252]

> Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
> -
>
> Key: SPARK-38939
> URL: https://issues.apache.org/jira/browse/SPARK-38939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Jackie Zhang
>Assignee: Jackie Zhang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error 
> if the column doesn't exist. We would like to provide an (IF EXISTS) syntax 
> to provide better user experience for downstream handlers (such as Delta) 
> that support it, and make consistent with some other DMLs such as `DROP TABLE 
> (IF EXISTS)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38939) Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax

2022-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38939:
---

Assignee: Jackie Zhang

> Support ALTER TABLE ... DROP [IF EXISTS] COLUMN .. syntax
> -
>
> Key: SPARK-38939
> URL: https://issues.apache.org/jira/browse/SPARK-38939
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Jackie Zhang
>Assignee: Jackie Zhang
>Priority: Major
>
> Currently `ALTER TABLE ... DROP COLUMN(s) ...` syntax will always throw error 
> if the column doesn't exist. We would like to provide an (IF EXISTS) syntax 
> to provide better user experience for downstream handlers (such as Delta) 
> that support it, and make consistent with some other DMLs such as `DROP TABLE 
> (IF EXISTS)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-25 Thread jagadeesh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527648#comment-17527648
 ] 

jagadeesh commented on SPARK-25355:
---

[~pedro.rossi]  , we are running into problem with this feature enable in Spark 
3.2 on K8s. any insights?? Appreciate your help.

here are the setups follows, 
 * Service id is configured properly in HDFS side .

 
{code:java}
    
      hadoop.proxyuser.serviceid.groups
      *
        

      hadoop.proxyuser.serviceid.hosts
      *
        
      hadoop.proxyuser.serviceid.users
      *
    {code}
 
 * Getting service id Kerberos ticket in spark client.
 * Running spark job without --proxy-user connecting to Kerberized HDFS cluster 
 - {color:#00875a}WORKS AS EXPECTED .{color}
 * Running spark job with --proxy-user= connecting to Kerberized HDFS 
cluster - {color:#de350b}FAILS{color}

{code:java}
$SPARK_HOME/bin/spark-submit \
    --master  \
    --deploy-mode cluster \
    --proxy-user  \
    --name spark-javawordcount \
    --class org.apache.spark.examples.JavaWordCount \
    --conf spark.kubernetes.container.image=\
    --conf spark.kubernetes.driver.podTemplateFile=driver.yaml \
    --conf spark.kubernetes.executor.podTemplateFile=executor.yaml \
    --conf spark.kubernetes.container.image.pullPolicy=Always \
    --conf spark.kubernetes.driver.limit.cores=1 \
    --conf spark.executor.instances=2 \
    --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.namespace= \
    --conf spark.eventLog.enabled=true \
    --conf spark.eventLog.dir=hdfs://:8020/scaas/shs_logs \
    --conf spark.kubernetes.file.upload.path=hdfs://:8020/tmp \
    $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
/user//input{code}
 
 * ERROR logs from Driver pod

 
{code:java}
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress= --deploy-mode client --proxy-user 
 --properties-file /opt/spark/conf/spark.properties --class 
org.apache.spark.examples.JavaWordCount spark-internal /user//input
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/04/21 17:50:30 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
22/04/21 17:50:30 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.
22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:50:31 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:50:37 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:50:53 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:51:32 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:52:07 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN, KERBEROS]
22/04/21 17:52:27 WARN Client: Exception encountered while connecting to the 
server : org.apache.hadoop.security.AccessControlException: Client cannot 
authenticate via:[TOKEN,

[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527611#comment-17527611
 ] 

Apache Spark commented on SPARK-38879:
--

User 'pralabhkumar' has created a pull request for this issue:
https://github.com/apache/spark/pull/36342

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38879:


Assignee: Apache Spark

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Apache Spark
>Priority: Minor
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38879:


Assignee: (was: Apache Spark)

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37696) Optimizer exceeds max iterations

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37696:
-
Affects Version/s: 3.2.1

> Optimizer exceeds max iterations
> 
>
> Key: SPARK-37696
> URL: https://issues.apache.org/jira/browse/SPARK-37696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.2.1
>Reporter: Denis Tarima
>Priority: Minor
>
> A specific scenario causing Spark's failure in tests and a warning in 
> production:
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
> 21/12/20 06:45:24 WARN BaseSessionStateBuilder$$anon$2: Max iterations (100) 
> reached for batch Operator Optimization after Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>  
> To reproduce run the following commands in `spark-shell`:
> {{// define case class for a struct type in an array}}
> {{case class S(v: Int, v2: Int)}}
>  
> {{// prepare a table with an array of structs}}
> {{Seq((10, Seq(S(1, 2.toDF("i", "data").write.saveAsTable("tbl")}}
>  
> {{// select using SQL and join with a dataset using "left_anti"}}
> {{spark.sql("select i, data[size(data) - 1].v from 
> tbl").join(Seq(10).toDF("i"), Seq("i"), "left_anti").show()}}
>  
> The following conditions are required:
>  # Having additional `v2` field in `S`
>  # Using `{{{}data[size(data) - 1]{}}}` instead of `{{{}element_at(data, 
> -1){}}}`
>  # Using `{{{}left_anti{}}}` in join operation
>  
> The same behavior was observed in `master` branch and `3.1.1`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38868) `assert_true` fails unconditionnaly after `left_outer` joins

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527585#comment-17527585
 ] 

Apache Spark commented on SPARK-38868:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36341

> `assert_true` fails unconditionnaly after `left_outer` joins
> 
>
> Key: SPARK-38868
> URL: https://issues.apache.org/jira/browse/SPARK-38868
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1, 3.3.0, 3.4.0
>Reporter: Fabien Dubosson
>Priority: Major
>
> When `assert_true` is used after a `left_outer` join the assert exception is 
> raised even though all the rows meet the condition. Using an `inner` join 
> does not expose this issue.
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> session = SparkSession.builder.getOrCreate()
> entries = session.createDataFrame(
>     [
>         ("a", 1),
>         ("b", 2),
>         ("c", 3),
>     ],
>     ["id", "outcome_id"],
> )
> outcomes = session.createDataFrame(
>     [
>         (1, 12),
>         (2, 34),
>         (3, 32),
>     ],
>     ["outcome_id", "outcome_value"],
> )
> # Inner join works as expected
> (
>     entries.join(outcomes, on="outcome_id", how="inner")
>     .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
>     .filter(sf.col("valid").isNull())
>     .show()
> )
> # Left join fails with «'('outcome_value > 10)' is not true!» even though it 
> is the case
> (
>     entries.join(outcomes, on="outcome_id", how="left_outer")
>     .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
>     .filter(sf.col("valid").isNull())
>     .show()
> ){code}
> Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am 
> not sure if "native" Spark exposes this issue as well or not, I don't have 
> the knowledge/setup to test that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39011) V2 Filter to ORC Predicate support

2022-04-25 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-39011:
---
Summary: V2 Filter to ORC Predicate support  (was: V2 Filter to ORC Filter 
support)

> V2 Filter to ORC Predicate support
> --
>
> Key: SPARK-39011
> URL: https://issues.apache.org/jira/browse/SPARK-39011
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4
>Reporter: Huaxin Gao
>Priority: Major
>
> add V2 filter to ORC predicate support



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39011) V2 Filter to ORC Filter support

2022-04-25 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-39011:
--

 Summary: V2 Filter to ORC Filter support
 Key: SPARK-39011
 URL: https://issues.apache.org/jira/browse/SPARK-39011
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4
Reporter: Huaxin Gao


add V2 filter to ORC predicate support



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39010) V2 Filter to Parquet Predicate support

2022-04-25 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-39010:
--

 Summary: V2 Filter to Parquet Predicate support
 Key: SPARK-39010
 URL: https://issues.apache.org/jira/browse/SPARK-39010
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4
Reporter: Huaxin Gao


Add support for V2 Filter to Parquet Predicate



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38667) Optimizer generates error when using inner join along with sequence

2022-04-25 Thread Lars (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars resolved SPARK-38667.
--
Resolution: Resolved

> Optimizer generates error when using inner join along with sequence
> ---
>
> Key: SPARK-38667
> URL: https://issues.apache.org/jira/browse/SPARK-38667
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2
>Reporter: Lars
>Priority: Major
>
> This issue occurred in a more complex scenario, so I've broken it down into a 
> simple case.
> {*}Steps to reproduce{*}: Execute the following example. The code should run 
> without errors, but instead a *java.lang.IllegalArgumentException: Illegal 
> sequence boundaries: 4 to 2 by 1* is thrown.
> {code:java}
> package com.example
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object SparkIssue {
>     def main(args: Array[String]): Unit = {
>         val spark = SparkSession
>             .builder()
>             .master("local[*]")
>             .getOrCreate()
>         val dfA = spark
>             .createDataFrame(Seq((1, 1), (2, 4)))
>             .toDF("a1", "a2")
>         val dfB = spark
>             .createDataFrame(Seq((1, 5), (2, 2)))
>             .toDF("b1", "b2")
>         dfA.join(dfB, dfA("a1") === dfB("b1"), "inner")
>             .where(col("a2") < col("b2"))
>             .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1
>             .show()
>         spark.stop()
>     }
> }
> {code}
> When I look at the Optimized Logical Plan I can see that the Inner Join and 
> the Filter are brought together, with an additional check for an empty 
> Sequence. The exception is thrown because the Sequence check is executed 
> before the Filter.
> {code:java}
> == Parsed Logical Plan ==
> 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), 
> None)) AS x#24]
> +- Filter (a2#5 < b2#13)
>    +- Join Inner, (a1#4 = b1#12)
>       :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>       :  +- LocalRelation [_1#0, _2#1]
>       +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>          +- LocalRelation [_1#8, _2#9]
> == Analyzed Logical Plan ==
> a1: int, a2: int, b1: int, b2: int, x: int
> Project [a1#4, a2#5, b1#12, b2#13, x#25]
> +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), 
> false, [x#25]
>    +- Filter (a2#5 < b2#13)
>       +- Join Inner, (a1#4 = b1#12)
>          :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>          :  +- LocalRelation [_1#0, _2#1]
>          +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>             +- LocalRelation [_1#8, _2#9]
> == Optimized Logical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, 
> [x#25]
> +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), 
> true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12))
>    :- LocalRelation [a1#4, a2#5]
>    +- LocalRelation [b1#12, b2#13]
> == Physical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, 
> a2#5, b1#12, b2#13], false, [x#25]
> +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, 
> ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND 
> (a2#5 < b2#13)), false
>    :- *(1) LocalTableScan [a1#4, a2#5]
>    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [id=#15]
>       +- LocalTableScan [b1#12, b2#13]
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38667) Optimizer generates error when using inner join along with sequence

2022-04-25 Thread Lars (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527547#comment-17527547
 ] 

Lars commented on SPARK-38667:
--

Thanks all for pointing this out. Changed the affected version to 3.1.2 and 
resolved this issue.

> Optimizer generates error when using inner join along with sequence
> ---
>
> Key: SPARK-38667
> URL: https://issues.apache.org/jira/browse/SPARK-38667
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2
>Reporter: Lars
>Priority: Major
>
> This issue occurred in a more complex scenario, so I've broken it down into a 
> simple case.
> {*}Steps to reproduce{*}: Execute the following example. The code should run 
> without errors, but instead a *java.lang.IllegalArgumentException: Illegal 
> sequence boundaries: 4 to 2 by 1* is thrown.
> {code:java}
> package com.example
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object SparkIssue {
>     def main(args: Array[String]): Unit = {
>         val spark = SparkSession
>             .builder()
>             .master("local[*]")
>             .getOrCreate()
>         val dfA = spark
>             .createDataFrame(Seq((1, 1), (2, 4)))
>             .toDF("a1", "a2")
>         val dfB = spark
>             .createDataFrame(Seq((1, 5), (2, 2)))
>             .toDF("b1", "b2")
>         dfA.join(dfB, dfA("a1") === dfB("b1"), "inner")
>             .where(col("a2") < col("b2"))
>             .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1
>             .show()
>         spark.stop()
>     }
> }
> {code}
> When I look at the Optimized Logical Plan I can see that the Inner Join and 
> the Filter are brought together, with an additional check for an empty 
> Sequence. The exception is thrown because the Sequence check is executed 
> before the Filter.
> {code:java}
> == Parsed Logical Plan ==
> 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), 
> None)) AS x#24]
> +- Filter (a2#5 < b2#13)
>    +- Join Inner, (a1#4 = b1#12)
>       :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>       :  +- LocalRelation [_1#0, _2#1]
>       +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>          +- LocalRelation [_1#8, _2#9]
> == Analyzed Logical Plan ==
> a1: int, a2: int, b1: int, b2: int, x: int
> Project [a1#4, a2#5, b1#12, b2#13, x#25]
> +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), 
> false, [x#25]
>    +- Filter (a2#5 < b2#13)
>       +- Join Inner, (a1#4 = b1#12)
>          :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>          :  +- LocalRelation [_1#0, _2#1]
>          +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>             +- LocalRelation [_1#8, _2#9]
> == Optimized Logical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, 
> [x#25]
> +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), 
> true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12))
>    :- LocalRelation [a1#4, a2#5]
>    +- LocalRelation [b1#12, b2#13]
> == Physical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, 
> a2#5, b1#12, b2#13], false, [x#25]
> +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, 
> ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND 
> (a2#5 < b2#13)), false
>    :- *(1) LocalTableScan [a1#4, a2#5]
>    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [id=#15]
>       +- LocalTableScan [b1#12, b2#13]
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38667) Optimizer generates error when using inner join along with sequence

2022-04-25 Thread Lars (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars updated SPARK-38667:
-
Affects Version/s: 3.1.2
   (was: 3.2.1)

> Optimizer generates error when using inner join along with sequence
> ---
>
> Key: SPARK-38667
> URL: https://issues.apache.org/jira/browse/SPARK-38667
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2
>Reporter: Lars
>Priority: Major
>
> This issue occurred in a more complex scenario, so I've broken it down into a 
> simple case.
> {*}Steps to reproduce{*}: Execute the following example. The code should run 
> without errors, but instead a *java.lang.IllegalArgumentException: Illegal 
> sequence boundaries: 4 to 2 by 1* is thrown.
> {code:java}
> package com.example
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object SparkIssue {
>     def main(args: Array[String]): Unit = {
>         val spark = SparkSession
>             .builder()
>             .master("local[*]")
>             .getOrCreate()
>         val dfA = spark
>             .createDataFrame(Seq((1, 1), (2, 4)))
>             .toDF("a1", "a2")
>         val dfB = spark
>             .createDataFrame(Seq((1, 5), (2, 2)))
>             .toDF("b1", "b2")
>         dfA.join(dfB, dfA("a1") === dfB("b1"), "inner")
>             .where(col("a2") < col("b2"))
>             .withColumn("x", explode(sequence(col("a2"), col("b2"), lit(1
>             .show()
>         spark.stop()
>     }
> }
> {code}
> When I look at the Optimized Logical Plan I can see that the Inner Join and 
> the Filter are brought together, with an additional check for an empty 
> Sequence. The exception is thrown because the Sequence check is executed 
> before the Filter.
> {code:java}
> == Parsed Logical Plan ==
> 'Project [a1#4, a2#5, b1#12, b2#13, explode(sequence('a2, 'b2, Some(1), 
> None)) AS x#24]
> +- Filter (a2#5 < b2#13)
>    +- Join Inner, (a1#4 = b1#12)
>       :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>       :  +- LocalRelation [_1#0, _2#1]
>       +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>          +- LocalRelation [_1#8, _2#9]
> == Analyzed Logical Plan ==
> a1: int, a2: int, b1: int, b2: int, x: int
> Project [a1#4, a2#5, b1#12, b2#13, x#25]
> +- Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), 
> false, [x#25]
>    +- Filter (a2#5 < b2#13)
>       +- Join Inner, (a1#4 = b1#12)
>          :- Project [_1#0 AS a1#4, _2#1 AS a2#5]
>          :  +- LocalRelation [_1#0, _2#1]
>          +- Project [_1#8 AS b1#12, _2#9 AS b2#13]
>             +- LocalRelation [_1#8, _2#9]
> == Optimized Logical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), false, 
> [x#25]
> +- Join Inner, (((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), 
> true) > 0) AND (a2#5 < b2#13)) AND (a1#4 = b1#12))
>    :- LocalRelation [a1#4, a2#5]
>    +- LocalRelation [b1#12, b2#13]
> == Physical Plan ==
> Generate explode(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin))), [a1#4, 
> a2#5, b1#12, b2#13], false, [x#25]
> +- *(1) BroadcastHashJoin [a1#4], [b1#12], Inner, BuildRight, 
> ((size(sequence(a2#5, b2#13, Some(1), Some(Europe/Berlin)), true) > 0) AND 
> (a2#5 < b2#13)), false
>    :- *(1) LocalTableScan [a1#4, a2#5]
>    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)),false), [id=#15]
>       +- LocalTableScan [b1#12, b2#13]
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527527#comment-17527527
 ] 

Nicholas Chammas commented on SPARK-37222:
--

Thanks for the detailed report, [~ssmith]. I am hitting this issue as well on 
Spark 3.2.1, and your minimal test case also reproduces the issue for me.

How did you break down the optimization into its individual steps like that? 
That was very helpful.

I was able to use your breakdown to work around the issue by excluding 
{{{}PushDownLeftSemiAntiJoin{}}}:
{code:java}
spark.conf.set(
  "spark.sql.optimizer.excludedRules",
  "org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin"
){code}
If I run that before running the problematic query (including your test case), 
it seems to work around the issue.

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id,

[jira] [Updated] (SPARK-37222) Max iterations reached in Operator Optimization w/left_anti or left_semi join and nested structures

2022-04-25 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37222:
-
Affects Version/s: 3.2.1

> Max iterations reached in Operator Optimization w/left_anti or left_semi join 
> and nested structures
> ---
>
> Key: SPARK-37222
> URL: https://issues.apache.org/jira/browse/SPARK-37222
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.2, 3.2.0, 3.2.1
> Environment: I've reproduced the error on Spark 3.1.2, 3.2.0, and 
> with the current branch-3.2 HEAD (git commit 966c90c0b5) as of November 5, 
> 2021.
> The problem does not occur with Spark 3.0.1.
>  
>Reporter: Shawn Smith
>Priority: Major
>
> The query optimizer never reaches a fixed point when optimizing the query 
> below. This manifests as a warning:
> > WARN: Max iterations (100) reached for batch Operator Optimization before 
> > Inferring Filters, please set 'spark.sql.optimizer.maxIterations' to a 
> > larger value.
> But the suggested fix won't help. The actual problem is that the optimizer 
> fails to make progress on each iteration and gets stuck in a loop.
> In practice, Spark logs a warning but continues on and appears to execute the 
> query successfully, albeit perhaps sub-optimally.
> To reproduce, paste the following into the Spark shell. With Spark 3.1.2 and 
> 3.2.0 but not 3.0.1 it will throw an exception:
> {noformat}
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> {noformat}
> Looking at the query plan as the optimizer iterates in 
> {{RuleExecutor.execute()}}, here's an example of the plan after 49 iterations:
> {noformat}
> Project [id#2, _gen_alias_108#108L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_108#108L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> And here's the plan after one more iteration. You can see that all that has 
> changed is new aliases for the column in the nested column: 
> "{{_gen_alias_108#108L}}" to "{{_gen_alias_109#109L}}".
> {noformat}
> Project [id#2, _gen_alias_109#109L AS nested.n#28L]
> +- Join LeftAnti, (id#2 = id#18)
>:- Project [id#2, nested#3.n AS _gen_alias_109#109L]
>:  +- InMemoryRelation [id#2, nested#3], StorageLevel(disk, memory, 
> deserialized, 1 replicas)
>:+- LocalTableScan , [id#2, nested#3]
>+- InMemoryRelation [id#18], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- LocalTableScan , [id#18]
> {noformat}
> The optimizer continues looping and tweaking the alias until it hits the max 
> iteration count and bails out.
> Here's an example that includes a stack trace:
> {noformat}
> $ bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.2.0
>   /_/
> Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class Nested(b: Boolean, n: Long)
> case class Table(id: String, nested: Nested)
> case class Identifier(id: String)
> locally {
>   System.setProperty("spark.testing", "true") // Fail instead of logging a 
> warning
>   val df = List.empty[Table].toDS.cache()
>   val ids = List.empty[Identifier].toDS.cache()
>   df.join(ids, Seq("id"), "left_anti") // also fails with "left_semi"
> .select('id, 'nested("n"))
> .explain()
> }
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: Max iterations (100) reached for batch Operator 
> Optimization before Inferring Filters, please set 
> 'spark.sql.optimizer.maxIterations' to a larger value.
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:246)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
>

[jira] [Commented] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets

2022-04-25 Thread Chris Kimmel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527512#comment-17527512
 ] 

Chris Kimmel commented on SPARK-38983:
--

Thanks for your comment, [~hyukjin.kwon] . This issue is about the misleading 
error message. I edited the ticket to clarify.

> Pyspark throws AnalysisException with incorrect error message when using 
> .grouping() or .groupingId() (AnalysisException: grouping() can only be used 
> with GroupingSets/Cube/Rollup;)
> -
>
> Key: SPARK-38983
> URL: https://issues.apache.org/jira/browse/SPARK-38983
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2, 3.2.1
> Environment: I have reproduced this error in two environments. I 
> would be happy to answer questions about either.
> h1. Environment 1
> I first encountered this error on my employer's Azure Databricks cluster, 
> which runs Spark version 3.1.2. I have limited access to cluster 
> configuration information, but I can ask if it will help.
> h1. Environment 2
> I reproduced the error by running the same code in the Pyspark shell from 
> Spark 3.2.1 on my Chromebook (i.e. Crostini Linux). I have more access to 
> environment information here. Running {{spark-submit --version}} produced the 
> following output:
> {{Welcome to Spark version 3.2.1}}
> {{Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.14}}
> {{Branch HEAD}}
> {{Compiled by user hgao on 2022-01-20T19:26:14Z}}
> {{Revision 4f25b3f71238a00508a356591553f2dfa89f8290}}
> {{Url https://github.com/apache/spark}}
>Reporter: Chris Kimmel
>Priority: Minor
>  Labels: cube, error_message_improvement, exception-handling, 
> grouping, rollup
>
> h1. In a nutshell
> Pyspark emits an incorrect error message when committing a type error with 
> the results of the {{grouping()}} function.
> h1. Code to reproduce
> {{print(spark.version) # My environment, Azure DataBricks, defines spark 
> automatically.}}
> {{from pyspark.sql import functions as f}}
> {{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
> {{  ('a',),}}
> {{  ('b',),}}
> {{]}}
> {{s = t.StructType([}}
> {{  t.StructField('col1', t.StringType())}}
> {{])}}
> {{df = spark.createDataFrame(l, s)}}
> {{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
> {{  df}}
> {{  .cube(f.col('col1'))}}
> {{  .agg(f.grouping('col1') & f.lit(True))}}
> {{  .collect()}}
> {{)}}
> h1. Expected results
> The code produces an {{AnalysisException()}} with error message along the 
> lines of:
> {{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
> type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
> boolean).;}}
> h1. Actual results
> The code throws an {{AnalysisException()}} with error message
> {{AnalysisException: grouping() can only be used with 
> GroupingSets/Cube/Rollup;}}
> Python provides the following traceback:
> {{---}}
> {{AnalysisException                         Traceback (most recent call 
> last)}}
> {{ in }}
> {{     15 }}
> {{     16 ( # This expression raises an AnalysisException()}}
> {{---> 17   df}}
> {{     18   .cube(f.col('col1'))}}
> {{{}     19   .agg(f.grouping('col1') & 
> f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in 
> agg(self, *exprs){}}}
> {{    116             # Columns}}
> {{    117             assert all(isinstance(c, Column) for c in exprs), "all 
> exprs should be Column"}}
> {{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
> {{    119                                 _to_seq(self.sql_ctx._sc, [c._jc 
> for c in exprs[1:]]))}}
> {{{}    120         return DataFrame(jdf, 
> self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in {_}{{_}}call{{_}}{_}(self, *args){}}}
> {{   1302 }}
> {{   1303         answer = self.gateway_client.send_command(command)}}
> {{-> 1304         return_value = get_return_value(}}
> {{   1305             answer, self.gateway_client, self.target_id, 
> self.name)}}
> {{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, 
> **kw)}}
> {{    121                 # Hide where the exception came from that shows a 
> non-Pythonic}}
> {{    122                 # JVM exception message.}}
> {{--> 123                 raise converted from None}}
> {{    124             else:}}
> {{{}    125                 raise{}}}{{{}AnalysisException: grouping() can 
> only be used with GroupingSets/Cube/Rollup;{}}}
> {{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
> AS (grouping(col1) AND true)#551|#548,

[jira] [Updated] (SPARK-38983) Pyspark throws AnalysisException with incorrect error message when using .grouping() or .groupingId() (AnalysisException: grouping() can only be used with GroupingSets/C

2022-04-25 Thread Chris Kimmel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Kimmel updated SPARK-38983:
-
Description: 
h1. In a nutshell

Pyspark emits an incorrect error message when committing a type error with the 
results of the {{grouping()}} function.
h1. Code to reproduce

{{print(spark.version) # My environment, Azure DataBricks, defines spark 
automatically.}}
{{from pyspark.sql import functions as f}}
{{{}from pyspark.sql import types as t{}}}{{{}l = [{}}}
{{  ('a',),}}
{{  ('b',),}}
{{]}}
{{s = t.StructType([}}
{{  t.StructField('col1', t.StringType())}}
{{])}}
{{df = spark.createDataFrame(l, s)}}
{{{}df.display(){}}}{{{}( # This expression raises an AnalysisException(){}}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1') & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Expected results

The code produces an {{AnalysisException()}} with error message along the lines 
of:
{{AnalysisException: cannot resolve '(GROUPING(`col1`) AND true)' due to data 
type mismatch: differing types in '(GROUPING(`col1`) AND true)' (int and 
boolean).;}}
h1. Actual results

The code throws an {{AnalysisException()}} with error message
{{AnalysisException: grouping() can only be used with 
GroupingSets/Cube/Rollup;}}

Python provides the following traceback:
{{---}}
{{AnalysisException                         Traceback (most recent call last)}}
{{ in }}
{{     15 }}
{{     16 ( # This expression raises an AnalysisException()}}
{{---> 17   df}}
{{     18   .cube(f.col('col1'))}}
{{{}     19   .agg(f.grouping('col1') & 
f.lit(True)){}}}{{{}/databricks/spark/python/pyspark/sql/group.py in agg(self, 
*exprs){}}}
{{    116             # Columns}}
{{    117             assert all(isinstance(c, Column) for c in exprs), "all 
exprs should be Column"}}
{{--> 118             jdf = self._jgd.agg(exprs[0]._jc,}}
{{    119                                 _to_seq(self.sql_ctx._sc, [c._jc for 
c in exprs[1:]]))}}
{{{}    120         return DataFrame(jdf, 
self.sql_ctx){}}}{{{}/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
 in {_}{{_}}call{{_}}{_}(self, *args){}}}
{{   1302 }}
{{   1303         answer = self.gateway_client.send_command(command)}}
{{-> 1304         return_value = get_return_value(}}
{{   1305             answer, self.gateway_client, self.target_id, self.name)}}
{{   1306 }}{{/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)}}
{{    121                 # Hide where the exception came from that shows a 
non-Pythonic}}
{{    122                 # JVM exception message.}}
{{--> 123                 raise converted from None}}
{{    124             else:}}
{{{}    125                 raise{}}}{{{}AnalysisException: grouping() can only 
be used with GroupingSets/Cube/Rollup;{}}}
{{'Aggregate [cube(col1#548)|#548)], [col1#548, (grouping(col1#548) AND true) 
AS (grouping(col1) AND true)#551|#548, (grouping(col1#548) AND true) AS 
(grouping(col1) AND true)#551]}}
{{+- LogicalRDD [col1#548|#548], false}}
h1. Workaround

_Note:_ The reason I opened this ticket is that, when the user makes a 
particular type error, the resulting error message is misleading. The code 
snippet below shows how to fix that type error. It does not address the 
false-error-message bug, which is the focus of this ticket.

Cast the result of {{.grouping()}} to boolean type. That is, know _ab ovo_ that 
{{.grouping()}} produces an integer 0 or 1 rather than a boolean True or False.

{{(  # This expression does not raise an AnalysisException()}}
{{  df}}
{{  .cube(f.col('col1'))}}
{{  .agg(f.grouping('col1').cast(t.BooleanType()) & f.lit(True))}}
{{  .collect()}}
{{)}}
h1. Additional notes

The same error occurs if {{.cube()}} is replaced with {{.rollup()}} in "Code to 
reproduce".

The same error occurs if {{.grouping()}} is replaced with {{.grouping_id()}} in 
"Code to reproduce".
h1. Related tickets

https://issues.apache.org/jira/browse/SPARK-22748
h1. Relevant documentation
 * [Spark SQL GROUPBY, ROLLUP, and CUBE 
semantics|https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html]
 * 
[DataFrame.cube()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.cube.html]
 * 
[DataFrame.rollup()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.rollup.html]
 * 
[DataFrame.agg()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.agg.html]
 * 
[functions.grouping()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping.html]
 * 
[functions.grouping_id()|https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.grouping_id.html]

 

  was:
h1. Code to reproduce

{{print(spark.version) # My environment, Azure DataBricks, defines spark 
automatically.}}
{{from pyspark.sql

[jira] [Commented] (SPARK-39007) Use double quotes for SQL configs in error messages

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527501#comment-17527501
 ] 

Apache Spark commented on SPARK-39007:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36340

> Use double quotes for SQL configs in error messages
> ---
>
> Key: SPARK-39007
> URL: https://issues.apache.org/jira/browse/SPARK-39007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> All SQL configs should be printed in SQL style in error messages, and wrapped 
> by double quotes. For example, the config spark.sql.ansi.enabled should be 
> highlighted as "spark.sql.ansi.enabled" to make it more visible in error 
> messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39007) Use double quotes for SQL configs in error messages

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527498#comment-17527498
 ] 

Apache Spark commented on SPARK-39007:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/36340

> Use double quotes for SQL configs in error messages
> ---
>
> Key: SPARK-39007
> URL: https://issues.apache.org/jira/browse/SPARK-39007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> All SQL configs should be printed in SQL style in error messages, and wrapped 
> by double quotes. For example, the config spark.sql.ansi.enabled should be 
> highlighted as "spark.sql.ansi.enabled" to make it more visible in error 
> messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39009) Spark Log4j vul - CVE-2021-44228

2022-04-25 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-39009.
--
Resolution: Duplicate

https://issues.apache.org/jira/browse/SPARK-6305
But this is not how to use JIRA; read https://spark.apache.org/contributing.html

> Spark Log4j vul - CVE-2021-44228
> 
>
> Key: SPARK-39009
> URL: https://issues.apache.org/jira/browse/SPARK-39009
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: Production
>Reporter: Prakash Shankar
>Priority: Major
>
> When can we expect the spark 3.3 release, can you please confirm whether 
> it’ll fix the log4j issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.

2022-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-37174:

Attachment: (was: info.txt)

> WARN WindowExec: No Partition Defined is being printed 4 times. 
> 
>
> Key: SPARK-37174
> URL: https://issues.apache.org/jira/browse/SPARK-37174
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Hi I use this code  
> {code:java}
> f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
> pf01 = f01.to_pandas_on_spark()
> pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
> pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = 
> ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
> pf01.info(){code}
>  
>  sometimes it prints 
>   
> {code:java}
>  21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:04 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: 
> DataFrame is highly fragmented.  This is usually the result of calling 
> `frame.insert` many times, which has poor performance.  Consider joining all 
> columns at once using pd.concat(axis=1) instead.  To get a de-fragmented 
> frame, use `newframe = frame.copy()`
>    df[column_name] = series
>  /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` 
> loads all data into the driver's memory. It should only be used if the 
> resulting pandas Series is expected to be small.
>    warnings.warn(message, UserWarning)
>  21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
>  
>  and some other times it "just" prints 
>   
> {code:java}
>  21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
> Why does it print df[column_name] = series ?
>   
>  can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
>  and warnings.warn(message, UserWarning) ?
>  and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving 
> all data to a single partition, this can cause serious performance 
> degradation.?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-25 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527432#comment-17527432
 ] 

Bjørn Jørgensen commented on SPARK-38988:
-

I add a new fil "warning printed.txt" it show that it depends one the dataframe 
size. 

So if you have a dataframe 
Int64Index: 34 entries, 0 to 33
Data columns (total 37 columns):

The warning won`t get printed.

If the datafreme is 
Int64Index: 109 entries, 0 to 108
Data columns (total 112 columns):

Then the warning is printed 13 times.





> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38988) Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get printed many times.

2022-04-25 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-38988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-38988:

Attachment: warning printed.txt

> Pandas API - "PerformanceWarning: DataFrame is highly fragmented." get 
> printed many times. 
> ---
>
> Key: SPARK-38988
> URL: https://issues.apache.org/jira/browse/SPARK-38988
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: Untitled.html, info.txt, warning printed.txt
>
>
> I add a file and a notebook with the info msg I get when I run df.info()
> Spark master build from 13.04.22.
> df.shape
> (763300, 224)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38965) Optimize RemoteBlockPushResolver with a memory pool

2022-04-25 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38965:

Summary: Optimize RemoteBlockPushResolver with a memory pool  (was: Retry 
transfer blocks for exceptions listed in the error handler )

> Optimize RemoteBlockPushResolver with a memory pool
> ---
>
> Key: SPARK-38965
> URL: https://issues.apache.org/jira/browse/SPARK-38965
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Minor
>
> For push-based shuffle service, there are many 
> {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks 
> outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is 
> writing, the others map tasks pushed blocks will failed in {{onComplete()}} 
> method.
> And {{RemoteBlockPushResolver}} has no memory limit , so many executors will 
> OOM when there are many small pushed blocks waiting to be written to the 
> final data file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527402#comment-17527402
 ] 

Apache Spark commented on SPARK-39001:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36339

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39001:


Assignee: Apache Spark

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39001:


Assignee: (was: Apache Spark)

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527401#comment-17527401
 ] 

Apache Spark commented on SPARK-39001:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36339

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39001) Document which options are unsupported in CSV and JSON functions

2022-04-25 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527400#comment-17527400
 ] 

Hyukjin Kwon commented on SPARK-39001:
--

Actually this is pretty straightforward. let me just make a quick PR 

> Document which options are unsupported in CSV and JSON functions
> 
>
> Key: SPARK-39001
> URL: https://issues.apache.org/jira/browse/SPARK-39001
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://github.com/apache/spark/pull/36294. Some of CSV and JSON options 
> don't work in expressions because some of them are plan-wise options like 
> parseMode = DROPMALFORMED.
> We should document that which options are not working. possibly we should 
> also throw an exception.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38965) Retry transfer blocks for exceptions listed in the error handler

2022-04-25 Thread Wan Kun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38965:

Description: 
For push-based shuffle service, there are many 
{{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks 
outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is 
writing, the others map tasks pushed blocks will failed in {{onComplete()}} 
method.
And {{RemoteBlockPushResolver}} has no memory limit , so many executors will 
OOM when there are many small pushed blocks waiting to be written to the final 
data file.

  was:
We should retry transfer blocks if *errorHandler.shouldRetryError(e)* return 
true, 

Even though that exception may not a IOException, for example:
{code:java}
org.apache.spark.network.server.BlockPushNonFatalFailure: Block 
shufflePush_0_0_3316_5647 experienced merge collision on the server side
{code}


> Retry transfer blocks for exceptions listed in the error handler 
> -
>
> Key: SPARK-38965
> URL: https://issues.apache.org/jira/browse/SPARK-38965
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Minor
>
> For push-based shuffle service, there are many 
> {{BLOCK_APPEND_COLLISION_DETECTED}} when there are many small map tasks 
> outputs. In {{{}RemoteBlockPushResolver{}}}, if one map task pushed blocks is 
> writing, the others map tasks pushed blocks will failed in {{onComplete()}} 
> method.
> And {{RemoteBlockPushResolver}} has no memory limit , so many executors will 
> OOM when there are many small pushed blocks waiting to be written to the 
> final data file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39007) Use double quotes for SQL configs in error messages

2022-04-25 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39007.
--
Resolution: Fixed

Issue resolved by pull request 36335
[https://github.com/apache/spark/pull/36335]

> Use double quotes for SQL configs in error messages
> ---
>
> Key: SPARK-39007
> URL: https://issues.apache.org/jira/browse/SPARK-39007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> All SQL configs should be printed in SQL style in error messages, and wrapped 
> by double quotes. For example, the config spark.sql.ansi.enabled should be 
> highlighted as "spark.sql.ansi.enabled" to make it more visible in error 
> messages.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38999) Refactor DataSourceScanExec code to

2022-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38999:
---

Assignee: Utkarsh Agarwal

> Refactor DataSourceScanExec code to 
> 
>
> Key: SPARK-38999
> URL: https://issues.apache.org/jira/browse/SPARK-38999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>
> Currently the code for `FileSourceScanExec` class, the physical node for the 
> file scans is quite complex and lengthy. The class should be refactored into 
> a trait `FileSourceScanLike` which implements basic functionality like 
> metrics and file listing. The execution specific code can then live inside 
> `FileSourceScanExec` which will subclass `FileSourceScanLike`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38999) Refactor DataSourceScanExec code to

2022-04-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38999.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36327
[https://github.com/apache/spark/pull/36327]

> Refactor DataSourceScanExec code to 
> 
>
> Key: SPARK-38999
> URL: https://issues.apache.org/jira/browse/SPARK-38999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.2.2, 3.4.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently the code for `FileSourceScanExec` class, the physical node for the 
> file scans is quite complex and lengthy. The class should be refactored into 
> a trait `FileSourceScanLike` which implements basic functionality like 
> metrics and file listing. The execution specific code can then live inside 
> `FileSourceScanExec` which will subclass `FileSourceScanLike`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38981:
-
Priority: Major  (was: Critical)

> Unexpected commutative property of udf/pandas_udf and filters
> -
>
> Key: SPARK-38981
> URL: https://issues.apache.org/jira/browse/SPARK-38981
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Maximilian Sackel
>Priority: Major
>  Labels: beginner
> Attachments: optimization_udf_filter.html, screenshot-1.png, 
> screenshot-2.png
>
>
> Hello all,
> When running the attached minmal working example in the attachments, the 
> order of the filter and the UDF is swapped by the optimizer. This can lead to 
> errors, which are difficult to debug. In the documentation I have found no 
> reference to such behavior. 
> Is this a bug or a functionality which is poorly documented?
> With kind regards,
> Max



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38981:
-
Component/s: PySpark

> Unexpected commutative property of udf/pandas_udf and filters
> -
>
> Key: SPARK-38981
> URL: https://issues.apache.org/jira/browse/SPARK-38981
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark
>Affects Versions: 3.2.1
>Reporter: Maximilian Sackel
>Priority: Major
> Attachments: optimization_udf_filter.html, screenshot-1.png, 
> screenshot-2.png
>
>
> Hello all,
> When running the attached minmal working example in the attachments, the 
> order of the filter and the UDF is swapped by the optimizer. This can lead to 
> errors, which are difficult to debug. In the documentation I have found no 
> reference to such behavior. 
> Is this a bug or a functionality which is poorly documented?
> With kind regards,
> Max



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters

2022-04-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38981:
-
Labels:   (was: beginner)

> Unexpected commutative property of udf/pandas_udf and filters
> -
>
> Key: SPARK-38981
> URL: https://issues.apache.org/jira/browse/SPARK-38981
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Maximilian Sackel
>Priority: Major
> Attachments: optimization_udf_filter.html, screenshot-1.png, 
> screenshot-2.png
>
>
> Hello all,
> When running the attached minmal working example in the attachments, the 
> order of the filter and the UDF is swapped by the optimizer. This can lead to 
> errors, which are difficult to debug. In the documentation I have found no 
> reference to such behavior. 
> Is this a bug or a functionality which is poorly documented?
> With kind regards,
> Max



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39009) Spark Log4j vul - CVE-2021-44228

2022-04-25 Thread Prakash Shankar (Jira)

Prakash Shankar created SPARK-39009:
---

 Summary: Spark Log4j vul - CVE-2021-44228
 Key: SPARK-39009
 URL: https://issues.apache.org/jira/browse/SPARK-39009
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: Production
Reporter: Prakash Shankar


When can we expect the spark 3.3 release, can you please confirm whether it’ll 
fix the log4j issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38981) Unexpected commutative property of udf/pandas_udf and filters

2022-04-25 Thread Maximilian Sackel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527322#comment-17527322
 ] 

Maximilian Sackel commented on SPARK-38981:
---

To fill the minimal working example with some more input, I'll try to motivate 
it.
In general, a function should be applied to a large table for a certain 
category type. 
Therefore the task is divided into 3 subtasks. 
a) For each row determine the category types using udf.
b) Filter rows by the searched category types
c) Calculate values for the category types using udf. If rows are used in the 
calculation which do not correspond to the category, an error is thrown during 
the calculation. 
d) the error terminates the whole process

Simply adding the rule 
"org.apache.spark.sql.catalyst.optimizer.PushDownPredicate" to the exclude 
rules does not seem to solve the problem. [~hyukjin.kwon]  it would be realy 
nice if you could refer me to the appropriate place in the documentation, where 
I can start testing. 

The basic Idea is to exclude the optimizer rules for the corresponding lines 
and then reactivate it, to make use of the optimizer algorithms again?

> Unexpected commutative property of udf/pandas_udf and filters
> -
>
> Key: SPARK-38981
> URL: https://issues.apache.org/jira/browse/SPARK-38981
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Maximilian Sackel
>Priority: Critical
>  Labels: beginner
> Attachments: optimization_udf_filter.html, screenshot-1.png, 
> screenshot-2.png
>
>
> Hello all,
> When running the attached minmal working example in the attachments, the 
> order of the filter and the UDF is swapped by the optimizer. This can lead to 
> errors, which are difficult to debug. In the documentation I have found no 
> reference to such behavior. 
> Is this a bug or a functionality which is poorly documented?
> With kind regards,
> Max



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

92 matches

Mail list logo