[jira] [Updated] (SPARK-46985) Move _NoValue from pyspark.* to pyspark.sql.*

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46985:
---
Labels: pull-request-available  (was: )

> Move _NoValue from pyspark.* to pyspark.sql.*
> -
>
> Key: SPARK-46985
> URL: https://issues.apache.org/jira/browse/SPARK-46985
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> _NoValue is only used in SQL and pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T

2024-02-05 Thread Andoni Teso (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andoni Teso updated SPARK-46679:

Affects Version/s: 4.0.0

> Encoders with multiple inheritance - Key not found: T
> -
>
> Key: SPARK-46679
> URL: https://issues.apache.org/jira/browse/SPARK-46679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Andoni Teso
>Priority: Blocker
> Attachments: spark_test.zip
>
>
> Since version 3.4, I've been experiencing the following error when using 
> encoders.
> {code:java}
> Exception in thread "main" java.util.NoSuchElementException: key not found: T
>     at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
>     at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
>     at org.apache.spark.sql.Encoders.bean(Encoders.scala)
>     at org.example.Main.main(Main.java:26) {code}
> I'm attaching the code I use to reproduce the error locally.  
> [^spark_test.zip]
> The issue is in the JavaTypeInference class when it tries to find the encoder 
> for a ParameterizedType with the value Team. When running 
> JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T 
> again, but this time as a Company object, and pt.getRawType as Team. This 
> ends up generating a tuple of Team, Company in the typeVariables map, leading 
> to errors when searching for TypeVariables.
> In my case, I've resolved this by doing the following:
> {code:java}
> case tv: TypeVariable[_] =>
>   encoderFor(typeVariables.head._2, seenTypeSet, typeVariables)
> case pt: ParameterizedType =>
>   encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code}
> I haven't submitted a pull request because it doesn't seem to be the most 
> optimal solution, or it might break some parts of the code. Additional 
> validations or conditions may need to be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T

2024-02-05 Thread Andoni Teso (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andoni Teso updated SPARK-46679:

Priority: Critical  (was: Blocker)

> Encoders with multiple inheritance - Key not found: T
> -
>
> Key: SPARK-46679
> URL: https://issues.apache.org/jira/browse/SPARK-46679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Andoni Teso
>Priority: Critical
> Attachments: spark_test.zip
>
>
> Since version 3.4, I've been experiencing the following error when using 
> encoders.
> {code:java}
> Exception in thread "main" java.util.NoSuchElementException: key not found: T
>     at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
>     at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
>     at org.apache.spark.sql.Encoders.bean(Encoders.scala)
>     at org.example.Main.main(Main.java:26) {code}
> I'm attaching the code I use to reproduce the error locally.  
> [^spark_test.zip]
> The issue is in the JavaTypeInference class when it tries to find the encoder 
> for a ParameterizedType with the value Team. When running 
> JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T 
> again, but this time as a Company object, and pt.getRawType as Team. This 
> ends up generating a tuple of Team, Company in the typeVariables map, leading 
> to errors when searching for TypeVariables.
> In my case, I've resolved this by doing the following:
> {code:java}
> case tv: TypeVariable[_] =>
>   encoderFor(typeVariables.head._2, seenTypeSet, typeVariables)
> case pt: ParameterizedType =>
>   encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code}
> I haven't submitted a pull request because it doesn't seem to be the most 
> optimal solution, or it might break some parts of the code. Additional 
> validations or conditions may need to be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46984) Remove pyspark.copy_func

2024-02-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46984:
-
Priority: Minor  (was: Major)

> Remove pyspark.copy_func
> 
>
> Key: SPARK-46984
> URL: https://issues.apache.org/jira/browse/SPARK-46984
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46985) Move _NoValue from pyspark.* to pyspark.sql.*

2024-02-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-46985:


 Summary: Move _NoValue from pyspark.* to pyspark.sql.*
 Key: SPARK-46985
 URL: https://issues.apache.org/jira/browse/SPARK-46985
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


_NoValue is only used in SQL and pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46984) Remove pyspark.copy_func

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46984:
---
Labels: pull-request-available  (was: )

> Remove pyspark.copy_func
> 
>
> Key: SPARK-46984
> URL: https://issues.apache.org/jira/browse/SPARK-46984
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46984) Remove pyspark.copy_func

2024-02-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-46984:


 Summary: Remove pyspark.copy_func
 Key: SPARK-46984
 URL: https://issues.apache.org/jira/browse/SPARK-46984
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46983) Decouple module dependencies between PySpark modules

2024-02-05 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-46983:


 Summary: Decouple module dependencies between PySpark modules
 Key: SPARK-46983
 URL: https://issues.apache.org/jira/browse/SPARK-46983
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


We have unnecessary dependencies between each PySpark modules. We should remove 
them out so individual package can be self-contained.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions

2024-02-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-46170:
-
Fix Version/s: 3.5.1

> Support inject adaptive query post planner strategy rules in 
> SparkSessionExtensions
> ---
>
> Key: SPARK-46170
> URL: https://issues.apache.org/jira/browse/SPARK-46170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46982) Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46982:
---
Labels: pull-request-available  (was: )

> Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE
> -
>
> Key: SPARK-46982
> URL: https://issues.apache.org/jira/browse/SPARK-46982
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Description: 
We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
"java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 ** No need to place any data files as this happens with empty location.
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)~
~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~
~at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~
~at scala.collection.Iterator.foreach(Iterator.scala:943)~
~at scala.collection.Iterator.foreach$(Iterator.scala:943)~
~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~
~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~
~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~
~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~
~GC overhead limit exceeded~
~java.lang.OutOfMemoryError: GC overhead limit exceeded~
~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.Physic

[jira] [Created] (SPARK-46982) Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE

2024-02-05 Thread Kent Yao (Jira)
Kent Yao created SPARK-46982:


 Summary: Remove _LEGACY_ERROR_TEMP_2187 in favor of 
CANNOT_RECOGNIZE_HIVE_TYPE
 Key: SPARK-46982
 URL: https://issues.apache.org/jira/browse/SPARK-46982
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Description: 
We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
"java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)~
~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~
~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~
~at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~
~at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~
~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~
~at scala.collection.Iterator.foreach(Iterator.scala:943)~
~at scala.collection.Iterator.foreach$(Iterator.scala:943)~
~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~
~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~
~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~
~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~
~GC overhead limit exceeded~
~java.lang.OutOfMemoryError: GC overhead limit exceeded~
~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~
~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~
~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~
~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~
~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~
~at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~
~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~
~at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)~
~at scala.Option.getOrElse(Option.scala:189)~
~at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~
~at 
org.apache.spark.sql.hive.

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Attachment: test_and_twodays_simplified.sql

> Driver OOM happens in query planning phase with empty tables
> 
>
> Key: SPARK-46981
> URL: https://issues.apache.org/jira/browse/SPARK-46981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: * OSS Spark 3.5.0
>  * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
>  * AWS Glue Spark 3.3.0 (Glue version 4.0)
>Reporter: Noritaka Sekiyama
>Priority: Major
> Attachments: create_sanitized_tables.py, 
> test_and_twodays_simplified.sql
>
>
> We have observed that Driver OOM happens in query planning phase with empty 
> tables when we ran specific patterns of queries.
> h2. Issue details
> If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
> then the query fails in planning phase due to Driver OOM, more specifically, 
> }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
> exceeded{}}}{{{}{}}}.
> If we change the where condition from {{pt>='20231004' and pt<='20231004'}} 
> to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.
>  
> This issue happened even with empty table, and it happened before actual data 
> load. This seems like an issue in catalyst side.
> h2. Reproduction step
> Attaching script and query to reproduce the issue.
>  * create_sanitized_tables.py: Script to create table definitions
>  * test_and_twodays_simplified.sql: Query to reproduce the issue
> Here's the typical stacktrace:
> {{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
> at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
> at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
> at scala.collection.Iterator.foreach(Iterator.scala:943)
> at scala.collection.Iterator.foreach$(Iterator.scala:943)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
> at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
> GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collec

[jira] [Created] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)
Noritaka Sekiyama created SPARK-46981:
-

 Summary: Driver OOM happens in query planning phase with empty 
tables
 Key: SPARK-46981
 URL: https://issues.apache.org/jira/browse/SPARK-46981
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
 Environment: * OSS Spark 3.5.0
 * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
 * AWS Glue Spark 3.3.0 (Glue version 4.0)
Reporter: Noritaka Sekiyama
 Attachments: create_sanitized_tables.py

We have observed that Driver OOM happens in query planning phase with empty 
tables when we ran specific patterns of queries.
h2. Issue details

If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
then the query fails in planning phase due to Driver OOM, more specifically, 
}}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
exceeded{}}}{{{}{}}}.

If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to 
{{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.

 

This issue happened even with empty table, and it happened before actual data 
load. This seems like an issue in catalyst side.
h2. Reproduction step

Attaching script and query to reproduce the issue.
 * create_sanitized_tables.py: Script to create table definitions
 * test_and_twodays_simplified.sql: Query to reproduce the issue

Here's the typical stacktrace:

{{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
at scala.collection.immutable.Vector.iterator(Vector.scala:69)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
at 
org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
 Source)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
at 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
 Source)
at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
at 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.Vector.iterator(Vector.scala:100)
at scala.collection.immutable.Vector.iterator(Vector.scala:69)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
at 
scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
at 
org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
  

[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables

2024-02-05 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-46981:
--
Attachment: create_sanitized_tables.py

> Driver OOM happens in query planning phase with empty tables
> 
>
> Key: SPARK-46981
> URL: https://issues.apache.org/jira/browse/SPARK-46981
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
> Environment: * OSS Spark 3.5.0
>  * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0)
>  * AWS Glue Spark 3.3.0 (Glue version 4.0)
>Reporter: Noritaka Sekiyama
>Priority: Major
> Attachments: create_sanitized_tables.py
>
>
> We have observed that Driver OOM happens in query planning phase with empty 
> tables when we ran specific patterns of queries.
> h2. Issue details
> If we run the query with where condition {{pt>='20231004' and pt<='20231004', 
> then the query fails in planning phase due to Driver OOM, more specifically, 
> }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit 
> exceeded{}}}{{{}{}}}.
> If we change the where condition from {{pt>='20231004' and pt<='20231004'}} 
> to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error.
>  
> This issue happened even with empty table, and it happened before actual data 
> load. This seems like an issue in catalyst side.
> h2. Reproduction step
> Attaching script and query to reproduce the issue.
>  * create_sanitized_tables.py: Script to create table definitions
>  * test_and_twodays_simplified.sql: Query to reproduce the issue
> Here's the typical stacktrace:
> {{  at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)
> at 
> scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)
> at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown
>  Source)
> at scala.Option.getOrElse(Option.scala:189)
> at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)
> at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown
>  Source)
> at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> at 
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown
>  Source)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)
> at 
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)
> at scala.collection.Iterator.foreach(Iterator.scala:943)
> at scala.collection.Iterator.foreach$(Iterator.scala:943)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)
> at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)
> GC overhead limit exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at scala.collection.immutable.Vector.iterator(Vector.scala:100)
> at scala.collection.immutable.Vector.iterator(Vector.scala:69)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.sc

[jira] [Resolved] (SPARK-46958) missing timezone to coerce default values

2024-02-05 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46958.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45000
[https://github.com/apache/spark/pull/45000]

> missing timezone to coerce default values
> -
>
> Key: SPARK-46958
> URL: https://issues.apache.org/jira/browse/SPARK-46958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ```
> create table src(key int, c string DEFAULT date'2018-11-17') using parquet;
> Time taken: 0.133 seconds
> spark-sql (default)> desc src;
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> analysis failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46934:
---
Labels: pull-request-available  (was: )

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Priority: Blocker
>  Labels: pull-request-available
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(HiveClientImpl.scala:1037)
>   at 
> org.apache.spark.sql.hive.client

[jira] [Updated] (SPARK-46979) Add support for defining state encoder for key/value and col family independently

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46979:
---
Labels: pull-request-available  (was: )

> Add support for defining state encoder for key/value and col family 
> independently
> -
>
> Key: SPARK-46979
> URL: https://issues.apache.org/jira/browse/SPARK-46979
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Anish Shrigondekar
>Priority: Major
>  Labels: pull-request-available
>
> Add support for defining state encoder for key/value and col family 
> independently



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-05 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814588#comment-17814588
 ] 

Kent Yao commented on SPARK-46934:
--

Hi [~yutinglin],  How can I create an element named `AA_pos/AA_length` with 
Hive DDLs? 

I tried to use Hive 2.3.9 in Spark, but it failed. 


{code:java}
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: 
Error: : expected at the position 8 of 'struct' but '/' is found.
{code}


> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Priority: Blocker
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLi

[jira] [Assigned] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator

2024-02-05 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-46960:


Assignee: Eric Marnadi

> Testing Multiple Input Streams for TransformWithState operator
> --
>
> Key: SPARK-46960
> URL: https://issues.apache.org/jira/browse/SPARK-46960
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
>
> Adding unit tests to ensure multiple input streams are supported for the 
> TransformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator

2024-02-05 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-46960.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45004
[https://github.com/apache/spark/pull/45004]

> Testing Multiple Input Streams for TransformWithState operator
> --
>
> Key: SPARK-46960
> URL: https://issues.apache.org/jira/browse/SPARK-46960
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Adding unit tests to ensure multiple input streams are supported for the 
> TransformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46960:
---
Labels: pull-request-available  (was: )

> Testing Multiple Input Streams for TransformWithState operator
> --
>
> Key: SPARK-46960
> URL: https://issues.apache.org/jira/browse/SPARK-46960
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
>
> Adding unit tests to ensure multiple input streams are supported for the 
> TransformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45599:
---
Labels: correctness pull-request-available  (was: correctness)

> Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
> --
>
> Key: SPARK-45599
> URL: https://issues.apache.org/jira/browse/SPARK-45599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0
>Reporter: Robert Joseph Evans
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> I think this actually impacts all versions that have ever supported 
> percentile and it may impact other things because the bug is in OpenHashMap.
>  
> I am really surprised that we caught this bug because everything has to hit 
> just wrong to make it happen. in python/pyspark if you run
>  
> {code:python}
> from math import *
> from pyspark.sql.types import *
> data = [(1.779652973678931e+173,), (9.247723870123388e-295,), 
> (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), 
> (-3.085825028509117e+74,), (-1.9569489404314425e+128,), 
> (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), 
> (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), 
> (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563703283e-64,), (3.002803065141241e-139,), 
> (-1.1041009815645263e+203,), (1.8461539468514548e-225,), 
> (-5.620339412794757e-251,), (3.5103766991437114e-60,), 
> (2.4925669515657655e+165,), (3.217759099462207e+108,), 
> (-8.796717685143486e+203,), (2.037360925124577e+292,), 
> (-6.542279108216022e+206,), (-7.951172614280046e-74,), 
> (6.226527569272003e+152,), (-5.673977270111637e-84,), 
> (-1.0186016078084965e-281,), (1.7976931348623157e+308,), 
> (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), 
> (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), 
> (1.7976931348623157e+308,), (4.3214483342777574e-117,), 
> (-7.973642629411105e-89,), (-1.1028137694801181e-297,), 
> (2.9000325280299273e-39,), (-1.077534929323113e-264,), 
> (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), 
> (-1.831402251805194e+65,), (-2.664533698035492e+203,), 
> (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), 
> (-9.607772864590422e+217,), (3.437191836077251e+209,), 
> (1.9846569552093057e-137,), (-3.010452936419635e-233,), 
> (1.4309793775440402e-87,), (-2.9383643865423363e-103,), 
> (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), 
> (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), 
> (2.187766760184779e+306,), (7.679268835670585e+223,), 
> (6.3131466321042515e+153,), (1.779652973678931e+173,), 
> (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), 
> (1.9042708096454302e+195,), (-3.085825028509117e+74,), 
> (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), 
> (2.5212410617263588e-282,), (-2.646144697462316e-35,), 
> (-3.468683249247593e-196,), (nan,), (None,), (nan,), 
> (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), 
> (-5.682293414619055e+46,), (-4.585039307326895e+166,), 
> (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), 
> (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), 
> (-5.046677974902737e+132,), (-5.490780063080251e-09,), 
> (1.703824427218836e-55,), (-1.1961155424160076e+102,), 
> (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), 
> (5.120795466142678e-215,), (-9.01991342808203e+282,), 
> (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), 
> (3.4543959813437507e-304,), (-7.590734560275502e-63,), 
> (9.376528689861087e+117,), (-2.1696969883753554e-292,), 
> (7.227411393136537e+206,), (-2.428999624265911e-293,), 
> (5.741383583382542e-14,), (-1.4882040107841963e+286,), 
> (2.1973064836362255e-159,), (0.028096279323357867,), 
> (8.475809563

[jira] [Updated] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46980:
---
Labels: pull-request-available  (was: )

> Avoid using internal APIs in dataframe end-to-end tests
> ---
>
> Key: SPARK-46980
> URL: https://issues.apache.org/jira/browse/SPARK-46980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46980.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45034
[https://github.com/apache/spark/pull/45034]

> Avoid using internal APIs in dataframe end-to-end tests
> ---
>
> Key: SPARK-46980
> URL: https://issues.apache.org/jira/browse/SPARK-46980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46980) Avoid using internal APIs in tests

2024-02-05 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-46980:
---

 Summary: Avoid using internal APIs in tests
 Key: SPARK-46980
 URL: https://issues.apache.org/jira/browse/SPARK-46980
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-46980:

Summary: Avoid using internal APIs in dataframe end-to-end tests  (was: 
Avoid using internal APIs in tests)

> Avoid using internal APIs in dataframe end-to-end tests
> ---
>
> Key: SPARK-46980
> URL: https://issues.apache.org/jira/browse/SPARK-46980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Mark Jarvin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations

2024-02-05 Thread Mitesh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321
 ] 

Mitesh edited comment on SPARK-39441 at 2/5/24 11:28 PM:
-

After applying this fix to 3.3.2, I still see some slowness here with a very 
tall query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations). The same query plan works fine in 2.4.x

Is it safe to skip this analyzer rule? Or another way to speed it up? cc 
[~cloud_fan]


was (Author: masterddt):
After applying this fix to 3.3.2, I still see some slowness here with a very 
tall query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations)

Is it safe to skip this analyzer rule? Or another way to speed it up? cc 
[~cloud_fan]

> Speed up DeduplicateRelations
> -
>
> Key: SPARK-39441
> URL: https://issues.apache.org/jira/browse/SPARK-39441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Speed up the Analyzer rule DeduplicateRelations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46979) Add support for defining state encoder for key/value and col family independently

2024-02-05 Thread Anish Shrigondekar (Jira)
Anish Shrigondekar created SPARK-46979:
--

 Summary: Add support for defining state encoder for key/value and 
col family independently
 Key: SPARK-46979
 URL: https://issues.apache.org/jira/browse/SPARK-46979
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Anish Shrigondekar


Add support for defining state encoder for key/value and col family 
independently



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests

2024-02-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46977.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45030
[https://github.com/apache/spark/pull/45030]

> A failed request to obtain a token from one NameNode should not block 
> subsequent token requests
> ---
>
> Key: SPARK-46977
> URL: https://issues.apache.org/jira/browse/SPARK-46977
> Project: Spark
>  Issue Type: Improvement
>  Components: Security, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests

2024-02-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46977:
-

Assignee: Cheng Pan

> A failed request to obtain a token from one NameNode should not block 
> subsequent token requests
> ---
>
> Key: SPARK-46977
> URL: https://issues.apache.org/jira/browse/SPARK-46977
> Project: Spark
>  Issue Type: Improvement
>  Components: Security, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable

2024-02-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46972.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45019
[https://github.com/apache/spark/pull/45019]

> Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
> -
>
> Key: SPARK-46972
> URL: https://issues.apache.org/jira/browse/SPARK-46972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`

2024-02-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46978.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45031
[https://github.com/apache/spark/pull/45031]

> Refine docstring of `sum_distinct/array_agg/count_if`
> -
>
> Key: SPARK-46978
> URL: https://issues.apache.org/jira/browse/SPARK-46978
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`

2024-02-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46978:
-

Assignee: Yang Jie

> Refine docstring of `sum_distinct/array_agg/count_if`
> -
>
> Key: SPARK-46978
> URL: https://issues.apache.org/jira/browse/SPARK-46978
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations

2024-02-05 Thread Mitesh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321
 ] 

Mitesh edited comment on SPARK-39441 at 2/5/24 7:01 PM:


After applying this fix to 3.3.2, I still see some slowness here with a very 
tall query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations)

Is it safe to skip this analyzer rule? Or another way to speed it up? cc 
[~cloud_fan]


was (Author: masterddt):
After applying this fix to 3.3.2, I still see some slowness here with a very 
large query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations)

Is it safe to skip this analyzer rule? Or another way to speed it up? cc 
[~cloud_fan]

> Speed up DeduplicateRelations
> -
>
> Key: SPARK-39441
> URL: https://issues.apache.org/jira/browse/SPARK-39441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Speed up the Analyzer rule DeduplicateRelations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations

2024-02-05 Thread Mitesh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321
 ] 

Mitesh edited comment on SPARK-39441 at 2/5/24 7:00 PM:


After applying this fix to 3.3.2, I still see some slowness here with a very 
large query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations)

Is it safe to skip this analyzer rule? Or another way to speed it up? cc 
[~cloud_fan]


was (Author: masterddt):
After applying this fix to 3.3.2, I still see some slowness here with a very 
large query tree: 
https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for 
DeduplicateRelations)

Is it safe to skip this analyzer rule? Or another way to speed it up?

> Speed up DeduplicateRelations
> -
>
> Key: SPARK-39441
> URL: https://issues.apache.org/jira/browse/SPARK-39441
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Speed up the Analyzer rule DeduplicateRelations



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-46032) connect: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f

2024-02-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-46032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814432#comment-17814432
 ] 

Gaétan CACACE edited comment on SPARK-46032 at 2/5/24 4:37 PM:
---

Hello there,

 

Just coming to give some more information.

With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a 
cluster). I encounter the same issue.

 

The problem appears when I try to use RDD function like df.count() or 
df.collect(). The same appears with pandas_api() when I try to show the values 
of the DataFrame.

 

Note that I have no problem for processing data, I just can't collect any 
values.

 

My spark connect is also quite simple:

 
{code:java}
/opt/spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 
{code}
 

And the spark-defaults.conf
{code:java}
spark.executor.memory 12g
spark.executor.memoryOverhead 1g
spark.executor.cores 4
spark.executor.instances 1
spark.sql.execution.arrow.pyspark.enabled true
spark.driver.cores 2
spark.driver.memory 4g
spark.network.timeout 600s
spark.files.fetchTimeout 600s
spark.worker.cleanup.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.jars /opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar
 {code}
 

Do not hesitate to ask me if you want more information


was (Author: JIRAUSER304079):
Hello there,

 

Just coming to give some more information.

With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a 
cluster). I encounter the same issue.

 

The problem appears when I try to use RDD function like df.count() or 
df.collect(). The same appears with pandas_api() when I try to show the values 
of the DataFrame.

 

Note that I have no problem for processing data, I just can't collect any 
values.

 

My spark connect is also quite simple:

 
{code:java}
/opt/spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 
{code}
 

And the spark-defaults.conf
{code:java}
spark.executor.memory 12gspark.executor.memoryOverhead 1gspark.executor.cores 4
spark.executor.instances 1spark.sql.execution.arrow.pyspark.enabled 
truespark.driver.cores 2spark.driver.memory 4gspark.network.timeout 
600sspark.files.fetchTimeout 600s
spark.worker.cleanup.enabled truespark.serializer 
org.apache.spark.serializer.KryoSerializerspark.jars 
/opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar
 {code}
 

Do not hesitate to ask me if you want more information

> connect: cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f
> -
>
> Key: SPARK-46032
> URL: https://issues.apache.org/jira/browse/SPARK-46032
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
>
> I downloaded spark 3.5 from the spark official website, and then I started a 
> Spark Standalone cluster in which both master and the only worker are in the 
> same node. 
>  
> Then I started the connect server by 
> {code:java}
> start-connect-server.sh \
>     --master spark://10.19.183.93:7077 \
>     --packages org.apache.spark:spark-connect_2.12:3.5.0 \
>     --conf spark.executor.cores=12 \
>     --conf spark.task.cpus=1 \
>     --executor-memory 30G \
>     --conf spark.executor.resource.gpu.amount=1 \
>     --conf spark.task.resource.gpu.amount=0.08 \
>     --driver-memory 1G{code}
>  
> I can 100% ensure the spark standalone cluster, the connect server and spark 
> driver are started observed from the webui.
>  
> Finally, I tried to run a very simple spark job 
> (spark.range(100).filter("id>2").collect()) from spark-connect-client using 
> pyspark, but I got the below error.
>  
> _pyspark --remote sc://localhost_
> _Python 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] on linux_
> _Type "help", "copyright", "credits" or "license" for more information._
> _Welcome to_
>       _              ___
>      _/ __/_  {{_}}{_}__ ___{_}{{_}}/ /{{_}}{_}_
>     {_}{{_}}\ \/ _ \/ _ `/ {_}{{_}}/  '{_}/{_}
>    {_}/{_}_ / .{_}{{_}}/{_},{_}/{_}/ /{_}/{_}\   version 3.5.0{_}
>       {_}/{_}/_
>  
> _Using Python version 3.10.0 (default, Mar  3 2022 09:58:08)_
> _Client connected to the Spark Connect server at localhost_
> _SparkSession available as 'spark'._
> _>>> spark.range(100).filter("id > 3").collect()_
> _Traceback (most recent call last):_
>   _File "", line 1, in _
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/dataframe.py",
>  line 1645, in collect_
>     _table, 

[jira] [Commented] (SPARK-46032) connect: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f

2024-02-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-46032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814432#comment-17814432
 ] 

Gaétan CACACE commented on SPARK-46032:
---

Hello there,

 

Just coming to give some more information.

With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a 
cluster). I encounter the same issue.

 

The problem appears when I try to use RDD function like df.count() or 
df.collect(). The same appears with pandas_api() when I try to show the values 
of the DataFrame.

 

Note that I have no problem for processing data, I just can't collect any 
values.

 

My spark connect is also quite simple:

 
{code:java}
/opt/spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 
{code}
 

And the spark-defaults.conf
{code:java}
spark.executor.memory 12gspark.executor.memoryOverhead 1gspark.executor.cores 4
spark.executor.instances 1spark.sql.execution.arrow.pyspark.enabled 
truespark.driver.cores 2spark.driver.memory 4gspark.network.timeout 
600sspark.files.fetchTimeout 600s
spark.worker.cleanup.enabled truespark.serializer 
org.apache.spark.serializer.KryoSerializerspark.jars 
/opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar
 {code}
 

Do not hesitate to ask me if you want more information

> connect: cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f
> -
>
> Key: SPARK-46032
> URL: https://issues.apache.org/jira/browse/SPARK-46032
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Bobby Wang
>Priority: Major
>  Labels: pull-request-available
>
> I downloaded spark 3.5 from the spark official website, and then I started a 
> Spark Standalone cluster in which both master and the only worker are in the 
> same node. 
>  
> Then I started the connect server by 
> {code:java}
> start-connect-server.sh \
>     --master spark://10.19.183.93:7077 \
>     --packages org.apache.spark:spark-connect_2.12:3.5.0 \
>     --conf spark.executor.cores=12 \
>     --conf spark.task.cpus=1 \
>     --executor-memory 30G \
>     --conf spark.executor.resource.gpu.amount=1 \
>     --conf spark.task.resource.gpu.amount=0.08 \
>     --driver-memory 1G{code}
>  
> I can 100% ensure the spark standalone cluster, the connect server and spark 
> driver are started observed from the webui.
>  
> Finally, I tried to run a very simple spark job 
> (spark.range(100).filter("id>2").collect()) from spark-connect-client using 
> pyspark, but I got the below error.
>  
> _pyspark --remote sc://localhost_
> _Python 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] on linux_
> _Type "help", "copyright", "credits" or "license" for more information._
> _Welcome to_
>       _              ___
>      _/ __/_  {{_}}{_}__ ___{_}{{_}}/ /{{_}}{_}_
>     {_}{{_}}\ \/ _ \/ _ `/ {_}{{_}}/  '{_}/{_}
>    {_}/{_}_ / .{_}{{_}}/{_},{_}/{_}/ /{_}/{_}\   version 3.5.0{_}
>       {_}/{_}/_
>  
> _Using Python version 3.10.0 (default, Mar  3 2022 09:58:08)_
> _Client connected to the Spark Connect server at localhost_
> _SparkSession available as 'spark'._
> _>>> spark.range(100).filter("id > 3").collect()_
> _Traceback (most recent call last):_
>   _File "", line 1, in _
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/dataframe.py",
>  line 1645, in collect_
>     _table, schema = self._session.client.to_table(query)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 858, in to_table_
>     _table, schema, _, _, _ = self._execute_and_fetch(req)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1282, in _execute_and_fetch_
>     _for response in self._execute_and_fetch_as_iterator(req):_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1263, in _execute_and_fetch_as_iterator_
>     _self._handle_error(error)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1502, in _handle_error_
>     _self._handle_rpc_error(error)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1538, in _handle_rpc_error_
>     _raise convert_exception(info, status.message) from None_
> _pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
> stage 0.0 f

[jira] [Assigned] (SPARK-46833) Using ICU library for collation tracking

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46833:
---

Assignee: Aleksandar Tomic

> Using ICU library for collation tracking
> 
>
> Key: SPARK-46833
> URL: https://issues.apache.org/jira/browse/SPARK-46833
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46833) Using ICU library for collation tracking

2024-02-05 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46833.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44968
[https://github.com/apache/spark/pull/44968]

> Using ICU library for collation tracking
> 
>
> Key: SPARK-46833
> URL: https://issues.apache.org/jira/browse/SPARK-46833
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46810) Clarify error class terminology

2024-02-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814406#comment-17814406
 ] 

Nicholas Chammas commented on SPARK-46810:
--

[~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - What are your 
thoughts on the 3 proposed options?

> Clarify error class terminology
> ---
>
> Key: SPARK-46810
> URL: https://issues.apache.org/jira/browse/SPARK-46810
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>
> We use inconsistent terminology when talking about error classes. I'd like to 
> get some clarity on that before contributing any potential improvements to 
> this part of the documentation.
> Consider 
> [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html].
>  It has several key pieces of hierarchical information that have inconsistent 
> names throughout our documentation and codebase:
>  * 42
>  ** K01
>  *** INCOMPLETE_TYPE_DEFINITION
>   ARRAY
>   MAP
>   STRUCT
> What are the names of these different levels of information?
> Some examples of inconsistent terminology:
>  * [Over 
> here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation]
>  we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION 
> we call that an "error class". So what exactly is a class, the 42 or the 
> INCOMPLETE_TYPE_DEFINITION?
>  * [Over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122]
>  we call K01 the "subclass". But [over 
> here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467]
>  we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for 
> INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". 
> So what exactly is a subclass?
>  * [On this 
> page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition]
>  we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other 
> places we refer to it as an "error class".
> I don't think we should leave this status quo as-is. I see a couple of ways 
> to fix this.
> h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition"
> One solution is to use the following terms:
>  * Error class: 42
>  * Error sub-class: K01
>  * Error state: 42K01
>  * Error condition: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-condition: ARRAY, MAP, STRUCT
> Pros: 
>  * This terminology seems (to me at least) the most natural and intuitive.
>  * It aligns most closely to the SQL standard.
> Cons:
>  * We use {{errorClass}} [all over our 
> codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30]
>  – literally in thousands of places – to refer to strings like 
> INCOMPLETE_TYPE_DEFINITION.
>  ** It's probably not practical to update all these usages to say 
> {{errorCondition}} instead, so if we go with this approach there will be a 
> divide between the terminology we use in user-facing documentation vs. what 
> the code base uses.
>  ** We can perhaps rename the existing {{error-classes.json}} to 
> {{error-conditions.json}} but clarify the reason for this divide between code 
> and user docs in the documentation for {{ErrorClassesJsonReader}} .
> h1. Option 2: 42 becomes an "Error Category"
> Another approach is to use the following terminology:
>  * Error category: 42
>  * Error sub-category: K01
>  * Error state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to a "category" is low impact and 
> may not show up in user-facing documentation at all. (See my side note below.)
> Cons:
>  * These terms do not align with the SQL standard.
>  * We will have to retire the term "error condition", which we have [already 
> used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md]
>  in user-facing documentation.
> h1. Option 3: "Error Class" and "State Class"
>  * SQL state class: 42
>  * SQL state sub-class: K01
>  * SQL state: 42K01
>  * Error class: INCOMPLETE_TYPE_DEFINITION
>  * Error sub-classes: ARRAY, MAP, STRUCT
> Pros:
>  * We continue to use "error class" as we do today in our code base.
>  * The change from calling "42" a "class" to

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

2024-02-05 Thread Krystal Mitchell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763940#comment-17763940
 ] 

Krystal Mitchell edited comment on SPARK-24815 at 2/5/24 3:32 PM:
--

Thank you [~pavan0831]. This draft PR will have a significant impact some of 
the projects we are currently working on. Can't wait to see it over the line.


was (Author: JIRAUSER302183):
Thank you [~pavan0831]. This draft PR will impact some of the projects we are 
currently working on. 

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>  Labels: pull-request-available
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46978:
---
Labels: pull-request-available  (was: )

> Refine docstring of `sum_distinct/array_agg/count_if`
> -
>
> Key: SPARK-46978
> URL: https://issues.apache.org/jira/browse/SPARK-46978
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46977:
---
Labels: pull-request-available  (was: )

> A failed request to obtain a token from one NameNode should not block 
> subsequent token requests
> ---
>
> Key: SPARK-46977
> URL: https://issues.apache.org/jira/browse/SPARK-46977
> Project: Spark
>  Issue Type: Improvement
>  Components: Security, Spark Core
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`

2024-02-05 Thread Yang Jie (Jira)
Yang Jie created SPARK-46978:


 Summary: Refine docstring of `sum_distinct/array_agg/count_if`
 Key: SPARK-46978
 URL: https://issues.apache.org/jira/browse/SPARK-46978
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests

2024-02-05 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-46977:
-

 Summary: A failed request to obtain a token from one NameNode 
should not block subsequent token requests
 Key: SPARK-46977
 URL: https://issues.apache.org/jira/browse/SPARK-46977
 Project: Spark
  Issue Type: Improvement
  Components: Security, Spark Core
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46975:
--

Assignee: (was: Apache Spark)

> Move to_{hdf, feather, stata} to the fallback list
> --
>
> Key: SPARK-46975
> URL: https://issues.apache.org/jira/browse/SPARK-46975
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46975:
--

Assignee: Apache Spark

> Move to_{hdf, feather, stata} to the fallback list
> --
>
> Key: SPARK-46975
> URL: https://issues.apache.org/jira/browse/SPARK-46975
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46976:
--

Assignee: (was: Apache Spark)

> Implement `DataFrameGroupBy.corr`
> -
>
> Key: SPARK-46976
> URL: https://issues.apache.org/jira/browse/SPARK-46976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46976:
--

Assignee: Apache Spark

> Implement `DataFrameGroupBy.corr`
> -
>
> Key: SPARK-46976
> URL: https://issues.apache.org/jira/browse/SPARK-46976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46976:
--

Assignee: Apache Spark

> Implement `DataFrameGroupBy.corr`
> -
>
> Key: SPARK-46976
> URL: https://issues.apache.org/jira/browse/SPARK-46976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46976:
--

Assignee: (was: Apache Spark)

> Implement `DataFrameGroupBy.corr`
> -
>
> Key: SPARK-46976
> URL: https://issues.apache.org/jira/browse/SPARK-46976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org