[jira] [Updated] (SPARK-46985) Move _NoValue from pyspark.* to pyspark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-46985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46985: --- Labels: pull-request-available (was: ) > Move _NoValue from pyspark.* to pyspark.sql.* > - > > Key: SPARK-46985 > URL: https://issues.apache.org/jira/browse/SPARK-46985 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > _NoValue is only used in SQL and pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T
[ https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andoni Teso updated SPARK-46679: Affects Version/s: 4.0.0 > Encoders with multiple inheritance - Key not found: T > - > > Key: SPARK-46679 > URL: https://issues.apache.org/jira/browse/SPARK-46679 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Andoni Teso >Priority: Blocker > Attachments: spark_test.zip > > > Since version 3.4, I've been experiencing the following error when using > encoders. > {code:java} > Exception in thread "main" java.util.NoSuchElementException: key not found: T > at scala.collection.immutable.Map$Map1.apply(Map.scala:163) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) > at org.apache.spark.sql.Encoders.bean(Encoders.scala) > at org.example.Main.main(Main.java:26) {code} > I'm attaching the code I use to reproduce the error locally. > [^spark_test.zip] > The issue is in the JavaTypeInference class when it tries to find the encoder > for a ParameterizedType with the value Team. When running > JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T > again, but this time as a Company object, and pt.getRawType as Team. This > ends up generating a tuple of Team, Company in the typeVariables map, leading > to errors when searching for TypeVariables. > In my case, I've resolved this by doing the following: > {code:java} > case tv: TypeVariable[_] => > encoderFor(typeVariables.head._2, seenTypeSet, typeVariables) > case pt: ParameterizedType => > encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code} > I haven't submitted a pull request because it doesn't seem to be the most > optimal solution, or it might break some parts of the code. Additional > validations or conditions may need to be added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T
[ https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andoni Teso updated SPARK-46679: Priority: Critical (was: Blocker) > Encoders with multiple inheritance - Key not found: T > - > > Key: SPARK-46679 > URL: https://issues.apache.org/jira/browse/SPARK-46679 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Andoni Teso >Priority: Critical > Attachments: spark_test.zip > > > Since version 3.4, I've been experiencing the following error when using > encoders. > {code:java} > Exception in thread "main" java.util.NoSuchElementException: key not found: T > at scala.collection.immutable.Map$Map1.apply(Map.scala:163) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) > at org.apache.spark.sql.Encoders.bean(Encoders.scala) > at org.example.Main.main(Main.java:26) {code} > I'm attaching the code I use to reproduce the error locally. > [^spark_test.zip] > The issue is in the JavaTypeInference class when it tries to find the encoder > for a ParameterizedType with the value Team. When running > JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T > again, but this time as a Company object, and pt.getRawType as Team. This > ends up generating a tuple of Team, Company in the typeVariables map, leading > to errors when searching for TypeVariables. > In my case, I've resolved this by doing the following: > {code:java} > case tv: TypeVariable[_] => > encoderFor(typeVariables.head._2, seenTypeSet, typeVariables) > case pt: ParameterizedType => > encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code} > I haven't submitted a pull request because it doesn't seem to be the most > optimal solution, or it might break some parts of the code. Additional > validations or conditions may need to be added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46984) Remove pyspark.copy_func
[ https://issues.apache.org/jira/browse/SPARK-46984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46984: - Priority: Minor (was: Major) > Remove pyspark.copy_func > > > Key: SPARK-46984 > URL: https://issues.apache.org/jira/browse/SPARK-46984 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46985) Move _NoValue from pyspark.* to pyspark.sql.*
Hyukjin Kwon created SPARK-46985: Summary: Move _NoValue from pyspark.* to pyspark.sql.* Key: SPARK-46985 URL: https://issues.apache.org/jira/browse/SPARK-46985 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon _NoValue is only used in SQL and pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46984) Remove pyspark.copy_func
[ https://issues.apache.org/jira/browse/SPARK-46984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46984: --- Labels: pull-request-available (was: ) > Remove pyspark.copy_func > > > Key: SPARK-46984 > URL: https://issues.apache.org/jira/browse/SPARK-46984 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46984) Remove pyspark.copy_func
Hyukjin Kwon created SPARK-46984: Summary: Remove pyspark.copy_func Key: SPARK-46984 URL: https://issues.apache.org/jira/browse/SPARK-46984 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46983) Decouple module dependencies between PySpark modules
Hyukjin Kwon created SPARK-46983: Summary: Decouple module dependencies between PySpark modules Key: SPARK-46983 URL: https://issues.apache.org/jira/browse/SPARK-46983 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon We have unnecessary dependencies between each PySpark modules. We should remove them out so individual package can be self-contained. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46170) Support inject adaptive query post planner strategy rules in SparkSessionExtensions
[ https://issues.apache.org/jira/browse/SPARK-46170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-46170: - Fix Version/s: 3.5.1 > Support inject adaptive query post planner strategy rules in > SparkSessionExtensions > --- > > Key: SPARK-46170 > URL: https://issues.apache.org/jira/browse/SPARK-46170 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46982) Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE
[ https://issues.apache.org/jira/browse/SPARK-46982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46982: --- Labels: pull-request-available (was: ) > Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE > - > > Key: SPARK-46982 > URL: https://issues.apache.org/jira/browse/SPARK-46982 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Description: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, "java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions ** No need to place any data files as this happens with empty location. * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ ~GC overhead limit exceeded~ ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.Physic
[jira] [Created] (SPARK-46982) Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE
Kent Yao created SPARK-46982: Summary: Remove _LEGACY_ERROR_TEMP_2187 in favor of CANNOT_RECOGNIZE_HIVE_TYPE Key: SPARK-46982 URL: https://issues.apache.org/jira/browse/SPARK-46982 Project: Spark Issue Type: Test Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Description: We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{{}pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, "java.lang.OutOfMemoryError: GC overhead limit exceeded"{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source)~ ~at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)~ ~at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)~ ~at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)~ ~at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)~ ~at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)~ ~at scala.collection.Iterator.foreach(Iterator.scala:943)~ ~at scala.collection.Iterator.foreach$(Iterator.scala:943)~ ~at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)~ ~at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)~ ~at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192)~ ~at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431)~ ~GC overhead limit exceeded~ ~java.lang.OutOfMemoryError: GC overhead limit exceeded~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:100)~ ~at scala.collection.immutable.Vector.iterator(Vector.scala:69)~ ~at scala.collection.IterableLike.foreach(IterableLike.scala:74)~ ~at scala.collection.IterableLike.foreach$(IterableLike.scala:73)~ ~at scala.collection.AbstractIterable.foreach(Iterable.scala:56)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219)~ ~at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211)~ ~at scala.collection.AbstractTraversable.transpose(Traversable.scala:108)~ ~at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)~ ~at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source)~ ~at scala.Option.getOrElse(Option.scala:189)~ ~at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119)~ ~at org.apache.spark.sql.hive.
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Attachment: test_and_twodays_simplified.sql > Driver OOM happens in query planning phase with empty tables > > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) >Reporter: Noritaka Sekiyama >Priority: Major > Attachments: create_sanitized_tables.py, > test_and_twodays_simplified.sql > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{pt>='20231004' and pt<='20231004', > then the query fails in planning phase due to Driver OOM, more specifically, > }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit > exceeded{}}}{{{}{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) > GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collec
[jira] [Created] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
Noritaka Sekiyama created SPARK-46981: - Summary: Driver OOM happens in query planning phase with empty tables Key: SPARK-46981 URL: https://issues.apache.org/jira/browse/SPARK-46981 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Environment: * OSS Spark 3.5.0 * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) * AWS Glue Spark 3.3.0 (Glue version 4.0) Reporter: Noritaka Sekiyama Attachments: create_sanitized_tables.py We have observed that Driver OOM happens in query planning phase with empty tables when we ran specific patterns of queries. h2. Issue details If we run the query with where condition {{pt>='20231004' and pt<='20231004', then the query fails in planning phase due to Driver OOM, more specifically, }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit exceeded{}}}{{{}{}}}. If we change the where condition from {{pt>='20231004' and pt<='20231004'}} to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. This issue happened even with empty table, and it happened before actual data load. This seems like an issue in catalyst side. h2. Reproduction step Attaching script and query to reproduce the issue. * create_sanitized_tables.py: Script to create table definitions * test_and_twodays_simplified.sql: Query to reproduce the issue Here's the typical stacktrace: {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) at org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown Source) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) at org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown Source) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) at scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.collection.immutable.Vector.iterator(Vector.scala:100) at scala.collection.immutable.Vector.iterator(Vector.scala:69) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) at scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) at org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461)
[jira] [Updated] (SPARK-46981) Driver OOM happens in query planning phase with empty tables
[ https://issues.apache.org/jira/browse/SPARK-46981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noritaka Sekiyama updated SPARK-46981: -- Attachment: create_sanitized_tables.py > Driver OOM happens in query planning phase with empty tables > > > Key: SPARK-46981 > URL: https://issues.apache.org/jira/browse/SPARK-46981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: * OSS Spark 3.5.0 > * Amazon EMR Spark 3.3.0 (EMR release label 6.9.0) > * AWS Glue Spark 3.3.0 (Glue version 4.0) >Reporter: Noritaka Sekiyama >Priority: Major > Attachments: create_sanitized_tables.py > > > We have observed that Driver OOM happens in query planning phase with empty > tables when we ran specific patterns of queries. > h2. Issue details > If we run the query with where condition {{pt>='20231004' and pt<='20231004', > then the query fails in planning phase due to Driver OOM, more specifically, > }}{{{}{}}}{{{}java.lang.OutOfMemoryError: GC overhead limit > exceeded{}}}{{{}{}}}. > If we change the where condition from {{pt>='20231004' and pt<='20231004'}} > to {{{}pt='20231004' or pt='20231005'{}}}, the SQL can run without any error. > > This issue happened even with empty table, and it happened before actual data > load. This seems like an issue in catalyst side. > h2. Reproduction step > Attaching script and query to reproduce the issue. > * create_sanitized_tables.py: Script to create table definitions > * test_and_twodays_simplified.sql: Query to reproduce the issue > Here's the typical stacktrace: > {{ at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > scala.collection.generic.GenericTraversableTemplate.transpose(GenericTraversableTemplate.scala:219) > at > scala.collection.generic.GenericTraversableTemplate.transpose$(GenericTraversableTemplate.scala:211) > at scala.collection.AbstractTraversable.transpose(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.Window.output(basicLogicalOperators.scala:1205) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.$anonfun$unapply$2(patterns.scala:119) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$$$Lambda$1874/539825188.apply(Unknown > Source) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.planning.PhysicalOperation$.unapply(patterns.scala:119) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:307) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2114/1104718965.apply(Unknown > Source) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) > at > org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:70) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$Lambda$2117/2079515765.apply(Unknown > Source) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196) > at > scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199) > at scala.collection.TraversableOnce.foldLeft$(TraversableOnce.scala:192) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1431) > GC overhead limit exceeded > java.lang.OutOfMemoryError: GC overhead limit exceeded > at scala.collection.immutable.Vector.iterator(Vector.scala:100) > at scala.collection.immutable.Vector.iterator(Vector.scala:69) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.sc
[jira] [Resolved] (SPARK-46958) missing timezone to coerce default values
[ https://issues.apache.org/jira/browse/SPARK-46958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46958. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45000 [https://github.com/apache/spark/pull/45000] > missing timezone to coerce default values > - > > Key: SPARK-46958 > URL: https://issues.apache.org/jira/browse/SPARK-46958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > ``` > create table src(key int, c string DEFAULT date'2018-11-17') using parquet; > Time taken: 0.133 seconds > spark-sql (default)> desc src; > [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. > You hit a bug in Spark or the Spark plugins you use. Please, report this bug > to the corresponding communities or vendors, and provide the full stack trace. > org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase > analysis failed with an internal error. You hit a bug in Spark or the Spark > plugins you use. Please, report this bug to the corresponding communities or > vendors, and provide the full stack trace. > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46934: --- Labels: pull-request-available (was: ) > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Priority: Blocker > Labels: pull-request-available > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(HiveClientImpl.scala:1037) > at > org.apache.spark.sql.hive.client
[jira] [Updated] (SPARK-46979) Add support for defining state encoder for key/value and col family independently
[ https://issues.apache.org/jira/browse/SPARK-46979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46979: --- Labels: pull-request-available (was: ) > Add support for defining state encoder for key/value and col family > independently > - > > Key: SPARK-46979 > URL: https://issues.apache.org/jira/browse/SPARK-46979 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Anish Shrigondekar >Priority: Major > Labels: pull-request-available > > Add support for defining state encoder for key/value and col family > independently -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814588#comment-17814588 ] Kent Yao commented on SPARK-46934: -- Hi [~yutinglin], How can I create an element named `AA_pos/AA_length` with Hive DDLs? I tried to use Hive 2.3.9 in Spark, but it failed. {code:java} FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but '/' is found. {code} > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Priority: Blocker > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLi
[jira] [Assigned] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator
[ https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-46960: Assignee: Eric Marnadi > Testing Multiple Input Streams for TransformWithState operator > -- > > Key: SPARK-46960 > URL: https://issues.apache.org/jira/browse/SPARK-46960 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > Labels: pull-request-available > > Adding unit tests to ensure multiple input streams are supported for the > TransformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator
[ https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-46960. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45004 [https://github.com/apache/spark/pull/45004] > Testing Multiple Input Streams for TransformWithState operator > -- > > Key: SPARK-46960 > URL: https://issues.apache.org/jira/browse/SPARK-46960 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Eric Marnadi >Assignee: Eric Marnadi >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Adding unit tests to ensure multiple input streams are supported for the > TransformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator
[ https://issues.apache.org/jira/browse/SPARK-46960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46960: --- Labels: pull-request-available (was: ) > Testing Multiple Input Streams for TransformWithState operator > -- > > Key: SPARK-46960 > URL: https://issues.apache.org/jira/browse/SPARK-46960 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Eric Marnadi >Priority: Major > Labels: pull-request-available > > Adding unit tests to ensure multiple input streams are supported for the > TransformWithState operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45599) Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset
[ https://issues.apache.org/jira/browse/SPARK-45599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45599: --- Labels: correctness pull-request-available (was: correctness) > Percentile can produce a wrong answer if -0.0 and 0.0 are mixed in the dataset > -- > > Key: SPARK-45599 > URL: https://issues.apache.org/jira/browse/SPARK-45599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.6.3, 3.3.0, 3.2.3, 3.5.0 >Reporter: Robert Joseph Evans >Priority: Critical > Labels: correctness, pull-request-available > > I think this actually impacts all versions that have ever supported > percentile and it may impact other things because the bug is in OpenHashMap. > > I am really surprised that we caught this bug because everything has to hit > just wrong to make it happen. in python/pyspark if you run > > {code:python} > from math import * > from pyspark.sql.types import * > data = [(1.779652973678931e+173,), (9.247723870123388e-295,), > (5.891823952773268e+98,), (inf,), (1.9042708096454302e+195,), > (-3.085825028509117e+74,), (-1.9569489404314425e+128,), > (2.0738138203216883e+201,), (inf,), (2.5212410617263588e-282,), > (-2.646144697462316e-35,), (-3.468683249247593e-196,), (nan,), (None,), > (nan,), (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563703283e-64,), (3.002803065141241e-139,), > (-1.1041009815645263e+203,), (1.8461539468514548e-225,), > (-5.620339412794757e-251,), (3.5103766991437114e-60,), > (2.4925669515657655e+165,), (3.217759099462207e+108,), > (-8.796717685143486e+203,), (2.037360925124577e+292,), > (-6.542279108216022e+206,), (-7.951172614280046e-74,), > (6.226527569272003e+152,), (-5.673977270111637e-84,), > (-1.0186016078084965e-281,), (1.7976931348623157e+308,), > (4.205809391029644e+137,), (-9.871721037428167e+119,), (None,), > (-1.6663254121185628e-256,), (1.0075153091760986e-236,), (-0.0,), (0.0,), > (1.7976931348623157e+308,), (4.3214483342777574e-117,), > (-7.973642629411105e-89,), (-1.1028137694801181e-297,), > (2.9000325280299273e-39,), (-1.077534929323113e-264,), > (-1.1847952892216515e+137,), (nan,), (7.849390806334983e+226,), > (-1.831402251805194e+65,), (-2.664533698035492e+203,), > (-2.2385155698231885e+285,), (-2.3016388448634844e-155,), > (-9.607772864590422e+217,), (3.437191836077251e+209,), > (1.9846569552093057e-137,), (-3.010452936419635e-233,), > (1.4309793775440402e-87,), (-2.9383643865423363e-103,), > (-4.696878567317712e-162,), (8.391630779050713e-135,), (nan,), > (-3.3885098786542755e-128,), (-4.5154178008513483e-122,), (nan,), (nan,), > (2.187766760184779e+306,), (7.679268835670585e+223,), > (6.3131466321042515e+153,), (1.779652973678931e+173,), > (9.247723870123388e-295,), (5.891823952773268e+98,), (inf,), > (1.9042708096454302e+195,), (-3.085825028509117e+74,), > (-1.9569489404314425e+128,), (2.0738138203216883e+201,), (inf,), > (2.5212410617263588e-282,), (-2.646144697462316e-35,), > (-3.468683249247593e-196,), (nan,), (None,), (nan,), > (1.822129180806602e-245,), (5.211702553315461e-259,), (-1.0,), > (-5.682293414619055e+46,), (-4.585039307326895e+166,), > (-5.936844510098297e-82,), (-5234708055733.116,), (4920675036.053339,), > (None,), (4.4501477170144023e-308,), (2.176024662699802e-210,), > (-5.046677974902737e+132,), (-5.490780063080251e-09,), > (1.703824427218836e-55,), (-1.1961155424160076e+102,), > (1.4403274475565667e+41,), (None,), (5.4470705929955455e-86,), > (5.120795466142678e-215,), (-9.01991342808203e+282,), > (4.051866849943636e-254,), (-3588518231990.927,), (-1.8891559842111865e+63,), > (3.4543959813437507e-304,), (-7.590734560275502e-63,), > (9.376528689861087e+117,), (-2.1696969883753554e-292,), > (7.227411393136537e+206,), (-2.428999624265911e-293,), > (5.741383583382542e-14,), (-1.4882040107841963e+286,), > (2.1973064836362255e-159,), (0.028096279323357867,), > (8.475809563
[jira] [Updated] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46980: --- Labels: pull-request-available (was: ) > Avoid using internal APIs in dataframe end-to-end tests > --- > > Key: SPARK-46980 > URL: https://issues.apache.org/jira/browse/SPARK-46980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Mark Jarvin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46980. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45034 [https://github.com/apache/spark/pull/45034] > Avoid using internal APIs in dataframe end-to-end tests > --- > > Key: SPARK-46980 > URL: https://issues.apache.org/jira/browse/SPARK-46980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Mark Jarvin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46980) Avoid using internal APIs in tests
Wenchen Fan created SPARK-46980: --- Summary: Avoid using internal APIs in tests Key: SPARK-46980 URL: https://issues.apache.org/jira/browse/SPARK-46980 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46980) Avoid using internal APIs in dataframe end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-46980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-46980: Summary: Avoid using internal APIs in dataframe end-to-end tests (was: Avoid using internal APIs in tests) > Avoid using internal APIs in dataframe end-to-end tests > --- > > Key: SPARK-46980 > URL: https://issues.apache.org/jira/browse/SPARK-46980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Assignee: Mark Jarvin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321 ] Mitesh edited comment on SPARK-39441 at 2/5/24 11:28 PM: - After applying this fix to 3.3.2, I still see some slowness here with a very tall query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations). The same query plan works fine in 2.4.x Is it safe to skip this analyzer rule? Or another way to speed it up? cc [~cloud_fan] was (Author: masterddt): After applying this fix to 3.3.2, I still see some slowness here with a very tall query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations) Is it safe to skip this analyzer rule? Or another way to speed it up? cc [~cloud_fan] > Speed up DeduplicateRelations > - > > Key: SPARK-39441 > URL: https://issues.apache.org/jira/browse/SPARK-39441 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > Speed up the Analyzer rule DeduplicateRelations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46979) Add support for defining state encoder for key/value and col family independently
Anish Shrigondekar created SPARK-46979: -- Summary: Add support for defining state encoder for key/value and col family independently Key: SPARK-46979 URL: https://issues.apache.org/jira/browse/SPARK-46979 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Anish Shrigondekar Add support for defining state encoder for key/value and col family independently -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests
[ https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46977. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45030 [https://github.com/apache/spark/pull/45030] > A failed request to obtain a token from one NameNode should not block > subsequent token requests > --- > > Key: SPARK-46977 > URL: https://issues.apache.org/jira/browse/SPARK-46977 > Project: Spark > Issue Type: Improvement > Components: Security, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests
[ https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46977: - Assignee: Cheng Pan > A failed request to obtain a token from one NameNode should not block > subsequent token requests > --- > > Key: SPARK-46977 > URL: https://issues.apache.org/jira/browse/SPARK-46977 > Project: Spark > Issue Type: Improvement > Components: Security, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
[ https://issues.apache.org/jira/browse/SPARK-46972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46972. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45019 [https://github.com/apache/spark/pull/45019] > Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable > - > > Key: SPARK-46972 > URL: https://issues.apache.org/jira/browse/SPARK-46972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`
[ https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46978. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45031 [https://github.com/apache/spark/pull/45031] > Refine docstring of `sum_distinct/array_agg/count_if` > - > > Key: SPARK-46978 > URL: https://issues.apache.org/jira/browse/SPARK-46978 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`
[ https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46978: - Assignee: Yang Jie > Refine docstring of `sum_distinct/array_agg/count_if` > - > > Key: SPARK-46978 > URL: https://issues.apache.org/jira/browse/SPARK-46978 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321 ] Mitesh edited comment on SPARK-39441 at 2/5/24 7:01 PM: After applying this fix to 3.3.2, I still see some slowness here with a very tall query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations) Is it safe to skip this analyzer rule? Or another way to speed it up? cc [~cloud_fan] was (Author: masterddt): After applying this fix to 3.3.2, I still see some slowness here with a very large query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations) Is it safe to skip this analyzer rule? Or another way to speed it up? cc [~cloud_fan] > Speed up DeduplicateRelations > - > > Key: SPARK-39441 > URL: https://issues.apache.org/jira/browse/SPARK-39441 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > Speed up the Analyzer rule DeduplicateRelations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39441) Speed up DeduplicateRelations
[ https://issues.apache.org/jira/browse/SPARK-39441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17749321#comment-17749321 ] Mitesh edited comment on SPARK-39441 at 2/5/24 7:00 PM: After applying this fix to 3.3.2, I still see some slowness here with a very large query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations) Is it safe to skip this analyzer rule? Or another way to speed it up? cc [~cloud_fan] was (Author: masterddt): After applying this fix to 3.3.2, I still see some slowness here with a very large query tree: https://gist.github.com/MasterDDT/422f933d91f59becf3924f01d03d5456 (search for DeduplicateRelations) Is it safe to skip this analyzer rule? Or another way to speed it up? > Speed up DeduplicateRelations > - > > Key: SPARK-39441 > URL: https://issues.apache.org/jira/browse/SPARK-39441 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > Speed up the Analyzer rule DeduplicateRelations -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46032) connect: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f
[ https://issues.apache.org/jira/browse/SPARK-46032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814432#comment-17814432 ] Gaétan CACACE edited comment on SPARK-46032 at 2/5/24 4:37 PM: --- Hello there, Just coming to give some more information. With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a cluster). I encounter the same issue. The problem appears when I try to use RDD function like df.count() or df.collect(). The same appears with pandas_api() when I try to show the values of the DataFrame. Note that I have no problem for processing data, I just can't collect any values. My spark connect is also quite simple: {code:java} /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 {code} And the spark-defaults.conf {code:java} spark.executor.memory 12g spark.executor.memoryOverhead 1g spark.executor.cores 4 spark.executor.instances 1 spark.sql.execution.arrow.pyspark.enabled true spark.driver.cores 2 spark.driver.memory 4g spark.network.timeout 600s spark.files.fetchTimeout 600s spark.worker.cleanup.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer spark.jars /opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar {code} Do not hesitate to ask me if you want more information was (Author: JIRAUSER304079): Hello there, Just coming to give some more information. With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a cluster). I encounter the same issue. The problem appears when I try to use RDD function like df.count() or df.collect(). The same appears with pandas_api() when I try to show the values of the DataFrame. Note that I have no problem for processing data, I just can't collect any values. My spark connect is also quite simple: {code:java} /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 {code} And the spark-defaults.conf {code:java} spark.executor.memory 12gspark.executor.memoryOverhead 1gspark.executor.cores 4 spark.executor.instances 1spark.sql.execution.arrow.pyspark.enabled truespark.driver.cores 2spark.driver.memory 4gspark.network.timeout 600sspark.files.fetchTimeout 600s spark.worker.cleanup.enabled truespark.serializer org.apache.spark.serializer.KryoSerializerspark.jars /opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar {code} Do not hesitate to ask me if you want more information > connect: cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.rdd.MapPartitionsRDD.f > - > > Key: SPARK-46032 > URL: https://issues.apache.org/jira/browse/SPARK-46032 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Bobby Wang >Priority: Major > Labels: pull-request-available > > I downloaded spark 3.5 from the spark official website, and then I started a > Spark Standalone cluster in which both master and the only worker are in the > same node. > > Then I started the connect server by > {code:java} > start-connect-server.sh \ > --master spark://10.19.183.93:7077 \ > --packages org.apache.spark:spark-connect_2.12:3.5.0 \ > --conf spark.executor.cores=12 \ > --conf spark.task.cpus=1 \ > --executor-memory 30G \ > --conf spark.executor.resource.gpu.amount=1 \ > --conf spark.task.resource.gpu.amount=0.08 \ > --driver-memory 1G{code} > > I can 100% ensure the spark standalone cluster, the connect server and spark > driver are started observed from the webui. > > Finally, I tried to run a very simple spark job > (spark.range(100).filter("id>2").collect()) from spark-connect-client using > pyspark, but I got the below error. > > _pyspark --remote sc://localhost_ > _Python 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] on linux_ > _Type "help", "copyright", "credits" or "license" for more information._ > _Welcome to_ > _ ___ > _/ __/_ {{_}}{_}__ ___{_}{{_}}/ /{{_}}{_}_ > {_}{{_}}\ \/ _ \/ _ `/ {_}{{_}}/ '{_}/{_} > {_}/{_}_ / .{_}{{_}}/{_},{_}/{_}/ /{_}/{_}\ version 3.5.0{_} > {_}/{_}/_ > > _Using Python version 3.10.0 (default, Mar 3 2022 09:58:08)_ > _Client connected to the Spark Connect server at localhost_ > _SparkSession available as 'spark'._ > _>>> spark.range(100).filter("id > 3").collect()_ > _Traceback (most recent call last):_ > _File "", line 1, in _ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/dataframe.py", > line 1645, in collect_ > _table,
[jira] [Commented] (SPARK-46032) connect: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f
[ https://issues.apache.org/jira/browse/SPARK-46032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814432#comment-17814432 ] Gaétan CACACE commented on SPARK-46032: --- Hello there, Just coming to give some more information. With spark-3.5.0-bin-hadoop3 version and spark connect (connected to a cluster). I encounter the same issue. The problem appears when I try to use RDD function like df.count() or df.collect(). The same appears with pandas_api() when I try to show the values of the DataFrame. Note that I have no problem for processing data, I just can't collect any values. My spark connect is also quite simple: {code:java} /opt/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0 --master spark://spark-master:7077 {code} And the spark-defaults.conf {code:java} spark.executor.memory 12gspark.executor.memoryOverhead 1gspark.executor.cores 4 spark.executor.instances 1spark.sql.execution.arrow.pyspark.enabled truespark.driver.cores 2spark.driver.memory 4gspark.network.timeout 600sspark.files.fetchTimeout 600s spark.worker.cleanup.enabled truespark.serializer org.apache.spark.serializer.KryoSerializerspark.jars /opt/spark/jars/sqljdbc42.jar,/opt/spark/jars/pgjdbc_42.7.0.jar {code} Do not hesitate to ask me if you want more information > connect: cannot assign instance of java.lang.invoke.SerializedLambda to field > org.apache.spark.rdd.MapPartitionsRDD.f > - > > Key: SPARK-46032 > URL: https://issues.apache.org/jira/browse/SPARK-46032 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Bobby Wang >Priority: Major > Labels: pull-request-available > > I downloaded spark 3.5 from the spark official website, and then I started a > Spark Standalone cluster in which both master and the only worker are in the > same node. > > Then I started the connect server by > {code:java} > start-connect-server.sh \ > --master spark://10.19.183.93:7077 \ > --packages org.apache.spark:spark-connect_2.12:3.5.0 \ > --conf spark.executor.cores=12 \ > --conf spark.task.cpus=1 \ > --executor-memory 30G \ > --conf spark.executor.resource.gpu.amount=1 \ > --conf spark.task.resource.gpu.amount=0.08 \ > --driver-memory 1G{code} > > I can 100% ensure the spark standalone cluster, the connect server and spark > driver are started observed from the webui. > > Finally, I tried to run a very simple spark job > (spark.range(100).filter("id>2").collect()) from spark-connect-client using > pyspark, but I got the below error. > > _pyspark --remote sc://localhost_ > _Python 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] on linux_ > _Type "help", "copyright", "credits" or "license" for more information._ > _Welcome to_ > _ ___ > _/ __/_ {{_}}{_}__ ___{_}{{_}}/ /{{_}}{_}_ > {_}{{_}}\ \/ _ \/ _ `/ {_}{{_}}/ '{_}/{_} > {_}/{_}_ / .{_}{{_}}/{_},{_}/{_}/ /{_}/{_}\ version 3.5.0{_} > {_}/{_}/_ > > _Using Python version 3.10.0 (default, Mar 3 2022 09:58:08)_ > _Client connected to the Spark Connect server at localhost_ > _SparkSession available as 'spark'._ > _>>> spark.range(100).filter("id > 3").collect()_ > _Traceback (most recent call last):_ > _File "", line 1, in _ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/dataframe.py", > line 1645, in collect_ > _table, schema = self._session.client.to_table(query)_ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py", > line 858, in to_table_ > _table, schema, _, _, _ = self._execute_and_fetch(req)_ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py", > line 1282, in _execute_and_fetch_ > _for response in self._execute_and_fetch_as_iterator(req):_ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py", > line 1263, in _execute_and_fetch_as_iterator_ > _self._handle_error(error)_ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py", > line 1502, in _handle_error_ > _self._handle_rpc_error(error)_ > _File > "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py", > line 1538, in _handle_rpc_error_ > _raise convert_exception(info, status.message) from None_ > _pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in > stage 0.0 f
[jira] [Assigned] (SPARK-46833) Using ICU library for collation tracking
[ https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46833: --- Assignee: Aleksandar Tomic > Using ICU library for collation tracking > > > Key: SPARK-46833 > URL: https://issues.apache.org/jira/browse/SPARK-46833 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46833) Using ICU library for collation tracking
[ https://issues.apache.org/jira/browse/SPARK-46833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46833. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44968 [https://github.com/apache/spark/pull/44968] > Using ICU library for collation tracking > > > Key: SPARK-46833 > URL: https://issues.apache.org/jira/browse/SPARK-46833 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46810) Clarify error class terminology
[ https://issues.apache.org/jira/browse/SPARK-46810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814406#comment-17814406 ] Nicholas Chammas commented on SPARK-46810: -- [~cloud_fan], [~LuciferYang], [~beliefer], and [~dongjoon] - What are your thoughts on the 3 proposed options? > Clarify error class terminology > --- > > Key: SPARK-46810 > URL: https://issues.apache.org/jira/browse/SPARK-46810 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > > We use inconsistent terminology when talking about error classes. I'd like to > get some clarity on that before contributing any potential improvements to > this part of the documentation. > Consider > [INCOMPLETE_TYPE_DEFINITION|https://spark.apache.org/docs/3.5.0/sql-error-conditions-incomplete-type-definition-error-class.html]. > It has several key pieces of hierarchical information that have inconsistent > names throughout our documentation and codebase: > * 42 > ** K01 > *** INCOMPLETE_TYPE_DEFINITION > ARRAY > MAP > STRUCT > What are the names of these different levels of information? > Some examples of inconsistent terminology: > * [Over > here|https://spark.apache.org/docs/latest/sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation] > we call 42 the "class". Yet on the main page for INCOMPLETE_TYPE_DEFINITION > we call that an "error class". So what exactly is a class, the 42 or the > INCOMPLETE_TYPE_DEFINITION? > * [Over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/README.md#L122] > we call K01 the "subclass". But [over > here|https://github.com/apache/spark/blob/26d3eca0a8d3303d0bb9450feb6575ed145bbd7e/common/utils/src/main/resources/error/error-classes.json#L1452-L1467] > we call the ARRAY, MAP, and STRUCT the subclasses. And on the main page for > INCOMPLETE_TYPE_DEFINITION we call those same things "derived error classes". > So what exactly is a subclass? > * [On this > page|https://spark.apache.org/docs/3.5.0/sql-error-conditions.html#incomplete_type_definition] > we call INCOMPLETE_TYPE_DEFINITION an "error condition", though in other > places we refer to it as an "error class". > I don't think we should leave this status quo as-is. I see a couple of ways > to fix this. > h1. Option 1: INCOMPLETE_TYPE_DEFINITION becomes an "Error Condition" > One solution is to use the following terms: > * Error class: 42 > * Error sub-class: K01 > * Error state: 42K01 > * Error condition: INCOMPLETE_TYPE_DEFINITION > * Error sub-condition: ARRAY, MAP, STRUCT > Pros: > * This terminology seems (to me at least) the most natural and intuitive. > * It aligns most closely to the SQL standard. > Cons: > * We use {{errorClass}} [all over our > codebase|https://github.com/apache/spark/blob/15c9ec7ca3b66ec413b7964a374cb9508a80/common/utils/src/main/scala/org/apache/spark/SparkException.scala#L30] > – literally in thousands of places – to refer to strings like > INCOMPLETE_TYPE_DEFINITION. > ** It's probably not practical to update all these usages to say > {{errorCondition}} instead, so if we go with this approach there will be a > divide between the terminology we use in user-facing documentation vs. what > the code base uses. > ** We can perhaps rename the existing {{error-classes.json}} to > {{error-conditions.json}} but clarify the reason for this divide between code > and user docs in the documentation for {{ErrorClassesJsonReader}} . > h1. Option 2: 42 becomes an "Error Category" > Another approach is to use the following terminology: > * Error category: 42 > * Error sub-category: K01 > * Error state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to a "category" is low impact and > may not show up in user-facing documentation at all. (See my side note below.) > Cons: > * These terms do not align with the SQL standard. > * We will have to retire the term "error condition", which we have [already > used|https://github.com/apache/spark/blob/e7fb0ad68f73d0c1996b19c9e139d70dcc97a8c4/docs/sql-error-conditions.md] > in user-facing documentation. > h1. Option 3: "Error Class" and "State Class" > * SQL state class: 42 > * SQL state sub-class: K01 > * SQL state: 42K01 > * Error class: INCOMPLETE_TYPE_DEFINITION > * Error sub-classes: ARRAY, MAP, STRUCT > Pros: > * We continue to use "error class" as we do today in our code base. > * The change from calling "42" a "class" to
[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763940#comment-17763940 ] Krystal Mitchell edited comment on SPARK-24815 at 2/5/24 3:32 PM: -- Thank you [~pavan0831]. This draft PR will have a significant impact some of the projects we are currently working on. Can't wait to see it over the line. was (Author: JIRAUSER302183): Thank you [~pavan0831]. This draft PR will impact some of the projects we are currently working on. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > Labels: pull-request-available > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`
[ https://issues.apache.org/jira/browse/SPARK-46978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46978: --- Labels: pull-request-available (was: ) > Refine docstring of `sum_distinct/array_agg/count_if` > - > > Key: SPARK-46978 > URL: https://issues.apache.org/jira/browse/SPARK-46978 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests
[ https://issues.apache.org/jira/browse/SPARK-46977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46977: --- Labels: pull-request-available (was: ) > A failed request to obtain a token from one NameNode should not block > subsequent token requests > --- > > Key: SPARK-46977 > URL: https://issues.apache.org/jira/browse/SPARK-46977 > Project: Spark > Issue Type: Improvement > Components: Security, Spark Core >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46978) Refine docstring of `sum_distinct/array_agg/count_if`
Yang Jie created SPARK-46978: Summary: Refine docstring of `sum_distinct/array_agg/count_if` Key: SPARK-46978 URL: https://issues.apache.org/jira/browse/SPARK-46978 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46977) A failed request to obtain a token from one NameNode should not block subsequent token requests
Cheng Pan created SPARK-46977: - Summary: A failed request to obtain a token from one NameNode should not block subsequent token requests Key: SPARK-46977 URL: https://issues.apache.org/jira/browse/SPARK-46977 Project: Spark Issue Type: Improvement Components: Security, Spark Core Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list
[ https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46975: -- Assignee: (was: Apache Spark) > Move to_{hdf, feather, stata} to the fallback list > -- > > Key: SPARK-46975 > URL: https://issues.apache.org/jira/browse/SPARK-46975 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list
[ https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46975: -- Assignee: Apache Spark > Move to_{hdf, feather, stata} to the fallback list > -- > > Key: SPARK-46975 > URL: https://issues.apache.org/jira/browse/SPARK-46975 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`
[ https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46976: -- Assignee: (was: Apache Spark) > Implement `DataFrameGroupBy.corr` > - > > Key: SPARK-46976 > URL: https://issues.apache.org/jira/browse/SPARK-46976 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`
[ https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46976: -- Assignee: Apache Spark > Implement `DataFrameGroupBy.corr` > - > > Key: SPARK-46976 > URL: https://issues.apache.org/jira/browse/SPARK-46976 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`
[ https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46976: -- Assignee: Apache Spark > Implement `DataFrameGroupBy.corr` > - > > Key: SPARK-46976 > URL: https://issues.apache.org/jira/browse/SPARK-46976 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46976) Implement `DataFrameGroupBy.corr`
[ https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46976: -- Assignee: (was: Apache Spark) > Implement `DataFrameGroupBy.corr` > - > > Key: SPARK-46976 > URL: https://issues.apache.org/jira/browse/SPARK-46976 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org