[jira] [Commented] (SPARK-37285) Add Weight of Evidence and Information value to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-37285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489301#comment-17489301 ] zhengruifeng commented on SPARK-37285: -- were these metrics or algorithms implemented in scikit-learn or other libraries? > Add Weight of Evidence and Information value to ml.feature > -- > > Key: SPARK-37285 > URL: https://issues.apache.org/jira/browse/SPARK-37285 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.2.0 >Reporter: Simon Tao >Assignee: Apache Spark >Priority: Major > > The weight of evidence (WOE) and information value (IV) provide a great > framework for exploratory analysis and variable screening for binary > classifiers as well as beneficial and help us analyze multiple points as > listed below: > 1. Helps check the linear relationship of a feature with its dependent > feature to be used in the model. > 2. Is a good variable transformation method for both continuous and > categorical features. > 3. Is better than on-hot encoding as this method of variable transformation > does not increase the complexity of the model. > 4. Detect linear and non-linear relationships. > 5. Be useful in feature selection. > 6. Is a good measure of the predictive power of a feature and it also helps > point out the suspicious feature. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
[ https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38120: -- Affects Version/s: 3.0.3 > HiveExternalCatalog.listPartitions is failing when partition column name is > upper case and dot in partition value > - > > Key: SPARK-38120 > URL: https://issues.apache.org/jira/browse/SPARK-38120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.1 >Reporter: Khalid Mammadov >Priority: Minor > > HiveExternalCatalog.listPartitions method call is failing when a partition > column name is upper case and partition value contains dot. It's related to > this change > [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23] > The test casein that PR does not produce the issue as partition column name > is lower case. > > Below how to reproduce the issue: > scala> import org.apache.spark.sql.catalyst.TableIdentifier > import org.apache.spark.sql.catalyst.TableIdentifier > scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY > (partCol1 STRING, partCol2 STRING)") > scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = > 'i.j') VALUES (100, 'John')") > scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), > Some(Map("partCol2" -> "i.j"))).foreach(println) > java.util.NoSuchElementException: key not found: partcol2 > at scala.collection.immutable.Map$Map2.apply(Map.scala:227) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202) > at scala.collection.immutable.Map$Map1.forall(Map.scala:196) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251) > ... 47 elided -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
[ https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38120: -- Affects Version/s: 3.1.2 > HiveExternalCatalog.listPartitions is failing when partition column name is > upper case and dot in partition value > - > > Key: SPARK-38120 > URL: https://issues.apache.org/jira/browse/SPARK-38120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 >Reporter: Khalid Mammadov >Priority: Minor > > HiveExternalCatalog.listPartitions method call is failing when a partition > column name is upper case and partition value contains dot. It's related to > this change > [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23] > The test casein that PR does not produce the issue as partition column name > is lower case. > > Below how to reproduce the issue: > scala> import org.apache.spark.sql.catalyst.TableIdentifier > import org.apache.spark.sql.catalyst.TableIdentifier > scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY > (partCol1 STRING, partCol2 STRING)") > scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = > 'i.j') VALUES (100, 'John')") > scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), > Some(Map("partCol2" -> "i.j"))).foreach(println) > java.util.NoSuchElementException: key not found: partcol2 > at scala.collection.immutable.Map$Map2.apply(Map.scala:227) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202) > at scala.collection.immutable.Map$Map1.forall(Map.scala:196) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251) > ... 47 elided -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf
[ https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38153. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35455 [https://github.com/apache/spark/pull/35455] > Remove option newlines.topLevelStatements in scalafmt.conf > -- > > Key: SPARK-38153 > URL: https://issues.apache.org/jira/browse/SPARK-38153 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > The configuration > newlines.topLevelStatements = [before,after] > is to add blank line before the first member or after the last member of the > class. > This is neither encouraged nor discouraged as per > https://github.com/databricks/scala-style-guide#blanklines -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
[ https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489290#comment-17489290 ] Apache Spark commented on SPARK-38033: -- User 'LLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/35458 > The structured streaming processing cannot be started because the commitId > and offsetId are inconsistent > > > Key: SPARK-38033 > URL: https://issues.apache.org/jira/browse/SPARK-38033 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.3 >Reporter: LLiu >Priority: Major > > Streaming Processing could not start due to an unexpected machine shutdown. > The exception is as follows > > {code:java} > ERROR 22/01/12 02:48:36 MicroBatchExecution: Query > streaming_4a026335eafd4bb498ee51752b49f7fb [id = > 647ba9e4-16d2-4972-9824-6f9179588806, runId = > 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error > java.lang.IllegalStateException: batch 113258 doesn't exist > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > {code} > I checked checkpoint file on HDFS and found the latest offset is 113259. But > commits is 113257. The following > {code:java} > commits > /tmp/streaming_/commits/113253 > /tmp/streaming_/commits/113254 > /tmp/streaming_/commits/113255 > /tmp/streaming_/commits/113256 > /tmp/streaming_/commits/113257 > offset > /tmp/streaming_/offsets/113253 > /tmp/streaming_/offsets/113254 > /tmp/streaming_/offsets/113255 > /tmp/streaming_/offsets/113256 > /tmp/streaming_/offsets/113257 > /tmp/streaming_/offsets/113259{code} > Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the > program started normally. I think there is a problem here and we should try > to handle this exception or give some resolution in the log. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
[ https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38033: Assignee: (was: Apache Spark) > The structured streaming processing cannot be started because the commitId > and offsetId are inconsistent > > > Key: SPARK-38033 > URL: https://issues.apache.org/jira/browse/SPARK-38033 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.3 >Reporter: LLiu >Priority: Major > > Streaming Processing could not start due to an unexpected machine shutdown. > The exception is as follows > > {code:java} > ERROR 22/01/12 02:48:36 MicroBatchExecution: Query > streaming_4a026335eafd4bb498ee51752b49f7fb [id = > 647ba9e4-16d2-4972-9824-6f9179588806, runId = > 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error > java.lang.IllegalStateException: batch 113258 doesn't exist > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > {code} > I checked checkpoint file on HDFS and found the latest offset is 113259. But > commits is 113257. The following > {code:java} > commits > /tmp/streaming_/commits/113253 > /tmp/streaming_/commits/113254 > /tmp/streaming_/commits/113255 > /tmp/streaming_/commits/113256 > /tmp/streaming_/commits/113257 > offset > /tmp/streaming_/offsets/113253 > /tmp/streaming_/offsets/113254 > /tmp/streaming_/offsets/113255 > /tmp/streaming_/offsets/113256 > /tmp/streaming_/offsets/113257 > /tmp/streaming_/offsets/113259{code} > Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the > program started normally. I think there is a problem here and we should try > to handle this exception or give some resolution in the log. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
[ https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38033: Assignee: Apache Spark > The structured streaming processing cannot be started because the commitId > and offsetId are inconsistent > > > Key: SPARK-38033 > URL: https://issues.apache.org/jira/browse/SPARK-38033 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.3 >Reporter: LLiu >Assignee: Apache Spark >Priority: Major > > Streaming Processing could not start due to an unexpected machine shutdown. > The exception is as follows > > {code:java} > ERROR 22/01/12 02:48:36 MicroBatchExecution: Query > streaming_4a026335eafd4bb498ee51752b49f7fb [id = > 647ba9e4-16d2-4972-9824-6f9179588806, runId = > 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error > java.lang.IllegalStateException: batch 113258 doesn't exist > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > {code} > I checked checkpoint file on HDFS and found the latest offset is 113259. But > commits is 113257. The following > {code:java} > commits > /tmp/streaming_/commits/113253 > /tmp/streaming_/commits/113254 > /tmp/streaming_/commits/113255 > /tmp/streaming_/commits/113256 > /tmp/streaming_/commits/113257 > offset > /tmp/streaming_/offsets/113253 > /tmp/streaming_/offsets/113254 > /tmp/streaming_/offsets/113255 > /tmp/streaming_/offsets/113256 > /tmp/streaming_/offsets/113257 > /tmp/streaming_/offsets/113259{code} > Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the > program started normally. I think there is a problem here and we should try > to handle this exception or give some resolution in the log. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent
[ https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489289#comment-17489289 ] Apache Spark commented on SPARK-38033: -- User 'LLiu' has created a pull request for this issue: https://github.com/apache/spark/pull/35458 > The structured streaming processing cannot be started because the commitId > and offsetId are inconsistent > > > Key: SPARK-38033 > URL: https://issues.apache.org/jira/browse/SPARK-38033 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.6, 3.0.3 >Reporter: LLiu >Priority: Major > > Streaming Processing could not start due to an unexpected machine shutdown. > The exception is as follows > > {code:java} > ERROR 22/01/12 02:48:36 MicroBatchExecution: Query > streaming_4a026335eafd4bb498ee51752b49f7fb [id = > 647ba9e4-16d2-4972-9824-6f9179588806, runId = > 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error > java.lang.IllegalStateException: batch 113258 doesn't exist > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193) > {code} > I checked checkpoint file on HDFS and found the latest offset is 113259. But > commits is 113257. The following > {code:java} > commits > /tmp/streaming_/commits/113253 > /tmp/streaming_/commits/113254 > /tmp/streaming_/commits/113255 > /tmp/streaming_/commits/113256 > /tmp/streaming_/commits/113257 > offset > /tmp/streaming_/offsets/113253 > /tmp/streaming_/offsets/113254 > /tmp/streaming_/offsets/113255 > /tmp/streaming_/offsets/113256 > /tmp/streaming_/offsets/113257 > /tmp/streaming_/offsets/113259{code} > Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the > program started normally. I think there is a problem here and we should try > to handle this exception or give some resolution in the log. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
[ https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36553: Assignee: (was: Apache Spark) > KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 > was introduced > > > Key: SPARK-36553 > URL: https://issues.apache.org/jira/browse/SPARK-36553 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.1.1 >Reporter: Anders Rydbirk >Priority: Major > > We are running KMeans on approximately 350M rows of x, y, z coordinates using > the following configuration: > {code:java} > KMeans( > featuresCol='features', > predictionCol='centroid_id', > k=5, > initMode='k-means||', > initSteps=2, > tol=0.5, > maxIter=20, > seed=SEED, > distanceMeasure='euclidean' > ) > {code} > When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are > consistently getting errors unless we reduce K. > Stacktrace: > > {code:java} > An error occurred while calling o167.fit.An error occurred while calling > o167.fit.: java.lang.NegativeArraySizeException: -897458648 at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at > scala.Array$.ofDim(Array.scala:221) at > org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) > at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) > at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.base/java.lang.Thread.run(Unknown Source) > {code} > > The issue is introduced by > [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] > which significantly reduces the maximum value of K. Snippit of line that > throws error from [DistanceMeasure.scala:|#L52]] > {code:java} > val packedValues = Array.ofDim[Double](k * (k + 1) / 2) > {code} > > *What we have tried:* > * Reducing iterations > * Reducing input volume > * Reducing K > Only reducing K have yielded success. > > *Possible workaround:* > # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot > be loaded in 3.1.1. > # Reduce K. Currently trying with 45000. > > *What we don't understand*: > Given the line of code above, we do not understand why we would get an > integer overflow. > For K=50,000, packedValues should be allocated with the size of 1,250,025,000 > < (2^31) and not result in a negative array size. > > *Suggested resolution:* > I'm not strong in the inner workings on KMeans, but my immediate thought > would be to add a fallback to previous logic for K larger than a set > threshold if the optimisation is to stay in place, as it breaks compatibility > from 3.0.0 to 3.1.1 for edge cases. > > Please let me know if more information is needed, this is my first time > raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
[ https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489281#comment-17489281 ] Apache Spark commented on SPARK-36553: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/35457 > KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 > was introduced > > > Key: SPARK-36553 > URL: https://issues.apache.org/jira/browse/SPARK-36553 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.1.1 >Reporter: Anders Rydbirk >Priority: Major > > We are running KMeans on approximately 350M rows of x, y, z coordinates using > the following configuration: > {code:java} > KMeans( > featuresCol='features', > predictionCol='centroid_id', > k=5, > initMode='k-means||', > initSteps=2, > tol=0.5, > maxIter=20, > seed=SEED, > distanceMeasure='euclidean' > ) > {code} > When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are > consistently getting errors unless we reduce K. > Stacktrace: > > {code:java} > An error occurred while calling o167.fit.An error occurred while calling > o167.fit.: java.lang.NegativeArraySizeException: -897458648 at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at > scala.Array$.ofDim(Array.scala:221) at > org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) > at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) > at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.base/java.lang.Thread.run(Unknown Source) > {code} > > The issue is introduced by > [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] > which significantly reduces the maximum value of K. Snippit of line that > throws error from [DistanceMeasure.scala:|#L52]] > {code:java} > val packedValues = Array.ofDim[Double](k * (k + 1) / 2) > {code} > > *What we have tried:* > * Reducing iterations > * Reducing input volume > * Reducing K > Only reducing K have yielded success. > > *Possible workaround:* > # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot > be loaded in 3.1.1. > # Reduce K. Currently trying with 45000. > > *What we don't understand*: > Given the line of code above, we do not understand why we would get an > integer overflow. > For K=50,000, packedValues should be allocated with the size of 1,250,025,000 > < (2^31) and not result in a negative array size. > > *Suggested resolution:* > I'm not strong in the inner workings on KMeans, but my immediate thought > would be to add a fallback to previous logic for K larger than a set > threshold if the optimisation is to stay in place, as it breaks compatibility > from 3.0.0 to 3.1.1 for edge cases. > > Please let me know if more information is needed, this is my first time > raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
[ https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36553: Assignee: Apache Spark > KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 > was introduced > > > Key: SPARK-36553 > URL: https://issues.apache.org/jira/browse/SPARK-36553 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.1.1 >Reporter: Anders Rydbirk >Assignee: Apache Spark >Priority: Major > > We are running KMeans on approximately 350M rows of x, y, z coordinates using > the following configuration: > {code:java} > KMeans( > featuresCol='features', > predictionCol='centroid_id', > k=5, > initMode='k-means||', > initSteps=2, > tol=0.5, > maxIter=20, > seed=SEED, > distanceMeasure='euclidean' > ) > {code} > When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are > consistently getting errors unless we reduce K. > Stacktrace: > > {code:java} > An error occurred while calling o167.fit.An error occurred while calling > o167.fit.: java.lang.NegativeArraySizeException: -897458648 at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at > scala.Array$.ofDim(Array.scala:221) at > org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) > at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) > at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.base/java.lang.Thread.run(Unknown Source) > {code} > > The issue is introduced by > [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] > which significantly reduces the maximum value of K. Snippit of line that > throws error from [DistanceMeasure.scala:|#L52]] > {code:java} > val packedValues = Array.ofDim[Double](k * (k + 1) / 2) > {code} > > *What we have tried:* > * Reducing iterations > * Reducing input volume > * Reducing K > Only reducing K have yielded success. > > *Possible workaround:* > # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot > be loaded in 3.1.1. > # Reduce K. Currently trying with 45000. > > *What we don't understand*: > Given the line of code above, we do not understand why we would get an > integer overflow. > For K=50,000, packedValues should be allocated with the size of 1,250,025,000 > < (2^31) and not result in a negative array size. > > *Suggested resolution:* > I'm not strong in the inner workings on KMeans, but my immediate thought > would be to add a fallback to previous logic for K larger than a set > threshold if the optimisation is to stay in place, as it breaks compatibility > from 3.0.0 to 3.1.1 for edge cases. > > Please let me know if more information is needed, this is my first time > raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38145: Parent: SPARK-38154 Issue Type: Sub-task (was: Improvement) > Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True > > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced
[ https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489278#comment-17489278 ] zhengruifeng commented on SPARK-36553: -- it is a overflow: {code:java} scala> val k = 5 val k: Int = 5 scala> k * (k + 1) / 2 val res1: Int = -897458648 {code} > KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 > was introduced > > > Key: SPARK-36553 > URL: https://issues.apache.org/jira/browse/SPARK-36553 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark >Affects Versions: 3.1.1 >Reporter: Anders Rydbirk >Priority: Major > > We are running KMeans on approximately 350M rows of x, y, z coordinates using > the following configuration: > {code:java} > KMeans( > featuresCol='features', > predictionCol='centroid_id', > k=5, > initMode='k-means||', > initSteps=2, > tol=0.5, > maxIter=20, > seed=SEED, > distanceMeasure='euclidean' > ) > {code} > When using Spark 3.0.0 this worked fine, but when upgrading to 3.1.1 we are > consistently getting errors unless we reduce K. > Stacktrace: > > {code:java} > An error occurred while calling o167.fit.An error occurred while calling > o167.fit.: java.lang.NegativeArraySizeException: -897458648 at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at > scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at > scala.Array$.ofDim(Array.scala:221) at > org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52) > at > org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280) > at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) > at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:282) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:238) at > java.base/java.lang.Thread.run(Unknown Source) > {code} > > The issue is introduced by > [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]] > which significantly reduces the maximum value of K. Snippit of line that > throws error from [DistanceMeasure.scala:|#L52]] > {code:java} > val packedValues = Array.ofDim[Double](k * (k + 1) / 2) > {code} > > *What we have tried:* > * Reducing iterations > * Reducing input volume > * Reducing K > Only reducing K have yielded success. > > *Possible workaround:* > # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot > be loaded in 3.1.1. > # Reduce K. Currently trying with 45000. > > *What we don't understand*: > Given the line of code above, we do not understand why we would get an > integer overflow. > For K=50,000, packedValues should be allocated with the size of 1,250,025,000 > < (2^31) and not result in a negative array size. > > *Suggested resolution:* > I'm not strong in the inner workings on KMeans, but my immediate thought > would be to add a fallback to previous logic for K larger than a set > threshold if the optimisation is to stay in place, as it breaks compatibility > from 3.0.0 to 3.1.1 for edge cases. > > Please let me know if more information is needed, this is my first time > raising a bug for a OS. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37960) A new framework to represent catalyst expressions in DS v2 APIs
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37960: --- Description: Spark need a new framework to represent catalyst expressions in DS v2 APIs. CASE ... WHEN ... ELSE ... END is just the first use case. was: Currently, Spark supports aggregate push down SUM(column) into JDBC data source. SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. > A new framework to represent catalyst expressions in DS v2 APIs > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Spark need a new framework to represent catalyst expressions in DS v2 APIs. > CASE ... WHEN ... ELSE ... END is just the first use case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37960) A new framework to represent catalyst expressions in DS v2 APIs
[ https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37960: --- Summary: A new framework to represent catalyst expressions in DS v2 APIs (was: Support aggregate push down with CASE ... WHEN ... ELSE ... END) > A new framework to represent catalyst expressions in DS v2 APIs > --- > > Key: SPARK-37960 > URL: https://issues.apache.org/jira/browse/SPARK-37960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports aggregate push down SUM(column) into JDBC data > source. > SUM(CASE ... WHEN ... ELSE ... END) is very useful for users. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30661) KMeans blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489271#comment-17489271 ] zhengruifeng commented on SPARK-30661: -- ok, I will skip .mllib calling .ml here. We may re-org the impls in the future. I'm looking for a way to minimize the changes. > KMeans blockify input vectors > - > > Key: SPARK-30661 > URL: https://issues.apache.org/jira/browse/SPARK-30661 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Attachments: blockify_kmeans.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31007) KMeans optimization based on triangle-inequality
[ https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489270#comment-17489270 ] zhengruifeng edited comment on SPARK-31007 at 2/9/22, 6:05 AM: --- this case is not OOM, but the overflow: {code:java} scala> val k = 5 val k: Int = 5 scala> k * (k + 1) / 2 val res0: Int = -897458648 scala> k * (k + 1) val res1: Int = -1794917296 scala> k / 2 * (k + 1) val res2: Int = 1250025000 scala> Array.ofDim[Double](k * (k + 1) / 2) java.lang.NegativeArraySizeException at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 32 elided scala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 29 elidedscala> val k = 45000 val k: Int = 45000 scala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap space scala> k / 2 * (k + 1) val res6: Int = 1012522500 {code} I think we should add a limit of k for this optimization. the change should not be large, I will send a PR. was (Author: podongfeng): this case is not OOM, but the overflow: {code:java} scala> val k = 5 val k: Int = 5scala> k * (k + 1) / 2 val res0: Int = -897458648scala> k * (k + 1) val res1: Int = -1794917296scala> k / 2 * (k + 1) val res2: Int = 1250025000scala> Array.ofDim[Double](k * (k + 1) / 2) java.lang.NegativeArraySizeException at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 32 elidedscala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 29 elidedscala> val k = 45000 val k: Int = 45000scala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap spacescala> k / 2 * (k + 1) val res6: Int = 1012522500 {code} I think we should add a limit of k for this optimization. the change should not be large, I will send a PR. > KMeans optimization based on triangle-inequality > > > Key: SPARK-31007 > URL: https://issues.apache.org/jira/browse/SPARK-31007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > Attachments: ICML03-022.pdf > > > In current impl, following Lemma is used in KMeans: > 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= > |(d(x,o) - d(c,o))| = |norm(x)-norm(c)| > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > According to [Using the Triangle Inequality to Accelerate > K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go > futher, and there are another two Lemmas can be used: > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then > d(x,c) >= d(x,b); > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}. > However, luckily for CosineDistance we can get a variant in the space of > radian/angle. > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, > d(x,b)-d(b,c)}; > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > The application of Lemma 2 is a little complex: It need to cache/update the > distance/lower bounds to previous centers, and thus can be only applied in > training, not usable in prediction. > So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is > close to center b enough (less than a pre-computed radius), then we can say > point x belong to center c without computing the distances between x and > other centers. It can be used in both training and predction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality
[ https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489270#comment-17489270 ] zhengruifeng commented on SPARK-31007: -- this case is not OOM, but the overflow: {code:java} scala> val k = 5 val k: Int = 5scala> k * (k + 1) / 2 val res0: Int = -897458648scala> k * (k + 1) val res1: Int = -1794917296scala> k / 2 * (k + 1) val res2: Int = 1250025000scala> Array.ofDim[Double](k * (k + 1) / 2) java.lang.NegativeArraySizeException at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 32 elidedscala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap space at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273) at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271) at scala.Array$.ofDim(Array.scala:323) ... 29 elidedscala> val k = 45000 val k: Int = 45000scala> Array.ofDim[Double](k / 2 * (k + 1)) java.lang.OutOfMemoryError: Java heap spacescala> k / 2 * (k + 1) val res6: Int = 1012522500 {code} I think we should add a limit of k for this optimization. the change should not be large, I will send a PR. > KMeans optimization based on triangle-inequality > > > Key: SPARK-31007 > URL: https://issues.apache.org/jira/browse/SPARK-31007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > Attachments: ICML03-022.pdf > > > In current impl, following Lemma is used in KMeans: > 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= > |(d(x,o) - d(c,o))| = |norm(x)-norm(c)| > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > According to [Using the Triangle Inequality to Accelerate > K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go > futher, and there are another two Lemmas can be used: > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then > d(x,c) >= d(x,b); > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}. > However, luckily for CosineDistance we can get a variant in the space of > radian/angle. > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, > d(x,b)-d(b,c)}; > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > The application of Lemma 2 is a little complex: It need to cache/update the > distance/lower bounds to previous centers, and thus can be only applied in > training, not usable in prediction. > So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is > close to center b enough (less than a pre-computed radius), then we can say > point x belong to center c without computing the distances between x and > other centers. It can be used in both training and predction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489259#comment-17489259 ] Xinyi Yu edited comment on SPARK-38157 at 2/9/22, 6:00 AM: --- I will take this one. Kinda my first Spark PR! was (Author: xyyu): I will take this one. Kinda my first OSS PR! > Fix > /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite > under ANSI mode > > > Key: SPARK-38157 > URL: https://issues.apache.org/jira/browse/SPARK-38157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when > ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} > should only work with ANSI off according to the golden result file, but > ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the > default ANSI setting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489259#comment-17489259 ] Xinyi Yu commented on SPARK-38157: -- I will take this one. Kinda my first OSS PR! > Fix > /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite > under ANSI mode > > > Key: SPARK-38157 > URL: https://issues.apache.org/jira/browse/SPARK-38157 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Xinyi Yu >Priority: Major > > ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when > ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} > should only work with ANSI off according to the golden result file, but > ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the > default ANSI setting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38158) Rdd and shuffle blocks not migrated to new executors when decommission feature is enabled
[ https://issues.apache.org/jira/browse/SPARK-38158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohan Patidar updated SPARK-38158: -- Attachment: sparkConf_Decommission_to_alternate_storage.conf sparkConf_Decommission_to_executors.conf > Rdd and shuffle blocks not migrated to new executors when decommission > feature is enabled > - > > Key: SPARK-38158 > URL: https://issues.apache.org/jira/browse/SPARK-38158 > Project: Spark > Issue Type: Bug > Components: Block Manager, Kubernetes >Affects Versions: 3.2.0 > Environment: Spark on k8s >Reporter: Mohan Patidar >Priority: Major > Attachments: sparkConf_Decommission_to_alternate_storage.conf, > sparkConf_Decommission_to_executors.conf > > > We’re using Spark 3.2.0 and we have enabled the spark decommission feature. > As part of validating this feature, we wanted to check if the rdd blocks and > shuffle blocks from the decommissioned executors are migrated to other > executors. > However, we could not see this happening. Below is the configuration we used. > # *Spark Configuration used:* > ** spark.local.dir /mnt/spark-ldir > spark.decommission.enabled true > spark.storage.decommission.enabled true > spark.storage.decommission.rddBlocks.enabled true > spark.storage.decommission.shuffleBlocks.enabled true > spark.dynamicAllocation.enabled true > # *Brought up spark-driver and executors on the different nodes.* > NAME > READY STATUS NODE > decommission-driver > 1/1 Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-1 1/1 > Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-2 1/1 > Running *Node2* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-3 1/1 > Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-4 1/1 > Running *Node2* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-5 1/1 > Running *Node1* > # *Bringdown Node2 so status of pods as are following.* > NAME > READY STATUS NODE > decommission-driver > 1/1 Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-1 1/1 > Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-2 1/1 > Terminating *Node2* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-3 1/1 > Running *Node1* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-4 1/1 > Terminating *Node2* > gzip-compression-test-ae0b0b7e4d7fbe40-exec-5 1/1 > Running *Node1* > # *Driver logs:* > {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", > "timezone":"UTC", "log":"Adding decommission script to lifecycle"} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", > "timezone":"UTC", "log":"Adding decommission script to lifecycle"} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", > "timezone":"UTC", "log":"Adding decommission script to lifecycle"} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", > "timezone":"UTC", "log":"Adding decommission script to lifecycle"} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", > "timezone":"UTC", "log":"Adding decommission script to lifecycle"} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", > "timezone":"UTC", "log":"Notify executor 5 to decommissioning."} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", > "timezone":"UTC", "log":"Notify executor 1 to decommissioning."} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", > "timezone":"UTC", "log":"Notify executor 3 to decommissioning."} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", > "timezone":"UTC", "log":"Mark BlockManagers (BlockManagerId(5, X.X.X.X, > 33359, None), BlockManagerId(1, X.X.X.X, 38655, None), BlockManagerId(3, > X.X.X.X, 35797, None)) as being decommissioning."} > {"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", > "timezone":"UTC", "log":"Executor 2 is removed. Remove reason statistics: > (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, > unexpectedly
[jira] [Created] (SPARK-38158) Rdd and shuffle blocks not migrated to new executors when decommission feature is enabled
Mohan Patidar created SPARK-38158: - Summary: Rdd and shuffle blocks not migrated to new executors when decommission feature is enabled Key: SPARK-38158 URL: https://issues.apache.org/jira/browse/SPARK-38158 Project: Spark Issue Type: Bug Components: Block Manager, Kubernetes Affects Versions: 3.2.0 Environment: Spark on k8s Reporter: Mohan Patidar We’re using Spark 3.2.0 and we have enabled the spark decommission feature. As part of validating this feature, we wanted to check if the rdd blocks and shuffle blocks from the decommissioned executors are migrated to other executors. However, we could not see this happening. Below is the configuration we used. # *Spark Configuration used:* ** spark.local.dir /mnt/spark-ldir spark.decommission.enabled true spark.storage.decommission.enabled true spark.storage.decommission.rddBlocks.enabled true spark.storage.decommission.shuffleBlocks.enabled true spark.dynamicAllocation.enabled true # *Brought up spark-driver and executors on the different nodes.* NAME READY STATUS NODE decommission-driver 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-1 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-2 1/1 Running *Node2* gzip-compression-test-ae0b0b7e4d7fbe40-exec-3 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-4 1/1 Running *Node2* gzip-compression-test-ae0b0b7e4d7fbe40-exec-5 1/1 Running *Node1* # *Bringdown Node2 so status of pods as are following.* NAME READY STATUS NODE decommission-driver 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-1 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-2 1/1 Terminating *Node2* gzip-compression-test-ae0b0b7e4d7fbe40-exec-3 1/1 Running *Node1* gzip-compression-test-ae0b0b7e4d7fbe40-exec-4 1/1 Terminating *Node2* gzip-compression-test-ae0b0b7e4d7fbe40-exec-5 1/1 Running *Node1* # *Driver logs:* {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", "timezone":"UTC", "log":"Adding decommission script to lifecycle"} {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", "timezone":"UTC", "log":"Adding decommission script to lifecycle"} {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", "timezone":"UTC", "log":"Adding decommission script to lifecycle"} {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", "timezone":"UTC", "log":"Adding decommission script to lifecycle"} {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", "timezone":"UTC", "log":"Adding decommission script to lifecycle"} {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", "timezone":"UTC", "log":"Notify executor 5 to decommissioning."} {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", "timezone":"UTC", "log":"Notify executor 1 to decommissioning."} {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", "timezone":"UTC", "log":"Notify executor 3 to decommissioning."} {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", "timezone":"UTC", "log":"Mark BlockManagers (BlockManagerId(5, X.X.X.X, 33359, None), BlockManagerId(1, X.X.X.X, 38655, None), BlockManagerId(3, X.X.X.X, 35797, None)) as being decommissioning."} {"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", "timezone":"UTC", "log":"Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 1)."} {"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", "timezone":"UTC", "log":"Executor 4 is removed. Remove reason statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 2)."} # *Verified by Execute into all live executors(1,3,5) and checked at location (/mnt/spark-ldir/) so only one blockManger id present, not seeing any other blockManager id copied to this location.* Example: \{*}${*}kubectl exec -it gzip-compression-test-ae0b0b7e4d7fbe40-exec-1 -n test bash $cd
[jira] [Commented] (SPARK-36714) bugs in MIniLSH
[ https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489258#comment-17489258 ] zhengruifeng commented on SPARK-36714: -- [~sheng_1992] Since you had investigate this issue, feel free to send a PR and ping me in it > bugs in MIniLSH > --- > > Key: SPARK-36714 > URL: https://issues.apache.org/jira/browse/SPARK-36714 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: shengzhang >Priority: Minor > > This is about MinHashLSH algorithm. > To get the similartiy dataframe DFA and DFB, I used MinHashLSH > approxSimilarityJoin function. But there are some missing data in the result. > the example in documents is no problem > [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com] > > but when the data based on distributed system(hive, more than one node) > there will be some missing data. > for example vectora= vectorb. but it no in the reslut of > approxSimilarityJoin, even though > "threshold" more than 1 . > I think maybe the problem is in these codes > {code:java} > // part1 > override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 > = { > require(inputDim <= MinHashLSH.HASH_PRIME, > s"The input vector dimension $inputDim exceeds the threshold > ${MinHashLSH.HASH_PRIME}.") > val rand = new Random($(seed)) > val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) { > (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), > rand.nextInt(MinHashLSH.HASH_PRIME - 1)) > } > new MinHashLSHModel1(uid, randCoefs) > } > // part2 > @Since("2.1.0") > override protected[ml] val hashFunction: Vector => Array[Vector] = { > elems: Vector => { > require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") > val elemsList = elems.toSparse.indices.toList > val hashValues = randCoefficients.map { case (a, b) => > elemsList.map { elem: Int => > ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME > }.min.toDouble > } > // TODO: Output vectors of dimension numHashFunctions in SPARK-18450 > hashValues.map(Vectors.dense(_)) > } > {code} > val r1 = new scala.util.Random(1) > r1.nextInt(1000) // -> 985 > val r2 = new scala.util.Random(2) > r2.nextInt(1000) // -> 108 - > val r3 = new scala.util.Random(1) > r3.nextInt(1000) // -> 985 - because seeded just as `r1` > r3.nextInt(1000). //-> 588 > {{}}the reason maybe is above. if random is only initialized once . > random.nextInt() will get different result every time ,like r3. > r3.nextInt(1000) // -> 985 r3.nextInt(1000). //-> 588 > so the code > val rand = new Random($(seed)) in def createRawLSHModel move to > hashFunction is better > . every worker will initialize random class. and every worker get same data > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode
Xinyi Yu created SPARK-38157: Summary: Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode Key: SPARK-38157 URL: https://issues.apache.org/jira/browse/SPARK-38157 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Xinyi Yu ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} should only work with ANSI off according to the golden result file, but ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the default ANSI setting. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax
[ https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489250#comment-17489250 ] Apache Spark commented on SPARK-38156: -- User 'yeshengm' has created a pull request for this issue: https://github.com/apache/spark/pull/35456 > Support CREATE EXTERNAL TABLE LIKE syntax > - > > Key: SPARK-38156 > URL: https://issues.apache.org/jira/browse/SPARK-38156 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yesheng Ma >Priority: Major > > Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users > to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax > is not supported in Spark right now and we should make these CREATE TABLE > DDLs consistent. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax
[ https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489249#comment-17489249 ] Apache Spark commented on SPARK-38156: -- User 'yeshengm' has created a pull request for this issue: https://github.com/apache/spark/pull/35456 > Support CREATE EXTERNAL TABLE LIKE syntax > - > > Key: SPARK-38156 > URL: https://issues.apache.org/jira/browse/SPARK-38156 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yesheng Ma >Priority: Major > > Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users > to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax > is not supported in Spark right now and we should make these CREATE TABLE > DDLs consistent. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax
[ https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38156: Assignee: (was: Apache Spark) > Support CREATE EXTERNAL TABLE LIKE syntax > - > > Key: SPARK-38156 > URL: https://issues.apache.org/jira/browse/SPARK-38156 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yesheng Ma >Priority: Major > > Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users > to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax > is not supported in Spark right now and we should make these CREATE TABLE > DDLs consistent. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax
[ https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38156: Assignee: Apache Spark > Support CREATE EXTERNAL TABLE LIKE syntax > - > > Key: SPARK-38156 > URL: https://issues.apache.org/jira/browse/SPARK-38156 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yesheng Ma >Assignee: Apache Spark >Priority: Major > > Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users > to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax > is not supported in Spark right now and we should make these CREATE TABLE > DDLs consistent. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax
Yesheng Ma created SPARK-38156: -- Summary: Support CREATE EXTERNAL TABLE LIKE syntax Key: SPARK-38156 URL: https://issues.apache.org/jira/browse/SPARK-38156 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Yesheng Ma Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax is not supported in Spark right now and we should make these CREATE TABLE DDLs consistent. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates
[ https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-38155: - Description: Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and correlated non-equality predicates. This can lead to incorrect results as DISTINCT will be rewritten as Aggregate during the optimization phase. For example {code:java} CREATE VIEW t1(c1, c2) AS VALUES (0, 1) CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2) SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1) {code} The correct results should be (0, 1, 2) but currently, it gives [(0, 1, 2), (0, 1, 2)]. was: Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and correlated non-equality predicates. This can lead to incorrect results as DISTINCT will be rewritten as Aggregate during the optimization phase. For example CREATE VIEW t1(c1, c2) AS VALUES (0, 1) CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2) SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1) The correct results should be (0, 1, 2) but currently, it gives [(0, 1, 2), (0, 1, 2)] > Disallow distinct aggregate in lateral subqueries with unsupported correlated > predicates > > > Key: SPARK-38155 > URL: https://issues.apache.org/jira/browse/SPARK-38155 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Priority: Major > > Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate > and correlated non-equality predicates. This can lead to incorrect results as > DISTINCT will be rewritten as Aggregate during the optimization phase. > For example > > {code:java} > CREATE VIEW t1(c1, c2) AS VALUES (0, 1) > CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2) > SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1) > {code} > > The correct results should be (0, 1, 2) but currently, it gives [(0, 1, 2), > (0, 1, 2)]. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates
Allison Wang created SPARK-38155: Summary: Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates Key: SPARK-38155 URL: https://issues.apache.org/jira/browse/SPARK-38155 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Allison Wang Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and correlated non-equality predicates. This can lead to incorrect results as DISTINCT will be rewritten as Aggregate during the optimization phase. For example CREATE VIEW t1(c1, c2) AS VALUES (0, 1) CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2) SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1) The correct results should be (0, 1, 2) but currently, it gives [(0, 1, 2), (0, 1, 2)] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38154) Set up a new GA job to run tests with ANSI mode
Wenchen Fan created SPARK-38154: --- Summary: Set up a new GA job to run tests with ANSI mode Key: SPARK-38154 URL: https://issues.apache.org/jira/browse/SPARK-38154 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38149) Upgrade joda-time to 2.10.13
[ https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38149. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35452 [https://github.com/apache/spark/pull/35452] > Upgrade joda-time to 2.10.13 > > > Key: SPARK-38149 > URL: https://issues.apache.org/jira/browse/SPARK-38149 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > joda-time 2.10.13 was released, which supports the latest TZ database of > 2021e. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38147) Upgrade shapeless to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38147. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35450 [https://github.com/apache/spark/pull/35450] > Upgrade shapeless to 2.3.7 > -- > > Key: SPARK-38147 > URL: https://issues.apache.org/jira/browse/SPARK-38147 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37507) Add the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37507: Assignee: (was: Apache Spark) > Add the TO_BINARY() function > > > Key: SPARK-37507 > URL: https://issues.apache.org/jira/browse/SPARK-37507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > to_binary(expr, fmt) is a common function available in many other systems to > provide a unified entry for string to binary data conversion, where fmt can > be utf8, base64, hex and base2 (or whatever the reverse operation > to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489238#comment-17489238 ] Apache Spark commented on SPARK-37507: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35415 > Add the TO_BINARY() function > > > Key: SPARK-37507 > URL: https://issues.apache.org/jira/browse/SPARK-37507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > to_binary(expr, fmt) is a common function available in many other systems to > provide a unified entry for string to binary data conversion, where fmt can > be utf8, base64, hex and base2 (or whatever the reverse operation > to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37507) Add the TO_BINARY() function
[ https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37507: Assignee: Apache Spark > Add the TO_BINARY() function > > > Key: SPARK-37507 > URL: https://issues.apache.org/jira/browse/SPARK-37507 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > to_binary(expr, fmt) is a common function available in many other systems to > provide a unified entry for string to binary data conversion, where fmt can > be utf8, base64, hex and base2 (or whatever the reverse operation > to_char()supports). > [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html] > [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html] > [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes] > [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw] > Related Spark functions: unbase64, unhex -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf
[ https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38153: Assignee: Apache Spark (was: Gengliang Wang) > Remove option newlines.topLevelStatements in scalafmt.conf > -- > > Key: SPARK-38153 > URL: https://issues.apache.org/jira/browse/SPARK-38153 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > The configuration > newlines.topLevelStatements = [before,after] > is to add blank line before the first member or after the last member of the > class. > This is neither encouraged nor discouraged as per > https://github.com/databricks/scala-style-guide#blanklines -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf
[ https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38153: Assignee: Gengliang Wang (was: Apache Spark) > Remove option newlines.topLevelStatements in scalafmt.conf > -- > > Key: SPARK-38153 > URL: https://issues.apache.org/jira/browse/SPARK-38153 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > The configuration > newlines.topLevelStatements = [before,after] > is to add blank line before the first member or after the last member of the > class. > This is neither encouraged nor discouraged as per > https://github.com/databricks/scala-style-guide#blanklines -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf
[ https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489236#comment-17489236 ] Apache Spark commented on SPARK-38153: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35455 > Remove option newlines.topLevelStatements in scalafmt.conf > -- > > Key: SPARK-38153 > URL: https://issues.apache.org/jira/browse/SPARK-38153 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > The configuration > newlines.topLevelStatements = [before,after] > is to add blank line before the first member or after the last member of the > class. > This is neither encouraged nor discouraged as per > https://github.com/databricks/scala-style-guide#blanklines -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf
Gengliang Wang created SPARK-38153: -- Summary: Remove option newlines.topLevelStatements in scalafmt.conf Key: SPARK-38153 URL: https://issues.apache.org/jira/browse/SPARK-38153 Project: Spark Issue Type: Task Components: Build Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang The configuration newlines.topLevelStatements = [before,after] is to add blank line before the first member or after the last member of the class. This is neither encouraged nor discouraged as per https://github.com/databricks/scala-style-guide#blanklines -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality
[ https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489235#comment-17489235 ] Sean R. Owen commented on SPARK-31007: -- Hm, can you use a simpler "unpacked" representation for those statistics? it's not the memory per se, but the size of a single array. Instead an array of arrays? > KMeans optimization based on triangle-inequality > > > Key: SPARK-31007 > URL: https://issues.apache.org/jira/browse/SPARK-31007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > Attachments: ICML03-022.pdf > > > In current impl, following Lemma is used in KMeans: > 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= > |(d(x,o) - d(c,o))| = |norm(x)-norm(c)| > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > According to [Using the Triangle Inequality to Accelerate > K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go > futher, and there are another two Lemmas can be used: > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then > d(x,c) >= d(x,b); > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}. > However, luckily for CosineDistance we can get a variant in the space of > radian/angle. > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, > d(x,b)-d(b,c)}; > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > The application of Lemma 2 is a little complex: It need to cache/update the > distance/lower bounds to previous centers, and thus can be only applied in > training, not usable in prediction. > So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is > close to center b enough (less than a pre-computed radius), then we can say > point x belong to center c without computing the distances between x and > other centers. It can be used in both training and predction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38145: Assignee: (was: Apache Spark) > Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True > > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489234#comment-17489234 ] Apache Spark commented on SPARK-38145: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/35454 > Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True > > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38145: Assignee: Apache Spark > Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True > > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38086) Make ArrowColumnVector Extendable
[ https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38086: - Assignee: Kazuyuki Tanimura > Make ArrowColumnVector Extendable > - > > Key: SPARK-38086 > URL: https://issues.apache.org/jira/browse/SPARK-38086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > > Some Spark extension libraries need to extend ArrowColumnVector.java. For > now, it is impossible as ArrowColumnVector class is final and the accessors > are all private. > For example, Rapids copies the entire ArrowColumnVector class in order to > work around the issue > [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] > Proposing to relax private/final restrictions to make ArrowColumnVector > extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38086) Make ArrowColumnVector Extendable
[ https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38086. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35393 [https://github.com/apache/spark/pull/35393] > Make ArrowColumnVector Extendable > - > > Key: SPARK-38086 > URL: https://issues.apache.org/jira/browse/SPARK-38086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Minor > Fix For: 3.3.0 > > > Some Spark extension libraries need to extend ArrowColumnVector.java. For > now, it is impossible as ArrowColumnVector class is final and the accessors > are all private. > For example, Rapids copies the entire ArrowColumnVector class in order to > work around the issue > [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] > Proposing to relax private/final restrictions to make ArrowColumnVector > extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36714) bugs in MIniLSH
[ https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489232#comment-17489232 ] zhengruifeng commented on SPARK-36714: -- could you please provide a simple script to reproduce this issue? > bugs in MIniLSH > --- > > Key: SPARK-36714 > URL: https://issues.apache.org/jira/browse/SPARK-36714 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: shengzhang >Priority: Minor > > This is about MinHashLSH algorithm. > To get the similartiy dataframe DFA and DFB, I used MinHashLSH > approxSimilarityJoin function. But there are some missing data in the result. > the example in documents is no problem > [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com] > > but when the data based on distributed system(hive, more than one node) > there will be some missing data. > for example vectora= vectorb. but it no in the reslut of > approxSimilarityJoin, even though > "threshold" more than 1 . > I think maybe the problem is in these codes > {code:java} > // part1 > override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 > = { > require(inputDim <= MinHashLSH.HASH_PRIME, > s"The input vector dimension $inputDim exceeds the threshold > ${MinHashLSH.HASH_PRIME}.") > val rand = new Random($(seed)) > val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) { > (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), > rand.nextInt(MinHashLSH.HASH_PRIME - 1)) > } > new MinHashLSHModel1(uid, randCoefs) > } > // part2 > @Since("2.1.0") > override protected[ml] val hashFunction: Vector => Array[Vector] = { > elems: Vector => { > require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") > val elemsList = elems.toSparse.indices.toList > val hashValues = randCoefficients.map { case (a, b) => > elemsList.map { elem: Int => > ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME > }.min.toDouble > } > // TODO: Output vectors of dimension numHashFunctions in SPARK-18450 > hashValues.map(Vectors.dense(_)) > } > {code} > val r1 = new scala.util.Random(1) > r1.nextInt(1000) // -> 985 > val r2 = new scala.util.Random(2) > r2.nextInt(1000) // -> 108 - > val r3 = new scala.util.Random(1) > r3.nextInt(1000) // -> 985 - because seeded just as `r1` > r3.nextInt(1000). //-> 588 > {{}}the reason maybe is above. if random is only initialized once . > random.nextInt() will get different result every time ,like r3. > r3.nextInt(1000) // -> 985 r3.nextInt(1000). //-> 588 > so the code > val rand = new Random($(seed)) in def createRawLSHModel move to > hashFunction is better > . every worker will initialize random class. and every worker get same data > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38145: Summary: Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True (was: Address the tests when "spark.sql.ansi.enabled" is True) > Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True > > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality
[ https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489230#comment-17489230 ] zhengruifeng commented on SPARK-31007: -- [~srowen] This optimization needs an array of size val packedValues = Array.ofDim[Double](k * (k + 1) / 2) can may cause failure when k is large. https://issues.apache.org/jira/browse/SPARK-36553 k=5 should we: 1, revert this optimization 2, or enable it only for small k? (but the impl will be more complex) > KMeans optimization based on triangle-inequality > > > Key: SPARK-31007 > URL: https://issues.apache.org/jira/browse/SPARK-31007 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.1.0 > > Attachments: ICML03-022.pdf > > > In current impl, following Lemma is used in KMeans: > 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= > |(d(x,o) - d(c,o))| = |norm(x)-norm(c)| > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > According to [Using the Triangle Inequality to Accelerate > K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go > futher, and there are another two Lemmas can be used: > 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then > d(x,c) >= d(x,b); > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}. > However, luckily for CosineDistance we can get a variant in the space of > radian/angle. > 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, > d(x,b)-d(b,c)}; > this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}} > The application of Lemma 2 is a little complex: It need to cache/update the > distance/lower bounds to previous centers, and thus can be only applied in > training, not usable in prediction. > So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is > close to center b enough (less than a pre-computed radius), then we can say > point x belong to center c without computing the distances between x and > other centers. It can be used in both training and predction. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38150) Update comment of RelationConversions
angerszhu created SPARK-38150: - Summary: Update comment of RelationConversions Key: SPARK-38150 URL: https://issues.apache.org/jira/browse/SPARK-38150 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.1, 3.2.0 Reporter: angerszhu Fix For: 3.3.0 Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
[ https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489225#comment-17489225 ] zhengruifeng commented on SPARK-38037: -- could you please provide a simple script to reproduce this issue? > Spark MLlib FPGrowth not working with 40+ items in Frequent Item set > > > Key: SPARK-38037 > URL: https://issues.apache.org/jira/browse/SPARK-38037 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.2.0 > Environment: Stanalone Linux server > 32 GB RAM > 4 core > >Reporter: RJ >Priority: Major > > We have been using Spark FPGrowth and it works well with millions of > transactions (records) when the frequent items in the Frequent Itemset is > less than 25. Beyond 25 it runs into computational limit. For 40+ items in > the Frequent Itemset the process never return. > To reproduce, you can create a simple data set of 3 transactions with equal > items (40 of them) and run FPgrowth with 0.9 support, the process never > completes. Below is a sample data I have used to narrow down the problem: > |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40| > |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40| > |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40| > > While the computation grows (2^n -1) with each item in Frequent Itemset, it > surely should be able to handle 40 or more items in a Frequest Itemset > > Is this a FPGrowth implementation limitation, > are there any tuning parameters that I am missing? Thank you. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38152) Upgrade Jetty version from 9.4.40 to 9.4.44
[ https://issues.apache.org/jira/browse/SPARK-38152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHOBHIT SHUKLA updated SPARK-38152: --- Description: Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 in order to fix vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182 (was: Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 inn order to fix vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182) > Upgrade Jetty version from 9.4.40 to 9.4.44 > --- > > Key: SPARK-38152 > URL: https://issues.apache.org/jira/browse/SPARK-38152 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.8, 3.0.3 >Reporter: SHOBHIT SHUKLA >Priority: Major > > Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 in order to fix > vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38152) Upgrade Jetty version from 9.4.40 to 9.4.44
SHOBHIT SHUKLA created SPARK-38152: -- Summary: Upgrade Jetty version from 9.4.40 to 9.4.44 Key: SPARK-38152 URL: https://issues.apache.org/jira/browse/SPARK-38152 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.3, 2.4.8 Reporter: SHOBHIT SHUKLA Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 inn order to fix vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38150) Update comment of RelationConversions
[ https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489219#comment-17489219 ] Apache Spark commented on SPARK-38150: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/35453 > Update comment of RelationConversions > - > > Key: SPARK-38150 > URL: https://issues.apache.org/jira/browse/SPARK-38150 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions
[ https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38150: Assignee: (was: Apache Spark) > Update comment of RelationConversions > - > > Key: SPARK-38150 > URL: https://issues.apache.org/jira/browse/SPARK-38150 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions
[ https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38150: Assignee: Apache Spark > Update comment of RelationConversions > - > > Key: SPARK-38150 > URL: https://issues.apache.org/jira/browse/SPARK-38150 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38030: -- Fix Version/s: 3.1.3 > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > Fix For: 3.1.3, 3.3.0, 3.2.2 > > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) > at
[jira] [Updated] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38030: -- Fix Version/s: 3.2.2 > Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1 > - > > Key: SPARK-38030 > URL: https://issues.apache.org/jira/browse/SPARK-38030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: Shardul Mahadik >Assignee: Shardul Mahadik >Priority: Major > Fix For: 3.3.0, 3.2.2 > > > One of our user queries failed in Spark 3.1.1 when using AQE with the > following stacktrace mentioned below (some parts of the plan have been > redacted, but the structure is preserved). > Debugging this issue, we found that the failure was within AQE calling > [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402]. > The query contains a cast over a column with non-nullable struct fields. > Canonicalization [removes nullability > information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45] > from the child {{AttributeReference}} of the Cast, however it does not > remove nullability information from the Cast's target dataType. This causes > the > [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290] > to return false because the child is now nullable and cast target data type > is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}. > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232] > +- Union >:- Project [cast(columnA#30) as struct<...>] >: +- BatchScan[columnA#30] hive.tbl >+- Project [cast(columnA#35) as struct<...>] > +- BatchScan[columnA#35] hive.tbl > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87) > at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58) > at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279) > at
[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
[ https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38151: -- Description: - https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true {code} [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds) [info] 9131 did not equal 9130 Round trip of 9130 did not work in tz Pacific/Kanton (DateTimeUtilsSuite.scala:783) {code} was: - https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true {code} [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds) 19042 [info] 9131 did not equal 9130 Round trip of 9130 did not work in tz Pacific/Kanton (DateTimeUtilsSuite.scala:783) {code} > Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays` > -- > > Key: SPARK-38151 > URL: https://issues.apache.org/jira/browse/SPARK-38151 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true > {code} > [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds) > [info] 9131 did not equal 9130 Round trip of 9130 did not work in tz > Pacific/Kanton (DateTimeUtilsSuite.scala:783) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
Dongjoon Hyun created SPARK-38151: - Summary: Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays` Key: SPARK-38151 URL: https://issues.apache.org/jira/browse/SPARK-38151 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun - https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true {code} [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds) 19042 [info] 9131 did not equal 9130 Round trip of 9130 did not work in tz Pacific/Kanton (DateTimeUtilsSuite.scala:783) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38149) Upgrade joda-time to 2.10.13
[ https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38149: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade joda-time to 2.10.13 > > > Key: SPARK-38149 > URL: https://issues.apache.org/jira/browse/SPARK-38149 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > joda-time 2.10.13 was released, which supports the latest TZ database of > 2021e. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38149) Upgrade joda-time to 2.10.13
[ https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489214#comment-17489214 ] Apache Spark commented on SPARK-38149: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/35452 > Upgrade joda-time to 2.10.13 > > > Key: SPARK-38149 > URL: https://issues.apache.org/jira/browse/SPARK-38149 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > joda-time 2.10.13 was released, which supports the latest TZ database of > 2021e. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38149) Upgrade joda-time to 2.10.13
[ https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38149: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade joda-time to 2.10.13 > > > Key: SPARK-38149 > URL: https://issues.apache.org/jira/browse/SPARK-38149 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > joda-time 2.10.13 was released, which supports the latest TZ database of > 2021e. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38149) Upgrade joda-time to 2.10.13
Kousuke Saruta created SPARK-38149: -- Summary: Upgrade joda-time to 2.10.13 Key: SPARK-38149 URL: https://issues.apache.org/jira/browse/SPARK-38149 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta joda-time 2.10.13 was released, which supports the latest TZ database of 2021e. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning
[ https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489212#comment-17489212 ] Apache Spark commented on SPARK-38148: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/35451 > Do not add dynamic partition pruning if there exists static partition pruning > - > > Key: SPARK-38148 > URL: https://issues.apache.org/jira/browse/SPARK-38148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > Dynamic partition pruning add a filter as long as the join condition contains > partition columns. But if there exists other condition which contains the > static partition pruning, it's unnecessary to add an extra dynamic partition > pruning. > For example: > {code:java} > CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string); > CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string); > SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0; > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning
[ https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38148: Assignee: (was: Apache Spark) > Do not add dynamic partition pruning if there exists static partition pruning > - > > Key: SPARK-38148 > URL: https://issues.apache.org/jira/browse/SPARK-38148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > Dynamic partition pruning add a filter as long as the join condition contains > partition columns. But if there exists other condition which contains the > static partition pruning, it's unnecessary to add an extra dynamic partition > pruning. > For example: > {code:java} > CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string); > CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string); > SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0; > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning
[ https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489211#comment-17489211 ] Apache Spark commented on SPARK-38148: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/35451 > Do not add dynamic partition pruning if there exists static partition pruning > - > > Key: SPARK-38148 > URL: https://issues.apache.org/jira/browse/SPARK-38148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > Dynamic partition pruning add a filter as long as the join condition contains > partition columns. But if there exists other condition which contains the > static partition pruning, it's unnecessary to add an extra dynamic partition > pruning. > For example: > {code:java} > CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string); > CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string); > SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0; > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38147: - Assignee: Dongjoon Hyun > Upgrade shapeless to 2.3.7 > -- > > Key: SPARK-38147 > URL: https://issues.apache.org/jira/browse/SPARK-38147 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning
[ https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38148: Assignee: Apache Spark > Do not add dynamic partition pruning if there exists static partition pruning > - > > Key: SPARK-38148 > URL: https://issues.apache.org/jira/browse/SPARK-38148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > Dynamic partition pruning add a filter as long as the join condition contains > partition columns. But if there exists other condition which contains the > static partition pruning, it's unnecessary to add an extra dynamic partition > pruning. > For example: > {code:java} > CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string); > CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string); > SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0; > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38147: Assignee: Apache Spark > Upgrade shapeless to 2.3.7 > -- > > Key: SPARK-38147 > URL: https://issues.apache.org/jira/browse/SPARK-38147 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38147: Assignee: (was: Apache Spark) > Upgrade shapeless to 2.3.7 > -- > > Key: SPARK-38147 > URL: https://issues.apache.org/jira/browse/SPARK-38147 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38147) Upgrade shapeless to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489210#comment-17489210 ] Apache Spark commented on SPARK-38147: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/35450 > Upgrade shapeless to 2.3.7 > -- > > Key: SPARK-38147 > URL: https://issues.apache.org/jira/browse/SPARK-38147 > Project: Spark > Issue Type: Improvement > Components: Build, MLlib >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning
XiDuo You created SPARK-38148: - Summary: Do not add dynamic partition pruning if there exists static partition pruning Key: SPARK-38148 URL: https://issues.apache.org/jira/browse/SPARK-38148 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You Dynamic partition pruning add a filter as long as the join condition contains partition columns. But if there exists other condition which contains the static partition pruning, it's unnecessary to add an extra dynamic partition pruning. For example: {code:java} CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string); CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string); SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0; {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38147) Upgrade shapeless to 2.3.7
Dongjoon Hyun created SPARK-38147: - Summary: Upgrade shapeless to 2.3.7 Key: SPARK-38147 URL: https://issues.apache.org/jira/browse/SPARK-38147 Project: Spark Issue Type: Improvement Components: Build, MLlib Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column
[ https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489207#comment-17489207 ] Bruce Robbins edited comment on SPARK-38146 at 2/9/22, 2:23 AM: This affects master only and has a simple fix: {{BufferSetterGetterUtils}} needs case statements for {{TimestampNTZType}} that replicates what is done for {{{}TimestampType{}}}. Edit: Also, the filtering of {{TimestampNTZType}} [here in AggregationQuerySuite|https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L901] should be removed so that {{TimestampNTZType}} gets tested. was (Author: bersprockets): This affects master only and has a simple fix: {{BufferSetterGetterUtils}} needs case statements for {{TimestampNTZType}} that replicates what is done for {{{}TimestampType{}}}. > UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column > - > > Key: SPARK-38146 > URL: https://issues.apache.org/jira/browse/SPARK-38146 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, > Spark throws the error: > {noformat} > 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.UnsupportedOperationException: null > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46) > ~[scala-library.jar:?] > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > ~[scala-library.jar:?] > at > $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45) > ~[scala-library.jar:?] > at > org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197) > ~[spark-sql_2.12-3.3.0-SNAPSHO > {noformat} > This is because {{BufferSetterGetterUtils#createSetters}} does not have a > case statement for {{TimestampNTZType}}, so it generates a function that > tries to call {{UnsafeRow.update}}, which throws an > {{UnsupportedOperationException}}. > This reproduction example is mostly taken from {{AggregationQuerySuite}}: > {noformat} > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > class ScalaAggregateFunction(schema: StructType) extends > UserDefinedAggregateFunction { > def inputSchema: StructType = schema > def bufferSchema: StructType = schema > def dataType: DataType = schema > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > (0 until schema.length).foreach { i => > buffer.update(i, null) > } > } > def update(buffer: MutableAggregationBuffer, input: Row): Unit = { > if (!input.isNullAt(0) && input.getInt(0) == 50) { > (0 until schema.length).foreach { i => > buffer.update(i, input.get(i)) > } > } > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) { > (0 until schema.length).foreach { i => > buffer1.update(i, buffer2.get(i)) > } > } > } > def evaluate(buffer: Row): Any = { > Row.fromSeq(buffer.toSeq) > } > } > import scala.util.Random > import java.time.LocalDateTime > val r = new Random(65676563L) > val data = Seq.tabulate(50) { x => > Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, > LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1)) > } > val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 > timestamp_ntz") > val rdd = spark.sparkContext.parallelize(data, 1) > val df = spark.createDataFrame(rdd, schema) > val udaf = new ScalaAggregateFunction(df.schema) > val allColumns
[jira] [Commented] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column
[ https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489207#comment-17489207 ] Bruce Robbins commented on SPARK-38146: --- This affects master only and has a simple fix: {{BufferSetterGetterUtils}} needs case statements for {{TimestampNTZType}} that replicates what is done for {{{}TimestampType{}}}. > UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column > - > > Key: SPARK-38146 > URL: https://issues.apache.org/jira/browse/SPARK-38146 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, > Spark throws the error: > {noformat} > 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.UnsupportedOperationException: null > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218) > ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46) > ~[scala-library.jar:?] > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) > ~[scala-library.jar:?] > at > $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45) > ~[scala-library.jar:?] > at > org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) > ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197) > ~[spark-sql_2.12-3.3.0-SNAPSHO > {noformat} > This is because {{BufferSetterGetterUtils#createSetters}} does not have a > case statement for {{TimestampNTZType}}, so it generates a function that > tries to call {{UnsafeRow.update}}, which throws an > {{UnsupportedOperationException}}. > This reproduction example is mostly taken from {{AggregationQuerySuite}}: > {noformat} > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.Row > class ScalaAggregateFunction(schema: StructType) extends > UserDefinedAggregateFunction { > def inputSchema: StructType = schema > def bufferSchema: StructType = schema > def dataType: DataType = schema > def deterministic: Boolean = true > def initialize(buffer: MutableAggregationBuffer): Unit = { > (0 until schema.length).foreach { i => > buffer.update(i, null) > } > } > def update(buffer: MutableAggregationBuffer, input: Row): Unit = { > if (!input.isNullAt(0) && input.getInt(0) == 50) { > (0 until schema.length).foreach { i => > buffer.update(i, input.get(i)) > } > } > } > def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { > if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) { > (0 until schema.length).foreach { i => > buffer1.update(i, buffer2.get(i)) > } > } > } > def evaluate(buffer: Row): Any = { > Row.fromSeq(buffer.toSeq) > } > } > import scala.util.Random > import java.time.LocalDateTime > val r = new Random(65676563L) > val data = Seq.tabulate(50) { x => > Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, > LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1)) > } > val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 > timestamp_ntz") > val rdd = spark.sparkContext.parallelize(data, 1) > val df = spark.createDataFrame(rdd, schema) > val udaf = new ScalaAggregateFunction(df.schema) > val allColumns = df.schema.fields.map(f => col(f.name)) > df.groupBy().agg(udaf(allColumns: _*)).show(false) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37414) Inline type hints for python/pyspark/ml/tuning.py
[ https://issues.apache.org/jira/browse/SPARK-37414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37414: -- Assignee: Maciej Szymkiewicz > Inline type hints for python/pyspark/ml/tuning.py > - > > Key: SPARK-37414 > URL: https://issues.apache.org/jira/browse/SPARK-37414 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/tuning.pyi to > python/pyspark/ml/tuning.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37414) Inline type hints for python/pyspark/ml/tuning.py
[ https://issues.apache.org/jira/browse/SPARK-37414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37414. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35406 [https://github.com/apache/spark/pull/35406] > Inline type hints for python/pyspark/ml/tuning.py > - > > Key: SPARK-37414 > URL: https://issues.apache.org/jira/browse/SPARK-37414 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/tuning.pyi to > python/pyspark/ml/tuning.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column
Bruce Robbins created SPARK-38146: - Summary: UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column Key: SPARK-38146 URL: https://issues.apache.org/jira/browse/SPARK-38146 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Bruce Robbins When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, Spark throws the error: {noformat} 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.UnsupportedOperationException: null at org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218) ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217) ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215) ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272) ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46) ~[scala-library.jar:?] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) ~[scala-library.jar:?] at $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45) ~[scala-library.jar:?] at org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT] at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197) ~[spark-sql_2.12-3.3.0-SNAPSHO {noformat} This is because {{BufferSetterGetterUtils#createSetters}} does not have a case statement for {{TimestampNTZType}}, so it generates a function that tries to call {{UnsafeRow.update}}, which throws an {{UnsupportedOperationException}}. This reproduction example is mostly taken from {{AggregationQuerySuite}}: {noformat} import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row class ScalaAggregateFunction(schema: StructType) extends UserDefinedAggregateFunction { def inputSchema: StructType = schema def bufferSchema: StructType = schema def dataType: DataType = schema def deterministic: Boolean = true def initialize(buffer: MutableAggregationBuffer): Unit = { (0 until schema.length).foreach { i => buffer.update(i, null) } } def update(buffer: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0) && input.getInt(0) == 50) { (0 until schema.length).foreach { i => buffer.update(i, input.get(i)) } } } def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) { (0 until schema.length).foreach { i => buffer1.update(i, buffer2.get(i)) } } } def evaluate(buffer: Row): Any = { Row.fromSeq(buffer.toSeq) } } import scala.util.Random import java.time.LocalDateTime val r = new Random(65676563L) val data = Seq.tabulate(50) { x => Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1)) } val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 timestamp_ntz") val rdd = spark.sparkContext.parallelize(data, 1) val df = spark.createDataFrame(rdd, schema) val udaf = new ScalaAggregateFunction(df.schema) val allColumns = df.schema.fields.map(f => col(f.name)) df.groupBy().agg(udaf(allColumns: _*)).show(false) {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id
[ https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackie Zhang updated SPARK-38094: - Description: Field Id is a native field in the Parquet schema ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta. This PR supports: * vectorized reader * Parquet-mr reader was: Field Id is a native field in the Parquet schema ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) After this PR, when the requested schema has field IDs, Parquet readers will first use the field ID to determine which Parquet columns to read, before falling back to using column names as before. It enables matching columns by field id for supported DWs like iceberg and Delta. This PR supports: * vectorized reader does not support: * Parquet-mr reader due to lack of field id support (needs a follow up ticket) > Parquet: enable matching schema columns by field id > --- > > Key: SPARK-38094 > URL: https://issues.apache.org/jira/browse/SPARK-38094 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Jackie Zhang >Priority: Major > > Field Id is a native field in the Parquet schema > ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) > After this PR, when the requested schema has field IDs, Parquet readers will > first use the field ID to determine which Parquet columns to read, before > falling back to using column names as before. It enables matching columns by > field id for supported DWs like iceberg and Delta. > This PR supports: > * vectorized reader > * Parquet-mr reader -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38139) ml.recommendation.ALS doctests failures
[ https://issues.apache.org/jira/browse/SPARK-38139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489194#comment-17489194 ] Hyukjin Kwon commented on SPARK-38139: -- Yeah, I think we should fix as so! > ml.recommendation.ALS doctests failures > --- > > Key: SPARK-38139 > URL: https://issues.apache.org/jira/browse/SPARK-38139 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > In my dev setups, ml.recommendation:ALS test consistently converges to value > lower than expected and fails with: > {code:python} > File "/path/to/spark/python/pyspark/ml/recommendation.py", line 322, in > __main__.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.69291...) > Got: > Row(user=0, item=2, newPrediction=0.6929099559783936) > {code} > In can correct for that, but it creates some noise, so if anyone else > experiences this, we could drop a digit from the results > {code} > diff --git a/python/pyspark/ml/recommendation.py > b/python/pyspark/ml/recommendation.py > index f0628fb922..b8e2a6097d 100644 > --- a/python/pyspark/ml/recommendation.py > +++ b/python/pyspark/ml/recommendation.py > @@ -320,7 +320,7 @@ class ALS(JavaEstimator, _ALSParams, JavaMLWritable, > JavaMLReadable): > >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", > "item"]) > >>> predictions = sorted(model.transform(test).collect(), key=lambda r: > r[0]) > >>> predictions[0] > -Row(user=0, item=2, newPrediction=0.69291...) > +Row(user=0, item=2, newPrediction=0.6929...) > >>> predictions[1] > Row(user=1, item=0, newPrediction=3.47356...) > >>> predictions[2] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37980: - Affects Version/s: 3.3.0 (was: 3.3) > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prakhar Jain >Priority: Major > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37923) Generate partition transforms for BucketSpec inside parser
[ https://issues.apache.org/jira/browse/SPARK-37923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37923: - Affects Version/s: 3.3.0 (was: 3.3) > Generate partition transforms for BucketSpec inside parser > -- > > Key: SPARK-37923 > URL: https://issues.apache.org/jira/browse/SPARK-37923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.3.0 > > > We currently generate partition transforms for BucketSpec in Analyzer. It's > cleaner to do this inside Parser. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id
[ https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38094: - Affects Version/s: 3.3.0 (was: 3.3) > Parquet: enable matching schema columns by field id > --- > > Key: SPARK-38094 > URL: https://issues.apache.org/jira/browse/SPARK-38094 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Jackie Zhang >Priority: Major > > Field Id is a native field in the Parquet schema > ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398]) > After this PR, when the requested schema has field IDs, Parquet readers will > first use the field ID to determine which Parquet columns to read, before > falling back to using column names as before. It enables matching columns by > field id for supported DWs like iceberg and Delta. > This PR supports: > * vectorized reader > does not support: > * Parquet-mr reader due to lack of field id support (needs a follow up > ticket) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
[ https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38032: - Affects Version/s: 3.3.0 (was: 3.3) > Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL > - > > Key: SPARK-38032 > URL: https://issues.apache.org/jira/browse/SPARK-38032 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.3.0 > > > We should better test latest PyArrow version. Now 6.0.1 is release but we're > using < 5.0.0 for > https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38025) Improve test suite ExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38025: - Affects Version/s: 3.3.0 (was: 3.3) > Improve test suite ExternalCatalogSuite > --- > > Key: SPARK-38025 > URL: https://issues.apache.org/jira/browse/SPARK-38025 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.3.0 >Reporter: Khalid Mammadov >Priority: Minor > > Test suite *ExternalCatalogSuite.scala* can be optimized by removing > repetitive code by replacing them with already available utility function > with some minor changes. This will reduce redundant code, simplify the suite > and improve readability. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38114) Spark build fails in Windows
[ https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38114: - Affects Version/s: 3.3.0 (was: 3.3) > Spark build fails in Windows > > > Key: SPARK-38114 > URL: https://issues.apache.org/jira/browse/SPARK-38114 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.0 >Reporter: SOUVIK PAUL >Priority: Major > > java.lang.NoSuchMethodError: > org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream; > jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57) > jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27) > > A similar issue is being faced by the quarkus project with latest Maven. > [https://github.com/quarkusio/quarkus/issues/19491] > > Upgrading the scala-maven-plugin seems to resolve the issue but this ticket > can be a blocker > https://issues.apache.org/jira/browse/SPARK-36547 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38138) Materialize QueryPlan subqueries
[ https://issues.apache.org/jira/browse/SPARK-38138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38138: - Affects Version/s: 3.3.0 (was: 3.3) > Materialize QueryPlan subqueries > > > Key: SPARK-38138 > URL: https://issues.apache.org/jira/browse/SPARK-38138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38100) Remove unused method in `Decimal`
[ https://issues.apache.org/jira/browse/SPARK-38100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38100: - Fix Version/s: 3.3.0 (was: 3.3) > Remove unused method in `Decimal` > - > > Key: SPARK-38100 > URL: https://issues.apache.org/jira/browse/SPARK-38100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.3.0, 3.2.2 > > > there is a unused method `overflowException` in > `org.apache.spark.sql.types.Decimal`. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38145) Address the tests when "spark.sql.ansi.enabled" is True
[ https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38145: - Affects Version/s: 3.3.0 (was: 3.3) > Address the tests when "spark.sql.ansi.enabled" is True > --- > > Key: SPARK-38145 > URL: https://issues.apache.org/jira/browse/SPARK-38145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but > several tests are failed when the `spark.sql.ansi.enabled` is set to True. > For stability of project, it's always good to test passed regardless of the > option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38145) Address the tests when "spark.sql.ansi.enabled" is True
Haejoon Lee created SPARK-38145: --- Summary: Address the tests when "spark.sql.ansi.enabled" is True Key: SPARK-38145 URL: https://issues.apache.org/jira/browse/SPARK-38145 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3 Reporter: Haejoon Lee The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but several tests are failed when the `spark.sql.ansi.enabled` is set to True. For stability of project, it's always good to test passed regardless of the option value. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37404) Inline type hints for python/pyspark/ml/evaluation.py
[ https://issues.apache.org/jira/browse/SPARK-37404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37404. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35403 [https://github.com/apache/spark/pull/35403] > Inline type hints for python/pyspark/ml/evaluation.py > - > > Key: SPARK-37404 > URL: https://issues.apache.org/jira/browse/SPARK-37404 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Inline type hints from python/pyspark/ml/evaluation.pyi to > python/pyspark/ml/evaluation.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37404) Inline type hints for python/pyspark/ml/evaluation.py
[ https://issues.apache.org/jira/browse/SPARK-37404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37404: -- Assignee: Maciej Szymkiewicz > Inline type hints for python/pyspark/ml/evaluation.py > - > > Key: SPARK-37404 > URL: https://issues.apache.org/jira/browse/SPARK-37404 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Inline type hints from python/pyspark/ml/evaluation.pyi to > python/pyspark/ml/evaluation.py. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array
[ https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167 ] Johan Nyström-Persson edited comment on SPARK-38042 at 2/8/22, 11:50 PM: - My initial idea above was wrong. I ended up changing {code:java} val TypeRef(_, _, Seq(elementType)) = tpe{code} to {code:java} val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code} and this seems to work. was (Author: JIRAUSER284274): My initial idea above was wrong, and instead I fixed this by changing {code:java} val TypeRef(_, _, Seq(elementType)) = tpe{code} to {code:java} val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code} > Encoder cannot be found when a tuple component is a type alias for an Array > --- > > Key: SPARK-38042 > URL: https://issues.apache.org/jira/browse/SPARK-38042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Johan Nyström-Persson >Priority: Major > > ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, > and then the alias is being used as a component of e.g. a product. > Minimal example, tested in version 3.1.2: > {code:java} > type Data = Array[Long] > val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2)) > sc.parallelize(xs).toDF("a", "b"){code} > This gives the following exception: > {code:java} > scala.MatchError: Data (of class > scala.reflect.internal.Types$AliasNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573) > > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) > > at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251) > > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251) > > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) > ... 48 elided{code} > At first glance, I think this could be fixed by changing e.g. > {code:java} > getClassNameFromType(tpe) to > getClassNameFromType(tpe.dealias) > {code} > in ScalaReflection.dataTypeFor. I will try to test that and submit a pull > request shortly. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array
[ https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167 ] Johan Nyström-Persson edited comment on SPARK-38042 at 2/8/22, 11:49 PM: - My initial idea above was wrong, and instead I fixed this by changing {code:java} val TypeRef(_, _, Seq(elementType)) = tpe{code} to {code:java} val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code} was (Author: JIRAUSER284274): My initial idea above was wrong, and instead I fixed this by changing {{}} {code:java} val TypeRef(_, _, Seq(elementType)) = tpe{code} {{{}{}}}to {{}} {{}} {code:java} val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code} {{}} > Encoder cannot be found when a tuple component is a type alias for an Array > --- > > Key: SPARK-38042 > URL: https://issues.apache.org/jira/browse/SPARK-38042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Johan Nyström-Persson >Priority: Major > > ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, > and then the alias is being used as a component of e.g. a product. > Minimal example, tested in version 3.1.2: > {code:java} > type Data = Array[Long] > val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2)) > sc.parallelize(xs).toDF("a", "b"){code} > This gives the following exception: > {code:java} > scala.MatchError: Data (of class > scala.reflect.internal.Types$AliasNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573) > > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) > > at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251) > > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251) > > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) > ... 48 elided{code} > At first glance, I think this could be fixed by changing e.g. > {code:java} > getClassNameFromType(tpe) to > getClassNameFromType(tpe.dealias) > {code} > in ScalaReflection.dataTypeFor. I will try to test that and submit a pull > request shortly. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe,
[jira] [Commented] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array
[ https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167 ] Johan Nyström-Persson commented on SPARK-38042: --- My initial idea above was wrong, and instead I fixed this by changing {{}} {code:java} val TypeRef(_, _, Seq(elementType)) = tpe{code} {{{}{}}}to {{}} {{}} {code:java} val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code} {{}} > Encoder cannot be found when a tuple component is a type alias for an Array > --- > > Key: SPARK-38042 > URL: https://issues.apache.org/jira/browse/SPARK-38042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Johan Nyström-Persson >Priority: Major > > ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, > and then the alias is being used as a component of e.g. a product. > Minimal example, tested in version 3.1.2: > {code:java} > type Data = Array[Long] > val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2)) > sc.parallelize(xs).toDF("a", "b"){code} > This gives the following exception: > {code:java} > scala.MatchError: Data (of class > scala.reflect.internal.Types$AliasNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573) > > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421) > > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904) > > at > org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) > > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413) > > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55) > > at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251) > > at > org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251) > > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) > ... 48 elided{code} > At first glance, I think this could be fixed by changing e.g. > {code:java} > getClassNameFromType(tpe) to > getClassNameFromType(tpe.dealias) > {code} > in ScalaReflection.dataTypeFor. I will try to test that and submit a pull > request shortly. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38069) improve structured streaming window of calculated
[ https://issues.apache.org/jira/browse/SPARK-38069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-38069. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35362 [https://github.com/apache/spark/pull/35362] > improve structured streaming window of calculated > - > > Key: SPARK-38069 > URL: https://issues.apache.org/jira/browse/SPARK-38069 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: nyingping >Assignee: nyingping >Priority: Trivial > Fix For: 3.3.0 > > > Structed Streaming computes window by intermediate result windowId, and > windowId computes window by CaseWhen. > We can use Flink's method of calculating window to write it, which is more > easy to understand, simple and efficient -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38069) improve structured streaming window of calculated
[ https://issues.apache.org/jira/browse/SPARK-38069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-38069: --- Assignee: nyingping > improve structured streaming window of calculated > - > > Key: SPARK-38069 > URL: https://issues.apache.org/jira/browse/SPARK-38069 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: nyingping >Assignee: nyingping >Priority: Trivial > > Structed Streaming computes window by intermediate result windowId, and > windowId computes window by CaseWhen. > We can use Flink's method of calculating window to write it, which is more > easy to understand, simple and efficient -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38144) Remove unused `spark.storage.safetyFraction` config
[ https://issues.apache.org/jira/browse/SPARK-38144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38144. --- Fix Version/s: 3.3.0 3.2.2 3.1.3 Resolution: Fixed Issue resolved by pull request 35447 [https://github.com/apache/spark/pull/35447] > Remove unused `spark.storage.safetyFraction` config > --- > > Key: SPARK-38144 > URL: https://issues.apache.org/jira/browse/SPARK-38144 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.1.3, 3.3.0, 3.2.2 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.3 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org