date:20220208

[jira] [Commented] (SPARK-37285) Add Weight of Evidence and Information value to ml.feature

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489301#comment-17489301
 ] 

zhengruifeng commented on SPARK-37285:
--

were these metrics or algorithms implemented in scikit-learn or other libraries?

> Add Weight of Evidence and Information value to ml.feature
> --
>
> Key: SPARK-37285
> URL: https://issues.apache.org/jira/browse/SPARK-37285
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Simon Tao
>Assignee: Apache Spark
>Priority: Major
>
> The weight of evidence (WOE) and information value (IV) provide a great 
> framework for exploratory analysis and variable screening for binary 
> classifiers as well as beneficial and help us analyze multiple points as 
> listed below:
> 1. Helps check the linear relationship of a feature with its dependent 
> feature to be used in the model.
> 2. Is a good variable transformation method for both continuous and 
> categorical features.
> 3. Is better than on-hot encoding as this method of variable transformation 
> does not increase the complexity of the model.
> 4. Detect linear and non-linear relationships.
> 5. Be useful in feature selection.
> 6. Is a good measure of the predictive power of a feature and it also helps 
> point out the suspicious feature.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38120:
--
Affects Version/s: 3.0.3

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38120:
--
Affects Version/s: 3.1.2

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38153.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35455
[https://github.com/apache/spark/pull/35455]

> Remove option newlines.topLevelStatements in scalafmt.conf
> --
>
> Key: SPARK-38153
> URL: https://issues.apache.org/jira/browse/SPARK-38153
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> The configuration
> newlines.topLevelStatements = [before,after]
> is to add blank line before the first member or after the last member of the 
> class.
> This is neither encouraged nor discouraged as per 
> https://github.com/databricks/scala-style-guide#blanklines



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489290#comment-17489290
 ] 

Apache Spark commented on SPARK-38033:
--

User 'LLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35458

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38033:


Assignee: (was: Apache Spark)

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38033:


Assignee: Apache Spark

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Assignee: Apache Spark
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38033) The structured streaming processing cannot be started because the commitId and offsetId are inconsistent

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489289#comment-17489289
 ] 

Apache Spark commented on SPARK-38033:
--

User 'LLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35458

> The structured streaming processing cannot be started because the commitId 
> and offsetId are inconsistent
> 
>
> Key: SPARK-38033
> URL: https://issues.apache.org/jira/browse/SPARK-38033
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.6, 3.0.3
>Reporter: LLiu
>Priority: Major
>
> Streaming Processing could not start due to an unexpected machine shutdown.
> The exception is as follows
>  
> {code:java}
> ERROR 22/01/12 02:48:36 MicroBatchExecution: Query 
> streaming_4a026335eafd4bb498ee51752b49f7fb [id = 
> 647ba9e4-16d2-4972-9824-6f9179588806, runId = 
> 92385d5b-f31f-40d0-9ac7-bb7d9796d774] terminated with error
> java.lang.IllegalStateException: batch 113258 doesn't exist
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$4.apply(MicroBatchExecution.scala:256)
>         at scala.Option.getOrElse(Option.scala:121)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:255)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:169)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>         at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>         at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
>         at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
> {code}
> I checked checkpoint file on HDFS and found the latest offset is 113259. But 
> commits is 113257. The following
> {code:java}
> commits
> /tmp/streaming_/commits/113253
> /tmp/streaming_/commits/113254
> /tmp/streaming_/commits/113255
> /tmp/streaming_/commits/113256
> /tmp/streaming_/commits/113257
> offset
> /tmp/streaming_/offsets/113253
> /tmp/streaming_/offsets/113254
> /tmp/streaming_/offsets/113255
> /tmp/streaming_/offsets/113256
> /tmp/streaming_/offsets/113257
> /tmp/streaming_/offsets/113259{code}
> Finally, I deleted offsets “/tmp/streaming_/offsets/113259” and the 
> program started normally. I think there is a problem here and we should try 
> to handle this exception or give some resolution in the log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36553:


Assignee: (was: Apache Spark)

> KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 
> was introduced
> 
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.1
>Reporter: Anders Rydbirk
>Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=5,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.5,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489281#comment-17489281
 ] 

Apache Spark commented on SPARK-36553:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35457

> KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 
> was introduced
> 
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.1
>Reporter: Anders Rydbirk
>Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=5,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.5,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36553:


Assignee: Apache Spark

> KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 
> was introduced
> 
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.1
>Reporter: Anders Rydbirk
>Assignee: Apache Spark
>Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=5,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.5,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38145:

Parent: SPARK-38154
Issue Type: Sub-task  (was: Improvement)

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489278#comment-17489278
 ] 

zhengruifeng commented on SPARK-36553:
--

it is a overflow:

 
{code:java}
scala> val k = 5
val k: Int = 5

scala> k * (k + 1) / 2
val res1: Int = -897458648
 {code}

> KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 
> was introduced
> 
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.1
>Reporter: Anders Rydbirk
>Priority: Major
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=5,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.5,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37960) A new framework to represent catalyst expressions in DS v2 APIs

2022-02-08 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37960:
---
Description: 
Spark need a new framework to represent catalyst expressions in DS v2 APIs.

CASE ... WHEN ... ELSE ... END is just the first use case.

  was:
Currently, Spark supports aggregate push down SUM(column) into JDBC data source.
SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.


> A new framework to represent catalyst expressions in DS v2 APIs
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark need a new framework to represent catalyst expressions in DS v2 APIs.
> CASE ... WHEN ... ELSE ... END is just the first use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37960) A new framework to represent catalyst expressions in DS v2 APIs

2022-02-08 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37960:
---
Summary: A new framework to represent catalyst expressions in DS v2 APIs  
(was: Support aggregate push down with CASE ... WHEN ... ELSE ... END)

> A new framework to represent catalyst expressions in DS v2 APIs
> ---
>
> Key: SPARK-37960
> URL: https://issues.apache.org/jira/browse/SPARK-37960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports aggregate push down SUM(column) into JDBC data 
> source.
> SUM(CASE ... WHEN ... ELSE ... END) is very useful for users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30661) KMeans blockify input vectors

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489271#comment-17489271
 ] 

zhengruifeng commented on SPARK-30661:
--

ok, I will skip .mllib calling .ml here. We may re-org the impls in the future.

I'm looking for a way to minimize the changes.

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Attachments: blockify_kmeans.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31007) KMeans optimization based on triangle-inequality

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489270#comment-17489270
 ] 

zhengruifeng edited comment on SPARK-31007 at 2/9/22, 6:05 AM:
---

 

this case is not OOM, but the overflow:

 
{code:java}
scala> val k = 5
val k: Int = 5

scala> k * (k + 1) / 2
val res0: Int = -897458648

scala> k * (k + 1)
val res1: Int = -1794917296

scala> k / 2 * (k + 1)
val res2: Int = 1250025000

scala> Array.ofDim[Double](k * (k + 1) / 2)
java.lang.NegativeArraySizeException
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 32 elided

scala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 29 elidedscala> val k = 45000
val k: Int = 45000

scala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap space

scala> k / 2 * (k + 1)
val res6: Int = 1012522500
 {code}
 

I think we should add a limit of k for this optimization. the change should not 
be large, I will send a PR.


was (Author: podongfeng):
 

this case is not OOM, but the overflow:

 
{code:java}
scala> val k = 5
val k: Int = 5scala> k * (k + 1) / 2
val res0: Int = -897458648scala> k * (k + 1)
val res1: Int = -1794917296scala> k / 2 * (k + 1)
val res2: Int = 1250025000scala> Array.ofDim[Double](k * (k + 1) / 2)
java.lang.NegativeArraySizeException
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 32 elidedscala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 29 elidedscala> val k = 45000
val k: Int = 45000scala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap spacescala> k / 2 * (k + 1)
val res6: Int = 1012522500
 {code}
 

I think we should add a limit of k for this optimization. the change should not 
be large, I will send a PR.

> KMeans optimization based on triangle-inequality
> 
>
> Key: SPARK-31007
> URL: https://issues.apache.org/jira/browse/SPARK-31007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
> |(d(x,o) - d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate 
> K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go 
> futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
> d(x,c) >= d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of 
> radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, 
> d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the 
> distance/lower bounds to previous centers, and thus can be only applied in 
> training, not usable in prediction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is 
> close to center b enough (less than a pre-computed radius), then we can say 
> point x belong to center c without computing the distances between x and 
> other centers. It can be used in both training and predction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489270#comment-17489270
 ] 

zhengruifeng commented on SPARK-31007:
--

 

this case is not OOM, but the overflow:

 
{code:java}
scala> val k = 5
val k: Int = 5scala> k * (k + 1) / 2
val res0: Int = -897458648scala> k * (k + 1)
val res1: Int = -1794917296scala> k / 2 * (k + 1)
val res2: Int = 1250025000scala> Array.ofDim[Double](k * (k + 1) / 2)
java.lang.NegativeArraySizeException
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 32 elidedscala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:273)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:271)
  at scala.Array$.ofDim(Array.scala:323)
  ... 29 elidedscala> val k = 45000
val k: Int = 45000scala> Array.ofDim[Double](k / 2 * (k + 1))
java.lang.OutOfMemoryError: Java heap spacescala> k / 2 * (k + 1)
val res6: Int = 1012522500
 {code}
 

I think we should add a limit of k for this optimization. the change should not 
be large, I will send a PR.

> KMeans optimization based on triangle-inequality
> 
>
> Key: SPARK-31007
> URL: https://issues.apache.org/jira/browse/SPARK-31007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
> |(d(x,o) - d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate 
> K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go 
> futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
> d(x,c) >= d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of 
> radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, 
> d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the 
> distance/lower bounds to previous centers, and thus can be only applied in 
> training, not usable in prediction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is 
> close to center b enough (less than a pre-computed radius), then we can say 
> point x belong to center c without computing the distances between x and 
> other centers. It can be used in both training and predction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-08 Thread Xinyi Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489259#comment-17489259
 ] 

Xinyi Yu edited comment on SPARK-38157 at 2/9/22, 6:00 AM:
---

I will take this one. Kinda my first Spark PR!


was (Author: xyyu):
I will take this one. Kinda my first OSS PR!

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-08 Thread Xinyi Yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489259#comment-17489259
 ] 

Xinyi Yu commented on SPARK-38157:
--

I will take this one. Kinda my first OSS PR!

> Fix 
> /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
>  under ANSI mode
> 
>
> Key: SPARK-38157
> URL: https://issues.apache.org/jira/browse/SPARK-38157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
> ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
> should only work with ANSI off according to the golden result file, but 
> ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
> default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38158) Rdd and shuffle blocks not migrated to new executors when decommission feature is enabled

2022-02-08 Thread Mohan Patidar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohan Patidar updated SPARK-38158:
--
Attachment: sparkConf_Decommission_to_alternate_storage.conf
sparkConf_Decommission_to_executors.conf

> Rdd and shuffle blocks not migrated to new executors when decommission 
> feature is enabled
> -
>
> Key: SPARK-38158
> URL: https://issues.apache.org/jira/browse/SPARK-38158
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Kubernetes
>Affects Versions: 3.2.0
> Environment: Spark on k8s
>Reporter: Mohan Patidar
>Priority: Major
> Attachments: sparkConf_Decommission_to_alternate_storage.conf, 
> sparkConf_Decommission_to_executors.conf
>
>
> We’re using Spark 3.2.0 and we have enabled the spark decommission feature. 
> As part of validating this feature, we wanted to check if the rdd blocks and 
> shuffle blocks from the decommissioned executors are migrated to other 
> executors.
> However, we could not see this happening. Below is the configuration we used.
>  # *Spark Configuration used:*
>  ** spark.local.dir /mnt/spark-ldir
>  spark.decommission.enabled true
>  spark.storage.decommission.enabled true 
>  spark.storage.decommission.rddBlocks.enabled true
>  spark.storage.decommission.shuffleBlocks.enabled true
>  spark.dynamicAllocation.enabled true
>  # *Brought up spark-driver and executors on the different nodes.*
> NAME  
>     READY  STATUS   NODE
> decommission-driver   
>   1/1 Running       *Node1*
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
> Running   *Node1*  
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
> Running   *Node2*  
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
> Running   *Node1* 
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
> Running   *Node2* 
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
> Running   *Node1* 
>  # *Bringdown Node2 so status of pods as are following.*
> NAME  
>     READY  STATUS   NODE
> decommission-driver   
>   1/1 Running       *Node1*
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
> Running   *Node1*  
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
> Terminating    *Node2*  
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
> Running   *Node1* 
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
> Terminating    *Node2* 
> gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
> Running   *Node1* 
>  # *Driver logs:*
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", 
> "timezone":"UTC", "log":"Adding decommission script to lifecycle"}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", 
> "timezone":"UTC", "log":"Adding decommission script to lifecycle"}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", 
> "timezone":"UTC", "log":"Adding decommission script to lifecycle"}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", 
> "timezone":"UTC", "log":"Adding decommission script to lifecycle"}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", 
> "timezone":"UTC", "log":"Adding decommission script to lifecycle"}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", 
> "timezone":"UTC", "log":"Notify executor 5 to decommissioning."}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
> "timezone":"UTC", "log":"Notify executor 1 to decommissioning."}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
> "timezone":"UTC", "log":"Notify executor 3 to decommissioning."}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
> "timezone":"UTC", "log":"Mark BlockManagers (BlockManagerId(5, X.X.X.X, 
> 33359, None), BlockManagerId(1, X.X.X.X, 38655, None), BlockManagerId(3, 
> X.X.X.X, 35797, None)) as being decommissioning."}
> {"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", 
> "timezone":"UTC", "log":"Executor 2 is removed. Remove reason statistics: 
> (gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
> unexpectedly

[jira] [Created] (SPARK-38158) Rdd and shuffle blocks not migrated to new executors when decommission feature is enabled

2022-02-08 Thread Mohan Patidar (Jira)

Mohan Patidar created SPARK-38158:
-

 Summary: Rdd and shuffle blocks not migrated to new executors when 
decommission feature is enabled
 Key: SPARK-38158
 URL: https://issues.apache.org/jira/browse/SPARK-38158
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Kubernetes
Affects Versions: 3.2.0
 Environment: Spark on k8s
Reporter: Mohan Patidar


We’re using Spark 3.2.0 and we have enabled the spark decommission feature. As 
part of validating this feature, we wanted to check if the rdd blocks and 
shuffle blocks from the decommissioned executors are migrated to other 
executors.

However, we could not see this happening. Below is the configuration we used.
 # *Spark Configuration used:*
 ** spark.local.dir /mnt/spark-ldir
 spark.decommission.enabled true
 spark.storage.decommission.enabled true 
 spark.storage.decommission.rddBlocks.enabled true
 spark.storage.decommission.shuffleBlocks.enabled true
 spark.dynamicAllocation.enabled true
 # *Brought up spark-driver and executors on the different nodes.*
NAME    
  READY  STATUS   NODE
decommission-driver 
1/1 Running       *Node1*
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
Running   *Node1*  
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
Running   *Node2*  
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
Running   *Node1* 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
Running   *Node2* 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
Running   *Node1* 
 # *Bringdown Node2 so status of pods as are following.*
NAME    
  READY  STATUS   NODE
decommission-driver 
1/1 Running       *Node1*
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1  1/1 
Running   *Node1*  
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2  1/1 
Terminating    *Node2*  
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3  1/1 
Running   *Node1* 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4  1/1 
Terminating    *Node2* 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5  1/1 
Running   *Node1* 
 # *Driver logs:*
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", 
"timezone":"UTC", "log":"Notify executor 5 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Notify executor 1 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Notify executor 3 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Mark BlockManagers (BlockManagerId(5, X.X.X.X, 33359, 
None), BlockManagerId(1, X.X.X.X, 38655, None), BlockManagerId(3, X.X.X.X, 
35797, None)) as being decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", 
"timezone":"UTC", "log":"Executor 2 is removed. Remove reason statistics: 
(gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
unexpectedly exited: 1)."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", 
"timezone":"UTC", "log":"Executor 4 is removed. Remove reason statistics: 
(gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
unexpectedly exited: 2)."}
 # *Verified by Execute into all live executors(1,3,5) and checked at location 
(/mnt/spark-ldir/) so only one blockManger id present, not seeing any other 
blockManager id copied to this location.* 
Example:

    \{*}${*}kubectl exec -it 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1   -n test bash
    $cd

[jira] [Commented] (SPARK-36714) bugs in MIniLSH

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489258#comment-17489258
 ] 

zhengruifeng commented on SPARK-36714:
--

[~sheng_1992] Since you had investigate this issue, feel free to send a PR and 
ping me in it

> bugs in MIniLSH
> ---
>
> Key: SPARK-36714
> URL: https://issues.apache.org/jira/browse/SPARK-36714
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: shengzhang
>Priority: Minor
>
> This is about MinHashLSH algorithm.
> To get the similartiy dataframe DFA and DFB, I  used MinHashLSH  
> approxSimilarityJoin function.  But there are some missing data in the result.
> the example in documents is no problem  
> [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
>  
> but when the data based on distributed system(hive, more than one node)
> there will be some missing data. 
> for example    vectora= vectorb. but it no in the reslut of  
> approxSimilarityJoin, even though 
> "threshold"  more than 1 .
> I think  maybe the problem is  in these codes
> {code:java}
> // part1
> override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 
> = {
>   require(inputDim <= MinHashLSH.HASH_PRIME,
> s"The input vector dimension $inputDim exceeds the threshold 
> ${MinHashLSH.HASH_PRIME}.")
>   val rand = new Random($(seed)) 
>   val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
> (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), 
> rand.nextInt(MinHashLSH.HASH_PRIME - 1))
>   }
>   new MinHashLSHModel1(uid, randCoefs)
> }
> // part2
> @Since("2.1.0")
> override protected[ml] val hashFunction: Vector => Array[Vector] = {
>   elems: Vector => {
> require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
> val elemsList = elems.toSparse.indices.toList
> val hashValues = randCoefficients.map { case (a, b) =>
>   elemsList.map { elem: Int =>
> ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
>   }.min.toDouble
> }
> // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
> hashValues.map(Vectors.dense(_))
>   }
> {code}
>  val r1 = new scala.util.Random(1)
> r1.nextInt(1000)  // -> 985
> val r2 = new scala.util.Random(2)
> r2.nextInt(1000)  // -> 108 - 
> val r3 = new scala.util.Random(1)
> r3.nextInt(1000)  // -> 985 - because seeded just as `r1`
> r3.nextInt(1000).  //-> 588
> {{}}the reason maybe is above.  if  random is only  initialized once .  
> random.nextInt() will get different result every time ,like r3. 
> r3.nextInt(1000) // -> 985   r3.nextInt(1000).  //-> 588
> so the code 
> val rand = new Random($(seed)) in  def createRawLSHModel  move to 
> hashFunction is better
> . every worker will initialize random class. and every worker get same data
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38157) Fix /sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite under ANSI mode

2022-02-08 Thread Xinyi Yu (Jira)

Xinyi Yu created SPARK-38157:


 Summary: Fix 
/sql/hive-thriftserver/org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
 under ANSI mode
 Key: SPARK-38157
 URL: https://issues.apache.org/jira/browse/SPARK-38157
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinyi Yu


ThriftServerQueryTestSuite will fail on {{timestampNTZ/timestamp.sql}} , when 
ANSI mode is on by default. It is because the {{timestampNTZ/timestamp.sql}} 
should only work with ANSI off according to the golden result file, but 
ThriftServerQueryTestSuite or the timestamp.sql test doesn't override the 
default ANSI setting.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489250#comment-17489250
 ] 

Apache Spark commented on SPARK-38156:
--

User 'yeshengm' has created a pull request for this issue:
https://github.com/apache/spark/pull/35456

> Support CREATE EXTERNAL TABLE LIKE syntax
> -
>
> Key: SPARK-38156
> URL: https://issues.apache.org/jira/browse/SPARK-38156
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users 
> to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax 
> is not supported in Spark right now and we should make these CREATE TABLE 
> DDLs consistent.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489249#comment-17489249
 ] 

Apache Spark commented on SPARK-38156:
--

User 'yeshengm' has created a pull request for this issue:
https://github.com/apache/spark/pull/35456

> Support CREATE EXTERNAL TABLE LIKE syntax
> -
>
> Key: SPARK-38156
> URL: https://issues.apache.org/jira/browse/SPARK-38156
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users 
> to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax 
> is not supported in Spark right now and we should make these CREATE TABLE 
> DDLs consistent.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38156:


Assignee: (was: Apache Spark)

> Support CREATE EXTERNAL TABLE LIKE syntax
> -
>
> Key: SPARK-38156
> URL: https://issues.apache.org/jira/browse/SPARK-38156
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users 
> to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax 
> is not supported in Spark right now and we should make these CREATE TABLE 
> DDLs consistent.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38156:


Assignee: Apache Spark

> Support CREATE EXTERNAL TABLE LIKE syntax
> -
>
> Key: SPARK-38156
> URL: https://issues.apache.org/jira/browse/SPARK-38156
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yesheng Ma
>Assignee: Apache Spark
>Priority: Major
>
> Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users 
> to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax 
> is not supported in Spark right now and we should make these CREATE TABLE 
> DDLs consistent.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38156) Support CREATE EXTERNAL TABLE LIKE syntax

2022-02-08 Thread Yesheng Ma (Jira)

Yesheng Ma created SPARK-38156:
--

 Summary: Support CREATE EXTERNAL TABLE LIKE syntax
 Key: SPARK-38156
 URL: https://issues.apache.org/jira/browse/SPARK-38156
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yesheng Ma


Spark already has the syntax of `CREATE TABLE LIKE`. It's intuitive for users 
to say `CREATE EXTERNAL TABLE a LIKE b LOCATION 'path'`. However this syntax is 
not supported in Spark right now and we should make these CREATE TABLE DDLs 
consistent.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-08 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-38155:
-
Description: 
Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example

 
{code:java}
CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
{code}
 

The correct results should be (0, 1, 2) but currently, it gives  [(0, 1, 2), 
(0, 1, 2)].

  was:
Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example

CREATE VIEW t1(c1, c2) AS VALUES (0, 1)

CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)

SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)

The correct results should be (0, 1, 2) but currently, it gives  [(0, 1, 2), 
(0, 1, 2)]


> Disallow distinct aggregate in lateral subqueries with unsupported correlated 
> predicates
> 
>
> Key: SPARK-38155
> URL: https://issues.apache.org/jira/browse/SPARK-38155
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Priority: Major
>
> Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate 
> and correlated non-equality predicates. This can lead to incorrect results as 
> DISTINCT will be rewritten as Aggregate during the optimization phase.
> For example
>  
> {code:java}
> CREATE VIEW t1(c1, c2) AS VALUES (0, 1)
> CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)
> SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)
> {code}
>  
> The correct results should be (0, 1, 2) but currently, it gives  [(0, 1, 2), 
> (0, 1, 2)].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38155) Disallow distinct aggregate in lateral subqueries with unsupported correlated predicates

2022-02-08 Thread Allison Wang (Jira)

Allison Wang created SPARK-38155:


 Summary: Disallow distinct aggregate in lateral subqueries with 
unsupported correlated predicates
 Key: SPARK-38155
 URL: https://issues.apache.org/jira/browse/SPARK-38155
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Allison Wang


Block lateral subqueries in CheckAnalysis that contains DISTINCT aggregate and 
correlated non-equality predicates. This can lead to incorrect results as 
DISTINCT will be rewritten as Aggregate during the optimization phase.

For example

CREATE VIEW t1(c1, c2) AS VALUES (0, 1)

CREATE VIEW t2(c1, c2) AS VALUES (1, 2), (2, 2)

SELECT * FROM t1 JOIN LATERAL (SELECT DISTINCT c2 FROM t2 WHERE c1 > t1.c1)

The correct results should be (0, 1, 2) but currently, it gives  [(0, 1, 2), 
(0, 1, 2)]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38154) Set up a new GA job to run tests with ANSI mode

2022-02-08 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-38154:
---

 Summary: Set up a new GA job to run tests with ANSI mode
 Key: SPARK-38154
 URL: https://issues.apache.org/jira/browse/SPARK-38154
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38149) Upgrade joda-time to 2.10.13

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38149.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35452
[https://github.com/apache/spark/pull/35452]

> Upgrade joda-time to 2.10.13
> 
>
> Key: SPARK-38149
> URL: https://issues.apache.org/jira/browse/SPARK-38149
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> joda-time 2.10.13 was released, which supports the latest TZ database of 
> 2021e.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38147.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35450
[https://github.com/apache/spark/pull/35450]

> Upgrade shapeless to 2.3.7
> --
>
> Key: SPARK-38147
> URL: https://issues.apache.org/jira/browse/SPARK-38147
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37507) Add the TO_BINARY() function

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37507:


Assignee: (was: Apache Spark)

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37507) Add the TO_BINARY() function

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489238#comment-17489238
 ] 

Apache Spark commented on SPARK-37507:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35415

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37507) Add the TO_BINARY() function

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37507:


Assignee: Apache Spark

> Add the TO_BINARY() function
> 
>
> Key: SPARK-37507
> URL: https://issues.apache.org/jira/browse/SPARK-37507
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> to_binary(expr, fmt) is a common function available in many other systems to 
> provide a unified entry for string to binary data conversion, where fmt can 
> be utf8, base64, hex and base2 (or whatever the reverse operation 
> to_char()supports).
> [https://docs.aws.amazon.com/redshift/latest/dg/r_TO_VARBYTE.html]
> [https://docs.snowflake.com/en/sql-reference/functions/to_binary.html]
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#format_string_as_bytes]
> [https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/etRo5aTAY9n5fUPjxSEynw]
> Related Spark functions: unbase64, unhex



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38153:


Assignee: Apache Spark  (was: Gengliang Wang)

> Remove option newlines.topLevelStatements in scalafmt.conf
> --
>
> Key: SPARK-38153
> URL: https://issues.apache.org/jira/browse/SPARK-38153
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> The configuration
> newlines.topLevelStatements = [before,after]
> is to add blank line before the first member or after the last member of the 
> class.
> This is neither encouraged nor discouraged as per 
> https://github.com/databricks/scala-style-guide#blanklines



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38153:


Assignee: Gengliang Wang  (was: Apache Spark)

> Remove option newlines.topLevelStatements in scalafmt.conf
> --
>
> Key: SPARK-38153
> URL: https://issues.apache.org/jira/browse/SPARK-38153
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The configuration
> newlines.topLevelStatements = [before,after]
> is to add blank line before the first member or after the last member of the 
> class.
> This is neither encouraged nor discouraged as per 
> https://github.com/databricks/scala-style-guide#blanklines



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489236#comment-17489236
 ] 

Apache Spark commented on SPARK-38153:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35455

> Remove option newlines.topLevelStatements in scalafmt.conf
> --
>
> Key: SPARK-38153
> URL: https://issues.apache.org/jira/browse/SPARK-38153
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The configuration
> newlines.topLevelStatements = [before,after]
> is to add blank line before the first member or after the last member of the 
> class.
> This is neither encouraged nor discouraged as per 
> https://github.com/databricks/scala-style-guide#blanklines



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38153) Remove option newlines.topLevelStatements in scalafmt.conf

2022-02-08 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-38153:
--

 Summary: Remove option newlines.topLevelStatements in scalafmt.conf
 Key: SPARK-38153
 URL: https://issues.apache.org/jira/browse/SPARK-38153
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The configuration

newlines.topLevelStatements = [before,after]

is to add blank line before the first member or after the last member of the 
class.

This is neither encouraged nor discouraged as per 
https://github.com/databricks/scala-style-guide#blanklines



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality

2022-02-08 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489235#comment-17489235
 ] 

Sean R. Owen commented on SPARK-31007:
--

Hm, can you use a simpler "unpacked" representation for those statistics? it's 
not the memory per se, but the size of a single array. Instead an array of 
arrays?

> KMeans optimization based on triangle-inequality
> 
>
> Key: SPARK-31007
> URL: https://issues.apache.org/jira/browse/SPARK-31007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
> |(d(x,o) - d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate 
> K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go 
> futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
> d(x,c) >= d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of 
> radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, 
> d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the 
> distance/lower bounds to previous centers, and thus can be only applied in 
> training, not usable in prediction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is 
> close to center b enough (less than a pre-computed radius), then we can say 
> point x belong to center c without computing the distances between x and 
> other centers. It can be used in both training and predction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38145:


Assignee: (was: Apache Spark)

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489234#comment-17489234
 ] 

Apache Spark commented on SPARK-38145:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/35454

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38145:


Assignee: Apache Spark

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38086:
-

Assignee: Kazuyuki Tanimura

> Make ArrowColumnVector Extendable
> -
>
> Key: SPARK-38086
> URL: https://issues.apache.org/jira/browse/SPARK-38086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Some Spark extension libraries need to extend ArrowColumnVector.java. For 
> now, it is impossible as ArrowColumnVector class is final and the accessors 
> are all private.
> For example, Rapids copies the entire ArrowColumnVector class in order to 
> work around the issue
> [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]
> Proposing to relax private/final restrictions to make ArrowColumnVector 
> extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38086.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35393
[https://github.com/apache/spark/pull/35393]

> Make ArrowColumnVector Extendable
> -
>
> Key: SPARK-38086
> URL: https://issues.apache.org/jira/browse/SPARK-38086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.3.0
>
>
> Some Spark extension libraries need to extend ArrowColumnVector.java. For 
> now, it is impossible as ArrowColumnVector class is final and the accessors 
> are all private.
> For example, Rapids copies the entire ArrowColumnVector class in order to 
> work around the issue
> [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]
> Proposing to relax private/final restrictions to make ArrowColumnVector 
> extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36714) bugs in MIniLSH

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489232#comment-17489232
 ] 

zhengruifeng commented on SPARK-36714:
--

could you please provide a simple script to reproduce this issue?

> bugs in MIniLSH
> ---
>
> Key: SPARK-36714
> URL: https://issues.apache.org/jira/browse/SPARK-36714
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: shengzhang
>Priority: Minor
>
> This is about MinHashLSH algorithm.
> To get the similartiy dataframe DFA and DFB, I  used MinHashLSH  
> approxSimilarityJoin function.  But there are some missing data in the result.
> the example in documents is no problem  
> [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
>  
> but when the data based on distributed system(hive, more than one node)
> there will be some missing data. 
> for example    vectora= vectorb. but it no in the reslut of  
> approxSimilarityJoin, even though 
> "threshold"  more than 1 .
> I think  maybe the problem is  in these codes
> {code:java}
> // part1
> override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 
> = {
>   require(inputDim <= MinHashLSH.HASH_PRIME,
> s"The input vector dimension $inputDim exceeds the threshold 
> ${MinHashLSH.HASH_PRIME}.")
>   val rand = new Random($(seed)) 
>   val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
> (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), 
> rand.nextInt(MinHashLSH.HASH_PRIME - 1))
>   }
>   new MinHashLSHModel1(uid, randCoefs)
> }
> // part2
> @Since("2.1.0")
> override protected[ml] val hashFunction: Vector => Array[Vector] = {
>   elems: Vector => {
> require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
> val elemsList = elems.toSparse.indices.toList
> val hashValues = randCoefficients.map { case (a, b) =>
>   elemsList.map { elem: Int =>
> ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
>   }.min.toDouble
> }
> // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
> hashValues.map(Vectors.dense(_))
>   }
> {code}
>  val r1 = new scala.util.Random(1)
> r1.nextInt(1000)  // -> 985
> val r2 = new scala.util.Random(2)
> r2.nextInt(1000)  // -> 108 - 
> val r3 = new scala.util.Random(1)
> r3.nextInt(1000)  // -> 985 - because seeded just as `r1`
> r3.nextInt(1000).  //-> 588
> {{}}the reason maybe is above.  if  random is only  initialized once .  
> random.nextInt() will get different result every time ,like r3. 
> r3.nextInt(1000) // -> 985   r3.nextInt(1000).  //-> 588
> so the code 
> val rand = new Random($(seed)) in  def createRawLSHModel  move to 
> hashFunction is better
> . every worker will initialize random class. and every worker get same data
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38145) Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38145:

Summary: Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" 
is True  (was: Address the tests when "spark.sql.ansi.enabled" is True)

> Address the 'pyspark' tagged tests when "spark.sql.ansi.enabled" is True
> 
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31007) KMeans optimization based on triangle-inequality

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489230#comment-17489230
 ] 

zhengruifeng commented on SPARK-31007:
--

[~srowen]  This optimization needs an array of size 
val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
can may cause failure when k is large.

 

https://issues.apache.org/jira/browse/SPARK-36553   k=5

 

should we:

1, revert this optimization

2, or enable it only for small k? (but the impl will be more complex)

> KMeans optimization based on triangle-inequality
> 
>
> Key: SPARK-31007
> URL: https://issues.apache.org/jira/browse/SPARK-31007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: ICML03-022.pdf
>
>
> In current impl, following Lemma is used in KMeans:
> 0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= 
> |(d(x,o) - d(c,o))| = |norm(x)-norm(c)|
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> According to [Using the Triangle Inequality to Accelerate 
> K-Means|[https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf]], we can go 
> futher, and there are another two Lemmas can be used:
> 1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then 
> d(x,c) >= d(x,b);
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}.
> However, luckily for CosineDistance we can get a variant in the space of 
> radian/angle.
> 2, Let x be a point, and let b and c be centers. Then d(x,c) >= max\{0, 
> d(x,b)-d(b,c)};
> this can be applied in {{EuclideanDistance}}, but not in {{CosineDistance}}
> The application of Lemma 2 is a little complex: It need to cache/update the 
> distance/lower bounds to previous centers, and thus can be only applied in 
> training, not usable in prediction.
> So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is 
> close to center b enough (less than a pre-computed radius), then we can say 
> point x belong to center c without computing the distances between x and 
> other centers. It can be used in both training and predction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38150) Update comment of RelationConversions

2022-02-08 Thread angerszhu (Jira)

angerszhu created SPARK-38150:
-

 Summary: Update comment of RelationConversions
 Key: SPARK-38150
 URL: https://issues.apache.org/jira/browse/SPARK-38150
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.1, 3.2.0
Reporter: angerszhu
 Fix For: 3.3.0


Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38037) Spark MLlib FPGrowth not working with 40+ items in Frequent Item set

2022-02-08 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489225#comment-17489225
 ] 

zhengruifeng commented on SPARK-38037:
--

could you please provide a simple script to reproduce this issue?

> Spark MLlib FPGrowth not working with 40+ items in Frequent Item set
> 
>
> Key: SPARK-38037
> URL: https://issues.apache.org/jira/browse/SPARK-38037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.2.0
> Environment: Stanalone Linux server
> 32 GB RAM
> 4 core
>  
>Reporter: RJ
>Priority: Major
>
> We have been using Spark FPGrowth and it works well with millions of 
> transactions (records) when the frequent items in the Frequent Itemset is 
> less than 25. Beyond 25 it runs into computational limit. For 40+ items in 
> the Frequent Itemset the process never return.
> To reproduce, you can create a simple data set of 3 transactions with equal 
> items (40 of them) and run FPgrowth with 0.9 support, the process never 
> completes. Below is a sample data I have used to narrow down the problem:
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
> |I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|I12|I13|I14|I15|I16|I17|I18|I19|I20|I21|I22|I23|I24|I25|I26|I27|I28|I29|I30|I31|I32|I33|I34|I35|I36|I37|I38|I39|I40|
>  
> While the computation grows (2^n -1) with each item in Frequent Itemset, it 
> surely should be able to handle 40 or more items in a Frequest Itemset
>  
> Is this a FPGrowth implementation limitation,
> are there any tuning parameters that I am missing? Thank you.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38152) Upgrade Jetty version from 9.4.40 to 9.4.44

2022-02-08 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-38152:
---
Description: Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 in 
order to fix vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182  
(was: Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 inn order to fix 
vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182)

> Upgrade Jetty version from 9.4.40 to 9.4.44
> ---
>
> Key: SPARK-38152
> URL: https://issues.apache.org/jira/browse/SPARK-38152
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.8, 3.0.3
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>
> Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 in order to fix 
> vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38152) Upgrade Jetty version from 9.4.40 to 9.4.44

2022-02-08 Thread SHOBHIT SHUKLA (Jira)

SHOBHIT SHUKLA created SPARK-38152:
--

 Summary: Upgrade Jetty version from 9.4.40 to 9.4.44
 Key: SPARK-38152
 URL: https://issues.apache.org/jira/browse/SPARK-38152
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.3, 2.4.8
Reporter: SHOBHIT SHUKLA


Upgrade Jetty version to 9.4.44 in Spark 2.4.8 and 3.0.3 inn order to fix 
vulnerability CVE-2021-28169, CVE-2021-34429, PRISMA-2021-0182



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38150) Update comment of RelationConversions

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489219#comment-17489219
 ] 

Apache Spark commented on SPARK-38150:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35453

> Update comment of RelationConversions
> -
>
> Key: SPARK-38150
> URL: https://issues.apache.org/jira/browse/SPARK-38150
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38150:


Assignee: (was: Apache Spark)

> Update comment of RelationConversions
> -
>
> Key: SPARK-38150
> URL: https://issues.apache.org/jira/browse/SPARK-38150
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38150:


Assignee: Apache Spark

> Update comment of RelationConversions
> -
>
> Key: SPARK-38150
> URL: https://issues.apache.org/jira/browse/SPARK-38150
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38030:
--
Fix Version/s: 3.1.3

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at

[jira] [Updated] (SPARK-38030) Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38030:
--
Fix Version/s: 3.2.2

> Query with cast containing non-nullable columns fails with AQE on Spark 3.1.1
> -
>
> Key: SPARK-38030
> URL: https://issues.apache.org/jira/browse/SPARK-38030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> One of our user queries failed in Spark 3.1.1 when using AQE with the 
> following stacktrace mentioned below (some parts of the plan have been 
> redacted, but the structure is preserved).
> Debugging this issue, we found that the failure was within AQE calling 
> [QueryPlan.canonicalized|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L402].
> The query contains a cast over a column with non-nullable struct fields. 
> Canonicalization [removes nullability 
> information|https://github.com/apache/spark/blob/91db9a36a9ed74845908f14d21227d5267591653/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala#L45]
>  from the child {{AttributeReference}} of the Cast, however it does not 
> remove nullability information from the Cast's target dataType. This causes 
> the 
> [checkInputDataTypes|https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L290]
>  to return false because the child is now nullable and cast target data type 
> is not, leading to {{resolved=false}} and hence the {{UnresolvedException}}.
> {code:java}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree:
> Exchange RoundRobinPartitioning(1), REPARTITION_BY_NUM, [id=#232]
> +- Union
>:- Project [cast(columnA#30) as struct<...>]
>:  +- BatchScan[columnA#30] hive.tbl 
>+- Project [cast(columnA#35) as struct<...>]
>   +- BatchScan[columnA#35] hive.tbl
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:475)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:464)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:87)
>   at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:58)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:405)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:373)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:372)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:404)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$createQueryStages$2(AdaptiveSparkPlanExec.scala:447)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.createQueryStages(AdaptiveSparkPlanExec.scala:447)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:184)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:179)
>   at 
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:279)
>   at

[jira] [Updated] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38151:
--
Description: 
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}

  was:
- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
19042
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}


> Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`
> --
>
> Key: SPARK-38151
> URL: https://issues.apache.org/jira/browse/SPARK-38151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
> {code}
> [info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
> [info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
> Pacific/Kanton (DateTimeUtilsSuite.scala:783)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38151) Flaky Test: DateTimeUtilsSuite.`daysToMicros and microsToDays`

2022-02-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-38151:
-

 Summary: Flaky Test: DateTimeUtilsSuite.`daysToMicros and 
microsToDays`
 Key: SPARK-38151
 URL: https://issues.apache.org/jira/browse/SPARK-38151
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun


- https://github.com/dongjoon-hyun/spark/runs/5119322349?check_suite_focus=true
{code}
[info] - daysToMicros and microsToDays *** FAILED *** (620 milliseconds)
19042
[info]   9131 did not equal 9130 Round trip of 9130 did not work in tz 
Pacific/Kanton (DateTimeUtilsSuite.scala:783)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38149) Upgrade joda-time to 2.10.13

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38149:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade joda-time to 2.10.13
> 
>
> Key: SPARK-38149
> URL: https://issues.apache.org/jira/browse/SPARK-38149
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> joda-time 2.10.13 was released, which supports the latest TZ database of 
> 2021e.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38149) Upgrade joda-time to 2.10.13

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489214#comment-17489214
 ] 

Apache Spark commented on SPARK-38149:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/35452

> Upgrade joda-time to 2.10.13
> 
>
> Key: SPARK-38149
> URL: https://issues.apache.org/jira/browse/SPARK-38149
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.13 was released, which supports the latest TZ database of 
> 2021e.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38149) Upgrade joda-time to 2.10.13

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38149:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade joda-time to 2.10.13
> 
>
> Key: SPARK-38149
> URL: https://issues.apache.org/jira/browse/SPARK-38149
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.13 was released, which supports the latest TZ database of 
> 2021e.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38149) Upgrade joda-time to 2.10.13

2022-02-08 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-38149:
--

 Summary: Upgrade joda-time to 2.10.13
 Key: SPARK-38149
 URL: https://issues.apache.org/jira/browse/SPARK-38149
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


joda-time 2.10.13 was released, which supports the latest TZ database of 2021e.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489212#comment-17489212
 ] 

Apache Spark commented on SPARK-38148:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35451

> Do not add dynamic partition pruning if there exists static partition pruning
> -
>
> Key: SPARK-38148
> URL: https://issues.apache.org/jira/browse/SPARK-38148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Dynamic partition pruning add a filter as long as the join condition contains 
> partition columns. But if there exists other condition which contains the 
> static partition pruning, it's unnecessary to add an extra dynamic partition 
> pruning.
> For example:
> {code:java}
> CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string);
> CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string);
> SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38148:


Assignee: (was: Apache Spark)

> Do not add dynamic partition pruning if there exists static partition pruning
> -
>
> Key: SPARK-38148
> URL: https://issues.apache.org/jira/browse/SPARK-38148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Dynamic partition pruning add a filter as long as the join condition contains 
> partition columns. But if there exists other condition which contains the 
> static partition pruning, it's unnecessary to add an extra dynamic partition 
> pruning.
> For example:
> {code:java}
> CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string);
> CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string);
> SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489211#comment-17489211
 ] 

Apache Spark commented on SPARK-38148:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/35451

> Do not add dynamic partition pruning if there exists static partition pruning
> -
>
> Key: SPARK-38148
> URL: https://issues.apache.org/jira/browse/SPARK-38148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Dynamic partition pruning add a filter as long as the join condition contains 
> partition columns. But if there exists other condition which contains the 
> static partition pruning, it's unnecessary to add an extra dynamic partition 
> pruning.
> For example:
> {code:java}
> CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string);
> CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string);
> SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38147:
-

Assignee: Dongjoon Hyun

> Upgrade shapeless to 2.3.7
> --
>
> Key: SPARK-38147
> URL: https://issues.apache.org/jira/browse/SPARK-38147
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38148:


Assignee: Apache Spark

> Do not add dynamic partition pruning if there exists static partition pruning
> -
>
> Key: SPARK-38148
> URL: https://issues.apache.org/jira/browse/SPARK-38148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> Dynamic partition pruning add a filter as long as the join condition contains 
> partition columns. But if there exists other condition which contains the 
> static partition pruning, it's unnecessary to add an extra dynamic partition 
> pruning.
> For example:
> {code:java}
> CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string);
> CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string);
> SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38147:


Assignee: Apache Spark

> Upgrade shapeless to 2.3.7
> --
>
> Key: SPARK-38147
> URL: https://issues.apache.org/jira/browse/SPARK-38147
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38147:


Assignee: (was: Apache Spark)

> Upgrade shapeless to 2.3.7
> --
>
> Key: SPARK-38147
> URL: https://issues.apache.org/jira/browse/SPARK-38147
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489210#comment-17489210
 ] 

Apache Spark commented on SPARK-38147:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35450

> Upgrade shapeless to 2.3.7
> --
>
> Key: SPARK-38147
> URL: https://issues.apache.org/jira/browse/SPARK-38147
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38148) Do not add dynamic partition pruning if there exists static partition pruning

2022-02-08 Thread XiDuo You (Jira)

XiDuo You created SPARK-38148:
-

 Summary: Do not add dynamic partition pruning if there exists 
static partition pruning
 Key: SPARK-38148
 URL: https://issues.apache.org/jira/browse/SPARK-38148
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


Dynamic partition pruning add a filter as long as the join condition contains 
partition columns. But if there exists other condition which contains the 
static partition pruning, it's unnecessary to add an extra dynamic partition 
pruning.

For example:
{code:java}
CREATE TABLE t1 (c1 int) USING PARQUET PARTITIONED BY (p1 string);
CREATE TABLE t2 (c2 int) USING PARQUET PARTITIONED BY (p2 string);

SELECT * FROM t1 JOIN t2 ON t1.p1 = t2.p2 and t1.p1 = 'a' AND t2.c2 > 0;
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38147) Upgrade shapeless to 2.3.7

2022-02-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-38147:
-

 Summary: Upgrade shapeless to 2.3.7
 Key: SPARK-38147
 URL: https://issues.apache.org/jira/browse/SPARK-38147
 Project: Spark
  Issue Type: Improvement
  Components: Build, MLlib
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column

2022-02-08 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489207#comment-17489207
 ] 

Bruce Robbins edited comment on SPARK-38146 at 2/9/22, 2:23 AM:


This affects master only and has a simple fix: {{BufferSetterGetterUtils}} 
needs case statements for {{TimestampNTZType}} that replicates what is done for 
{{{}TimestampType{}}}.

Edit: Also, the filtering of {{TimestampNTZType}} [here in 
AggregationQuerySuite|https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L901]
 should be removed so that {{TimestampNTZType}} gets tested.


was (Author: bersprockets):
This affects master only and has a simple fix: {{BufferSetterGetterUtils}} 
needs case statements for {{TimestampNTZType}} that replicates what is done for 
{{{}TimestampType{}}}.

> UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column
> -
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns

[jira] [Commented] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column

2022-02-08 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489207#comment-17489207
 ] 

Bruce Robbins commented on SPARK-38146:
---

This affects master only and has a simple fix: {{BufferSetterGetterUtils}} 
needs case statements for {{TimestampNTZType}} that replicates what is done for 
{{{}TimestampType{}}}.

> UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column
> -
>
> Key: SPARK-38146
> URL: https://issues.apache.org/jira/browse/SPARK-38146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, 
> Spark throws the error:
> {noformat}
> 22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.UnsupportedOperationException: null
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218)
>  ~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
>  ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
>  ~[scala-library.jar:?]
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
> ~[scala-library.jar:?]
>   at 
> $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
>  ~[scala-library.jar:?]
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
> ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
>  ~[spark-sql_2.12-3.3.0-SNAPSHO
> {noformat}
> This  is because {{BufferSetterGetterUtils#createSetters}} does not have a 
> case statement for {{TimestampNTZType}}, so it generates a function that 
> tries to call {{UnsafeRow.update}}, which throws an 
> {{UnsupportedOperationException}}.
> This reproduction example is mostly taken from {{AggregationQuerySuite}}:
> {noformat}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> class ScalaAggregateFunction(schema: StructType) extends 
> UserDefinedAggregateFunction {
>   def inputSchema: StructType = schema
>   def bufferSchema: StructType = schema
>   def dataType: DataType = schema
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> (0 until schema.length).foreach { i =>
>   buffer.update(i, null)
> }
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!input.isNullAt(0) && input.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer.update(i, input.get(i))
>   }
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
>   (0 until schema.length).foreach { i =>
> buffer1.update(i, buffer2.get(i))
>   }
> }
>   }
>   def evaluate(buffer: Row): Any = {
> Row.fromSeq(buffer.toSeq)
>   }
> }
> import scala.util.Random
> import java.time.LocalDateTime
> val r = new Random(65676563L)
> val data = Seq.tabulate(50) { x =>
>   Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
> LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
> }
> val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
> timestamp_ntz")
> val rdd = spark.sparkContext.parallelize(data, 1)
> val df = spark.createDataFrame(rdd, schema)
> val udaf = new ScalaAggregateFunction(df.schema)
> val allColumns = df.schema.fields.map(f => col(f.name))
> df.groupBy().agg(udaf(allColumns: _*)).show(false)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37414) Inline type hints for python/pyspark/ml/tuning.py

2022-02-08 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37414:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/tuning.py
> -
>
> Key: SPARK-37414
> URL: https://issues.apache.org/jira/browse/SPARK-37414
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/tuning.pyi to 
> python/pyspark/ml/tuning.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37414) Inline type hints for python/pyspark/ml/tuning.py

2022-02-08 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37414.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35406
[https://github.com/apache/spark/pull/35406]

> Inline type hints for python/pyspark/ml/tuning.py
> -
>
> Key: SPARK-37414
> URL: https://issues.apache.org/jira/browse/SPARK-37414
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/tuning.pyi to 
> python/pyspark/ml/tuning.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38146) UDAF fails with unsafe rows containing a TIMESTAMP_NTZ column

2022-02-08 Thread Bruce Robbins (Jira)

Bruce Robbins created SPARK-38146:
-

 Summary: UDAF fails with unsafe rows containing a TIMESTAMP_NTZ 
column
 Key: SPARK-38146
 URL: https://issues.apache.org/jira/browse/SPARK-38146
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Bruce Robbins


When using a UDAF against unsafe rows containing a TIMESTAMP_NTZ column, Spark 
throws the error:
{noformat}
22/02/08 18:05:12 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsupportedOperationException: null
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:218) 
~[spark-catalyst_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15(udaf.scala:217)
 ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.execution.aggregate.BufferSetterGetterUtils.$anonfun$createSetters$15$adapted(udaf.scala:215)
 ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.execution.aggregate.MutableAggregationBufferImpl.update(udaf.scala:272)
 ~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
$line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.$anonfun$update$1(:46)
 ~[scala-library.jar:?]
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) 
~[scala-library.jar:?]
at 
$line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$ScalaAggregateFunction.update(:45)
 ~[scala-library.jar:?]
at 
org.apache.spark.sql.execution.aggregate.ScalaUDAF.update(udaf.scala:458) 
~[spark-sql_2.12-3.3.0-SNAPSHOT.jar:3.3.0-SNAPSHOT]
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:197)
 ~[spark-sql_2.12-3.3.0-SNAPSHO
{noformat}
This  is because {{BufferSetterGetterUtils#createSetters}} does not have a case 
statement for {{TimestampNTZType}}, so it generates a function that tries to 
call {{UnsafeRow.update}}, which throws an {{UnsupportedOperationException}}.

This reproduction example is mostly taken from {{AggregationQuerySuite}}:
{noformat}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

class ScalaAggregateFunction(schema: StructType) extends 
UserDefinedAggregateFunction {

  def inputSchema: StructType = schema

  def bufferSchema: StructType = schema

  def dataType: DataType = schema

  def deterministic: Boolean = true

  def initialize(buffer: MutableAggregationBuffer): Unit = {
(0 until schema.length).foreach { i =>
  buffer.update(i, null)
}
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0) && input.getInt(0) == 50) {
  (0 until schema.length).foreach { i =>
buffer.update(i, input.get(i))
  }
}
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
if (!buffer2.isNullAt(0) && buffer2.getInt(0) == 50) {
  (0 until schema.length).foreach { i =>
buffer1.update(i, buffer2.get(i))
  }
}
  }

  def evaluate(buffer: Row): Any = {
Row.fromSeq(buffer.toSeq)
  }
}

import scala.util.Random
import java.time.LocalDateTime

val r = new Random(65676563L)
val data = Seq.tabulate(50) { x =>
  Row((x + 1).toInt, (x + 2).toDouble, (x + 2).toLong, 
LocalDateTime.parse("2100-01-01T01:33:33.123").minusDays(x + 1))
}
val schema = StructType.fromDDL("id int, col1 double, col2 bigint, col3 
timestamp_ntz")
val rdd = spark.sparkContext.parallelize(data, 1)
val df = spark.createDataFrame(rdd, schema)

val udaf = new ScalaAggregateFunction(df.schema)

val allColumns = df.schema.fields.map(f => col(f.name))

df.groupBy().agg(udaf(allColumns: _*)).show(false)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-08 Thread Jackie Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jackie Zhang updated SPARK-38094:
-
Description: 
Field Id is a native field in the Parquet schema 
([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will 
first use the field ID to determine which Parquet columns to read, before 
falling back to using column names as before. It enables matching columns by 
field id for supported DWs like iceberg and Delta.

This PR supports:
 * vectorized reader
 * Parquet-mr reader

  was:
Field Id is a native field in the Parquet schema 
([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will 
first use the field ID to determine which Parquet columns to read, before 
falling back to using column names as before. It enables matching columns by 
field id for supported DWs like iceberg and Delta.

This PR supports:
 * vectorized reader

does not support:
 * Parquet-mr reader due to lack of field id support (needs a follow up ticket)


> Parquet: enable matching schema columns by field id
> ---
>
> Key: SPARK-38094
> URL: https://issues.apache.org/jira/browse/SPARK-38094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Jackie Zhang
>Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * vectorized reader
>  * Parquet-mr reader



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38139) ml.recommendation.ALS doctests failures

2022-02-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489194#comment-17489194
 ] 

Hyukjin Kwon commented on SPARK-38139:
--

Yeah, I think we should fix as so!

> ml.recommendation.ALS doctests failures
> ---
>
> Key: SPARK-38139
> URL: https://issues.apache.org/jira/browse/SPARK-38139
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In my dev setups, ml.recommendation:ALS test consistently converges to value 
> lower than expected and fails with:
> {code:python}
> File "/path/to/spark/python/pyspark/ml/recommendation.py", line 322, in 
> __main__.ALS
> Failed example:
> predictions[0]
> Expected:
> Row(user=0, item=2, newPrediction=0.69291...)
> Got:
> Row(user=0, item=2, newPrediction=0.6929099559783936)
> {code}
> In can correct for that, but it creates some noise, so if anyone else 
> experiences this, we could drop  a digit from the results
> {code}
> diff --git a/python/pyspark/ml/recommendation.py 
> b/python/pyspark/ml/recommendation.py
> index f0628fb922..b8e2a6097d 100644
> --- a/python/pyspark/ml/recommendation.py
> +++ b/python/pyspark/ml/recommendation.py
> @@ -320,7 +320,7 @@ class ALS(JavaEstimator, _ALSParams, JavaMLWritable, 
> JavaMLReadable):
>  >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", 
> "item"])
>  >>> predictions = sorted(model.transform(test).collect(), key=lambda r: 
> r[0])
>  >>> predictions[0]
> -Row(user=0, item=2, newPrediction=0.69291...)
> +Row(user=0, item=2, newPrediction=0.6929...)
>  >>> predictions[1]
>  Row(user=1, item=0, newPrediction=3.47356...)
>  >>> predictions[2]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37980) Extend METADATA column to support row indices for file based data sources

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37980:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Extend METADATA column to support row indices for file based data sources
> -
>
> Key: SPARK-37980
> URL: https://issues.apache.org/jira/browse/SPARK-37980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prakhar Jain
>Priority: Major
>
> Spark recently added hidden metadata column support for File based 
> datasources as part of  SPARK-37273.
> We should extend it to support ROW_INDEX/ROW_POSITION also.
>  
> Meaning of  ROW_POSITION:
> ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th 
> row in the file will have ROW_INDEX 5.
>  
> Use cases: 
> Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple 
> uniquely identifies row in a table. This information can be used to mark rows 
> e.g. this can be used by indexer etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37923) Generate partition transforms for BucketSpec inside parser

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37923:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Generate partition transforms for BucketSpec inside parser
> --
>
> Key: SPARK-37923
> URL: https://issues.apache.org/jira/browse/SPARK-37923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.3.0
>
>
> We currently generate partition transforms for BucketSpec in Analyzer. It's 
> cleaner to do this inside Parser.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38094:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Parquet: enable matching schema columns by field id
> ---
>
> Key: SPARK-38094
> URL: https://issues.apache.org/jira/browse/SPARK-38094
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Jackie Zhang
>Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * vectorized reader
> does not support:
>  * Parquet-mr reader due to lack of field id support (needs a follow up 
> ticket)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38032) Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38032:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Upgrade Arrow version < 7.0.0 for Python UDF tests in SQL
> -
>
> Key: SPARK-38032
> URL: https://issues.apache.org/jira/browse/SPARK-38032
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> We should better test latest PyArrow version. Now 6.0.1 is release but we're 
> using < 5.0.0 for 
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38025) Improve test suite ExternalCatalogSuite

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38025:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Improve test suite ExternalCatalogSuite
> ---
>
> Key: SPARK-38025
> URL: https://issues.apache.org/jira/browse/SPARK-38025
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> Test suite *ExternalCatalogSuite.scala* can be optimized by removing 
> repetitive code by replacing them with already available utility function 
> with some minor changes. This will reduce redundant code, simplify the suite 
> and improve readability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38114) Spark build fails in Windows

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38114:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Spark build fails in Windows
> 
>
> Key: SPARK-38114
> URL: https://issues.apache.org/jira/browse/SPARK-38114
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: SOUVIK PAUL
>Priority: Major
>
> java.lang.NoSuchMethodError: 
> org.fusesource.jansi.AnsiConsole.wrapOutputStream(Ljava/io/OutputStream;)Ljava/io/OutputStream;
> jline.AnsiWindowsTerminal.detectAnsiSupport(AnsiWindowsTerminal.java:57)
> jline.AnsiWindowsTerminal.(AnsiWindowsTerminal.java:27)
>  
> A similar issue is being faced by the quarkus project with latest Maven. 
> [https://github.com/quarkusio/quarkus/issues/19491]
>  
> Upgrading the scala-maven-plugin seems to resolve the issue but this ticket 
> can be a blocker
> https://issues.apache.org/jira/browse/SPARK-36547



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38138) Materialize QueryPlan subqueries

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38138:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Materialize QueryPlan subqueries
> 
>
> Key: SPARK-38138
> URL: https://issues.apache.org/jira/browse/SPARK-38138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38100) Remove unused method in `Decimal`

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38100:
-
Fix Version/s: 3.3.0
   (was: 3.3)

> Remove unused method in `Decimal`
> -
>
> Key: SPARK-38100
> URL: https://issues.apache.org/jira/browse/SPARK-38100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.3.0, 3.2.2
>
>
> there is a unused method `overflowException` in 
> `org.apache.spark.sql.types.Decimal`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38145) Address the tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38145:
-
Affects Version/s: 3.3.0
   (was: 3.3)

> Address the tests when "spark.sql.ansi.enabled" is True
> ---
>
> Key: SPARK-38145
> URL: https://issues.apache.org/jira/browse/SPARK-38145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
> several tests are failed when the `spark.sql.ansi.enabled` is set to True.
> For stability of project, it's always good to test passed regardless of the 
> option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38145) Address the tests when "spark.sql.ansi.enabled" is True

2022-02-08 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-38145:
---

 Summary: Address the tests when "spark.sql.ansi.enabled" is True
 Key: SPARK-38145
 URL: https://issues.apache.org/jira/browse/SPARK-38145
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3
Reporter: Haejoon Lee


The default value of `spark.sql.ansi.enabled` is False in Apache Spark, but 
several tests are failed when the `spark.sql.ansi.enabled` is set to True.

For stability of project, it's always good to test passed regardless of the 
option value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37404) Inline type hints for python/pyspark/ml/evaluation.py

2022-02-08 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37404.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35403
[https://github.com/apache/spark/pull/35403]

> Inline type hints for python/pyspark/ml/evaluation.py
> -
>
> Key: SPARK-37404
> URL: https://issues.apache.org/jira/browse/SPARK-37404
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/evaluation.pyi to 
> python/pyspark/ml/evaluation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37404) Inline type hints for python/pyspark/ml/evaluation.py

2022-02-08 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37404:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/evaluation.py
> -
>
> Key: SPARK-37404
> URL: https://issues.apache.org/jira/browse/SPARK-37404
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/evaluation.pyi to 
> python/pyspark/ml/evaluation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array

2022-02-08 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167
 ] 

Johan Nyström-Persson edited comment on SPARK-38042 at 2/8/22, 11:50 PM:
-

My initial idea above was wrong. I ended up changing
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe{code}
to
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code}
and this seems to work.


was (Author: JIRAUSER284274):
My initial idea above was wrong, and instead I fixed this by changing
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe{code}
to
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code}
 

> Encoder cannot be found when a tuple component is a type alias for an Array
> ---
>
> Key: SPARK-38042
> URL: https://issues.apache.org/jira/browse/SPARK-38042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Johan Nyström-Persson
>Priority: Major
>
> ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, 
> and then the alias is being used as a component of e.g. a product.
> Minimal example, tested in version 3.1.2:
> {code:java}
> type Data = Array[Long]
> val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2))
> sc.parallelize(xs).toDF("a", "b"){code}
> This gives the following exception:
> {code:java}
> scala.MatchError: Data (of class 
> scala.reflect.internal.Types$AliasNoArgsTypeRef) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573)
>  
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
>  at scala.collection.immutable.List.foreach(List.scala:392) 
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238) 
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231) 
>  at scala.collection.immutable.List.map(List.scala:298) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413)
>  
>  at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
>  
>  at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) 
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) 
>  ... 48 elided{code}
> At first glance, I think this could be fixed by changing e.g.
> {code:java}
> getClassNameFromType(tpe) to 
> getClassNameFromType(tpe.dealias)
> {code}
> in ScalaReflection.dataTypeFor. I will try to test that and submit a pull 
> request shortly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail:

[jira] [Comment Edited] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array

2022-02-08 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167
 ] 

Johan Nyström-Persson edited comment on SPARK-38042 at 2/8/22, 11:49 PM:
-

My initial idea above was wrong, and instead I fixed this by changing
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe{code}
to
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code}
 


was (Author: JIRAUSER284274):
My initial idea above was wrong, and instead I fixed this by changing

{{}}
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe{code}
{{{}{}}}to

{{}}

{{}}
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code}
{{}}

> Encoder cannot be found when a tuple component is a type alias for an Array
> ---
>
> Key: SPARK-38042
> URL: https://issues.apache.org/jira/browse/SPARK-38042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Johan Nyström-Persson
>Priority: Major
>
> ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, 
> and then the alias is being used as a component of e.g. a product.
> Minimal example, tested in version 3.1.2:
> {code:java}
> type Data = Array[Long]
> val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2))
> sc.parallelize(xs).toDF("a", "b"){code}
> This gives the following exception:
> {code:java}
> scala.MatchError: Data (of class 
> scala.reflect.internal.Types$AliasNoArgsTypeRef) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573)
>  
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
>  at scala.collection.immutable.List.foreach(List.scala:392) 
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238) 
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231) 
>  at scala.collection.immutable.List.map(List.scala:298) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413)
>  
>  at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
>  
>  at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) 
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) 
>  ... 48 elided{code}
> At first glance, I think this could be fixed by changing e.g.
> {code:java}
> getClassNameFromType(tpe) to 
> getClassNameFromType(tpe.dealias)
> {code}
> in ScalaReflection.dataTypeFor. I will try to test that and submit a pull 
> request shortly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe,

[jira] [Commented] (SPARK-38042) Encoder cannot be found when a tuple component is a type alias for an Array

2022-02-08 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-38042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489167#comment-17489167
 ] 

Johan Nyström-Persson commented on SPARK-38042:
---

My initial idea above was wrong, and instead I fixed this by changing

{{}}
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe{code}
{{{}{}}}to

{{}}

{{}}
{code:java}
val TypeRef(_, _, Seq(elementType)) = tpe.dealias{code}
{{}}

> Encoder cannot be found when a tuple component is a type alias for an Array
> ---
>
> Key: SPARK-38042
> URL: https://issues.apache.org/jira/browse/SPARK-38042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Johan Nyström-Persson
>Priority: Major
>
> ScalaReflection.dataTypeFor fails when Array[T] has been aliased for some T, 
> and then the alias is being used as a component of e.g. a product.
> Minimal example, tested in version 3.1.2:
> {code:java}
> type Data = Array[Long]
> val xs:List[(Data,Int)] = List((Array(1),1), (Array(2),2))
> sc.parallelize(xs).toDF("a", "b"){code}
> This gives the following exception:
> {code:java}
> scala.MatchError: Data (of class 
> scala.reflect.internal.Types$AliasNoArgsTypeRef) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:104)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$6(ScalaReflection.scala:573)
>  
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
>  at scala.collection.immutable.List.foreach(List.scala:392) 
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238) 
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231) 
>  at scala.collection.immutable.List.map(List.scala:298) 
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:562)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:432)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:421)
>  
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:904)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:903)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
>  
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:413)
>  
>  at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:55)
>  
>  at org.apache.spark.sql.Encoders$.product(Encoders.scala:285) 
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.LowPrioritySQLImplicits.newProductEncoder$(SQLImplicits.scala:251)
>  
>  at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:32) 
>  ... 48 elided{code}
> At first glance, I think this could be fixed by changing e.g.
> {code:java}
> getClassNameFromType(tpe) to 
> getClassNameFromType(tpe.dealias)
> {code}
> in ScalaReflection.dataTypeFor. I will try to test that and submit a pull 
> request shortly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38069) improve structured streaming window of calculated

2022-02-08 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-38069.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35362
[https://github.com/apache/spark/pull/35362]

> improve structured streaming window of calculated
> -
>
> Key: SPARK-38069
> URL: https://issues.apache.org/jira/browse/SPARK-38069
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: nyingping
>Assignee: nyingping
>Priority: Trivial
> Fix For: 3.3.0
>
>
> Structed Streaming computes window by intermediate result windowId, and 
> windowId computes window by CaseWhen.
> We can use Flink's method of calculating window to write it, which is more 
> easy to understand, simple and efficient



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38069) improve structured streaming window of calculated

2022-02-08 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-38069:
---

Assignee: nyingping

> improve structured streaming window of calculated
> -
>
> Key: SPARK-38069
> URL: https://issues.apache.org/jira/browse/SPARK-38069
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: nyingping
>Assignee: nyingping
>Priority: Trivial
>
> Structed Streaming computes window by intermediate result windowId, and 
> windowId computes window by CaseWhen.
> We can use Flink's method of calculating window to write it, which is more 
> easy to understand, simple and efficient



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38144) Remove unused `spark.storage.safetyFraction` config

2022-02-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38144.
---
Fix Version/s: 3.3.0
   3.2.2
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 35447
[https://github.com/apache/spark/pull/35447]

> Remove unused `spark.storage.safetyFraction` config
> ---
>
> Key: SPARK-38144
> URL: https://issues.apache.org/jira/browse/SPARK-38144
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.1.3, 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 183 matches

Mail list logo