[jira] [Assigned] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39591:


Assignee: Apache Spark

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Assignee: Apache Spark
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629252#comment-17629252
 ] 

Apache Spark commented on SPARK-39591:
--

User 'jerrypeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38517

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39591) Offset Management Improvements in Structured Streaming

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39591:


Assignee: (was: Apache Spark)

> Offset Management Improvements in Structured Streaming
> --
>
> Key: SPARK-39591
> URL: https://issues.apache.org/jira/browse/SPARK-39591
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Boyang Jerry Peng
>Priority: Major
>
> Currently in Structured Streaming, at the beginning of every micro-batch the 
> offset to process up to for the current batch is persisted to durable 
> storage.  At the end of every micro-batch, a marker to indicate the 
> completion of this current micro-batch is persisted to durable storage. For 
> pipelines such as one that read from Kafka and write to Kafka, end-to-end 
> exactly once is not support and latency is sensitive, we can allow users to 
> configure offset commits to be written asynchronously thus this commit 
> operation will not contribute to the batch duration and effectively lowering 
> the overall latency of the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41013) spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream

2022-11-04 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629240#comment-17629240
 ] 

Yuming Wang commented on SPARK-41013:
-

Could you test the Spark 3.3.1?

> spark-3.1.2以cluster模式提交作业报 Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream
> 
>
> Key: SPARK-41013
> URL: https://issues.apache.org/jira/browse/SPARK-41013
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: yutiantian
>Priority: Blocker
>  Labels: libzstd-jni, spark.shuffle.mapStatus.compression.codec, 
> zstd
>
> 使用spark-3.1.2版本以cluster模式提交作业,报
> Could not initialize class com.github.luben.zstd.ZstdOutputStream。具体日志如下:
> Exception in thread "map-output-dispatcher-0" Exception in thread 
> "map-output-dispatcher-2" java.lang.ExceptionInInitializerError: Cannot 
> unpack libzstd-jni: No such file or directory at 
> java.io.UnixFileSystem.createFileExclusively(Native Method) at 
> java.io.File.createTempFile(File.java:2024) at 
> com.github.luben.zstd.util.Native.load(Native.java:97) at 
> com.github.luben.zstd.util.Native.load(Native.java:55) at 
> com.github.luben.zstd.ZstdOutputStream.(ZstdOutputStream.java:16) at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Exception in thread 
> "map-output-dispatcher-7" Exception in thread "map-output-dispatcher-5" 
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Exception in thread 
> "map-output-dispatcher-4" Exception in thread "map-output-dispatcher-3" 
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230)
>  at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) java.lang.NoClassDefFoundError: 
> Could not initialize class com.github.luben.zstd.ZstdOutputStream at 
> org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
>  at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
>  at 
> org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
> 

[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629239#comment-17629239
 ] 

Apache Spark commented on SPARK-32380:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/38516

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at 

[jira] [Commented] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629230#comment-17629230
 ] 

Apache Spark commented on SPARK-41015:
--

User 'SandishKumarHN' has created a pull request for this issue:
https://github.com/apache/spark/pull/38515

> Failure of ProtobufCatalystDataConversionSuite.scala
> 
>
> Key: SPARK-41015
> URL: https://issues.apache.org/jira/browse/SPARK-41015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> To reproduce the issue, set the seed to 38:
> {code}
> diff --git 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
>  
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> index 271c5b0fec..080bf1eb1f 100644
> --- 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> +++ 
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
>  StringType -> ("StringMsg", ""))
>testingTypes.foreach { dt =>
> -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
> +val seed = 38
>  test(s"single $dt with seed $seed") {
>val (messageName, defaultValue) = 
> catalystTypesToProtoMessages(dt.fields(0).dataType)
> {code}
> and run the test:
> {code}
> build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
> {code}
> which fails with NPE:
> {code}
> [info] - single StructType(StructField(double_type,DoubleType,true)) with 
> seed 38 *** FAILED *** (10 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41015:


Assignee: (was: Apache Spark)

> Failure of ProtobufCatalystDataConversionSuite.scala
> 
>
> Key: SPARK-41015
> URL: https://issues.apache.org/jira/browse/SPARK-41015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> To reproduce the issue, set the seed to 38:
> {code}
> diff --git 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
>  
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> index 271c5b0fec..080bf1eb1f 100644
> --- 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> +++ 
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
>  StringType -> ("StringMsg", ""))
>testingTypes.foreach { dt =>
> -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
> +val seed = 38
>  test(s"single $dt with seed $seed") {
>val (messageName, defaultValue) = 
> catalystTypesToProtoMessages(dt.fields(0).dataType)
> {code}
> and run the test:
> {code}
> build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
> {code}
> which fails with NPE:
> {code}
> [info] - single StructType(StructField(double_type,DoubleType,true)) with 
> seed 38 *** FAILED *** (10 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41015:


Assignee: Apache Spark

> Failure of ProtobufCatalystDataConversionSuite.scala
> 
>
> Key: SPARK-41015
> URL: https://issues.apache.org/jira/browse/SPARK-41015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> To reproduce the issue, set the seed to 38:
> {code}
> diff --git 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
>  
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> index 271c5b0fec..080bf1eb1f 100644
> --- 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> +++ 
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
>  StringType -> ("StringMsg", ""))
>testingTypes.foreach { dt =>
> -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
> +val seed = 38
>  test(s"single $dt with seed $seed") {
>val (messageName, defaultValue) = 
> catalystTypesToProtoMessages(dt.fields(0).dataType)
> {code}
> and run the test:
> {code}
> build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
> {code}
> which fails with NPE:
> {code}
> [info] - single StructType(StructField(double_type,DoubleType,true)) with 
> seed 38 *** FAILED *** (10 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

2022-11-04 Thread Nikesh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629220#comment-17629220
 ] 

Nikesh commented on SPARK-41018:


Attached the notebooks I used. When the data size is small( Notebook 
->ZScoreWithKoalas_PandasOnSpark_SmallerDataset), both koalas and Pandas output 
match.

But when the data size is huge, koalas output differs from the Pandas output.

> Koalas.idxmin() is not picking the minimum value from a dataframe, but 
> pandas.idxmin() gives
> 
>
> Key: SPARK-41018
> URL: https://issues.apache.org/jira/browse/SPARK-41018
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.1
> Environment: databricks
>Reporter: Nikesh
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, 
> ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html
>
>
> Hi,
> I have a koalas dataframe with age and income and I calculated Zscore on age 
> and income and then norms is calculated using age_zscore and 
> income_zscore(new column name is sq_dist). Then I tried to do an idxmin on 
> the new column, but its not giving the minimum value.
> I did the same operations on a Pandas dataframe, but it gives the minimum 
> value .
> Please find attached the notebook for step by step operations I performed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

2022-11-04 Thread Nikesh (Jira)
Nikesh created SPARK-41018:
--

 Summary: Koalas.idxmin() is not picking the minimum value from a 
dataframe, but pandas.idxmin() gives
 Key: SPARK-41018
 URL: https://issues.apache.org/jira/browse/SPARK-41018
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.1
 Environment: databricks
Reporter: Nikesh
 Fix For: 3.3.1
 Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, 
ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html

Hi,
I have a koalas dataframe with age and income and I calculated Zscore on age 
and income and then norms is calculated using age_zscore and income_zscore(new 
column name is sq_dist). Then I tried to do an idxmin on the new column, but 
its not giving the minimum value.
I did the same operations on a Pandas dataframe, but it gives the minimum value 
.

Please find attached the notebook for step by step operations I performed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

2022-11-04 Thread Nikesh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikesh updated SPARK-41018:
---
Attachment: ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html
ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html

> Koalas.idxmin() is not picking the minimum value from a dataframe, but 
> pandas.idxmin() gives
> 
>
> Key: SPARK-41018
> URL: https://issues.apache.org/jira/browse/SPARK-41018
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.1
> Environment: databricks
>Reporter: Nikesh
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, 
> ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html
>
>
> Hi,
> I have a koalas dataframe with age and income and I calculated Zscore on age 
> and income and then norms is calculated using age_zscore and 
> income_zscore(new column name is sq_dist). Then I tried to do an idxmin on 
> the new column, but its not giving the minimum value.
> I did the same operations on a Pandas dataframe, but it gives the minimum 
> value .
> Please find attached the notebook for step by step operations I performed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Raghu Angadi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629204#comment-17629204
 ] 

Raghu Angadi commented on SPARK-41015:
--

Thanks for filing this [~maxgekk] .

cc: [~sanysand...@gmail.com] 

> Failure of ProtobufCatalystDataConversionSuite.scala
> 
>
> Key: SPARK-41015
> URL: https://issues.apache.org/jira/browse/SPARK-41015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> To reproduce the issue, set the seed to 38:
> {code}
> diff --git 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
>  
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> index 271c5b0fec..080bf1eb1f 100644
> --- 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> +++ 
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
>  StringType -> ("StringMsg", ""))
>testingTypes.foreach { dt =>
> -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
> +val seed = 38
>  test(s"single $dt with seed $seed") {
>val (messageName, defaultValue) = 
> catalystTypesToProtoMessages(dt.fields(0).dataType)
> {code}
> and run the test:
> {code}
> build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
> {code}
> which fails with NPE:
> {code}
> [info] - single StructType(StructField(double_type,DoubleType,true)) with 
> seed 38 *** FAILED *** (10 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-11-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-40950:
---

Assignee: Emil Ejbyfeldt

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13

2022-11-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-40950.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38427
[https://github.com/apache/spark/pull/38427]

> isRemoteAddressMaxedOut performance overhead on scala 2.13
> --
>
> Key: SPARK-40950
> URL: https://issues.apache.org/jira/browse/SPARK-40950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
> Fix For: 3.4.0
>
>
> On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` 
> while in 2.12 it would be ArrayBuffer. This means that calculating the length 
> of the blocks which is done in isRemoteAddressMaxedOut and other places now 
> much more expensive.  This is because in 2.13 `Seq` is can no longer be 
> backed by a mutable collection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40903) Avoid reordering decimal Add for canonicalization

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629165#comment-17629165
 ] 

Apache Spark commented on SPARK-40903:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38513

> Avoid reordering decimal Add for canonicalization
> -
>
> Key: SPARK-40903
> URL: https://issues.apache.org/jira/browse/SPARK-40903
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Avoid reordering Add for canonicalizing if it is decimal type.
> Expressions are canonicalized for comparisons and explanations. For 
> non-decimal Add expression, the order can be sorted by hashcode, and the 
> result is supposed to be the same.
> However, for Add expression of Decimal type, the behavior is different: Given 
> decimal (p1, s1) and another decimal (p2, s2), the result integral part is 
> `max(p1-s1, p2-s2) +1`, the result decimal part is `max(s1, s2)`. Thus the 
> result data type is `(max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2))`.
> Thus the order matters:
> * For `(decimal(12,5) + decimal(12,6)) + decimal(3, 2)`, the first add 
> `decimal(12,5) + decimal(12,6)` results in `decimal(14, 6)`, and then 
> `decimal(14, 6) + decimal(3, 2)`  results in `decimal(15, 6)`
> * For `(decimal(12, 6) + decimal(3,2)) + decimal(12, 5)`, the first add 
> `decimal(12, 6) + decimal(3,2)` results in `decimal(13, 6)`, and then 
> `decimal(13, 6) + decimal(12, 5)` results in `decimal(14, 6)`
> In the following query:
> ```
> create table foo(a decimal(12, 5), b decimal(12, 6)) using orc
> select sum(coalesce(a+b+ 1.75, a)) from foo
> ```
> At first `coalesce(a+b+ 1.75, a)` is resolved as `coalesce(a+b+ 1.75, cast(a 
> as decimal(15, 6))`. In the canonicalized version, the expression becomes 
> `coalesce(1.75+b+a, cast(a as decimal(15, 6))`. As explained above, 
> `1.75+b+a` is of decimal(14, 6), which is different from  `cast(a as 
> decimal(15, 6)`. Thus the following error will happen:
> {code:java}
> java.lang.IllegalArgumentException: requirement failed: All input types must 
> be the same except nullable, containsNull, valueContainsNull flags. The input 
> types found are
>   DecimalType(14,6)
>   DecimalType(15,6)
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40903) Avoid reordering decimal Add for canonicalization

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629164#comment-17629164
 ] 

Apache Spark commented on SPARK-40903:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/38513

> Avoid reordering decimal Add for canonicalization
> -
>
> Key: SPARK-40903
> URL: https://issues.apache.org/jira/browse/SPARK-40903
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Avoid reordering Add for canonicalizing if it is decimal type.
> Expressions are canonicalized for comparisons and explanations. For 
> non-decimal Add expression, the order can be sorted by hashcode, and the 
> result is supposed to be the same.
> However, for Add expression of Decimal type, the behavior is different: Given 
> decimal (p1, s1) and another decimal (p2, s2), the result integral part is 
> `max(p1-s1, p2-s2) +1`, the result decimal part is `max(s1, s2)`. Thus the 
> result data type is `(max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2))`.
> Thus the order matters:
> * For `(decimal(12,5) + decimal(12,6)) + decimal(3, 2)`, the first add 
> `decimal(12,5) + decimal(12,6)` results in `decimal(14, 6)`, and then 
> `decimal(14, 6) + decimal(3, 2)`  results in `decimal(15, 6)`
> * For `(decimal(12, 6) + decimal(3,2)) + decimal(12, 5)`, the first add 
> `decimal(12, 6) + decimal(3,2)` results in `decimal(13, 6)`, and then 
> `decimal(13, 6) + decimal(12, 5)` results in `decimal(14, 6)`
> In the following query:
> ```
> create table foo(a decimal(12, 5), b decimal(12, 6)) using orc
> select sum(coalesce(a+b+ 1.75, a)) from foo
> ```
> At first `coalesce(a+b+ 1.75, a)` is resolved as `coalesce(a+b+ 1.75, cast(a 
> as decimal(15, 6))`. In the canonicalized version, the expression becomes 
> `coalesce(1.75+b+a, cast(a as decimal(15, 6))`. As explained above, 
> `1.75+b+a` is of decimal(14, 6), which is different from  `cast(a as 
> decimal(15, 6)`. Thus the following error will happen:
> {code:java}
> java.lang.IllegalArgumentException: requirement failed: All input types must 
> be the same except nullable, containsNull, valueContainsNull flags. The input 
> types found are
>   DecimalType(14,6)
>   DecimalType(15,6)
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40769.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38498
[https://github.com/apache/spark/pull/38498]

> Migrate type check failures of aggregate expressions onto error classes
> ---
>
> Key: SPARK-40769
> URL: https://issues.apache.org/jira/browse/SPARK-40769
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate 
> expressions:
> 1. Count (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59
> 2. CollectSet (1):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180
> 3. CountMinSketchAgg (4):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95
> 4. HistogramNumeric (3):
> https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39740) vis-timeline @ 4.2.1 vulnerable to XSS attacks

2022-11-04 Thread Eugene Shinn (Truveta) (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629113#comment-17629113
 ] 

Eugene Shinn (Truveta) commented on SPARK-39740:


Any updates here?

> vis-timeline @ 4.2.1 vulnerable to XSS attacks
> --
>
> Key: SPARK-39740
> URL: https://issues.apache.org/jira/browse/SPARK-39740
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Eugene Shinn (Truveta)
>Priority: Major
>
> Spark UI includes visjs/vis-timeline package@4.2.1, which is vulnerable to 
> XSS attacks ([Cross-site Scripting in vis-timeline · CVE-2020-28487 · GitHub 
> Advisory Database|https://github.com/advisories/GHSA-9mrv-456v-pf22]). This 
> version should be replaced with the next non-vulnerable issue - [Release 
> v7.4.4 · visjs/vis-timeline 
> (github.com)|https://github.com/visjs/vis-timeline/releases/tag/v7.4.4] or 
> higher.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38564) Support collecting metrics from streaming sinks

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629061#comment-17629061
 ] 

Apache Spark commented on SPARK-38564:
--

User 'FouadApp' has created a pull request for this issue:
https://github.com/apache/spark/pull/38512

> Support collecting metrics from streaming sinks
> ---
>
> Key: SPARK-38564
> URL: https://issues.apache.org/jira/browse/SPARK-38564
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Boyang Jerry Peng
>Assignee: Boyang Jerry Peng
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, only streaming sources have the capability to return custom 
> metrics but not sinks. Allow streaming sinks to also return custom metrics is 
> very useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41017) Do not push Filter through reference-only Project

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41017:


Assignee: Wenchen Fan  (was: Apache Spark)

> Do not push Filter through reference-only Project
> -
>
> Key: SPARK-41017
> URL: https://issues.apache.org/jira/browse/SPARK-41017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41017) Do not push Filter through reference-only Project

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41017:


Assignee: Apache Spark  (was: Wenchen Fan)

> Do not push Filter through reference-only Project
> -
>
> Key: SPARK-41017
> URL: https://issues.apache.org/jira/browse/SPARK-41017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41017) Do not push Filter through reference-only Project

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629030#comment-17629030
 ] 

Apache Spark commented on SPARK-41017:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/38511

> Do not push Filter through reference-only Project
> -
>
> Key: SPARK-41017
> URL: https://issues.apache.org/jira/browse/SPARK-41017
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40533) Extend type support for Spark Connect literals

2022-11-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40533.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38462
[https://github.com/apache/spark/pull/38462]

> Extend type support for Spark Connect literals
> --
>
> Key: SPARK-40533
> URL: https://issues.apache.org/jira/browse/SPARK-40533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>
> Extend type support in Literal transformation: Missing support for Instant, 
> BigDecimal, LocalDate, LocalTimestamp, Duration, Period.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40533) Extend type support for Spark Connect literals

2022-11-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40533:


Assignee: Martin Grund

> Extend type support for Spark Connect literals
> --
>
> Key: SPARK-40533
> URL: https://issues.apache.org/jira/browse/SPARK-40533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>
> Extend type support in Literal transformation: Missing support for Instant, 
> BigDecimal, LocalDate, LocalTimestamp, Duration, Period.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41017) Do not push Filter through reference-only Project

2022-11-04 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-41017:
---

 Summary: Do not push Filter through reference-only Project
 Key: SPARK-41017
 URL: https://issues.apache.org/jira/browse/SPARK-41017
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41016) Identical expressions should not be used on both sides of a binary operator

2022-11-04 Thread Jira
Bjørn Jørgensen created SPARK-41016:
---

 Summary: Identical expressions should not be used on both sides of 
a binary operator
 Key: SPARK-41016
 URL: https://issues.apache.org/jira/browse/SPARK-41016
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


Identical expressions should not be used on both sides of a binary operator.
There are a lot of this 278 total,

{code:java}
self.check_extension(pser != pser, psser != psser)
{code}


in pandas API test code. 

[Issues - spark-python - Bjørn Jørgensen 
(sonarcloud.io)|https://sonarcloud.io/project/issues?languages=py=false=python%3AS1764=spark-python]

We either need to fix this or remove them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-41015:
-
Description: 
To reproduce the issue, set the seed to 38:

{code}
diff --git 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
index 271c5b0fec..080bf1eb1f 100644
--- 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
+++ 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
@@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
 StringType -> ("StringMsg", ""))

   testingTypes.foreach { dt =>
-val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
+val seed = 38
 test(s"single $dt with seed $seed") {

   val (messageName, defaultValue) = 
catalystTypesToProtoMessages(dt.fields(0).dataType)
{code}

and run the test:

{code}
build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
{code}
which fails with NPE:
{code}
[info] - single StructType(StructField(double_type,DoubleType,true)) with seed 
38 *** FAILED *** (10 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
{code}



  was:
To reproduce the issue, set the seed to 38:

{code:diff}
diff --git 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
index 271c5b0fec..080bf1eb1f 100644
--- 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
+++ 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
@@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
 StringType -> ("StringMsg", ""))

   testingTypes.foreach { dt =>
-val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
+val seed = 38
 test(s"single $dt with seed $seed") {

   val (messageName, defaultValue) = 
catalystTypesToProtoMessages(dt.fields(0).dataType)
{code}

and run the test:

{code}
build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
{code}
which fails with NPE:
{code}
[info] - single StructType(StructField(double_type,DoubleType,true)) with seed 
38 *** FAILED *** (10 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
{code}




> Failure of ProtobufCatalystDataConversionSuite.scala
> 
>
> Key: SPARK-41015
> URL: https://issues.apache.org/jira/browse/SPARK-41015
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> To reproduce the issue, set the seed to 38:
> {code}
> diff --git 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
>  
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> index 271c5b0fec..080bf1eb1f 100644
> --- 
> a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> +++ 
> b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
> @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
>  StringType -> ("StringMsg", ""))
>testingTypes.foreach { dt =>
> -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
> +val seed = 38
>  test(s"single $dt with seed $seed") {
>val (messageName, defaultValue) = 
> catalystTypesToProtoMessages(dt.fields(0).dataType)
> {code}
> and run the test:
> {code}
> build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
> {code}
> which fails with NPE:
> {code}
> [info] - single StructType(StructField(double_type,DoubleType,true)) with 
> seed 38 *** FAILED *** (10 milliseconds)
> [info]   java.lang.NullPointerException:
> [info]   at 
> 

[jira] [Created] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala

2022-11-04 Thread Max Gekk (Jira)
Max Gekk created SPARK-41015:


 Summary: Failure of ProtobufCatalystDataConversionSuite.scala
 Key: SPARK-41015
 URL: https://issues.apache.org/jira/browse/SPARK-41015
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk


To reproduce the issue, set the seed to 38:

{code:diff}
diff --git 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
index 271c5b0fec..080bf1eb1f 100644
--- 
a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
+++ 
b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala
@@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite
 StringType -> ("StringMsg", ""))

   testingTypes.foreach { dt =>
-val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1)
+val seed = 38
 test(s"single $dt with seed $seed") {

   val (messageName, defaultValue) = 
catalystTypesToProtoMessages(dt.fields(0).dataType)
{code}

and run the test:

{code}
build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite"
{code}
which fails with NPE:
{code}
[info] - single StructType(StructField(double_type,DoubleType,true)) with seed 
38 *** FAILED *** (10 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at 
org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker

2022-11-04 Thread Anton Ippolitov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628904#comment-17628904
 ] 

Anton Ippolitov commented on SPARK-40954:
-

Thank you for the suggestion [~dcoliversun]!

The hyperkit driver is not supported on M1 
([https://github.com/kubernetes/minikube/issues/11885)] however I managed to 
run Minikube Spark on Kubernetes integration tests with the experimental 
[qemu2|https://github.com/kubernetes/minikube/pull/13639] driver. I also had to 
use the experimental 
[socket_vmnet|https://minikube.sigs.k8s.io/docs/drivers/qemu/#networking] 
network in order for the `[minikube 
service|https://github.com/apache/spark/blob/01014aa99fa851411262a6719058dde97319bbb3/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/backend/minikube/Minikube.scala#L111-L113]`
 call to work:
{noformat}
minikube start --driver qemu2 --network socket_vmnet{noformat}
The tests pass now.

I think it would be good to document this as a workaround for running Minikube 
integration tests on M1? There is also the experimental 
[podman|https://minikube.sigs.k8s.io/docs/drivers/podman/] Minikube driver 
which is supported on M1 but I haven't tried it.

 

> Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
> ---
>
> Key: SPARK-40954
> URL: https://issues.apache.org/jira/browse/SPARK-40954
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.1
> Environment: MacOS 12.6 (Mac M1)
> Minikube 1.27.1
> Docker 20.10.17
>Reporter: Anton Ippolitov
>Priority: Minor
> Attachments: TestProcess.scala
>
>
> h2. Description
> I tried running Kubernetes integration tests with the Minikube backend (+ 
> Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on 
> Spark's master branch. I ran them with the following command:
>  
> {code:java}
> mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
> -Pkubernetes -Pkubernetes-integration-tests \
> -Phadoop-3 \
> -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \
> -Dspark.kubernetes.test.imageRepo=docker.io/kubespark 
> \
> -Dspark.kubernetes.test.namespace=spark \
> -Dspark.kubernetes.test.serviceAccountName=spark \
> -Dspark.kubernetes.test.deployMode=minikube  {code}
> However the test suite got stuck literally for hours on my machine. 
>  
> h2. Investigation
> I ran {{jstack}} on the process that was running the tests and saw that it 
> was stuck here:
>  
> {noformat}
> "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 
> tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000]
>    java.lang.Thread.State: RUNNABLE
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:255)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>     - locked <0x00076c0b6f40> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
>     at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>     at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>     at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>     - locked <0x00076c0bb410> (a java.io.InputStreamReader)
>     at java.io.InputStreamReader.read(InputStreamReader.java:184)
>     at java.io.BufferedReader.fill(BufferedReader.java:161)
>     at java.io.BufferedReader.readLine(BufferedReader.java:324)
>     - locked <0x00076c0bb410> (a java.io.InputStreamReader)
>     at java.io.BufferedReader.readLine(BufferedReader.java:389)
>     at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown
>  Source)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49)
>     at 
> org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45)
>     at 
> 

[jira] [Resolved] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41012.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38508
[https://github.com/apache/spark/pull/38508]

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41012:


Assignee: Haejoon Lee

> Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
> ---
>
> Key: SPARK-41012
> URL: https://issues.apache.org/jira/browse/SPARK-41012
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Rename the _LEGACY_ERROR_TEMP_1022 to proper name.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41014:


Assignee: Apache Spark

> Improve documentation and typing of applyInPandas for groupby and cogroup
> -
>
> Key: SPARK-41014
> URL: https://issues.apache.org/jira/browse/SPARK-41014
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Trivial
>
> Documentation of method `applyInPandas` for grouby and cogroup does not 
> mention in the main description that there are two allowed signatures for the 
> provided function. The Examples state that this is possible, and the 
> parameters sections states that for cogroup.
> That should be made cleaner.
> Finally, type information for `PandasCogroupedMapFunction` does not mention 
> the three-argument callable alternative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup

2022-11-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628853#comment-17628853
 ] 

Apache Spark commented on SPARK-41014:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/38509

> Improve documentation and typing of applyInPandas for groupby and cogroup
> -
>
> Key: SPARK-41014
> URL: https://issues.apache.org/jira/browse/SPARK-41014
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Trivial
>
> Documentation of method `applyInPandas` for grouby and cogroup does not 
> mention in the main description that there are two allowed signatures for the 
> provided function. The Examples state that this is possible, and the 
> parameters sections states that for cogroup.
> That should be made cleaner.
> Finally, type information for `PandasCogroupedMapFunction` does not mention 
> the three-argument callable alternative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup

2022-11-04 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41014:


Assignee: (was: Apache Spark)

> Improve documentation and typing of applyInPandas for groupby and cogroup
> -
>
> Key: SPARK-41014
> URL: https://issues.apache.org/jira/browse/SPARK-41014
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Trivial
>
> Documentation of method `applyInPandas` for grouby and cogroup does not 
> mention in the main description that there are two allowed signatures for the 
> provided function. The Examples state that this is possible, and the 
> parameters sections states that for cogroup.
> That should be made cleaner.
> Finally, type information for `PandasCogroupedMapFunction` does not mention 
> the three-argument callable alternative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup

2022-11-04 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-41014:
--
Description: 
Documentation of method `applyInPandas` for grouby and cogroup does not mention 
in the main description that there are two allowed signatures for the provided 
function. The Examples state that this is possible, and the parameters sections 
states that for cogroup.

That should be made cleaner.

Finally, type information for `PandasCogroupedMapFunction` does not mention the 
three-argument callable alternative.

  was:
Documentation of method `applyInPandas` for grouby and cogroup does not mention 
in the main description that there are two allowed signatures for the provided 
function. The Examples state this is possible, and the parameters sections 
states that for cogroup.

That should be made cleaner.

Finally, type information for `PandasCogroupedMapFunction` does not mention the 
three-argument callable alternative.


> Improve documentation and typing of applyInPandas for groupby and cogroup
> -
>
> Key: SPARK-41014
> URL: https://issues.apache.org/jira/browse/SPARK-41014
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Trivial
>
> Documentation of method `applyInPandas` for grouby and cogroup does not 
> mention in the main description that there are two allowed signatures for the 
> provided function. The Examples state that this is possible, and the 
> parameters sections states that for cogroup.
> That should be made cleaner.
> Finally, type information for `PandasCogroupedMapFunction` does not mention 
> the three-argument callable alternative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup

2022-11-04 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-41014:
-

 Summary: Improve documentation and typing of applyInPandas for 
groupby and cogroup
 Key: SPARK-41014
 URL: https://issues.apache.org/jira/browse/SPARK-41014
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Enrico Minack


Documentation of method `applyInPandas` for grouby and cogroup does not mention 
in the main description that there are two allowed signatures for the provided 
function. The Examples state this is possible, and the parameters sections 
states that for cogroup.

That should be made cleaner.

Finally, type information for `PandasCogroupedMapFunction` does not mention the 
three-argument callable alternative.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41013) spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream

2022-11-04 Thread yutiantian (Jira)
yutiantian created SPARK-41013:
--

 Summary: spark-3.1.2以cluster模式提交作业报 Could not initialize class 
com.github.luben.zstd.ZstdOutputStream
 Key: SPARK-41013
 URL: https://issues.apache.org/jira/browse/SPARK-41013
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: yutiantian


使用spark-3.1.2版本以cluster模式提交作业,报

Could not initialize class com.github.luben.zstd.ZstdOutputStream。具体日志如下:

Exception in thread "map-output-dispatcher-0" Exception in thread 
"map-output-dispatcher-2" java.lang.ExceptionInInitializerError: Cannot unpack 
libzstd-jni: No such file or directory at 
java.io.UnixFileSystem.createFileExclusively(Native Method) at 
java.io.File.createTempFile(File.java:2024) at 
com.github.luben.zstd.util.Native.load(Native.java:97) at 
com.github.luben.zstd.util.Native.load(Native.java:55) at 
com.github.luben.zstd.ZstdOutputStream.(ZstdOutputStream.java:16) at 
org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
 at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
 at 
org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) 
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Exception in thread 
"map-output-dispatcher-7" Exception in thread "map-output-dispatcher-5" 
java.lang.NoClassDefFoundError: Could not initialize class 
com.github.luben.zstd.ZstdOutputStream at 
org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
 at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
 at 
org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) 
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Exception in thread 
"map-output-dispatcher-4" Exception in thread "map-output-dispatcher-3" 
java.lang.NoClassDefFoundError: Could not initialize class 
com.github.luben.zstd.ZstdOutputStream at 
org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
 at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
 at 
org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) 
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) java.lang.NoClassDefFoundError: Could 
not initialize class com.github.luben.zstd.ZstdOutputStream at 
org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223)
 at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910)
 at 
org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) 
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)

但是同样的代码,以client模式提交可以正常执行。

以cluster模式提交作业暂时的解决办法是在spark-default.conf 
中配置spark.shuffle.mapStatus.compression.codec lz4 作业可以正常提交。

想咨询下cluster模式,在shuffle 过程中使用zstd压缩为什么会不能正常使用呢?

有任何思路提供的大佬将不胜感激。



--
This message was 

[jira] [Assigned] (SPARK-40749) Migrate type check failures of generators onto error classes

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40749:


Assignee: BingKun Pan

> Migrate type check failures of generators onto error classes
> 
>
> Key: SPARK-40749
> URL: https://issues.apache.org/jira/browse/SPARK-40749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator 
> expressions:
> 1. Stack (3): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170
> 2. ExplodeBase (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299
> 3. Inline (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40749) Migrate type check failures of generators onto error classes

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40749.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38482
[https://github.com/apache/spark/pull/38482]

> Migrate type check failures of generators onto error classes
> 
>
> Key: SPARK-40749
> URL: https://issues.apache.org/jira/browse/SPARK-40749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.4.0
>
>
> Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator 
> expressions:
> 1. Stack (3): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170
> 2. ExplodeBase (1): 
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299
> 3. Inline (1):
> https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-41004.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38494
[https://github.com/apache/spark/pull/38494]

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite

2022-11-04 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-41004:


Assignee: BingKun Pan

> Check error classes in InterceptorRegistrySuite
> ---
>
> Key: SPARK-41004
> URL: https://issues.apache.org/jira/browse/SPARK-41004
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>
> - CONNECT.INTERCEPTOR_CTOR_MISSING
>  - CONNECT.INTERCEPTOR_RUNTIME_ERROR



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2022-11-04 Thread Yilun Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628700#comment-17628700
 ] 

Yilun Fan commented on SPARK-33349:
---

I also met this problem in Spark 3.2.1, kubernetes-client 5.4.1.

 
{code:java}
ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is 
expected if the application is shutting down.) 
io.fabric8.kubernetes.client.WatcherException: too old resource version: 
63993943 (64057995) 
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103){code}
I think we have to add some retry in ExecutorPodsWatchSnapshotSource. 
Especially when we close spark.kubernetes.executor.enableApiPolling,  only this 
watcher can receive executor pod status.

Just like what spark has done in the submit client.  
[https://github.com/apache/spark/pull/29533/files] 

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org