[jira] [Assigned] (SPARK-39591) Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39591: Assignee: Apache Spark > Offset Management Improvements in Structured Streaming > -- > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Assignee: Apache Spark >Priority: Major > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39591) Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629252#comment-17629252 ] Apache Spark commented on SPARK-39591: -- User 'jerrypeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38517 > Offset Management Improvements in Structured Streaming > -- > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39591) Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39591: Assignee: (was: Apache Spark) > Offset Management Improvements in Structured Streaming > -- > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41013) spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream
[ https://issues.apache.org/jira/browse/SPARK-41013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629240#comment-17629240 ] Yuming Wang commented on SPARK-41013: - Could you test the Spark 3.3.1? > spark-3.1.2以cluster模式提交作业报 Could not initialize class > com.github.luben.zstd.ZstdOutputStream > > > Key: SPARK-41013 > URL: https://issues.apache.org/jira/browse/SPARK-41013 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: yutiantian >Priority: Blocker > Labels: libzstd-jni, spark.shuffle.mapStatus.compression.codec, > zstd > > 使用spark-3.1.2版本以cluster模式提交作业,报 > Could not initialize class com.github.luben.zstd.ZstdOutputStream。具体日志如下: > Exception in thread "map-output-dispatcher-0" Exception in thread > "map-output-dispatcher-2" java.lang.ExceptionInInitializerError: Cannot > unpack libzstd-jni: No such file or directory at > java.io.UnixFileSystem.createFileExclusively(Native Method) at > java.io.File.createTempFile(File.java:2024) at > com.github.luben.zstd.util.Native.load(Native.java:97) at > com.github.luben.zstd.util.Native.load(Native.java:55) at > com.github.luben.zstd.ZstdOutputStream.(ZstdOutputStream.java:16) at > org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) > at > org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Exception in thread > "map-output-dispatcher-7" Exception in thread "map-output-dispatcher-5" > java.lang.NoClassDefFoundError: Could not initialize class > com.github.luben.zstd.ZstdOutputStream at > org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) > at > org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Exception in thread > "map-output-dispatcher-4" Exception in thread "map-output-dispatcher-3" > java.lang.NoClassDefFoundError: Could not initialize class > com.github.luben.zstd.ZstdOutputStream at > org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) > at > org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) java.lang.NoClassDefFoundError: > Could not initialize class com.github.luben.zstd.ZstdOutputStream at > org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) > at > org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at >
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629239#comment-17629239 ] Apache Spark commented on SPARK-32380: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/38516 > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at
[jira] [Commented] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629230#comment-17629230 ] Apache Spark commented on SPARK-41015: -- User 'SandishKumarHN' has created a pull request for this issue: https://github.com/apache/spark/pull/38515 > Failure of ProtobufCatalystDataConversionSuite.scala > > > Key: SPARK-41015 > URL: https://issues.apache.org/jira/browse/SPARK-41015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > To reproduce the issue, set the seed to 38: > {code} > diff --git > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > index 271c5b0fec..080bf1eb1f 100644 > --- > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > +++ > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite > StringType -> ("StringMsg", "")) >testingTypes.foreach { dt => > -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) > +val seed = 38 > test(s"single $dt with seed $seed") { >val (messageName, defaultValue) = > catalystTypesToProtoMessages(dt.fields(0).dataType) > {code} > and run the test: > {code} > build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" > {code} > which fails with NPE: > {code} > [info] - single StructType(StructField(double_type,DoubleType,true)) with > seed 38 *** FAILED *** (10 milliseconds) > [info] java.lang.NullPointerException: > [info] at > org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41015: Assignee: (was: Apache Spark) > Failure of ProtobufCatalystDataConversionSuite.scala > > > Key: SPARK-41015 > URL: https://issues.apache.org/jira/browse/SPARK-41015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > To reproduce the issue, set the seed to 38: > {code} > diff --git > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > index 271c5b0fec..080bf1eb1f 100644 > --- > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > +++ > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite > StringType -> ("StringMsg", "")) >testingTypes.foreach { dt => > -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) > +val seed = 38 > test(s"single $dt with seed $seed") { >val (messageName, defaultValue) = > catalystTypesToProtoMessages(dt.fields(0).dataType) > {code} > and run the test: > {code} > build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" > {code} > which fails with NPE: > {code} > [info] - single StructType(StructField(double_type,DoubleType,true)) with > seed 38 *** FAILED *** (10 milliseconds) > [info] java.lang.NullPointerException: > [info] at > org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41015: Assignee: Apache Spark > Failure of ProtobufCatalystDataConversionSuite.scala > > > Key: SPARK-41015 > URL: https://issues.apache.org/jira/browse/SPARK-41015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > To reproduce the issue, set the seed to 38: > {code} > diff --git > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > index 271c5b0fec..080bf1eb1f 100644 > --- > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > +++ > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite > StringType -> ("StringMsg", "")) >testingTypes.foreach { dt => > -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) > +val seed = 38 > test(s"single $dt with seed $seed") { >val (messageName, defaultValue) = > catalystTypesToProtoMessages(dt.fields(0).dataType) > {code} > and run the test: > {code} > build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" > {code} > which fails with NPE: > {code} > [info] - single StructType(StructField(double_type,DoubleType,true)) with > seed 38 *** FAILED *** (10 milliseconds) > [info] java.lang.NullPointerException: > [info] at > org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives
[ https://issues.apache.org/jira/browse/SPARK-41018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629220#comment-17629220 ] Nikesh commented on SPARK-41018: Attached the notebooks I used. When the data size is small( Notebook ->ZScoreWithKoalas_PandasOnSpark_SmallerDataset), both koalas and Pandas output match. But when the data size is huge, koalas output differs from the Pandas output. > Koalas.idxmin() is not picking the minimum value from a dataframe, but > pandas.idxmin() gives > > > Key: SPARK-41018 > URL: https://issues.apache.org/jira/browse/SPARK-41018 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.1 > Environment: databricks >Reporter: Nikesh >Priority: Critical > Fix For: 3.3.1 > > Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, > ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html > > > Hi, > I have a koalas dataframe with age and income and I calculated Zscore on age > and income and then norms is calculated using age_zscore and > income_zscore(new column name is sq_dist). Then I tried to do an idxmin on > the new column, but its not giving the minimum value. > I did the same operations on a Pandas dataframe, but it gives the minimum > value . > Please find attached the notebook for step by step operations I performed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives
Nikesh created SPARK-41018: -- Summary: Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives Key: SPARK-41018 URL: https://issues.apache.org/jira/browse/SPARK-41018 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.1 Environment: databricks Reporter: Nikesh Fix For: 3.3.1 Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html Hi, I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value. I did the same operations on a Pandas dataframe, but it gives the minimum value . Please find attached the notebook for step by step operations I performed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41018) Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives
[ https://issues.apache.org/jira/browse/SPARK-41018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikesh updated SPARK-41018: --- Attachment: ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html > Koalas.idxmin() is not picking the minimum value from a dataframe, but > pandas.idxmin() gives > > > Key: SPARK-41018 > URL: https://issues.apache.org/jira/browse/SPARK-41018 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.1 > Environment: databricks >Reporter: Nikesh >Priority: Critical > Fix For: 3.3.1 > > Attachments: ZScoreWithKoalas_PandasOnSpark_BiggerDataset.html, > ZScoreWithKoalas_PandasOnSpark_SmallerDataset.html > > > Hi, > I have a koalas dataframe with age and income and I calculated Zscore on age > and income and then norms is calculated using age_zscore and > income_zscore(new column name is sq_dist). Then I tried to do an idxmin on > the new column, but its not giving the minimum value. > I did the same operations on a Pandas dataframe, but it gives the minimum > value . > Please find attached the notebook for step by step operations I performed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629204#comment-17629204 ] Raghu Angadi commented on SPARK-41015: -- Thanks for filing this [~maxgekk] . cc: [~sanysand...@gmail.com] > Failure of ProtobufCatalystDataConversionSuite.scala > > > Key: SPARK-41015 > URL: https://issues.apache.org/jira/browse/SPARK-41015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > To reproduce the issue, set the seed to 38: > {code} > diff --git > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > index 271c5b0fec..080bf1eb1f 100644 > --- > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > +++ > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite > StringType -> ("StringMsg", "")) >testingTypes.foreach { dt => > -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) > +val seed = 38 > test(s"single $dt with seed $seed") { >val (messageName, defaultValue) = > catalystTypesToProtoMessages(dt.fields(0).dataType) > {code} > and run the test: > {code} > build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" > {code} > which fails with NPE: > {code} > [info] - single StructType(StructField(double_type,DoubleType,true)) with > seed 38 *** FAILED *** (10 milliseconds) > [info] java.lang.NullPointerException: > [info] at > org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-40950: --- Assignee: Emil Ejbyfeldt > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40950) isRemoteAddressMaxedOut performance overhead on scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-40950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-40950. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38427 [https://github.com/apache/spark/pull/38427] > isRemoteAddressMaxedOut performance overhead on scala 2.13 > -- > > Key: SPARK-40950 > URL: https://issues.apache.org/jira/browse/SPARK-40950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Fix For: 3.4.0 > > > On scala 2.13 the blocks in FetchRequest is sometimes backed by a `List` > while in 2.12 it would be ArrayBuffer. This means that calculating the length > of the blocks which is done in isRemoteAddressMaxedOut and other places now > much more expensive. This is because in 2.13 `Seq` is can no longer be > backed by a mutable collection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40903) Avoid reordering decimal Add for canonicalization
[ https://issues.apache.org/jira/browse/SPARK-40903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629165#comment-17629165 ] Apache Spark commented on SPARK-40903: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/38513 > Avoid reordering decimal Add for canonicalization > - > > Key: SPARK-40903 > URL: https://issues.apache.org/jira/browse/SPARK-40903 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Avoid reordering Add for canonicalizing if it is decimal type. > Expressions are canonicalized for comparisons and explanations. For > non-decimal Add expression, the order can be sorted by hashcode, and the > result is supposed to be the same. > However, for Add expression of Decimal type, the behavior is different: Given > decimal (p1, s1) and another decimal (p2, s2), the result integral part is > `max(p1-s1, p2-s2) +1`, the result decimal part is `max(s1, s2)`. Thus the > result data type is `(max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2))`. > Thus the order matters: > * For `(decimal(12,5) + decimal(12,6)) + decimal(3, 2)`, the first add > `decimal(12,5) + decimal(12,6)` results in `decimal(14, 6)`, and then > `decimal(14, 6) + decimal(3, 2)` results in `decimal(15, 6)` > * For `(decimal(12, 6) + decimal(3,2)) + decimal(12, 5)`, the first add > `decimal(12, 6) + decimal(3,2)` results in `decimal(13, 6)`, and then > `decimal(13, 6) + decimal(12, 5)` results in `decimal(14, 6)` > In the following query: > ``` > create table foo(a decimal(12, 5), b decimal(12, 6)) using orc > select sum(coalesce(a+b+ 1.75, a)) from foo > ``` > At first `coalesce(a+b+ 1.75, a)` is resolved as `coalesce(a+b+ 1.75, cast(a > as decimal(15, 6))`. In the canonicalized version, the expression becomes > `coalesce(1.75+b+a, cast(a as decimal(15, 6))`. As explained above, > `1.75+b+a` is of decimal(14, 6), which is different from `cast(a as > decimal(15, 6)`. Thus the following error will happen: > {code:java} > java.lang.IllegalArgumentException: requirement failed: All input types must > be the same except nullable, containsNull, valueContainsNull flags. The input > types found are > DecimalType(14,6) > DecimalType(15,6) > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40903) Avoid reordering decimal Add for canonicalization
[ https://issues.apache.org/jira/browse/SPARK-40903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629164#comment-17629164 ] Apache Spark commented on SPARK-40903: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/38513 > Avoid reordering decimal Add for canonicalization > - > > Key: SPARK-40903 > URL: https://issues.apache.org/jira/browse/SPARK-40903 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > > Avoid reordering Add for canonicalizing if it is decimal type. > Expressions are canonicalized for comparisons and explanations. For > non-decimal Add expression, the order can be sorted by hashcode, and the > result is supposed to be the same. > However, for Add expression of Decimal type, the behavior is different: Given > decimal (p1, s1) and another decimal (p2, s2), the result integral part is > `max(p1-s1, p2-s2) +1`, the result decimal part is `max(s1, s2)`. Thus the > result data type is `(max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2))`. > Thus the order matters: > * For `(decimal(12,5) + decimal(12,6)) + decimal(3, 2)`, the first add > `decimal(12,5) + decimal(12,6)` results in `decimal(14, 6)`, and then > `decimal(14, 6) + decimal(3, 2)` results in `decimal(15, 6)` > * For `(decimal(12, 6) + decimal(3,2)) + decimal(12, 5)`, the first add > `decimal(12, 6) + decimal(3,2)` results in `decimal(13, 6)`, and then > `decimal(13, 6) + decimal(12, 5)` results in `decimal(14, 6)` > In the following query: > ``` > create table foo(a decimal(12, 5), b decimal(12, 6)) using orc > select sum(coalesce(a+b+ 1.75, a)) from foo > ``` > At first `coalesce(a+b+ 1.75, a)` is resolved as `coalesce(a+b+ 1.75, cast(a > as decimal(15, 6))`. In the canonicalized version, the expression becomes > `coalesce(1.75+b+a, cast(a as decimal(15, 6))`. As explained above, > `1.75+b+a` is of decimal(14, 6), which is different from `cast(a as > decimal(15, 6)`. Thus the following error will happen: > {code:java} > java.lang.IllegalArgumentException: requirement failed: All input types must > be the same except nullable, containsNull, valueContainsNull flags. The input > types found are > DecimalType(14,6) > DecimalType(15,6) > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40769) Migrate type check failures of aggregate expressions onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40769. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38498 [https://github.com/apache/spark/pull/38498] > Migrate type check failures of aggregate expressions onto error classes > --- > > Key: SPARK-40769 > URL: https://issues.apache.org/jira/browse/SPARK-40769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the aggregate > expressions: > 1. Count (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala#L59 > 2. CollectSet (1): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180 > 3. CountMinSketchAgg (4): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala#L87-L95 > 4. HistogramNumeric (3): > https://github.com/apache/spark/blob/08678456d16bacfa91ad5f718b6d3fa51b1f6cc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HistogramNumeric.scala#L92-L96 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39740) vis-timeline @ 4.2.1 vulnerable to XSS attacks
[ https://issues.apache.org/jira/browse/SPARK-39740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629113#comment-17629113 ] Eugene Shinn (Truveta) commented on SPARK-39740: Any updates here? > vis-timeline @ 4.2.1 vulnerable to XSS attacks > -- > > Key: SPARK-39740 > URL: https://issues.apache.org/jira/browse/SPARK-39740 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.2.1, 3.3.0 >Reporter: Eugene Shinn (Truveta) >Priority: Major > > Spark UI includes visjs/vis-timeline package@4.2.1, which is vulnerable to > XSS attacks ([Cross-site Scripting in vis-timeline · CVE-2020-28487 · GitHub > Advisory Database|https://github.com/advisories/GHSA-9mrv-456v-pf22]). This > version should be replaced with the next non-vulnerable issue - [Release > v7.4.4 · visjs/vis-timeline > (github.com)|https://github.com/visjs/vis-timeline/releases/tag/v7.4.4] or > higher. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38564) Support collecting metrics from streaming sinks
[ https://issues.apache.org/jira/browse/SPARK-38564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629061#comment-17629061 ] Apache Spark commented on SPARK-38564: -- User 'FouadApp' has created a pull request for this issue: https://github.com/apache/spark/pull/38512 > Support collecting metrics from streaming sinks > --- > > Key: SPARK-38564 > URL: https://issues.apache.org/jira/browse/SPARK-38564 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.1 >Reporter: Boyang Jerry Peng >Assignee: Boyang Jerry Peng >Priority: Major > Fix For: 3.4.0 > > > Currently, only streaming sources have the capability to return custom > metrics but not sinks. Allow streaming sinks to also return custom metrics is > very useful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41017) Do not push Filter through reference-only Project
[ https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41017: Assignee: Wenchen Fan (was: Apache Spark) > Do not push Filter through reference-only Project > - > > Key: SPARK-41017 > URL: https://issues.apache.org/jira/browse/SPARK-41017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41017) Do not push Filter through reference-only Project
[ https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41017: Assignee: Apache Spark (was: Wenchen Fan) > Do not push Filter through reference-only Project > - > > Key: SPARK-41017 > URL: https://issues.apache.org/jira/browse/SPARK-41017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41017) Do not push Filter through reference-only Project
[ https://issues.apache.org/jira/browse/SPARK-41017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629030#comment-17629030 ] Apache Spark commented on SPARK-41017: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/38511 > Do not push Filter through reference-only Project > - > > Key: SPARK-41017 > URL: https://issues.apache.org/jira/browse/SPARK-41017 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40533) Extend type support for Spark Connect literals
[ https://issues.apache.org/jira/browse/SPARK-40533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40533. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38462 [https://github.com/apache/spark/pull/38462] > Extend type support for Spark Connect literals > -- > > Key: SPARK-40533 > URL: https://issues.apache.org/jira/browse/SPARK-40533 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Fix For: 3.4.0 > > > Extend type support in Literal transformation: Missing support for Instant, > BigDecimal, LocalDate, LocalTimestamp, Duration, Period. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40533) Extend type support for Spark Connect literals
[ https://issues.apache.org/jira/browse/SPARK-40533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40533: Assignee: Martin Grund > Extend type support for Spark Connect literals > -- > > Key: SPARK-40533 > URL: https://issues.apache.org/jira/browse/SPARK-40533 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > > Extend type support in Literal transformation: Missing support for Instant, > BigDecimal, LocalDate, LocalTimestamp, Duration, Period. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41017) Do not push Filter through reference-only Project
Wenchen Fan created SPARK-41017: --- Summary: Do not push Filter through reference-only Project Key: SPARK-41017 URL: https://issues.apache.org/jira/browse/SPARK-41017 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41016) Identical expressions should not be used on both sides of a binary operator
Bjørn Jørgensen created SPARK-41016: --- Summary: Identical expressions should not be used on both sides of a binary operator Key: SPARK-41016 URL: https://issues.apache.org/jira/browse/SPARK-41016 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Bjørn Jørgensen Identical expressions should not be used on both sides of a binary operator. There are a lot of this 278 total, {code:java} self.check_extension(pser != pser, psser != psser) {code} in pandas API test code. [Issues - spark-python - Bjørn Jørgensen (sonarcloud.io)|https://sonarcloud.io/project/issues?languages=py=false=python%3AS1764=spark-python] We either need to fix this or remove them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
[ https://issues.apache.org/jira/browse/SPARK-41015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-41015: - Description: To reproduce the issue, set the seed to 38: {code} diff --git a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala index 271c5b0fec..080bf1eb1f 100644 --- a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala +++ b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite StringType -> ("StringMsg", "")) testingTypes.foreach { dt => -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) +val seed = 38 test(s"single $dt with seed $seed") { val (messageName, defaultValue) = catalystTypesToProtoMessages(dt.fields(0).dataType) {code} and run the test: {code} build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" {code} which fails with NPE: {code} [info] - single StructType(StructField(double_type,DoubleType,true)) with seed 38 *** FAILED *** (10 milliseconds) [info] java.lang.NullPointerException: [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) {code} was: To reproduce the issue, set the seed to 38: {code:diff} diff --git a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala index 271c5b0fec..080bf1eb1f 100644 --- a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala +++ b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite StringType -> ("StringMsg", "")) testingTypes.foreach { dt => -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) +val seed = 38 test(s"single $dt with seed $seed") { val (messageName, defaultValue) = catalystTypesToProtoMessages(dt.fields(0).dataType) {code} and run the test: {code} build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" {code} which fails with NPE: {code} [info] - single StructType(StructField(double_type,DoubleType,true)) with seed 38 *** FAILED *** (10 milliseconds) [info] java.lang.NullPointerException: [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) {code} > Failure of ProtobufCatalystDataConversionSuite.scala > > > Key: SPARK-41015 > URL: https://issues.apache.org/jira/browse/SPARK-41015 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > To reproduce the issue, set the seed to 38: > {code} > diff --git > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > index 271c5b0fec..080bf1eb1f 100644 > --- > a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > +++ > b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala > @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite > StringType -> ("StringMsg", "")) >testingTypes.foreach { dt => > -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) > +val seed = 38 > test(s"single $dt with seed $seed") { >val (messageName, defaultValue) = > catalystTypesToProtoMessages(dt.fields(0).dataType) > {code} > and run the test: > {code} > build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" > {code} > which fails with NPE: > {code} > [info] - single StructType(StructField(double_type,DoubleType,true)) with > seed 38 *** FAILED *** (10 milliseconds) > [info] java.lang.NullPointerException: > [info] at >
[jira] [Created] (SPARK-41015) Failure of ProtobufCatalystDataConversionSuite.scala
Max Gekk created SPARK-41015: Summary: Failure of ProtobufCatalystDataConversionSuite.scala Key: SPARK-41015 URL: https://issues.apache.org/jira/browse/SPARK-41015 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk To reproduce the issue, set the seed to 38: {code:diff} diff --git a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala index 271c5b0fec..080bf1eb1f 100644 --- a/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala +++ b/connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufCatalystDataConversionSuite.scala @@ -123,7 +123,7 @@ class ProtobufCatalystDataConversionSuite StringType -> ("StringMsg", "")) testingTypes.foreach { dt => -val seed = 1 + scala.util.Random.nextInt((1024 - 1) + 1) +val seed = 38 test(s"single $dt with seed $seed") { val (messageName, defaultValue) = catalystTypesToProtoMessages(dt.fields(0).dataType) {code} and run the test: {code} build/sbt "test:testOnly *ProtobufCatalystDataConversionSuite" {code} which fails with NPE: {code} [info] - single StructType(StructField(double_type,DoubleType,true)) with seed 38 *** FAILED *** (10 milliseconds) [info] java.lang.NullPointerException: [info] at org.apache.spark.sql.protobuf.ProtobufCatalystDataConversionSuite.$anonfun$new$2(ProtobufCatalystDataConversionSuite.scala:134) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40954) Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker
[ https://issues.apache.org/jira/browse/SPARK-40954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628904#comment-17628904 ] Anton Ippolitov commented on SPARK-40954: - Thank you for the suggestion [~dcoliversun]! The hyperkit driver is not supported on M1 ([https://github.com/kubernetes/minikube/issues/11885)] however I managed to run Minikube Spark on Kubernetes integration tests with the experimental [qemu2|https://github.com/kubernetes/minikube/pull/13639] driver. I also had to use the experimental [socket_vmnet|https://minikube.sigs.k8s.io/docs/drivers/qemu/#networking] network in order for the `[minikube service|https://github.com/apache/spark/blob/01014aa99fa851411262a6719058dde97319bbb3/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/backend/minikube/Minikube.scala#L111-L113]` call to work: {noformat} minikube start --driver qemu2 --network socket_vmnet{noformat} The tests pass now. I think it would be good to document this as a workaround for running Minikube integration tests on M1? There is also the experimental [podman|https://minikube.sigs.k8s.io/docs/drivers/podman/] Minikube driver which is supported on M1 but I haven't tried it. > Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker > --- > > Key: SPARK-40954 > URL: https://issues.apache.org/jira/browse/SPARK-40954 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.3.1 > Environment: MacOS 12.6 (Mac M1) > Minikube 1.27.1 > Docker 20.10.17 >Reporter: Anton Ippolitov >Priority: Minor > Attachments: TestProcess.scala > > > h2. Description > I tried running Kubernetes integration tests with the Minikube backend (+ > Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on > Spark's master branch. I ran them with the following command: > > {code:java} > mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ > -Pkubernetes -Pkubernetes-integration-tests \ > -Phadoop-3 \ > -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \ > -Dspark.kubernetes.test.imageRepo=docker.io/kubespark > \ > -Dspark.kubernetes.test.namespace=spark \ > -Dspark.kubernetes.test.serviceAccountName=spark \ > -Dspark.kubernetes.test.deployMode=minikube {code} > However the test suite got stuck literally for hours on my machine. > > h2. Investigation > I ran {{jstack}} on the process that was running the tests and saw that it > was stuck here: > > {noformat} > "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 > tid=0x7f78d580b800 nid=0x2503 runnable [0x000304749000] > java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x00076c0b6f40> (a > java.lang.UNIXProcess$ProcessPipeInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <0x00076c0bb410> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > - locked <0x00076c0bb410> (a java.io.InputStreamReader) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown > Source) > at > org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49) > at > org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45) > at >
[jira] [Resolved] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41012. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38508 [https://github.com/apache/spark/pull/38508] > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41012) Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE
[ https://issues.apache.org/jira/browse/SPARK-41012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41012: Assignee: Haejoon Lee > Rename _LEGACY_ERROR_TEMP_1022 to ORDER_BY_POS_OUT_OF_RANGE > --- > > Key: SPARK-41012 > URL: https://issues.apache.org/jira/browse/SPARK-41012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Rename the _LEGACY_ERROR_TEMP_1022 to proper name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup
[ https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41014: Assignee: Apache Spark > Improve documentation and typing of applyInPandas for groupby and cogroup > - > > Key: SPARK-41014 > URL: https://issues.apache.org/jira/browse/SPARK-41014 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Assignee: Apache Spark >Priority: Trivial > > Documentation of method `applyInPandas` for grouby and cogroup does not > mention in the main description that there are two allowed signatures for the > provided function. The Examples state that this is possible, and the > parameters sections states that for cogroup. > That should be made cleaner. > Finally, type information for `PandasCogroupedMapFunction` does not mention > the three-argument callable alternative. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup
[ https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628853#comment-17628853 ] Apache Spark commented on SPARK-41014: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/38509 > Improve documentation and typing of applyInPandas for groupby and cogroup > - > > Key: SPARK-41014 > URL: https://issues.apache.org/jira/browse/SPARK-41014 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Trivial > > Documentation of method `applyInPandas` for grouby and cogroup does not > mention in the main description that there are two allowed signatures for the > provided function. The Examples state that this is possible, and the > parameters sections states that for cogroup. > That should be made cleaner. > Finally, type information for `PandasCogroupedMapFunction` does not mention > the three-argument callable alternative. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup
[ https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41014: Assignee: (was: Apache Spark) > Improve documentation and typing of applyInPandas for groupby and cogroup > - > > Key: SPARK-41014 > URL: https://issues.apache.org/jira/browse/SPARK-41014 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Trivial > > Documentation of method `applyInPandas` for grouby and cogroup does not > mention in the main description that there are two allowed signatures for the > provided function. The Examples state that this is possible, and the > parameters sections states that for cogroup. > That should be made cleaner. > Finally, type information for `PandasCogroupedMapFunction` does not mention > the three-argument callable alternative. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup
[ https://issues.apache.org/jira/browse/SPARK-41014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-41014: -- Description: Documentation of method `applyInPandas` for grouby and cogroup does not mention in the main description that there are two allowed signatures for the provided function. The Examples state that this is possible, and the parameters sections states that for cogroup. That should be made cleaner. Finally, type information for `PandasCogroupedMapFunction` does not mention the three-argument callable alternative. was: Documentation of method `applyInPandas` for grouby and cogroup does not mention in the main description that there are two allowed signatures for the provided function. The Examples state this is possible, and the parameters sections states that for cogroup. That should be made cleaner. Finally, type information for `PandasCogroupedMapFunction` does not mention the three-argument callable alternative. > Improve documentation and typing of applyInPandas for groupby and cogroup > - > > Key: SPARK-41014 > URL: https://issues.apache.org/jira/browse/SPARK-41014 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Trivial > > Documentation of method `applyInPandas` for grouby and cogroup does not > mention in the main description that there are two allowed signatures for the > provided function. The Examples state that this is possible, and the > parameters sections states that for cogroup. > That should be made cleaner. > Finally, type information for `PandasCogroupedMapFunction` does not mention > the three-argument callable alternative. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41014) Improve documentation and typing of applyInPandas for groupby and cogroup
Enrico Minack created SPARK-41014: - Summary: Improve documentation and typing of applyInPandas for groupby and cogroup Key: SPARK-41014 URL: https://issues.apache.org/jira/browse/SPARK-41014 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 3.4.0 Reporter: Enrico Minack Documentation of method `applyInPandas` for grouby and cogroup does not mention in the main description that there are two allowed signatures for the provided function. The Examples state this is possible, and the parameters sections states that for cogroup. That should be made cleaner. Finally, type information for `PandasCogroupedMapFunction` does not mention the three-argument callable alternative. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41013) spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream
yutiantian created SPARK-41013: -- Summary: spark-3.1.2以cluster模式提交作业报 Could not initialize class com.github.luben.zstd.ZstdOutputStream Key: SPARK-41013 URL: https://issues.apache.org/jira/browse/SPARK-41013 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: yutiantian 使用spark-3.1.2版本以cluster模式提交作业,报 Could not initialize class com.github.luben.zstd.ZstdOutputStream。具体日志如下: Exception in thread "map-output-dispatcher-0" Exception in thread "map-output-dispatcher-2" java.lang.ExceptionInInitializerError: Cannot unpack libzstd-jni: No such file or directory at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createTempFile(File.java:2024) at com.github.luben.zstd.util.Native.load(Native.java:97) at com.github.luben.zstd.util.Native.load(Native.java:55) at com.github.luben.zstd.ZstdOutputStream.(ZstdOutputStream.java:16) at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) at org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Exception in thread "map-output-dispatcher-7" Exception in thread "map-output-dispatcher-5" java.lang.NoClassDefFoundError: Could not initialize class com.github.luben.zstd.ZstdOutputStream at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) at org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Exception in thread "map-output-dispatcher-4" Exception in thread "map-output-dispatcher-3" java.lang.NoClassDefFoundError: Could not initialize class com.github.luben.zstd.ZstdOutputStream at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) at org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) java.lang.NoClassDefFoundError: Could not initialize class com.github.luben.zstd.ZstdOutputStream at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:223) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:910) at org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:233) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:230) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:466) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 但是同样的代码,以client模式提交可以正常执行。 以cluster模式提交作业暂时的解决办法是在spark-default.conf 中配置spark.shuffle.mapStatus.compression.codec lz4 作业可以正常提交。 想咨询下cluster模式,在shuffle 过程中使用zstd压缩为什么会不能正常使用呢? 有任何思路提供的大佬将不胜感激。 -- This message was
[jira] [Assigned] (SPARK-40749) Migrate type check failures of generators onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40749: Assignee: BingKun Pan > Migrate type check failures of generators onto error classes > > > Key: SPARK-40749 > URL: https://issues.apache.org/jira/browse/SPARK-40749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator > expressions: > 1. Stack (3): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170 > 2. ExplodeBase (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299 > 3. Inline (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40749) Migrate type check failures of generators onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40749. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38482 [https://github.com/apache/spark/pull/38482] > Migrate type check failures of generators onto error classes > > > Key: SPARK-40749 > URL: https://issues.apache.org/jira/browse/SPARK-40749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: BingKun Pan >Priority: Major > Fix For: 3.4.0 > > > Replace TypeCheckFailure by DataTypeMismatch in type checks in the generator > expressions: > 1. Stack (3): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L163-L170 > 2. ExplodeBase (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L299 > 3. Inline (1): > https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala#L441 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41004. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38494 [https://github.com/apache/spark/pull/38494] > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41004) Check error classes in InterceptorRegistrySuite
[ https://issues.apache.org/jira/browse/SPARK-41004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41004: Assignee: BingKun Pan > Check error classes in InterceptorRegistrySuite > --- > > Key: SPARK-41004 > URL: https://issues.apache.org/jira/browse/SPARK-41004 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > > - CONNECT.INTERCEPTOR_CTOR_MISSING > - CONNECT.INTERCEPTOR_RUNTIME_ERROR -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
[ https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628700#comment-17628700 ] Yilun Fan commented on SPARK-33349: --- I also met this problem in Spark 3.2.1, kubernetes-client 5.4.1. {code:java} ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.) io.fabric8.kubernetes.client.WatcherException: too old resource version: 63993943 (64057995) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$TypedWatcherWebSocketListener.onMessage(WatchConnectionManager.java:103){code} I think we have to add some retry in ExecutorPodsWatchSnapshotSource. Especially when we close spark.kubernetes.executor.enableApiPolling, only this watcher can receive executor pod status. Just like what spark has done in the submit client. [https://github.com/apache/spark/pull/29533/files] > ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed > -- > > Key: SPARK-33349 > URL: https://issues.apache.org/jira/browse/SPARK-33349 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2, 3.1.0 >Reporter: Nicola Bova >Priority: Critical > > I launch my spark application with the > [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator] > with the following yaml file: > {code:yaml} > apiVersion: sparkoperator.k8s.io/v1beta2 > kind: SparkApplication > metadata: > name: spark-kafka-streamer-test > namespace: kafka2hdfs > spec: > type: Scala > mode: cluster > image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0 > imagePullPolicy: Always > timeToLiveSeconds: 259200 > mainClass: path.to.my.class.KafkaStreamer > mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar > sparkVersion: 3.0.1 > restartPolicy: > type: Always > sparkConf: > "spark.kafka.consumer.cache.capacity": "8192" > "spark.kubernetes.memoryOverheadFactor": "0.3" > deps: > jars: > - my > - jar > - list > hadoopConfigMap: hdfs-config > driver: > cores: 4 > memory: 12g > labels: > version: 3.0.1 > serviceAccount: default > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > executor: > instances: 4 > cores: 4 > memory: 16g > labels: > version: 3.0.1 > javaOptions: > "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties" > {code} > I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart > the watcher when we receive a version changed from > k8s"|https://github.com/apache/spark/pull/29533] patch. > This is the driver log: > {code} > 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > ... // my app log, it's a structured streaming app reading from kafka and > writing to hdfs > 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed (this is expected if the application is shutting down.) > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 1574101276 (1574213896) > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > at > okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > at > okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > {code} > The error above appears after roughly 50 minutes. > After the exception above, no more logs are produced and the app hangs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org