[jira] [Commented] (SPARK-28934) Add `spark.sql.compatiblity.mode`
[ https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920611#comment-16920611 ] Wenchen Fan commented on SPARK-28934: - yea if we have pgsql mode, it's a good reason to have SPARK-28610 > Add `spark.sql.compatiblity.mode` > - > > Key: SPARK-28934 > URL: https://issues.apache.org/jira/browse/SPARK-28934 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` > or `pgSQL` case-insensitively to control PostgreSQL compatibility features. > > Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and > `spark.sql.compatiblity.mode=spark`. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28946) Add some more information about building SparkR on Windows
Hyukjin Kwon created SPARK-28946: Summary: Add some more information about building SparkR on Windows Key: SPARK-28946 URL: https://issues.apache.org/jira/browse/SPARK-28946 Project: Spark Issue Type: Test Components: Documentation, SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon We should mention: - it needs {{bash}} in {{PATH}} to build - supported JDK versions - building on Windows is not official support -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koert kuipers updated SPARK-28945: -- Summary: Allow concurrent writes to different partitions with dynamic partition overwrite (was: Allow concurrent writes to unrelated partitions with dynamic partition overwrite) > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28945) Allow concurrent writes to unrelated partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920602#comment-16920602 ] koert kuipers commented on SPARK-28945: --- See also: https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E > Allow concurrent writes to unrelated partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28945) Allow concurrent writes to unrelated partitions with dynamic partition overwrite
koert kuipers created SPARK-28945: - Summary: Allow concurrent writes to unrelated partitions with dynamic partition overwrite Key: SPARK-28945 URL: https://issues.apache.org/jira/browse/SPARK-28945 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: koert kuipers It is desirable to run concurrent jobs that write to different partitions within same baseDir using partitionBy and dynamic partitionOverwriteMode. See for example here: https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning Or the discussion here: https://github.com/delta-io/delta/issues/9 This doesnt seem that difficult. I suspect only changes needed are in org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling all committer activity (committer.setupJob, committer.commitJob, etc.) when dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920601#comment-16920601 ] Qiang Wang edited comment on SPARK-28927 at 9/2/19 3:56 AM: I check the commit list since May 31, 2017 before which all the commits are included in 2.2.1 and find that there is no relative commit about this problem. !image-2019-09-02-11-55-33-596.png|width=457,height=325! was (Author: jerryhouse): I check the commit list since May 31, 2017 before which all the commits are included in 2.2.1 and find that there is no relative commit about this problem. !image-2019-09-02-11-55-33-596.png! > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production
[jira] [Updated] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Wang updated SPARK-28927: --- Attachment: image-2019-09-02-11-55-33-596.png > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > Attachments: image-2019-09-02-11-55-33-596.png > > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() > val als = new ALS > val paramMap = ParamMap(als.alpha -> 25000). >
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920601#comment-16920601 ] Qiang Wang commented on SPARK-28927: I check the commit list since May 31, 2017 before which all the commits are included in 2.2.1 and find that there is no relative commit about this problem. !image-2019-09-02-11-55-33-596.png! > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData =
[jira] [Updated] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
[ https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28612: - Fix Version/s: (was: 3.0.0) > DataSourceV2: Add new DataFrameWriter API for v2 > > > Key: SPARK-28612 > URL: https://issues.apache.org/jira/browse/SPARK-28612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > > This tracks adding an API like the one proposed in SPARK-23521: > {code:lang=scala} > df.writeTo("catalog.db.table").append() // AppendData > df.writeTo("catalog.db.table").overwriteDynamic() // > OverwritePartiitonsDynamic > df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // > OverwriteByExpression > df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS > df.writeTo("catalog.db.table").replace() // RTAS > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920599#comment-16920599 ] zhengruifeng commented on SPARK-28373: -- [~smilegator] [~yumwang] I am afraid I have no time to do it this week. [~planga82] Could you please take it over? > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2
[ https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-28612: -- Assignee: (was: Ryan Blue) Reverted at https://github.com/apache/spark/commit/bd3915e356b69897d580d9f655a97f781e8f1c83 > DataSourceV2: Add new DataFrameWriter API for v2 > > > Key: SPARK-28612 > URL: https://issues.apache.org/jira/browse/SPARK-28612 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > This tracks adding an API like the one proposed in SPARK-23521: > {code:lang=scala} > df.writeTo("catalog.db.table").append() // AppendData > df.writeTo("catalog.db.table").overwriteDynamic() // > OverwritePartiitonsDynamic > df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // > OverwriteByExpression > df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS > df.writeTo("catalog.db.table").replace() // RTAS > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28933: -- Priority: Minor (was: Major) > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 3.0.0 > > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920596#comment-16920596 ] Qiang Wang commented on SPARK-28927: I only tested it on version 2.2.1 which is compatible to our spark version. Is it ok to use mlib of the master brach while keeping the spark in the version 2.2.1? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData = trainData.map(x => ALSData(x._1, x._2,
[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info
[ https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594 ] Kazuaki Ishizaki commented on SPARK-28906: -- For the information on {{git}} comand, {{.git}} directory is deleted after {{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, {{spark-version-info.properties}} can include the correct information like: {code} version=2.3.4 user=ishizaki revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210 branch=HEAD date=2019-09-02T02:31:25Z url=https://gitbox.apache.org/repos/asf/spark.git {code} > `bin/spark-submit --version` shows incorrect info > - > > Key: SPARK-28906 > URL: https://issues.apache.org/jira/browse/SPARK-28906 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, > 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Priority: Minor > Attachments: image-2019-08-29-05-50-13-526.png > > > Since Spark 2.3.1, `spark-submit` shows a wrong information. > {code} > $ bin/spark-submit --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.3 > /_/ > Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222 > Branch > Compiled by user on 2019-02-04T13:00:46Z > Revision > Url > Type --help for more information. > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28770) Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression failed
[ https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920587#comment-16920587 ] huangtianhua commented on SPARK-28770: -- [~wypoon], thank you for looking into this, so you suggest to delete the compare of SparkListenerStageExecutorMetrics events for the two failed tests? > Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression > failed > --- > > Key: SPARK-28770 > URL: https://issues.apache.org/jira/browse/SPARK-28770 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.3 > Environment: Community jenkins and our arm testing instance. >Reporter: huangtianhua >Priority: Major > > Test > org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with > compression is failed see > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/] > > And also the test is failed on arm instance, I sent email to spark-dev > before, and we suspect there is something related with the commit > [https://github.com/apache/spark/pull/23767], we tried to revert it and the > tests are passed: > ReplayListenerSuite: > - ... > - End-to-end replay *** FAILED *** > "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) > - End-to-end replay with compression *** FAILED *** > "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) > > Not sure what's wrong, hope someone can help to figure it out, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-28933: Fix Version/s: 3.0.0 > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920584#comment-16920584 ] Liang-Chi Hsieh commented on SPARK-28933: - This issue was resolved by [https://github.com/apache/spark/pull/25639]. > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors
[ https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-28933. - Resolution: Resolved > Reduce unnecessary shuffle in ALS when initializing factors > --- > > Key: SPARK-28933 > URL: https://issues.apache.org/jira/browse/SPARK-28933 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > When Initializing factors in ALS, we should use {{mapPartitions}} instead of > current {{map}}, so we can preserve existing partition of the RDD of > {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. > We don't change the partition when initializing factors. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'
[ https://issues.apache.org/jira/browse/SPARK-28923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin resolved SPARK-28923. - Resolution: Invalid > Deduplicate the codes 'multipartIdentifier' and 'identifierSeq' > --- > > Key: SPARK-28923 > URL: https://issues.apache.org/jira/browse/SPARK-28923 > Project: Spark > Issue Type: Request > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the > same functionality. We'd better deduplicate them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920577#comment-16920577 ] MOBIN commented on SPARK-28916: --- [~mgaido]thinks, spark.sql.subexpressionElimination.enabled parameter solved my problem > Generated SpecificSafeProjection.apply method grows beyond 64 KB when use > SparkSQL > --- > > Key: SPARK-28916 > URL: https://issues.apache.org/jira/browse/SPARK-28916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: MOBIN >Priority: Major > > Can be reproduced by the following steps: > 1. Create a table with 5000 fields > 2. val data=spark.sql("select * from spark64kb limit 10"); > 3. data.describe() > Then,The following error occurred > {code:java} > WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, > executor 1): org.codehaus.janino.InternalCompilerException: failed to > compile: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method > "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection" > grows beyond 64 KB > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44) > at > org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199) > at > org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86) > at > org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method >
[jira] [Created] (SPARK-28944) Expose peak memory of executor in metrics for parameter tuning
deshanxiao created SPARK-28944: -- Summary: Expose peak memory of executor in metrics for parameter tuning Key: SPARK-28944 URL: https://issues.apache.org/jira/browse/SPARK-28944 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: deshanxiao Maybe we can collect the peak of executor memory in heartbeat for parameter tuning like spark.executor.memory -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27336) Incorrect DataSet.summary() result
[ https://issues.apache.org/jira/browse/SPARK-27336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920557#comment-16920557 ] daile commented on SPARK-27336: --- I will check this issue > Incorrect DataSet.summary() result > -- > > Key: SPARK-27336 > URL: https://issues.apache.org/jira/browse/SPARK-27336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: test.csv > > > There is a single data point in the minimum_nights column that is 1.0E8 out > of 8k records, but .summary() says it is the 75% and the max. > I compared this with approxQuantile, and approxQuantile for 75% gave the > correct value of 30.0. > To reproduce: > {code:java} > scala> val df = > spark.read.format("csv").load("test.csv").withColumn("minimum_nights", > '_c0.cast("Int")) > df: org.apache.spark.sql.DataFrame = [_c0: string, minimum_nights: int] > scala> df.select("minimum_nights").summary().show() > +---+--+ > |summary|minimum_nights| > +---+--+ > | count| 7072| > | mean| 14156.35407239819| > | stddev|1189128.5444975856| > |min| 1| > |25%| 2| > |50%| 4| > |75%| 1| > |max| 1| > +---+--+ > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.1) > res1: Array[Double] = Array(30.0) > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.001) > res2: Array[Double] = Array(30.0) > scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.0001) > res3: Array[Double] = Array(1.0E8) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan
[ https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920550#comment-16920550 ] Liang-Chi Hsieh commented on SPARK-28935: - Thanks! [~smilegator] It should be helpful. > Document SQL metrics for Details for Query Plan > --- > > Key: SPARK-28935 > URL: https://issues.apache.org/jira/browse/SPARK-28935 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://github.com/apache/spark/pull/25349] shows the query plans but it > does not describe the meaning of each metric in the plan. For end users, they > might not understand the meaning of the metrics we output. > > !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28790) Document CACHE TABLE statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-28790. - Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed > Document CACHE TABLE statement in SQL Reference. > - > > Key: SPARK-28790 > URL: https://issues.apache.org/jira/browse/SPARK-28790 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920534#comment-16920534 ] Jungtaek Lim commented on SPARK-28594: -- Thanks [~felixcheung] for reviewing and volunteering to being shepherd on this work! Could you also jump in [https://github.com/apache/spark/pull/25577] which is coupled with this issue? > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Priority: Major > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page
[ https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920515#comment-16920515 ] Xiao Li commented on SPARK-28373: - [~podongfeng] This is the last one. Could you help finish this? > Document JDBC/ODBC Server page > -- > > Key: SPARK-28373 > URL: https://issues.apache.org/jira/browse/SPARK-28373 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Web UI >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503! > > [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME > and EXECUTION TIME. It is hard to understand the difference. We need to > document them; otherwise, it is hard for end users to understand them > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan
[ https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920507#comment-16920507 ] Xiao Li commented on SPARK-28935: - [https://docs.google.com/spreadsheets/d/11PV6SfkIQ8W_i98tNsMEIQ0QFiTcuQi-2roZDHvX0EM/edit?usp=sharing] Above is the summary draft I did today. Hopefully, it helps you. Thanks! [~viirya] > Document SQL metrics for Details for Query Plan > --- > > Key: SPARK-28935 > URL: https://issues.apache.org/jira/browse/SPARK-28935 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > [https://github.com/apache/spark/pull/25349] shows the query plans but it > does not describe the meaning of each metric in the plan. For end users, they > might not understand the meaning of the metrics we output. > > !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28943) NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
Michael Heuer created SPARK-28943: - Summary: NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B Key: SPARK-28943 URL: https://issues.apache.org/jira/browse/SPARK-28943 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Michael Heuer Since adapting our build for Spark 2.4.x, we are unable to run on Spark 2.2.0 provided by CDH. For more details, please see linked issue https://github.com/bigdatagenomics/adam/issues/2157 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances
[ https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920480#comment-16920480 ] Liang-Chi Hsieh commented on SPARK-28927: - Does this only happen on 2.2.1? How about current master branch? > ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets > with 12 billion instances > --- > > Key: SPARK-28927 > URL: https://issues.apache.org/jira/browse/SPARK-28927 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Qiang Wang >Priority: Major > > The stack trace is below: > {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 > BlockManager: Block rdd_10916_493 could not be removed as it was not found on > disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for > task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) > java.lang.ArrayIndexOutOfBoundsException: 6741 at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460) > at > org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at > org.apache.spark.scheduler.Task.run(Task.scala:108) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > This exception happened sometimes. And we also found that the AUC metric was > not stable when evaluating the inner product of the user factors and the item > factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 > which was not stable for production environment. > Dataset capacity: ~12 billion ratings > Here is the our code: > val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, > y._2.toFloat))) > .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class > ALSData(user:Int, item:Int, rating:Float) extends Serializable > val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF() > val als = new ALS > val paramMap = ParamMap(als.alpha -> 25000). >
[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling
[ https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920479#comment-16920479 ] Felix Cheung commented on SPARK-27495: -- +1 on this. I've reviewed this. A few questions/comments: # in the description above there is a passage on "Spark internal use by catalyst" - looking at the rest of the material, google doc etc, is this out of scope? if so we should clarify. # "different resources in multiple RDDs that get combined into a single stage" - this merge can be complicated, and I'm not sure taking the max etc is going to be right at all time. At the least it will be very confusing to the user on how much resource is used etc. Instead of a heuristic, the max etc, how about in the event of mismatch involving multiple RDDs, we detect and fail (fail fast) and ask the user to do a "repartition" operation before that stage? # in later comment, "resource requirement as a hint" - I am actually unsure about that. in many ML or DL/tensorflow use cases where MPI or allreduce are involved, the strict number of GPU, process, machine are required or else they fail to start. I am in favor of a strict mode for that purpose. > SPIP: Support Stage level resource configuration and scheduling > --- > > Key: SPARK-27495 > URL: https://issues.apache.org/jira/browse/SPARK-27495 > Project: Spark > Issue Type: Epic > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > Objectives: > # Allow users to specify task and executor resource requirements at the > stage level. > # Spark will use the stage level requirements to acquire the necessary > resources/executors and schedule tasks based on the per stage requirements. > Many times users have different resource requirements for different stages of > their application so they want to be able to configure resources at the stage > level. For instance, you have a single job that has 2 stages. The first stage > does some ETL which requires a lot of tasks, each with a small amount of > memory and 1 core each. Then you have a second stage where you feed that ETL > data into an ML algorithm. The second stage only requires a few executors but > each executor needs a lot of memory, GPUs, and many cores. This feature > allows the user to specify the task and executor resource requirements for > the ETL Stage and then change them for the ML stage of the job. > Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and > extra Resources (GPU/FPGA/etc). It has the potential to allow for other > things like limiting the number of tasks per stage, specifying other > parameters for things like shuffle, etc. Initially I would propose we only > support resources as they are now. So Task resources would be cpu and other > resources (GPU, FPGA), that way we aren't adding in extra scheduling things > at this point. Executor resources would be cpu, memory, and extra > resources(GPU,FPGA, etc). Changing the executor resources will rely on > dynamic allocation being enabled. > Main use cases: > # ML use case where user does ETL and feeds it into an ML algorithm where > it’s using the RDD API. This should work with barrier scheduling as well once > it supports dynamic allocation. > # Spark internal use by catalyst. Catalyst could control the stage level > resources as it finds the need to change it between stages for different > optimizations. For instance, with the new columnar plugin to the query > planner we can insert stages into the plan that would change running > something on the CPU in row format to running it on the GPU in columnar > format. This API would allow the planner to make sure the stages that run on > the GPU get the corresponding GPU resources it needs to run. Another possible > use case for catalyst is that it would allow catalyst to add in more > optimizations to where the user doesn’t need to configure container sizes at > all. If the optimizer/planner can handle that for the user, everyone wins. > This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I > think the DataSet API will require more changes because it specifically hides > the RDD from the users via the plans and catalyst can optimize the plan and > insert things into the plan. The only way I’ve found to make this work with > the Dataset API would be modifying all the plans to be able to get the > resource requirements down into where it creates the RDDs, which I believe > would be a lot of change. If other people know better options, it would be > great to hear them. > *Q2.* What problem is this
[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python
[ https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920478#comment-16920478 ] Junichi Koizumi commented on SPARK-28902: --- Could you tell a little bit more about the workaround? It turns out to be fine on my version . pyspark : >>> from pyspark.ml import Pipeline >>> from pyspark.ml.feature import Tokenizer >>> t = Tokenizer() >>> p = Pipeline().setStages([t]) >>> d = spark.createDataFrame([["Apache spark logistic regression "]]) >>> pm = p.fit(d) >>> np = Pipeline().setStages([pm]) >>> npm = np.fit(d) >>> npm.write().save('./npm_test') scala side : scala> import org.apache.spark.ml.PipelineModel import org.apache.spark.ml.PipelineModel scala> val pp = PipelineModel.load("./npm_test") pp: org.apache.spark.ml.PipelineModel = PipelineModel_4d879f6b2b02c8d3d467 > Spark ML Pipeline with nested Pipelines fails to load when saved from Python > > > Key: SPARK-28902 > URL: https://issues.apache.org/jira/browse/SPARK-28902 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 >Reporter: Saif Addin >Priority: Minor > > Hi, this error is affecting a bunch of our nested use cases. > Saving a *PipelineModel* with one of its stages being another > *PipelineModel*, fails when loading it from Scala if it is saved in Python. > *Python side:* > > {code:java} > from pyspark.ml import Pipeline > from pyspark.ml.feature import Tokenizer > t = Tokenizer() > p = Pipeline().setStages([t]) > d = spark.createDataFrame([["Hello Peter Parker"]]) > pm = p.fit(d) > np = Pipeline().setStages([pm]) > npm = np.fit(d) > npm.write().save('./npm_test') > {code} > > > *Scala side:* > > {code:java} > scala> import org.apache.spark.ml.PipelineModel > scala> val pp = PipelineModel.load("./npm_test") > java.lang.IllegalArgumentException: requirement failed: Error loading > metadata: Expected class name org.apache.spark.ml.PipelineModel but found > class name pyspark.ml.pipeline.PipelineModel > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638) > at > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) > at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332) > ... 50 elided > {code} > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920476#comment-16920476 ] Felix Cheung commented on SPARK-28594: -- Reviewed. looks reasonable to me. I can help shepherd this work. ping [~srowen] [~vanzin] [~irashid] for feedback. > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Priority: Major > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-28594: - Shepherd: Felix Cheung > Allow event logs for running streaming apps to be rolled over. > -- > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 > Environment: This has been reported on 2.0.2.22 but affects all > currently available versions. >Reporter: Stephen Levett >Priority: Major > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28942) Spark in local mode hostname display localhost in the Host Column of Task Summary Page
ABHISHEK KUMAR GUPTA created SPARK-28942: Summary: Spark in local mode hostname display localhost in the Host Column of Task Summary Page Key: SPARK-28942 URL: https://issues.apache.org/jira/browse/SPARK-28942 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0 Reporter: ABHISHEK KUMAR GUPTA In the stage page under Task Summary Page Host Column shows 'localhost' instead of showing host IP or host name mentioned against the Driver Host Name Steps: spark-shell --master local create table emp(id int); insert into emp values(100); select * from emp; Go to Stage UI page and check the Task Summary Page. Host column will display 'localhost' instead the driver host. Note in case of spark-shell --master yarn mode UI display correct host name under the column. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28942) Spark in local mode hostname display localhost in the Host Column of Task Summary Page
[ https://issues.apache.org/jira/browse/SPARK-28942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920427#comment-16920427 ] Shivu Sondur commented on SPARK-28942: -- i will work on this issue > Spark in local mode hostname display localhost in the Host Column of Task > Summary Page > -- > > Key: SPARK-28942 > URL: https://issues.apache.org/jira/browse/SPARK-28942 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > In the stage page under Task Summary Page Host Column shows 'localhost' > instead of showing host IP or host name mentioned against the Driver Host Name > Steps: > spark-shell --master local > create table emp(id int); > insert into emp values(100); > select * from emp; > Go to Stage UI page and check the Task Summary Page. > Host column will display 'localhost' instead the driver host. > > Note in case of spark-shell --master yarn mode UI display correct host name > under the column. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28941) Spark Sql Jobs
Brahmendra created SPARK-28941: -- Summary: Spark Sql Jobs Key: SPARK-28941 URL: https://issues.apache.org/jira/browse/SPARK-28941 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Brahmendra Fix For: 2.4.3 HI Team, I need one favor on spark sql jobs. I have to 200+ spark sql query running on 7 different hive table. How can we do this in one jar file to execute all 200+ spark sql jobs. currently we are managing 7 jar files for each tables. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28855) Remove outdated Experimental, Evolving annotations
[ https://issues.apache.org/jira/browse/SPARK-28855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28855. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25558 [https://github.com/apache/spark/pull/25558] > Remove outdated Experimental, Evolving annotations > -- > > Key: SPARK-28855 > URL: https://issues.apache.org/jira/browse/SPARK-28855 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > The Experimental and Evolving annotations are both (like Unstable) used to > express that a an API may change. However there are many things in the code > that have been marked that way since even Spark 1.x. Per the dev@ thread, > anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that > it would not change without a deprecation cycle. > Therefore I'd like to remove most of these annotations, leaving them for > things that are obviously inherently experimental (ExperimentalMethods), or > recently added and still legitimately experimental (DSv2, Barrier mode). -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14
[ https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved SPARK-28925. Resolution: Duplicate > Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and > 1.14 > > > Key: SPARK-28925 > URL: https://issues.apache.org/jira/browse/SPARK-28925 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Eric >Priority: Minor > > Hello, > If you use Spark with Kubernetes 1.13 or 1.14 you will see this error: > {code:java} > {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": > "org.apache.spark.internal.Logging", > "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to > request 1 executors from Kubernetes."} > {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": > "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", > "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: > HTTP 403, Status: 403 - "} > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > {code} > Apparently the bug is fixed here: > [https://github.com/fabric8io/kubernetes-client/pull/1669] > We have currently compiled Spark source code with Kubernetes-client 4.4.2 and > it's working great on our cluster. We are using Kubernetes 1.13.10. > > Could it be possible to update that dependency version? > > Thanks! -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920411#comment-16920411 ] Andy Grove commented on SPARK-28921: [~dongjoon] we are seeing it on both of the EKS clusters where we are running Spark jobs. I imagine it affects all EKS clusters? The versions we are using are 1.11.10 and 1.12.10 .. full version info: {code:java} Server Version: version.Info{Major:"1", Minor:"11+", GitVersion:"v1.11.10-eks-7f15cc", GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"} {code} {code:java} Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-825e5d", GitCommit:"825e5de08cb05714f9b224cd6c47d9514df1d1a7", GitTreeState:"clean", BuildDate:"2019-08-18T03:58:32Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"} {code} > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Paul Schweigert >Priority: Major > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated SPARK-28921: --- Summary: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10) (was: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.11.10)) > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.12.10, 1.11.10) > --- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Paul Schweigert >Priority: Major > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.11.10)
[ https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated SPARK-28921: --- Summary: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.11.10) (was: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10)) > Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, > 1.11.10) > -- > > Key: SPARK-28921 > URL: https://issues.apache.org/jira/browse/SPARK-28921 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.3, 2.4.3 >Reporter: Paul Schweigert >Priority: Major > > Spark jobs are failing on latest versions of Kubernetes when jobs attempt to > provision executor pods (jobs like Spark-Pi that do not launch executors run > without a problem): > > Here's an example error message: > > {code:java} > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes. > 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors > from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: > HTTP 403, Status: 403 - > java.net.ProtocolException: Expected HTTP 101 response but was '403 > Forbidden' > at > okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) > at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) > at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) > at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > {code} > > Looks like the issue is caused by fixes for a recent CVE : > CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809] > Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669] > > Looks like upgrading kubernetes-client to 4.4.2 would solve this issue. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28940) Subquery reuse accross all subquery levels
Peter Toth created SPARK-28940: -- Summary: Subquery reuse accross all subquery levels Key: SPARK-28940 URL: https://issues.apache.org/jira/browse/SPARK-28940 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth Currently subquery reuse doesn't work across all subquery levels. Here is an example query: {noformat} SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM testData)) FROM testData LIMIT 1 {noformat} where the plan now is: {noformat} CollectLimit 1 +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS scalarsubquery()#277] : :- Subquery scalar-subquery#268, [id=#231] : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#272]) : : +- Exchange SinglePartition, true, [id=#227] : :+- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery scalar-subquery#270, [id=#266] : +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS scalarsubquery()#275] :: +- Subquery scalar-subquery#269, [id=#263] :: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#274]) ::+- Exchange SinglePartition, true, [id=#259] :: +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L]) :: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] :: +- Scan[obj#12] :+- *(1) Scan OneRowRelation[] +- *(1) SerializeFromObject +- Scan[obj#12] {noformat} but it could be: {noformat} CollectLimit 1 +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS scalarsubquery()#249] : :- ReusedSubquery Subquery scalar-subquery#241, [id=#148] : +- Subquery scalar-subquery#242, [id=#164] : +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS scalarsubquery()#247] :: +- Subquery scalar-subquery#241, [id=#148] :: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#246]) ::+- Exchange SinglePartition, true, [id=#144] :: +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L]) :: +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] :: +- Scan[obj#12] :+- *(1) Scan OneRowRelation[] +- *(1) SerializeFromObject +- Scan[obj#12] {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated
[ https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-28939: Description: The SQL configurations are propagated to executors in order to be effective. Unfortunately, in some cases, we are missing to propagate them, making them un-effective. The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. And this is pretty frequent in the codebase. Please notice that there are 2 parts of this issue: - when a user directly uses those APIs - when Spark invokes them (eg. throughout the ML lib and other usages or the {{describe}} method on the {{Dataset}} class) was: The SQL configurations are propagated to executors in order to be effective. Unfortunately, in some cases, we are missing to propagate them, making them uneffective. For an example, please see the {{describe}} method on the {{Dataset}} class. > SQL configuration are not always propagated > --- > > Key: SPARK-28939 > URL: https://issues.apache.org/jira/browse/SPARK-28939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Marco Gaido >Priority: Major > > The SQL configurations are propagated to executors in order to be effective. > Unfortunately, in some cases, we are missing to propagate them, making them > un-effective. > The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. > And this is pretty frequent in the codebase. > Please notice that there are 2 parts of this issue: > - when a user directly uses those APIs > - when Spark invokes them (eg. throughout the ML lib and other usages or the > {{describe}} method on the {{Dataset}} class) -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28939) SQL configuration are not always propagated
Marco Gaido created SPARK-28939: --- Summary: SQL configuration are not always propagated Key: SPARK-28939 URL: https://issues.apache.org/jira/browse/SPARK-28939 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Marco Gaido The SQL configurations are propagated to executors in order to be effective. Unfortunately, in some cases, we are missing to propagate them, making them uneffective. For an example, please see the {{describe}} method on the {{Dataset}} class. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite
[ https://issues.apache.org/jira/browse/SPARK-24352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24352: Issue Type: Test (was: Bug) > Flaky test: StandaloneDynamicAllocationSuite > > > Key: SPARK-24352 > URL: https://issues.apache.org/jira/browse/SPARK-24352 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > From jenkins: > [https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/] > > {noformat} > Error Message > There is already an RpcEndpoint called CoarseGrainedScheduler > Stacktrace > java.lang.IllegalArgumentException: There is already an RpcEndpoint > called CoarseGrainedScheduler > at > org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71) > at > org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61) > at > org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512) > at > org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495) > at > org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) > {noformat} > This actually looks like a previous test is leaving some stuff running and > making this one fail. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28535) Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
[ https://issues.apache.org/jira/browse/SPARK-28535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28535: Issue Type: Test (was: Bug) > Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader" > --- > > Key: SPARK-28535 > URL: https://issues.apache.org/jira/browse/SPARK-28535 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > This is the same flakiness as in SPARK-23881, except the fix there didn't > really take, at least on our build machines. > {noformat} > org.scalatest.exceptions.TestFailedException: 1 was not less than 1 > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > {noformat} > Since that bug is short on explanations, the issue is that there's a race > between the thread posting the "stage completed" event to the listener which > unblocks the test, and the thread killing the task in the executor. If the > even arrives first, it will unblock task execution, and there's a chance that > all elements will actually be processed before the executor has a chance to > stop the task. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28418) Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect
[ https://issues.apache.org/jira/browse/SPARK-28418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28418: Issue Type: Test (was: Bug) > Flaky Test: pyspark.sql.tests.test_dataframe: > test_query_execution_listener_on_collect > -- > > Key: SPARK-28418 > URL: https://issues.apache.org/jira/browse/SPARK-28418 > Project: Spark > Issue Type: Test > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > {code} > ERROR [0.164s]: test_query_execution_listener_on_collect > (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests) > -- > Traceback (most recent call last): > File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, > in test_query_execution_listener_on_collect > "The callback from the query execution listener should be called after > 'collect'") > AssertionError: The callback from the query execution listener should be > called after 'collect' > {code} > Seems it can be failed due to not waiting events to be proceeded. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka
[ https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28335: Issue Type: Test (was: Bug) > Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset > recovery from kafka > - > > Key: SPARK-28335 > URL: https://issues.apache.org/jira/browse/SPARK-28335 > Project: Spark > Issue Type: Test > Components: DStreams, Tests >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 2.3.4, 2.4.4, 3.0.0 > > Attachments: bad.log > > > {code:java} > org.scalatest.exceptions.TestFailedException: {} was empty > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466) > at > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416) > at > org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at or > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed
[ https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28357: Issue Type: Test (was: Bug) > Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling > compressed > > > Key: SPARK-28357 > URL: https://issues.apache.org/jira/browse/SPARK-28357 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/ -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24898) Adding spark.checkpoint.compress to the docs
[ https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24898: Issue Type: Improvement (was: Task) > Adding spark.checkpoint.compress to the docs > > > Key: SPARK-24898 > URL: https://issues.apache.org/jira/browse/SPARK-24898 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Riccardo Corbella >Assignee: Sandeep >Priority: Trivial > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > Parameter *spark.checkpoint.compress* is not listed under configuration > properties. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28261) Flaky test: org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable
[ https://issues.apache.org/jira/browse/SPARK-28261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28261: Issue Type: Test (was: Bug) > Flaky test: > org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable > --- > > Key: SPARK-28261 > URL: https://issues.apache.org/jira/browse/SPARK-28261 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > Error message: > {noformat} > java.lang.AssertionError: expected:<3> but was:<4> > ...{noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28247) Flaky test: "query without test harness" in ContinuousSuite
[ https://issues.apache.org/jira/browse/SPARK-28247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28247: Issue Type: Test (was: Bug) > Flaky test: "query without test harness" in ContinuousSuite > --- > > Key: SPARK-28247 > URL: https://issues.apache.org/jira/browse/SPARK-28247 > Project: Spark > Issue Type: Test > Components: Structured Streaming, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > This test has failed a few times in some PRs, as well as easy to reproduce > locally. Example of a failure: > {noformat} > [info] - query without test harness *** FAILED *** (2 seconds, 931 > milliseconds) > [info] scala.Predef.Set.apply[Int](0, 1, 2, > 3).map[org.apache.spark.sql.Row, > scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => > org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row]) > was false > (ContinuousSuite.scala:226){noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28713) Bump checkstyle from 8.14 to 8.23
[ https://issues.apache.org/jira/browse/SPARK-28713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28713: Issue Type: Improvement (was: Task) > Bump checkstyle from 8.14 to 8.23 > - > > Key: SPARK-28713 > URL: https://issues.apache.org/jira/browse/SPARK-28713 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > From the GitHub Security Advisory Database: > Moderate severity vulnerability that affects com.puppycrawl.tools:checkstyle > Checkstyle prior to 8.18 loads external DTDs by default, which can > potentially lead to denial of service attacks or the leaking of confidential > information. > Affected versions: < 8.18 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27596) The JDBC 'query' option doesn't work for Oracle database
[ https://issues.apache.org/jira/browse/SPARK-27596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27596: Issue Type: Bug (was: Improvement) > The JDBC 'query' option doesn't work for Oracle database > > > Key: SPARK-27596 > URL: https://issues.apache.org/jira/browse/SPARK-27596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > For the JDBC option `query`, we use the identifier name to start with > underscore: s"(${subquery}) > __SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}". This is not > supported by Oracle. > The Oracle doesn't seem to support identifier name to start with non-alphabet > character (unless it is quoted) and has length restrictions as well. > https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm > {code:java} > Nonquoted identifiers must begin with an alphabetic character from your > database character set. Quoted identifiers can begin with any character as > per below documentation - > Nonquoted identifiers can contain only alphanumeric characters from your > database character set and the underscore (_), dollar sign ($), and pound > sign (#). Database links can also contain periods (.) and "at" signs (@). > Oracle strongly discourages you from using $ and # in nonquoted identifiers. > {code} > The alias name '_SPARK_GEN_JDBC_SUBQUERY_NAME' should be fixed to > remove "__" prefix ( or make it quoted.not sure if it may impact other > sources) to make it work for Oracle. Also the length should be limited as it > is hitting below error on removing the prefix. > {code:java} > java.sql.SQLSyntaxErrorException: ORA-00972: identifier is too long > {code} > It can be verified using below sqlfiddle link. > http://www.sqlfiddle.com/#!4/9bbe9a/10050 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28642) Hide credentials in show create table
[ https://issues.apache.org/jira/browse/SPARK-28642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28642: Issue Type: Bug (was: Improvement) > Hide credentials in show create table > - > > Key: SPARK-28642 > URL: https://issues.apache.org/jira/browse/SPARK-28642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > {code:sql} > spark-sql> show create table mysql_federated_sample; > CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, > `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, > `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` > STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) > USING org.apache.spark.sql.jdbc > OPTIONS ( > `url` 'jdbc:mysql://localhost/hive?user=root=mypasswd', > `driver` 'com.mysql.jdbc.Driver', > `dbtable` 'TBLS' > ) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org