[jira] [Commented] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2019-09-01 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920611#comment-16920611
 ] 

Wenchen Fan commented on SPARK-28934:
-

yea if we have pgsql mode, it's a good reason to have  SPARK-28610 

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28946) Add some more information about building SparkR on Windows

2019-09-01 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-28946:


 Summary: Add some more information about building SparkR on Windows
 Key: SPARK-28946
 URL: https://issues.apache.org/jira/browse/SPARK-28946
 Project: Spark
  Issue Type: Test
  Components: Documentation, SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


We should mention:

- it needs {{bash}} in {{PATH}} to build
- supported JDK versions
- building on Windows is not official support



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-01 Thread koert kuipers (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-28945:
--
Summary: Allow concurrent writes to different partitions with dynamic 
partition overwrite  (was: Allow concurrent writes to unrelated partitions with 
dynamic partition overwrite)

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28945) Allow concurrent writes to unrelated partitions with dynamic partition overwrite

2019-09-01 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920602#comment-16920602
 ] 

koert kuipers commented on SPARK-28945:
---

See also:
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E

> Allow concurrent writes to unrelated partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28945) Allow concurrent writes to unrelated partitions with dynamic partition overwrite

2019-09-01 Thread koert kuipers (Jira)
koert kuipers created SPARK-28945:
-

 Summary: Allow concurrent writes to unrelated partitions with 
dynamic partition overwrite
 Key: SPARK-28945
 URL: https://issues.apache.org/jira/browse/SPARK-28945
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: koert kuipers


It is desirable to run concurrent jobs that write to different partitions 
within same baseDir using partitionBy and dynamic partitionOverwriteMode.

See for example here:
https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning

Or the discussion here:
https://github.com/delta-io/delta/issues/9

This doesnt seem that difficult. I suspect only changes needed are in 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has a 
flag for dynamicPartitionOverwrite. I got a quick test to work by disabling all 
committer activity (committer.setupJob, committer.commitJob, etc.) when 
dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Qiang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920601#comment-16920601
 ] 

Qiang Wang edited comment on SPARK-28927 at 9/2/19 3:56 AM:


I check the commit list since May 31, 2017 before which all the commits are 
included in 2.2.1  and find that there is  no relative commit about this 
problem.
 !image-2019-09-02-11-55-33-596.png|width=457,height=325!


was (Author: jerryhouse):
I check the commit list since May 31, 2017 before which all the commits are 
included in 2.2.1  and find that there is  no relative commit about this 
problem.
!image-2019-09-02-11-55-33-596.png!

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production 

[jira] [Updated] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Qiang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Wang updated SPARK-28927:
---
Attachment: image-2019-09-02-11-55-33-596.png

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
> val als = new ALS
> val paramMap = ParamMap(als.alpha -> 25000).
>   

[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Qiang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920601#comment-16920601
 ] 

Qiang Wang commented on SPARK-28927:


I check the commit list since May 31, 2017 before which all the commits are 
included in 2.2.1  and find that there is  no relative commit about this 
problem.
!image-2019-09-02-11-55-33-596.png!

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = 

[jira] [Updated] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28612:
-
Fix Version/s: (was: 3.0.0)

> DataSourceV2: Add new DataFrameWriter API for v2
> 
>
> Key: SPARK-28612
> URL: https://issues.apache.org/jira/browse/SPARK-28612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> This tracks adding an API like the one proposed in SPARK-23521:
> {code:lang=scala}
> df.writeTo("catalog.db.table").append() // AppendData
> df.writeTo("catalog.db.table").overwriteDynamic() // 
> OverwritePartiitonsDynamic
> df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
> OverwriteByExpression
> df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
> df.writeTo("catalog.db.table").replace() // RTAS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-09-01 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920599#comment-16920599
 ] 

zhengruifeng commented on SPARK-28373:
--

[~smilegator] [~yumwang]  I am afraid I have no time to do it this week.  

[~planga82]  Could you please take it over?

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28612) DataSourceV2: Add new DataFrameWriter API for v2

2019-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-28612:
--
  Assignee: (was: Ryan Blue)

Reverted at 
https://github.com/apache/spark/commit/bd3915e356b69897d580d9f655a97f781e8f1c83

> DataSourceV2: Add new DataFrameWriter API for v2
> 
>
> Key: SPARK-28612
> URL: https://issues.apache.org/jira/browse/SPARK-28612
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> This tracks adding an API like the one proposed in SPARK-23521:
> {code:lang=scala}
> df.writeTo("catalog.db.table").append() // AppendData
> df.writeTo("catalog.db.table").overwriteDynamic() // 
> OverwritePartiitonsDynamic
> df.writeTo("catalog.db.table").overwrite($"date" === '2019-01-01') // 
> OverwriteByExpression
> df.writeTo("catalog.db.table").partitionBy($"type", $"date").create() // CTAS
> df.writeTo("catalog.db.table").replace() // RTAS
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28933:
--
Priority: Minor  (was: Major)

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Qiang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920596#comment-16920596
 ] 

Qiang Wang commented on SPARK-28927:


I only tested it on version 2.2.1 which is compatible to our spark version.  Is 
it ok to use mlib of the master brach while keeping the spark in the version 
2.2.1?

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = trainData.map(x => ALSData(x._1, x._2, 

[jira] [Commented] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-09-01 Thread Kazuaki Ishizaki (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920594#comment-16920594
 ] 

Kazuaki Ishizaki commented on SPARK-28906:
--

For the information on {{git}} comand, {{.git}} directory is deleted after 
{{git clone}} is executed. When I tentatively stop deleting {{.git}} directory, 
{{spark-version-info.properties}} can include the correct information like:
{code}
version=2.3.4
user=ishizaki
revision=8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210
branch=HEAD
date=2019-09-02T02:31:25Z
url=https://gitbox.apache.org/repos/asf/spark.git
{code}


> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.4, 2.4.0, 2.4.1, 2.4.2, 
> 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28770) Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression failed

2019-09-01 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920587#comment-16920587
 ] 

huangtianhua commented on SPARK-28770:
--

[~wypoon], thank you for looking into this, so you suggest to delete the 
compare of SparkListenerStageExecutorMetrics events for the two failed tests? 

> Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression 
> failed
> ---
>
> Key: SPARK-28770
> URL: https://issues.apache.org/jira/browse/SPARK-28770
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Community jenkins and our arm testing instance.
>Reporter: huangtianhua
>Priority: Major
>
> Test
> org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with 
> compression is failed  see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/]
>  
> And also the test is failed on arm instance, I sent email to spark-dev 
> before, and we suspect there is something related with the commit 
> [https://github.com/apache/spark/pull/23767], we tried to revert it and the 
> tests are passed:
> ReplayListenerSuite:
>        - ...
>        - End-to-end replay *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>        - End-to-end replay with compression *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) 
>  
> Not sure what's wrong, hope someone can help to figure it out, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-28933:

Fix Version/s: 3.0.0

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920584#comment-16920584
 ] 

Liang-Chi Hsieh commented on SPARK-28933:
-

This issue was resolved by [https://github.com/apache/spark/pull/25639].

 

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28933) Reduce unnecessary shuffle in ALS when initializing factors

2019-09-01 Thread Liang-Chi Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-28933.
-
Resolution: Resolved

> Reduce unnecessary shuffle in ALS when initializing factors
> ---
>
> Key: SPARK-28933
> URL: https://issues.apache.org/jira/browse/SPARK-28933
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When Initializing factors in ALS, we should use {{mapPartitions}} instead of 
> current {{map}}, so we can preserve existing partition of the RDD of 
> {{InBlock}}. The RDD of {{InBlock}} is already partitioned by src block id. 
> We don't change the partition when initializing factors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'

2019-09-01 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin resolved SPARK-28923.
-
Resolution: Invalid

> Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'
> ---
>
> Key: SPARK-28923
> URL: https://issues.apache.org/jira/browse/SPARK-28923
> Project: Spark
>  Issue Type: Request
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Minor
>
> In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the 
> same functionality. We'd better deduplicate them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-09-01 Thread MOBIN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920577#comment-16920577
 ] 

MOBIN commented on SPARK-28916:
---

[~mgaido]thinks, spark.sql.subexpressionElimination.enabled parameter solved my 
problem

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps:
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then,The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> 

[jira] [Created] (SPARK-28944) Expose peak memory of executor in metrics for parameter tuning

2019-09-01 Thread deshanxiao (Jira)
deshanxiao created SPARK-28944:
--

 Summary: Expose peak memory of executor in metrics for parameter 
tuning
 Key: SPARK-28944
 URL: https://issues.apache.org/jira/browse/SPARK-28944
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: deshanxiao


Maybe we can collect the peak of executor memory in heartbeat for parameter 
tuning like spark.executor.memory



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27336) Incorrect DataSet.summary() result

2019-09-01 Thread daile (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920557#comment-16920557
 ] 

daile commented on SPARK-27336:
---

I will check this issue

> Incorrect DataSet.summary() result
> --
>
> Key: SPARK-27336
> URL: https://issues.apache.org/jira/browse/SPARK-27336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: test.csv
>
>
> There is a single data point in the minimum_nights column that is 1.0E8 out 
> of 8k records, but .summary() says it is the 75% and the max.
> I compared this with approxQuantile, and approxQuantile for 75% gave the 
> correct value of 30.0.
> To reproduce:
> {code:java}
> scala> val df = 
> spark.read.format("csv").load("test.csv").withColumn("minimum_nights", 
> '_c0.cast("Int"))
> df: org.apache.spark.sql.DataFrame = [_c0: string, minimum_nights: int]
> scala> df.select("minimum_nights").summary().show()
> +---+--+
> |summary|minimum_nights|
> +---+--+
> |  count|  7072|
> |   mean| 14156.35407239819|
> | stddev|1189128.5444975856|
> |min| 1|
> |25%| 2|
> |50%| 4|
> |75%| 1|
> |max| 1|
> +---+--+
> scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.1)
> res1: Array[Double] = Array(30.0)
> scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.001)
> res2: Array[Double] = Array(30.0)
> scala> df.stat.approxQuantile("minimum_nights", Array(0.75), 0.0001)
> res3: Array[Double] = Array(1.0E8)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920550#comment-16920550
 ] 

Liang-Chi Hsieh commented on SPARK-28935:
-

Thanks! [~smilegator]

It should be helpful.

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28790) Document CACHE TABLE statement in SQL Reference.

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28790.
-
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

> Document CACHE TABLE statement in SQL Reference. 
> -
>
> Key: SPARK-28790
> URL: https://issues.apache.org/jira/browse/SPARK-28790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-09-01 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920534#comment-16920534
 ] 

Jungtaek Lim commented on SPARK-28594:
--

Thanks [~felixcheung] for reviewing and volunteering to being shepherd on this 
work!

Could you also jump in [https://github.com/apache/spark/pull/25577] which is 
coupled with this issue?

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28373) Document JDBC/ODBC Server page

2019-09-01 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920515#comment-16920515
 ] 

Xiao Li commented on SPARK-28373:
-

[~podongfeng] This is the last one. Could you help finish this?

> Document JDBC/ODBC Server page
> --
>
> Key: SPARK-28373
> URL: https://issues.apache.org/jira/browse/SPARK-28373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Web UI
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> !https://user-images.githubusercontent.com/5399861/60809590-9dcf2500-a1bd-11e9-826e-33729bb97daf.png|width=1720,height=503!
>  
> [https://github.com/apache/spark/pull/25062] added a new column CLOSE TIME 
> and EXECUTION TIME. It is hard to understand the difference. We need to 
> document them; otherwise, it is hard for end users to understand them
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28935) Document SQL metrics for Details for Query Plan

2019-09-01 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920507#comment-16920507
 ] 

Xiao Li commented on SPARK-28935:
-

[https://docs.google.com/spreadsheets/d/11PV6SfkIQ8W_i98tNsMEIQ0QFiTcuQi-2roZDHvX0EM/edit?usp=sharing]

Above is the summary draft I did today. Hopefully, it helps you. Thanks! 
[~viirya]

> Document SQL metrics for Details for Query Plan
> ---
>
> Key: SPARK-28935
> URL: https://issues.apache.org/jira/browse/SPARK-28935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> [https://github.com/apache/spark/pull/25349] shows the query plans but it 
> does not describe the meaning of each metric in the plan. For end users, they 
> might not understand the meaning of the metrics we output. 
>  
> !https://user-images.githubusercontent.com/7322292/62421634-9d9c4980-b6d7-11e9-8e31-1e6ba9b402e8.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28943) NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B

2019-09-01 Thread Michael Heuer (Jira)
Michael Heuer created SPARK-28943:
-

 Summary: NoSuchMethodError: 
shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B 
 Key: SPARK-28943
 URL: https://issues.apache.org/jira/browse/SPARK-28943
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Michael Heuer


Since adapting our build for Spark 2.4.x, we are unable to run on Spark 2.2.0 
provided by CDH.  For more details, please see linked issue 
https://github.com/bigdatagenomics/adam/issues/2157



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28927) ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-01 Thread Liang-Chi Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920480#comment-16920480
 ] 

Liang-Chi Hsieh commented on SPARK-28927:
-

Does this only happen on 2.2.1? How about current master branch?

> ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
> with 12 billion instances
> ---
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Priority: Major
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
> Here is the our code:
> val trainData = predataUser.flatMap(x => x._1._2.map(y => (x._2.toInt, y._1, 
> y._2.toFloat)))
>   .setName(trainDataName).persist(StorageLevel.MEMORY_AND_DISK_SER)case class 
> ALSData(user:Int, item:Int, rating:Float) extends Serializable
> val ratingData = trainData.map(x => ALSData(x._1, x._2, x._3)).toDF()
> val als = new ALS
> val paramMap = ParamMap(als.alpha -> 25000).
>

[jira] [Commented] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-09-01 Thread Felix Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920479#comment-16920479
 ] 

Felix Cheung commented on SPARK-27495:
--

+1 on this.

 

I've reviewed this. A few questions/comments:
 # in the description above there is a passage on "Spark internal use by 
catalyst" - looking at the rest of the material, google doc etc, is this out of 
scope? if so we should clarify.
 # "different resources in multiple RDDs that get combined into a single stage" 
- this merge can be complicated, and I'm not sure taking the max etc is going 
to be right at all time. At the least it will be very confusing to the user on 
how much resource is used etc. Instead of a heuristic, the max etc, how about 
in the event of mismatch involving multiple RDDs, we detect and fail (fail 
fast) and ask the user to do a "repartition" operation before that stage?
 # in later comment, "resource requirement as a hint" - I am actually unsure 
about that. in many ML or DL/tensorflow use cases where MPI or allreduce are 
involved, the strict number of GPU, process, machine are required or else they 
fail to start. I am in favor of a strict mode for that purpose.

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job.  
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # Spark internal use by catalyst. Catalyst could control the stage level 
> resources as it finds the need to change it between stages for different 
> optimizations. For instance, with the new columnar plugin to the query 
> planner we can insert stages into the plan that would change running 
> something on the CPU in row format to running it on the GPU in columnar 
> format. This API would allow the planner to make sure the stages that run on 
> the GPU get the corresponding GPU resources it needs to run. Another possible 
> use case for catalyst is that it would allow catalyst to add in more 
> optimizations to where the user doesn’t need to configure container sizes at 
> all. If the optimizer/planner can handle that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this 

[jira] [Commented] (SPARK-28902) Spark ML Pipeline with nested Pipelines fails to load when saved from Python

2019-09-01 Thread Junichi Koizumi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920478#comment-16920478
 ] 

Junichi  Koizumi  commented on SPARK-28902:
---

Could you tell a little bit more about the workaround?  It turns out to be fine 
on my version . 

pyspark :

>>> from pyspark.ml import Pipeline
>>> from pyspark.ml.feature import Tokenizer
>>> t = Tokenizer()
>>> p = Pipeline().setStages([t])
>>> d = spark.createDataFrame([["Apache spark logistic regression "]])
>>> pm = p.fit(d)
>>> np = Pipeline().setStages([pm])
>>> npm = np.fit(d)
>>> npm.write().save('./npm_test')

scala side :

scala> import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.PipelineModel

scala> val pp = PipelineModel.load("./npm_test")
pp: org.apache.spark.ml.PipelineModel = PipelineModel_4d879f6b2b02c8d3d467

> Spark ML Pipeline with nested Pipelines fails to load when saved from Python
> 
>
> Key: SPARK-28902
> URL: https://issues.apache.org/jira/browse/SPARK-28902
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Saif Addin
>Priority: Minor
>
> Hi, this error is affecting a bunch of our nested use cases.
> Saving a *PipelineModel* with one of its stages being another 
> *PipelineModel*, fails when loading it from Scala if it is saved in Python.
> *Python side:*
>  
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.feature import Tokenizer
> t = Tokenizer()
> p = Pipeline().setStages([t])
> d = spark.createDataFrame([["Hello Peter Parker"]])
> pm = p.fit(d)
> np = Pipeline().setStages([pm])
> npm = np.fit(d)
> npm.write().save('./npm_test')
> {code}
>  
>  
> *Scala side:*
>  
> {code:java}
> scala> import org.apache.spark.ml.PipelineModel
> scala> val pp = PipelineModel.load("./npm_test")
> java.lang.IllegalArgumentException: requirement failed: Error loading 
> metadata: Expected class name org.apache.spark.ml.PipelineModel but found 
> class name pyspark.ml.pipeline.PipelineModel
>  at scala.Predef$.require(Predef.scala:224)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:638)
>  at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:267)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
>  at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
>  ... 50 elided
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-09-01 Thread Felix Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920476#comment-16920476
 ] 

Felix Cheung commented on SPARK-28594:
--

Reviewed. looks reasonable to me. I can help shepherd this work.

 

ping [~srowen] [~vanzin] [~irashid] for feedback.

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over.

2019-09-01 Thread Felix Cheung (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-28594:
-
Shepherd: Felix Cheung

> Allow event logs for running streaming apps to be rolled over.
> --
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: This has been reported on 2.0.2.22 but affects all 
> currently available versions.
>Reporter: Stephen Levett
>Priority: Major
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28942) Spark in local mode hostname display localhost in the Host Column of Task Summary Page

2019-09-01 Thread ABHISHEK KUMAR GUPTA (Jira)
ABHISHEK KUMAR GUPTA created SPARK-28942:


 Summary: Spark in local mode hostname display localhost in the 
Host Column of Task Summary Page
 Key: SPARK-28942
 URL: https://issues.apache.org/jira/browse/SPARK-28942
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


In the stage page under Task Summary Page Host Column shows 'localhost' instead 
of showing host IP or host name mentioned against the Driver Host Name

Steps:

spark-shell --master local

create table emp(id int);

insert into emp values(100);

select * from emp;

Go to  Stage UI page and check the Task Summary Page.

Host column will display 'localhost' instead the driver host.

 

Note in case of spark-shell --master yarn mode UI display correct host name 
under the column.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28942) Spark in local mode hostname display localhost in the Host Column of Task Summary Page

2019-09-01 Thread Shivu Sondur (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920427#comment-16920427
 ] 

Shivu Sondur commented on SPARK-28942:
--

i will work on this issue

> Spark in local mode hostname display localhost in the Host Column of Task 
> Summary Page
> --
>
> Key: SPARK-28942
> URL: https://issues.apache.org/jira/browse/SPARK-28942
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> In the stage page under Task Summary Page Host Column shows 'localhost' 
> instead of showing host IP or host name mentioned against the Driver Host Name
> Steps:
> spark-shell --master local
> create table emp(id int);
> insert into emp values(100);
> select * from emp;
> Go to  Stage UI page and check the Task Summary Page.
> Host column will display 'localhost' instead the driver host.
>  
> Note in case of spark-shell --master yarn mode UI display correct host name 
> under the column.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28941) Spark Sql Jobs

2019-09-01 Thread Brahmendra (Jira)
Brahmendra created SPARK-28941:
--

 Summary: Spark Sql Jobs
 Key: SPARK-28941
 URL: https://issues.apache.org/jira/browse/SPARK-28941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Brahmendra
 Fix For: 2.4.3


HI Team,

I need one favor on spark sql jobs.

I have to 200+ spark sql query running on 7 different hive table.

How can we do this in one jar file to execute all 200+ spark sql jobs.

currently we are managing 7 jar files for each tables.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28855) Remove outdated Experimental, Evolving annotations

2019-09-01 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28855.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25558
[https://github.com/apache/spark/pull/25558]

> Remove outdated Experimental, Evolving annotations
> --
>
> Key: SPARK-28855
> URL: https://issues.apache.org/jira/browse/SPARK-28855
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> The Experimental and Evolving annotations are both (like Unstable) used to 
> express that a an API may change. However there are many things in the code 
> that have been marked that way since even Spark 1.x. Per the dev@ thread, 
> anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that 
> it would not change without a deprecation cycle.
> Therefore I'd like to remove most of these annotations, leaving them for 
> things that are obviously inherently experimental (ExperimentalMethods), or 
> recently added and still legitimately experimental (DSv2, Barrier mode).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28925) Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 1.14

2019-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved SPARK-28925.

Resolution: Duplicate

> Update Kubernetes-client to 4.4.2 to be compatible with Kubernetes 1.13 and 
> 1.14
> 
>
> Key: SPARK-28925
> URL: https://issues.apache.org/jira/browse/SPARK-28925
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Eric
>Priority: Minor
>
> Hello,
> If you use Spark with Kubernetes 1.13 or 1.14 you will see this error:
> {code:java}
> {"time": "2019-08-28T09:56:11.866Z", "lvl":"INFO", "logger": 
> "org.apache.spark.internal.Logging", 
> "thread":"kubernetes-executor-snapshots-subscribers-0","msg":"Going to 
> request 1 executors from Kubernetes."}
> {"time": "2019-08-28T09:56:12.028Z", "lvl":"WARN", "logger": 
> "io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2", 
> "thread":"OkHttp https://kubernetes.default.svc/...","msg":"Exec Failure: 
> HTTP 403, Status: 403 - "}
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> {code}
> Apparently the bug is fixed here: 
> [https://github.com/fabric8io/kubernetes-client/pull/1669]
> We have currently compiled Spark source code with Kubernetes-client 4.4.2 and 
> it's working great on our cluster. We are using Kubernetes 1.13.10.
>  
> Could it be possible to update that dependency version?
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-09-01 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920411#comment-16920411
 ] 

Andy Grove commented on SPARK-28921:


[~dongjoon] we are seeing it on both of the EKS clusters where we are running 
Spark jobs. I imagine it affects all EKS clusters?

The versions we are using are 1.11.10 and 1.12.10 .. full version info:
{code:java}
Server Version: version.Info{Major:"1", Minor:"11+", 
GitVersion:"v1.11.10-eks-7f15cc", 
GitCommit:"7f15ccb4e58f112866f7ddcfebf563f199558488", GitTreeState:"clean", 
BuildDate:"2019-08-19T17:46:02Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"linux/amd64"} {code}
{code:java}
Server Version: version.Info{Major:"1", Minor:"12+", 
GitVersion:"v1.12.10-eks-825e5d", 
GitCommit:"825e5de08cb05714f9b224cd6c47d9514df1d1a7", GitTreeState:"clean", 
BuildDate:"2019-08-18T03:58:32Z", GoVersion:"go1.12.9", Compiler:"gc", 
Platform:"linux/amd64"} {code}

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Paul Schweigert
>Priority: Major
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-28921:
---
Summary: Spark jobs failing on latest versions of Kubernetes (1.15.3, 
1.14.6, 1,13.10, 1.12.10, 1.11.10)  (was: Spark jobs failing on latest versions 
of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.11.10))

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Paul Schweigert
>Priority: Major
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.11.10)

2019-09-01 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated SPARK-28921:
---
Summary: Spark jobs failing on latest versions of Kubernetes (1.15.3, 
1.14.6, 1,13.10, 1.11.10)  (was: Spark jobs failing on latest versions of 
Kubernetes (1.15.3, 1.14.6, 1,13.10))

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.11.10)
> --
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Paul Schweigert
>Priority: Major
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28940) Subquery reuse accross all subquery levels

2019-09-01 Thread Peter Toth (Jira)
Peter Toth created SPARK-28940:
--

 Summary: Subquery reuse accross all subquery levels
 Key: SPARK-28940
 URL: https://issues.apache.org/jira/browse/SPARK-28940
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


Currently subquery reuse doesn't work across all subquery levels.
Here is an example query:
{noformat}
SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM testData))
FROM testData
LIMIT 1
{noformat}

where the plan now is:
{noformat}
CollectLimit 1
+- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS 
scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS 
scalarsubquery()#277]
   :  :- Subquery scalar-subquery#268, [id=#231]
   :  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
bigint))], output=[avg(key)#272])
   :  : +- Exchange SinglePartition, true, [id=#227]
   :  :+- *(1) HashAggregate(keys=[], 
functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L])
   :  :   +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :  :  +- Scan[obj#12]
   :  +- Subquery scalar-subquery#270, [id=#266]
   : +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS 
scalarsubquery()#275]
   ::  +- Subquery scalar-subquery#269, [id=#263]
   :: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
bigint))], output=[avg(key)#274])
   ::+- Exchange SinglePartition, true, [id=#259]
   ::   +- *(1) HashAggregate(keys=[], 
functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L])
   ::  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :: +- Scan[obj#12]
   :+- *(1) Scan OneRowRelation[]
   +- *(1) SerializeFromObject
  +- Scan[obj#12]
{noformat}

but it could be:
{noformat}
CollectLimit 1
+- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS 
scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS 
scalarsubquery()#249]
   :  :- ReusedSubquery Subquery scalar-subquery#241, [id=#148]
   :  +- Subquery scalar-subquery#242, [id=#164]
   : +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS 
scalarsubquery()#247]
   ::  +- Subquery scalar-subquery#241, [id=#148]
   :: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
bigint))], output=[avg(key)#246])
   ::+- Exchange SinglePartition, true, [id=#144]
   ::   +- *(1) HashAggregate(keys=[], 
functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L])
   ::  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :: +- Scan[obj#12]
   :+- *(1) Scan OneRowRelation[]
   +- *(1) SerializeFromObject
  +- Scan[obj#12]
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-28939:

Description: 
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
un-effective.

The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
And this is pretty frequent in the codebase.

Please notice that there are 2 parts of this issue:
 - when a user directly uses those APIs
 - when Spark invokes them (eg. throughout the ML lib and other usages or the 
{{describe}} method on the {{Dataset}} class)



  was:
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.


> SQL configuration are not always propagated
> ---
>
> Key: SPARK-28939
> URL: https://issues.apache.org/jira/browse/SPARK-28939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marco Gaido
>Priority: Major
>
> The SQL configurations are propagated to executors in order to be effective.
> Unfortunately, in some cases, we are missing to propagate them, making them 
> un-effective.
> The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
> And this is pretty frequent in the codebase.
> Please notice that there are 2 parts of this issue:
>  - when a user directly uses those APIs
>  - when Spark invokes them (eg. throughout the ML lib and other usages or the 
> {{describe}} method on the {{Dataset}} class)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)
Marco Gaido created SPARK-28939:
---

 Summary: SQL configuration are not always propagated
 Key: SPARK-28939
 URL: https://issues.apache.org/jira/browse/SPARK-28939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Marco Gaido


The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24352:

Issue Type: Test  (was: Bug)

> Flaky test: StandaloneDynamicAllocationSuite
> 
>
> Key: SPARK-24352
> URL: https://issues.apache.org/jira/browse/SPARK-24352
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> From jenkins:
> [https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/]
>  
> {noformat}
> Error Message
> There is already an RpcEndpoint called CoarseGrainedScheduler
> Stacktrace
>   java.lang.IllegalArgumentException: There is already an RpcEndpoint 
> called CoarseGrainedScheduler
>   at 
> org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
> {noformat}
> This actually looks like a previous test is leaving some stuff running and 
> making this one fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28535) Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28535:

Issue Type: Test  (was: Bug)

> Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
> ---
>
> Key: SPARK-28535
> URL: https://issues.apache.org/jira/browse/SPARK-28535
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> This is the same flakiness as in SPARK-23881, except the fix there didn't 
> really take, at least on our build machines.
> {noformat}
> org.scalatest.exceptions.TestFailedException: 1 was not less than 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> {noformat}
> Since that bug is short on explanations, the issue is that there's a race 
> between the thread posting the "stage completed" event to the listener which 
> unblocks the test, and the thread killing the task in the executor. If the 
> even arrives first, it will unblock task execution, and there's a chance that 
> all elements will actually be processed before the executor has a chance to 
> stop the task.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28418) Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28418:

Issue Type: Test  (was: Bug)

> Flaky Test: pyspark.sql.tests.test_dataframe: 
> test_query_execution_listener_on_collect
> --
>
> Key: SPARK-28418
> URL: https://issues.apache.org/jira/browse/SPARK-28418
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> {code}
> ERROR [0.164s]: test_query_execution_listener_on_collect 
> (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests)
> --
> Traceback (most recent call last):
>   File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, 
> in test_query_execution_listener_on_collect
> "The callback from the query execution listener should be called after 
> 'collect'")
> AssertionError: The callback from the query execution listener should be 
> called after 'collect'
> {code}
> Seems it can be failed due to not waiting events to be proceeded.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28335) Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset recovery from kafka

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28335:

Issue Type: Test  (was: Bug)

> Flaky test: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite.offset 
> recovery from kafka
> -
>
> Key: SPARK-28335
> URL: https://issues.apache.org/jira/browse/SPARK-28335
> Project: Spark
>  Issue Type: Test
>  Components: DStreams, Tests
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
> Attachments: bad.log
>
>
> {code:java}
> org.scalatest.exceptions.TestFailedException: {} was empty
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply$mcV$sp(DirectKafkaStreamSuite.scala:466)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite$$anonfun$6.apply(DirectKafkaStreamSuite.scala:416)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at or
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28357) Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling compressed

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28357:

Issue Type: Test  (was: Bug)

> Fix Flaky Test - FileAppenderSuite.rolling file appender - size-based rolling 
> compressed
> 
>
> Key: SPARK-28357
> URL: https://issues.apache.org/jira/browse/SPARK-28357
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107553/testReport/org.apache.spark.util/FileAppenderSuite/rolling_file_appender___size_based_rolling__compressed_/



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24898) Adding spark.checkpoint.compress to the docs

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24898:

Issue Type: Improvement  (was: Task)

> Adding spark.checkpoint.compress to the docs
> 
>
> Key: SPARK-24898
> URL: https://issues.apache.org/jira/browse/SPARK-24898
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Riccardo Corbella
>Assignee: Sandeep
>Priority: Trivial
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Parameter *spark.checkpoint.compress* is not listed under configuration 
> properties.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28261) Flaky test: org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28261:

Issue Type: Test  (was: Bug)

> Flaky test: 
> org.apache.spark.network.TransportClientFactorySuite.reuseClientsUpToConfigVariable
> ---
>
> Key: SPARK-28261
> URL: https://issues.apache.org/jira/browse/SPARK-28261
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Error message:
> {noformat}
> java.lang.AssertionError: expected:<3> but was:<4>
> ...{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28247) Flaky test: "query without test harness" in ContinuousSuite

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28247:

Issue Type: Test  (was: Bug)

> Flaky test: "query without test harness" in ContinuousSuite
> ---
>
> Key: SPARK-28247
> URL: https://issues.apache.org/jira/browse/SPARK-28247
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> This test has failed a few times in some PRs, as well as easy to reproduce 
> locally. Example of a failure:
> {noformat}
>  [info] - query without test harness *** FAILED *** (2 seconds, 931 
> milliseconds)
> [info]   scala.Predef.Set.apply[Int](0, 1, 2, 
> 3).map[org.apache.spark.sql.Row, 
> scala.collection.immutable.Set[org.apache.spark.sql.Row]](((x$3: Int) => 
> org.apache.spark.sql.Row.apply(x$3)))(immutable.this.Set.canBuildFrom[org.apache.spark.sql.Row]).subsetOf(scala.Predef.refArrayOps[org.apache.spark.sql.Row](results).toSet[org.apache.spark.sql.Row])
>  was false
> (ContinuousSuite.scala:226){noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28713) Bump checkstyle from 8.14 to 8.23

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28713:

Issue Type: Improvement  (was: Task)

> Bump checkstyle from 8.14 to 8.23
> -
>
> Key: SPARK-28713
> URL: https://issues.apache.org/jira/browse/SPARK-28713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> From the GitHub Security Advisory Database:
> Moderate severity vulnerability that affects com.puppycrawl.tools:checkstyle
> Checkstyle prior to 8.18 loads external DTDs by default, which can 
> potentially lead to denial of service attacks or the leaking of confidential 
> information.
> Affected versions: < 8.18



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27596) The JDBC 'query' option doesn't work for Oracle database

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27596:

Issue Type: Bug  (was: Improvement)

> The JDBC 'query' option doesn't work for Oracle database
> 
>
> Key: SPARK-27596
> URL: https://issues.apache.org/jira/browse/SPARK-27596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> For the JDBC option `query`, we use the identifier name to start with 
> underscore: s"(${subquery}) 
> __SPARK_GEN_JDBC_SUBQUERY_NAME_${curId.getAndIncrement()}". This is not 
> supported by Oracle. 
> The Oracle doesn't seem to support identifier name to start with non-alphabet 
> character (unless it is quoted) and has length restrictions as well.
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm
> {code:java}
> Nonquoted identifiers must begin with an alphabetic character from your 
> database character set. Quoted identifiers can begin with any character as 
> per below documentation - 
> Nonquoted identifiers can contain only alphanumeric characters from your 
> database character set and the underscore (_), dollar sign ($), and pound 
> sign (#). Database links can also contain periods (.) and "at" signs (@). 
> Oracle strongly discourages you from using $ and # in nonquoted identifiers.
> {code}
> The alias name '_SPARK_GEN_JDBC_SUBQUERY_NAME' should be fixed to 
> remove "__" prefix ( or make it quoted.not sure if it may impact other 
> sources) to make it work for Oracle. Also the length should be limited as it 
> is hitting below error on removing the prefix.
> {code:java}
> java.sql.SQLSyntaxErrorException: ORA-00972: identifier is too long 
> {code}
> It can be verified using below sqlfiddle link.
> http://www.sqlfiddle.com/#!4/9bbe9a/10050



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28642) Hide credentials in show create table

2019-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28642:

Issue Type: Bug  (was: Improvement)

> Hide credentials in show create table
> -
>
> Key: SPARK-28642
> URL: https://issues.apache.org/jira/browse/SPARK-28642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> {code:sql}
> spark-sql> show create table mysql_federated_sample;
> CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, 
> `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, 
> `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` 
> STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN)
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> `url` 'jdbc:mysql://localhost/hive?user=root=mypasswd',
> `driver` 'com.mysql.jdbc.Driver',
> `dbtable` 'TBLS'
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org