[jira] [Created] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-08-28 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-25268:
---

 Summary: runParallelPersonalizedPageRank throws serialization 
Exception
 Key: SPARK-25268
 URL: https://issues.apache.org/jira/browse/SPARK-25268
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.4.0
Reporter: Bago Amirbekian


A recent change to PageRank introduced a bug in the 
ParallelPersonalizedPageRank implementation. The change prevents serialization 
of a Map which needs to be broadcast to all workers. The issue is in this line 
here: 
[https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]

Because graphx units tests are run in local mode, the Serialization issue is 
not caught.

 
{code:java}
[info] - Star example parallel personalized PageRank *** FAILED *** (2 seconds, 
160 milliseconds)
[info] java.io.NotSerializableException: 
scala.collection.immutable.MapLike$$anon$2
[info] Serialization stack:
[info] - object not serializable (class: 
scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
SparseVector(3)((2,1.0
[info] at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
[info] at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
[info] at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
[info] at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
[info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
[info] at 
org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
[info] at 
org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
[info] at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
[info] at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info] at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info] at scala.collection.immutable.List.foreach(List.scala:383)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info] at 

[jira] [Resolved] (SPARK-25235) Merge the REPL code in Scala 2.11 and 2.12 branches

2018-08-28 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25235.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22246
[https://github.com/apache/spark/pull/22246]

> Merge the REPL code in Scala 2.11 and 2.12 branches
> ---
>
> Key: SPARK-25235
> URL: https://issues.apache.org/jira/browse/SPARK-25235
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Using some reflection tricks to merge Scala 2.11 and 2.12 codebase



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595891#comment-16595891
 ] 

Apache Spark commented on SPARK-25248:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/22261

> Audit barrier APIs for Spark 2.4
> 
>
> Key: SPARK-25248
> URL: https://issues.apache.org/jira/browse/SPARK-25248
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25112) Spark history cleaner does not work.

2018-08-28 Thread Gu Chao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gu Chao resolved SPARK-25112.
-
Resolution: Invalid

Sorry, It's not a problem.

> Spark history cleaner does not work.
> 
>
> Key: SPARK-25112
> URL: https://issues.apache.org/jira/browse/SPARK-25112
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gu Chao
>Priority: Major
>  Labels: newbie
>
> spark-env.sh configuration:
> {code:java}
> export SPARK_DAEMON_MEMORY=1g
> export 
> SPARK_HISTORY_OPTS="-Dspark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
>  -Dspark.history.fs.logDirectory=hdfs://xxx:9000/spark-logs 
> -Dspark.history.fs.cleaner.enabled=true 
> -Dspark.history.fs.cleaner.interval=1h 
> -Dspark.history.fs.cleaner.maxAge=7d"{code}
> hdfs /spark-logs:
> {code:java}
> -rwxrwx---   2 root supergroup  0 2018-08-08 12:25 
> /spark-logs/app-20180804141138-0546.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 16:02 
> /spark-logs/app-20180804150203-0547.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 16:02 
> /spark-logs/app-20180804150204-0548.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 17:02 
> /spark-logs/app-20180804160203-0549.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 17:02 
> /spark-logs/app-20180804160204-0550.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 18:02 
> /spark-logs/app-20180804170203-0551.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 18:02 
> /spark-logs/app-20180804170204-0552.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 19:02 
> /spark-logs/app-20180804180203-0553.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 19:02 
> /spark-logs/app-20180804180204-0554.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 20:02 
> /spark-logs/app-20180804190203-0555.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 20:02 
> /spark-logs/app-20180804190204-0556.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 21:02 
> /spark-logs/app-20180804200203-0557.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 21:02 
> /spark-logs/app-20180804200204-0558.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 22:02 
> /spark-logs/app-20180804210203-0559.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 22:02 
> /spark-logs/app-20180804210204-0560.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 23:02 
> /spark-logs/app-20180804220203-0561.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-04 23:02 
> /spark-logs/app-20180804220204-0562.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 00:02 
> /spark-logs/app-20180804230203-0563.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 00:02 
> /spark-logs/app-20180804230204-0564.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 01:02 
> /spark-logs/app-20180805000203-0565.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 01:02 
> /spark-logs/app-20180805000204-0566.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 02:02 
> /spark-logs/app-20180805010203-0567.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 02:02 
> /spark-logs/app-20180805010204-0568.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 03:02 
> /spark-logs/app-20180805020203-0569.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 03:02 
> /spark-logs/app-20180805020204-0570.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 04:02 
> /spark-logs/app-20180805030203-0571.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 04:02 
> /spark-logs/app-20180805030204-0572.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 05:02 
> /spark-logs/app-20180805040203-0573.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 05:02 
> /spark-logs/app-20180805040204-0574.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 06:02 
> /spark-logs/app-20180805050203-0575.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 06:02 
> /spark-logs/app-20180805050204-0576.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 07:02 
> /spark-logs/app-20180805060203-0577.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 07:02 
> /spark-logs/app-20180805060204-0578.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 08:02 
> /spark-logs/app-20180805070203-0579.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 08:02 
> /spark-logs/app-20180805070204-0580.inprogress
> -rwxrwx---   2 root supergroup  0 2018-08-05 09:02 
> 

[jira] [Comment Edited] (SPARK-19490) Hive partition columns are case-sensitive

2018-08-28 Thread Harish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595852#comment-16595852
 ] 

Harish edited comment on SPARK-19490 at 8/29/18 2:19 AM:
-

I see the same issue in 2.3.1. After changing the column to lower case(in 
select statement) then join works fine


was (Author: harishk15):
I

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Major
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive

2018-08-28 Thread Harish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595852#comment-16595852
 ] 

Harish commented on SPARK-19490:


I

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Major
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-28 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595835#comment-16595835
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator] I linked the most relevant pyarrow issues above, which deal 
mostly with binary type support. If I get the time, I'll go through the 
changelog here https://arrow.apache.org/release/0.10.0.html and look for other 
relevant ones.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25253:


Assignee: Imran Rashid

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25253.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22247
[https://github.com/apache/spark/pull/22247]

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25266) Fix memory leak in Barrier Execution Mode

2018-08-28 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-25266:
---
Description: 
BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked in 
ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.

Once a TimerTask is scheduled, the reference to it is not released until 
`Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.

 

  was:
BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.

Once a TimerTask is scheduled, the reference to it is not released until 
`Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.

 


> Fix memory leak in Barrier Execution Mode
> -
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25266) Fix memory leak in Barrier Execution Mode

2018-08-28 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-25266:
---
Summary: Fix memory leak in Barrier Execution Mode  (was: Fix memory leak 
vulnerability in Barrier Execution Mode)

> Fix memory leak in Barrier Execution Mode
> -
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25260:


Assignee: Arun Mahadevan

> Fix namespace handling in SchemaConverters.toAvroType
> -
>
> Key: SPARK-25260
> URL: https://issues.apache.org/jira/browse/SPARK-25260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> `toAvroType` converts spark data type to avro schema. It always appends the 
> record name to namespace so its impossible to have an Avro namespace 
> independent of the record name.
>  
> When invoked with a spark data type like,
>  
> {code:java}
> val sparkSchema = StructType(Seq(
>     StructField("name", StringType, nullable = false),
>     StructField("address", StructType(Seq(
>         StructField("city", StringType, nullable = false),
>         StructField("state", StringType, nullable = false))),
>     nullable = false)))
>  
> // map it to an avro schema with top level namespace "foo.bar",
> val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
> "foo.bar")
> // result is
> // avroSchema.getName = employee
> // avroSchema.getNamespace = foo.bar.employee
> // avroSchema.getFullname = foo.bar.employee.employee
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25260.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22251
[https://github.com/apache/spark/pull/22251]

> Fix namespace handling in SchemaConverters.toAvroType
> -
>
> Key: SPARK-25260
> URL: https://issues.apache.org/jira/browse/SPARK-25260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> `toAvroType` converts spark data type to avro schema. It always appends the 
> record name to namespace so its impossible to have an Avro namespace 
> independent of the record name.
>  
> When invoked with a spark data type like,
>  
> {code:java}
> val sparkSchema = StructType(Seq(
>     StructField("name", StringType, nullable = false),
>     StructField("address", StructType(Seq(
>         StructField("city", StringType, nullable = false),
>         StructField("state", StringType, nullable = false))),
>     nullable = false)))
>  
> // map it to an avro schema with top level namespace "foo.bar",
> val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
> "foo.bar")
> // result is
> // avroSchema.getName = employee
> // avroSchema.getNamespace = foo.bar.employee
> // avroSchema.getFullname = foo.bar.employee.employee
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter

2018-08-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22357:
-

Assignee: Bo Meng

> SparkContext.binaryFiles ignore minPartitions parameter
> ---
>
> Key: SPARK-22357
> URL: https://issues.apache.org/jira/browse/SPARK-22357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Weichen Xu
>Assignee: Bo Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> this is a bug in binaryFiles - even though we give it the partitions, 
> binaryFiles ignores it.
> This is a bug introduced in spark 2.1 from spark 2.0, in file 
> PortableDataStream.scala the argument “minPartitions” is no longer used (with 
> the push to master on 11/7/6):
> {code}
> /**
> Allow minPartitions set by end-user in order to keep compatibility with old 
> Hadoop API
> which is set through setMaxSplitSize
> */
> def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: 
> Int) {
> val defaultMaxSplitBytes = 
> sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
> val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
> val defaultParallelism = sc.defaultParallelism
> val files = listStatus(context).asScala
> val totalBytes = files.filterNot(.isDirectory).map(.getLen + 
> openCostInBytes).sum
> val bytesPerCore = totalBytes / defaultParallelism
> val maxSplitSize = Math.min(defaultMaxSplitBytes, 
> Math.max(openCostInBytes, bytesPerCore))
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The code previously, in version 2.0, was:
> {code}
> def setMinPartitions(context: JobContext, minPartitions: Int) {
> val totalLen = 
> listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum
> val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 
> 1.0)).toLong
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The new code is very smart, but it ignores what the user passes in and uses 
> the data size, which is kind of a breaking change in some sense
> In our specific case this was a problem, because we initially read in just 
> the files names and only after that the dataframe becomes very large, when 
> reading in the images themselves – and in this case the new code does not 
> handle the partitioning very well.
> I’m not sure if it can be easily fixed because I don’t understand the full 
> context of the change in spark (but at the very least the unused parameter 
> should be removed to avoid confusion).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter

2018-08-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22357.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21638
[https://github.com/apache/spark/pull/21638]

> SparkContext.binaryFiles ignore minPartitions parameter
> ---
>
> Key: SPARK-22357
> URL: https://issues.apache.org/jira/browse/SPARK-22357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Weichen Xu
>Assignee: Bo Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> this is a bug in binaryFiles - even though we give it the partitions, 
> binaryFiles ignores it.
> This is a bug introduced in spark 2.1 from spark 2.0, in file 
> PortableDataStream.scala the argument “minPartitions” is no longer used (with 
> the push to master on 11/7/6):
> {code}
> /**
> Allow minPartitions set by end-user in order to keep compatibility with old 
> Hadoop API
> which is set through setMaxSplitSize
> */
> def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: 
> Int) {
> val defaultMaxSplitBytes = 
> sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
> val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
> val defaultParallelism = sc.defaultParallelism
> val files = listStatus(context).asScala
> val totalBytes = files.filterNot(.isDirectory).map(.getLen + 
> openCostInBytes).sum
> val bytesPerCore = totalBytes / defaultParallelism
> val maxSplitSize = Math.min(defaultMaxSplitBytes, 
> Math.max(openCostInBytes, bytesPerCore))
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The code previously, in version 2.0, was:
> {code}
> def setMinPartitions(context: JobContext, minPartitions: Int) {
> val totalLen = 
> listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum
> val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 
> 1.0)).toLong
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The new code is very smart, but it ignores what the user passes in and uses 
> the data size, which is kind of a breaking change in some sense
> In our specific case this was a problem, because we initially read in just 
> the files names and only after that the dataframe becomes very large, when 
> reading in the images themselves – and in this case the new code does not 
> handle the partitioning very well.
> I’m not sure if it can be easily fixed because I don’t understand the full 
> context of the change in spark (but at the very least the unused parameter 
> should be removed to avoid confusion).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-08-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25212.
-
   Resolution: Fixed
 Assignee: Bogdan Raducanu
Fix Version/s: 2.4.0

> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2+

2018-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595650#comment-16595650
 ] 

Sean Owen commented on SPARK-25258:
---

Both of these suggest kryo 4. 

The issue is, does it cause any breaking changes. I don't so far know of any 
problems, but that's the kind of input that would be great to confirm a change 
like this, from those that may be watching Kryo closely.

> Upgrade kryo package to version 4.0.2+
> --
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595647#comment-16595647
 ] 

Apache Spark commented on SPARK-25044:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22259

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2018-08-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595633#comment-16595633
 ] 

Steve Loughran commented on SPARK-20799:


[~jzijlstra] yes, the final listing.

Note that in HADOOP-14833 I'm removing the user:secret in URI feature. It's 
just too dangerous; ends up in logs, ends up in support JIRAs. Per bucket 
options in hadoop-2.8; JCECKs files in 2.9+ give you much better security and 
work well with things.

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25004.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21977
[https://github.com/apache/spark/pull/21977]

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25004) Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25004:
--

Assignee: Ryan Blue

> Add spark.executor.pyspark.memory config to set resource.RLIMIT_AS
> --
>
> Key: SPARK-25004
> URL: https://issues.apache.org/jira/browse/SPARK-25004
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Some platforms support limiting Python's addressable memory space by limiting 
> [{{resource.RLIMIT_AS}}|https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS].
> We've found that adding a limit is very useful when running in YARN because 
> when Python doesn't know about memory constraints, it doesn't know when to 
> garbage collect and will continue using memory when it doesn't need to. 
> Adding a limit reduces PySpark memory consumption and avoids YARN killing 
> containers because Python hasn't cleaned up memory.
> This also improves error messages for users, allowing them to see when Python 
> is allocating too much memory instead of YARN killing the container:
> {code:lang=python}
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in 
> fe_engineer
> fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
>   File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
> comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, 
> []), mat_rec_prep.get(item, []))
>   File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in 
> leven_list_compare
> permutations = sorted(permutations, reverse=True)
>   MemoryError
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-28 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595450#comment-16595450
 ] 

Dilip Biswal commented on SPARK-25267:
--

[~smilegator] Thanks Sean.. will take a look.

> Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
> -
>
> Key: SPARK-25267
> URL: https://issues.apache.org/jira/browse/SPARK-25267
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> In SharedSparkSession and TestHive, we need to disable the rule 
> ConvertToLocalRelation for better test case coverage. To exclude the rules, 
> we can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25240) A deadlock in ALTER TABLE RECOVER PARTITIONS

2018-08-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25240.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> A deadlock in ALTER TABLE RECOVER PARTITIONS
> 
>
> Key: SPARK-25240
> URL: https://issues.apache.org/jira/browse/SPARK-25240
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Recover Partitions in ALTER TABLE is performed in recursive way by calling 
> the scanPartitions() method. scanPartitions() lists files sequentially or in 
> parallel if the 
> [condition|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L685]
>  is true:
> {code:scala}
> partitionNames.length > 1 && statuses.length > threshold || 
> partitionNames.length > 2
> {code}
> Parallel listening is executed on [the fixed thread 
> pool|https://github.com/apache/spark/blob/131ca146ed390cd0109cd6e8c95b61e418507080/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L622]
>  which can have 8 threads in total. Dead lock occurs when all 8 cores have 
> been already occupied and recursive call of scanPartitions() submits new 
> parallel file listening.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-28 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595388#comment-16595388
 ] 

Xiao Li commented on SPARK-25267:
-

cc [~dkbiswal]

> Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
> -
>
> Key: SPARK-25267
> URL: https://issues.apache.org/jira/browse/SPARK-25267
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> In SharedSparkSession and TestHive, we need to disable the rule 
> ConvertToLocalRelation for better test case coverage. To exclude the rules, 
> we can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-28 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25267:
---

 Summary: Disable ConvertToLocalRelation in the test cases of 
sql/core and sql/hive
 Key: SPARK-25267
 URL: https://issues.apache.org/jira/browse/SPARK-25267
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li


In SharedSparkSession and TestHive, we need to disable the rule 
ConvertToLocalRelation for better test case coverage. To exclude the rules, we 
can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595367#comment-16595367
 ] 

Apache Spark commented on SPARK-25266:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/22258

> Fix memory leak vulnerability in Barrier Execution Mode
> ---
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25266:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix memory leak vulnerability in Barrier Execution Mode
> ---
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25266:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix memory leak vulnerability in Barrier Execution Mode
> ---
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-28 Thread Vasanthkumar Velayudham (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595363#comment-16595363
 ] 

Vasanthkumar Velayudham commented on SPARK-25219:
-

Hi [~mgaido] - Thanks for your response!

PFA the excel that contains the details of the text data used for performing 
clustering in Spark and SKLearn. Excel workbook contains two sheets - one with 
the results of Spark and other with the results of SKLearn.

I have attached the code that I have used for performing the clustering too. As 
you notice, these code are straightforward implementation of text 
transformations.

In the results, you can observe that the clustering of Sklearn is more uniform. 
Whereas Spark results contain one of the cluster with almost 44 records.

Please let me know if you require any additional details.

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
> Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-28 Thread Vasanthkumar Velayudham (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasanthkumar Velayudham updated SPARK-25219:

Attachment: SKLearn_Kmeans.txt
Apache_Logs_Results.xlsx
Spark_Kmeans.txt

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
> Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24704) The order of stages in the DAG graph is incorrect

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24704:
---
Fix Version/s: 2.3.2

> The order of stages in the DAG graph is incorrect
> -
>
> Key: SPARK-24704
> URL: https://issues.apache.org/jira/browse/SPARK-24704
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Minor
>  Labels: regression
> Fix For: 2.3.2, 2.4.0
>
> Attachments: WX20180630-161907.png
>
>
> The regression is introduced by Spark 2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25266) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-28 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-25266:
--

 Summary: Fix memory leak vulnerability in Barrier Execution Mode
 Key: SPARK-25266
 URL: https://issues.apache.org/jira/browse/SPARK-25266
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.

Once a TimerTask is scheduled, the reference to it is not released until 
`Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-28 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-25265:
--

 Summary: Fix memory leak vulnerability in Barrier Execution Mode
 Key: SPARK-25265
 URL: https://issues.apache.org/jira/browse/SPARK-25265
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.

Once a TimerTask is scheduled, the reference to it is not released until 
`Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25119) stages in wrong order within job page DAG chart

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25119.

Resolution: Duplicate

> stages in wrong order within job page DAG chart
> ---
>
> Key: SPARK-25119
> URL: https://issues.apache.org/jira/browse/SPARK-25119
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Yunjian Zhang
>Priority: Minor
> Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png
>
>
> {color:#33}multiple stages for same job are shown with wrong order in DAG 
> Visualization of job page.{color}
> e.g.
> stage27   stage19 stage20 stage24 stage21



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23679:
--

Assignee: Saisai Shao

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.4.0
>
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23679.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22164
[https://github.com/apache/spark/pull/22164]

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.4.0
>
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23997) Configurable max number of buckets

2018-08-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23997.
-
   Resolution: Fixed
 Assignee: Fernando Pereira
Fix Version/s: 2.4.0

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Major
> Fix For: 2.4.0
>
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25264:


Assignee: Apache Spark

> Fix comma-delineated arguments passed into PythonRunner and RRunner
> ---
>
> Key: SPARK-25264
> URL: https://issues.apache.org/jira/browse/SPARK-25264
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> The arguments passed into the PythonRunner and RRunner are comma-delineated. 
> Because the Runners do a arg.slice(2,...) This means that the delineation in 
> the entrypoint needs to be a space, as it would be expected by the Runner 
> arguments. 
> This issue was logged here: 
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25264:


Assignee: (was: Apache Spark)

> Fix comma-delineated arguments passed into PythonRunner and RRunner
> ---
>
> Key: SPARK-25264
> URL: https://issues.apache.org/jira/browse/SPARK-25264
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> The arguments passed into the PythonRunner and RRunner are comma-delineated. 
> Because the Runners do a arg.slice(2,...) This means that the delineation in 
> the entrypoint needs to be a space, as it would be expected by the Runner 
> arguments. 
> This issue was logged here: 
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595306#comment-16595306
 ] 

Apache Spark commented on SPARK-25264:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22257

> Fix comma-delineated arguments passed into PythonRunner and RRunner
> ---
>
> Key: SPARK-25264
> URL: https://issues.apache.org/jira/browse/SPARK-25264
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> The arguments passed into the PythonRunner and RRunner are comma-delineated. 
> Because the Runners do a arg.slice(2,...) This means that the delineation in 
> the entrypoint needs to be a space, as it would be expected by the Runner 
> arguments. 
> This issue was logged here: 
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai edited comment on SPARK-25206 at 8/28/18 5:06 PM:


 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it could be more meaningful.


was (Author: yucai):
 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai edited comment on SPARK-25206 at 8/28/18 5:05 PM:


 Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.


was (Author: yucai):
 

Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

 

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595298#comment-16595298
 ] 

yucai commented on SPARK-25206:
---

 

Do you want to simulate an Exception in Spark? 

Backporting SPARK-25132 and SPARK-24716 is to fix a bug for our data source 
table, it looks more meaningful.

 

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner

2018-08-28 Thread Ilan Filonenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-25264:
---
Description: 
The arguments passed into the PythonRunner and RRunner are comma-delineated. 

Because the Runners do a arg.slice(2,...) This means that the delineation in 
the entrypoint needs to be a space, as it would be expected by the Runner 
arguments. 

This issue was logged here: 
[https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273]

  was:
The arguments passed into the PythonRunner and RRunner are comma-delineated. 

This issue was logged here: 
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273


> Fix comma-delineated arguments passed into PythonRunner and RRunner
> ---
>
> Key: SPARK-25264
> URL: https://issues.apache.org/jira/browse/SPARK-25264
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> The arguments passed into the PythonRunner and RRunner are comma-delineated. 
> Because the Runners do a arg.slice(2,...) This means that the delineation in 
> the entrypoint needs to be a space, as it would be expected by the Runner 
> arguments. 
> This issue was logged here: 
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25264) Fix comma-delineated arguments passed into PythonRunner and RRunner

2018-08-28 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25264:
--

 Summary: Fix comma-delineated arguments passed into PythonRunner 
and RRunner
 Key: SPARK-25264
 URL: https://issues.apache.org/jira/browse/SPARK-25264
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, PySpark
Affects Versions: 2.4.0
Reporter: Ilan Filonenko


The arguments passed into the PythonRunner and RRunner are comma-delineated. 

This issue was logged here: 
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/273



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595254#comment-16595254
 ] 

Apache Spark commented on SPARK-25262:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22256

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25262:


Assignee: Apache Spark

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25262:


Assignee: (was: Apache Spark)

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595240#comment-16595240
 ] 

Chenxiao Mao commented on SPARK-25175:
--

[~cloud_fan] Does it make sense?

> Field resolution should fail if there is ambiguity for ORC data source native 
> implementation
> 
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there is ambiguity for ORC data 
source native implementation  (was: Field resolution should fail if there is 
ambiguity in case-insensitive mode when reading from ORC)

> Field resolution should fail if there is ambiguity for ORC data source native 
> implementation
> 
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595225#comment-16595225
 ] 

Chenxiao Mao commented on SPARK-25175:
--

After a deep dive into ORC file read paths (data source native, data source 
hive, hive serde), I realized that this is a little complicated. I'm not sure 
whether it's technically possible to make all three read paths consistent with 
respect to case sensitivity, because we rely on hive InputFormat/SerDe which we 
might not be able to change.

Please also see [~cloud_fan]'s comment on Parquet: 
[https://github.com/apache/spark/pull/22184/files#r212849852]

So I changed the title of this Jira to reduce the scope. This ticket aims to 
make ORC data source native impl consistent with Parquet data source. The gap 
is that field resolution should fail if there is ambiguity in case-insensitive 
mode when reading from ORC. Does it make sense?

As for duplicate fields with different letter cases, we don't have real use 
cases. It's just for testing purpose.

 

> Field resolution should fail if there is ambiguity in case-insensitive mode 
> when reading from ORC
> -
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there is ambiguity in 
case-insensitive mode when reading from ORC  (was: Case-insensitive field 
resolution when reading from ORC)

> Field resolution should fail if there is ambiguity in case-insensitive mode 
> when reading from ORC
> -
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 4:00 PM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. They always do case-insensitive field 
resolution, and if there is ambiguity they return the first matched one. Given 
ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl, which is not consistent with Parquet data source.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
ORC file schema has more fields than table schema. If ORC schema does NOT have 
more fields, hive serde tables do field resolution by ordinal rather than by 
name.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) are 
reasonable.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary. We can 

[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues, but not identical 
to Parquet. Spark has two OrcFileFormat.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the first matched field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. If ORC data 
file has more fields than table schema, we just can't read hive serde tables. 
If ORC data file does not have more fields, hive serde tables always do field 
resolution by ordinal, rather than by name.

Both ORC data source hive impl and hive serde table rely on the hive orc 
InputFormat/SerDe to read table. I'm not sure whether we can change underlying 
hive classes to make all orc read behaviors consistent.

This ticket aims to make read behavior of ORC data source native impl 
consistent with Parquet data source.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there are 
duplicate fields (e.g. c, C), we just can't read hive serde tables. If there 
are no duplicate fields, hive serde tables always do case-insensitive field 
resolution regardless of case sensitivity mode.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-28 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595182#comment-16595182
 ] 

Imran Rashid commented on SPARK-24918:
--

Yes, that's right, its to avoid putting a reference to the static initializer 
in every single task.  That's a nuisance when your task definitions are 

For my intended use, the task itself doesn't depend on the plugin at all.  The 
plugin gives you added debug info, that's all.  That's why its OK for the 
original code to not know anything about the plugin.

There were other requests on the earlier jira to do lifecycle management, eg. 
eager initialization of resources.  I agree that's not as clear (if you need 
the resource, your task will have to reference it, so you have a spot for your 
static initializer).  In any case, you could use this same mechanism for that 
as well, if you wanted.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-08-28 Thread Rob Vesse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595178#comment-16595178
 ] 

Rob Vesse commented on SPARK-25262:
---

I have changes for this almost ready and plan to open a PR tomorrow

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25005) Structured streaming doesn't support kafka transaction (creating empty offset with abort & markers)

2018-08-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25005.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.4.0

> Structured streaming doesn't support kafka transaction (creating empty offset 
> with abort & markers)
> ---
>
> Key: SPARK-25005
> URL: https://issues.apache.org/jira/browse/SPARK-25005
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Quentin Ambard
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Structured streaming can't consume kafka transaction. 
> We could try to apply SPARK-24720 (DStream) logic to Structured Streaming 
> source



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25218) Potential resource leaks in TransportServer and SocketAuthHelper

2018-08-28 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25218.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Potential resource leaks in TransportServer and SocketAuthHelper
> 
>
> Key: SPARK-25218
> URL: https://issues.apache.org/jira/browse/SPARK-25218
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.4.0
>
>
> They don't release the resources for all types of errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:27 PM:
---

Investigation about ORC tables with duplicate fields (c and C), thus also data 
file has more fields than table schema.
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |

Followup tests that use ORC files with no duplicate fields (only a,B).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Type: struct

CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'
CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'

DESC EXTENDED orc_nodup_hive_serde_lower;
DESC EXTENDED orc_nodup_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
||no.||caseSensitive||table columns||select column||orc column
 (select via hive serde table)||
|1|true|a, b|a|a|
|2| | |b|B|
|4| | |A|AnalysisException|
|5| | |B|AnalysisException|
|7| |A, B|a|AnalysisException|
|8| | |b|AnalysisException |
|10| | |A|a|
|11| | |B|B|
|13|false|a, b|a|a|
|14| | |b|B|
|16| | |A|a|
|17| | |B|B|
|19| |A, B|a|a|
|20| | |b|B|
|22| | |A|a|
|23| | |B|B|

Tests show that for hive serde table field resolution is by ordinal, not by 
name.

{code}
spark.conf.set("spark.sql.caseSensitive", true)
spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
val data = spark.range(1).selectExpr("id + 1 as x", "id + 2 as y", "id + 3 as 
z")
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_xyz")
sql("CREATE TABLE orc_table_ABC (A LONG, B LONG, C LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_xyz'")
sql("select B from orc_table_ABC").show
+---+
|  B|
+---+
|  2|
+---+
{code}



was (Author: seancxmao):
Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = 

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:22 PM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. They always do case-insensitive field 
resolution, and if there is ambiguity they return the first matched one. Given 
ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
ORC schema has more fields than table schema. If ORC schema does NOT have more 
fields, hive serde tables do field resolution by ordinal rather than by name.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) are 
reasonable.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always 

[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites

2018-08-28 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595068#comment-16595068
 ] 

Marco Gaido commented on SPARK-23622:
-

This failure became permanent in the last build (at least seems so).

> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325
> {code}
> Error Message
> java.lang.reflect.InvocationTargetException: null
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
>   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
>   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
>   at org.scalatest.Suite$class.run(Suite.scala:1144)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
>   ... 29 more
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
>   ... 31 more
> Caused by: sbt.ForkMain$ForkError: 
> 

[jira] [Commented] (SPARK-23663) Spark Streaming Kafka 010 , fails with "java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access"

2018-08-28 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595044#comment-16595044
 ] 

Gabor Somogyi commented on SPARK-23663:
---

[~sipeti] from the stacktrace the root cause looks the same. There are 
additional jiras which tries to address this issue but not yet closed.

> Spark Streaming Kafka 010 , fails with 
> "java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access"
> ---
>
> Key: SPARK-23663
> URL: https://issues.apache.org/jira/browse/SPARK-23663
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 
> Spark streaming kafka 010
>  
>Reporter: kaushik srinivas
>Priority: Major
>
> test being tried:
> 10 kafka topics created. Streamed with avro data from kafka producers.
> org.apache.spark.streaming.kafka010 used for creating directStream to kafka.
> A single direct stream is created for all the ten topics.
> And on each RDD(batch of 50 seconds), key of kafka consumer record is checked 
> and seperate RDDs are created for seperate topics.
> Each topic has records with key as topic name and value of avro messages.
> Finally ten RDDs are converted to data frames and registered as separate temp 
> tables.
> Once all the temp tables are registered, few sql queries are run on these 
> temp tables.
>  
> Exception seen:
> 18/03/12 11:58:34 WARN TaskSetManager: Lost task 23.0 in stage 4.0 (TID 269, 
> 192.168.24.145, executor 7): java.util.ConcurrentModificationException: 
> KafkaConsumer is not safe for multi-threaded access
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:80)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.1 in stage 4.0 (TID 
> 828, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes)
> 18/03/12 11:58:34 INFO TaskSetManager: Lost task 23.1 in stage 4.0 (TID 828) 
> on 192.168.24.145, executor 7: java.util.ConcurrentModificationException 
> (KafkaConsumer is not safe for multi-threaded access) [duplicate 1]
> 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.2 in stage 4.0 (TID 
> 829, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes)
> 18/03/12 11:58:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 30, 
> 192.168.24.147, executor 6): java.lang.IllegalStateException: This consumer 
> has already been closed.
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189)
>  at 
> 

[jira] [Commented] (SPARK-23663) Spark Streaming Kafka 010 , fails with "java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access"

2018-08-28 Thread Peter Simon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595035#comment-16595035
 ] 

Peter Simon commented on SPARK-23663:
-

[~gsomogyi] isn't this the same as SPARK-19185?

> Spark Streaming Kafka 010 , fails with 
> "java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access"
> ---
>
> Key: SPARK-23663
> URL: https://issues.apache.org/jira/browse/SPARK-23663
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 
> Spark streaming kafka 010
>  
>Reporter: kaushik srinivas
>Priority: Major
>
> test being tried:
> 10 kafka topics created. Streamed with avro data from kafka producers.
> org.apache.spark.streaming.kafka010 used for creating directStream to kafka.
> A single direct stream is created for all the ten topics.
> And on each RDD(batch of 50 seconds), key of kafka consumer record is checked 
> and seperate RDDs are created for seperate topics.
> Each topic has records with key as topic name and value of avro messages.
> Finally ten RDDs are converted to data frames and registered as separate temp 
> tables.
> Once all the temp tables are registered, few sql queries are run on these 
> temp tables.
>  
> Exception seen:
> 18/03/12 11:58:34 WARN TaskSetManager: Lost task 23.0 in stage 4.0 (TID 269, 
> 192.168.24.145, executor 7): java.util.ConcurrentModificationException: 
> KafkaConsumer is not safe for multi-threaded access
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:80)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:108)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.1 in stage 4.0 (TID 
> 828, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes)
> 18/03/12 11:58:34 INFO TaskSetManager: Lost task 23.1 in stage 4.0 (TID 828) 
> on 192.168.24.145, executor 7: java.util.ConcurrentModificationException 
> (KafkaConsumer is not safe for multi-threaded access) [duplicate 1]
> 18/03/12 11:58:34 INFO TaskSetManager: Starting task 23.2 in stage 4.0 (TID 
> 829, 192.168.24.145, executor 7, partition 23, PROCESS_LOCAL, 4758 bytes)
> 18/03/12 11:58:40 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 30, 
> 192.168.24.147, executor 6): java.lang.IllegalStateException: This consumer 
> has already been closed.
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.ensureNotClosed(KafkaConsumer.java:1417)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1428)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:189)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
>  at 
> 

[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595020#comment-16595020
 ] 

Sean Owen commented on SPARK-24918:
---

Wait, why doesn't a static init "run everywhere"? is "everywhere" the same as 
"on every executor that will run a task"? and why would an executor be able to 
init without touching a static initializer that the user code touches? Yes you 
just call it somewhere in any task that needs the init. There's no overhead in 
checking the init and initting if not. Therein is the issue, I suppose: you 
have to put some reference to a class or method in every task. The upshot is it 
requires no additional mechanism or reasoning about what thread runs what.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25263) Add scheduler integration test for SPARK-24909

2018-08-28 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-25263:
-

 Summary: Add scheduler integration test for SPARK-24909
 Key: SPARK-25263
 URL: https://issues.apache.org/jira/browse/SPARK-25263
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.1
Reporter: Thomas Graves


for Jira SPARK-24909 . the unit test isn't sufficient to test across 
interactions between DAGScheduler and TAskSetManager.  Add a test to the 
SchedulerIntegrationSuite to cover this. 

The SchedulerIntegrationSuite needs to be updated to handle multiple executors 
and also be able to control the flow of tasks starting/ending.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-28 Thread Chao Fang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594981#comment-16594981
 ] 

Chao Fang edited comment on SPARK-25091 at 8/28/18 1:39 PM:


Yes, I think it's UI issue.

Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. 
And I get the attached files. As you can see, the storage tab is ok, while 
Storage Memory in Executor Tab always increase. And as you can see, the old gen 
space release memory space as expected and the Disk Memory in Executor Tab is 
0.0B 


was (Author: chao fang):
Yes, I think it's UI issue.

Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. 
And I get the attached files. As you can see, the storage tab is ok, while 
Storage Memory in Executor Tab always increase. And as you can see, the old gen 
space release memory space as expected and the Disk Memory in Executor Tab is 
0.0B !4.png!

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-28 Thread Chao Fang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594981#comment-16594981
 ] 

Chao Fang commented on SPARK-25091:
---

Yes, I think it's UI issue.

Today I run the CACHE/UNCACHE TABLE for three times and finally REFRESH TABLE. 
And I get the attached files. As you can see, the storage tab is ok, while 
Storage Memory in Executor Tab always increase. And as you can see, the old gen 
space release memory space as expected and the Disk Memory in Executor Tab is 
0.0B !4.png!

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-28 Thread Chao Fang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Fang updated SPARK-25091:
--
Attachment: 4.png

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-08-28 Thread Chao Fang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Fang updated SPARK-25091:
--
Attachment: 3.png
2.png
1.png
0.png

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25213) DataSourceV2 doesn't seem to produce unsafe rows

2018-08-28 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594973#comment-16594973
 ] 

Li Jin commented on SPARK-25213:


This is resolved by https://github.com/apache/spark/pull/22104

> DataSourceV2 doesn't seem to produce unsafe rows 
> -
>
> Key: SPARK-25213
> URL: https://issues.apache.org/jira/browse/SPARK-25213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Reproduce (Need to compile test-classes):
> bin/pyspark --driver-class-path sql/core/target/scala-2.11/test-classes
> {code:java}
> datasource_v2_df = spark.read \
> .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") 
> \
> .load()
> result = datasource_v2_df.withColumn('x', udf(lambda x: x, 
> 'int')(datasource_v2_df['i']))
> result.show()
> {code}
> The above code fails with:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> {code}
> Seems like Data Source V2 doesn't produce unsafeRows here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-28 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594958#comment-16594958
 ] 

yucai commented on SPARK-25206:
---

[~smilegator] , 2.1's exception is from parquet.
{code:java}
java.lang.IllegalArgumentException: Column [ID] was not found in schema!
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
{code}
2.1 uses parquet 1.8.1, while 2.3 uses parquet 1.8.3, it is behavior change in 
parquet.

See:

https://issues.apache.org/jira/browse/PARQUET-389

[https://github.com/apache/parquet-mr/commit/2282c22c5b252859b459cc2474350fbaf2a588e9]

 

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-08-28 Thread Rob Vesse (JIRA)
Rob Vesse created SPARK-25262:
-

 Summary: Make Spark local dir volumes configurable with Spark on 
Kubernetes
 Key: SPARK-25262
 URL: https://issues.apache.org/jira/browse/SPARK-25262
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1, 2.3.0
Reporter: Rob Vesse


As discussed during review of the design document for SPARK-24434 while 
providing pod templates will provide more in-depth customisation for Spark on 
Kubernetes there are some things that cannot be modified because Spark code 
generates pod specs in very specific ways.

The particular issue identified relates to handling on {{spark.local.dirs}} 
which is done by {{LocalDirsFeatureStep.scala}}.  For each directory specified, 
or a single default if no explicit specification, it creates a Kubernetes 
{{emptyDir}} volume.  As noted in the Kubernetes documentation this will be 
backed by the node storage 
(https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
compute environments this may be extremely undesirable.  For example with 
diskless compute resources the node storage will likely be a non-performant 
remote mounted disk, often with limited capacity.  For such environments it 
would likely be better to set {{medium: Memory}} on the volume per the K8S 
documentation to use a {{tmpfs}} volume instead.

Another closely related issue is that users might want to use a different 
volume type to back the local directories and there is no possibility to do 
that.

Pod templates will not really solve either of these issues because Spark is 
always going to attempt to generate a new volume for each local directory and 
always going to set these as {{emptyDir}}.

Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:

* Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
volumes
* Modify the logic to check if there is a volume already defined with the name 
and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25024) Update mesos documentation to be clear about security supported

2018-08-28 Thread Rob Vesse (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594933#comment-16594933
 ] 

Rob Vesse commented on SPARK-25024:
---

[~tgraves] Attempting to answer your questions:

* We never used cluster mode so can't comment
* Yes and no
** Similar to YARN it does the login locally in the client and then uses HDFS 
delegation tokens so it doesn't ship the keytabs AFAIK but it does ship the 
delegation tokens
* We never used Spark Shuffle Service either so can't comment
* Yes
** Mesos does authentication at the framework level rather than the user level 
so it depends on your setup.  You might have setups where there is a single 
principal and secret used by all Spark users or you might have setups where you 
create a principal and secret for each user.  You can optionally do ACLs within 
Mesos for each framework principal including configuring things like which 
users a framework is allowed to launch jobs as.
* Again not used this feature, think these are similar to K8S secrets in that 
they are created separately and you are just passing identifiers for these to 
Spark and Mesos takes care of providing these securely to your jobs.

Generally we have dropped use of Spark on Mesos in favour of Spark on K8S 
because the security story for Mesos was poor and we had to do a lot of extra 
stuff to provide multi-tenancy whereas with K8S a lot more was available out of 
the box (even if secure HDFS support has yet to land in mainline Spark)

> Update mesos documentation to be clear about security supported
> ---
>
> Key: SPARK-25024
> URL: https://issues.apache.org/jira/browse/SPARK-25024
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was reading through our mesos deployment docs and security docs and its not 
> clear at all what type of security and how to set it up for mesos.  I think 
> we should clarify this and have something about exactly what is supported and 
> what is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25102:


Assignee: Apache Spark

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Assignee: Apache Spark
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594914#comment-16594914
 ] 

Apache Spark commented on SPARK-25102:
--

User 'npoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22255

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25102:


Assignee: (was: Apache Spark)

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24411:


Assignee: (was: Apache Spark)

> Adding native Java tests for `isInCollection`
> -
>
> Key: SPARK-24411
> URL: https://issues.apache.org/jira/browse/SPARK-24411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Minor
>  Labels: starter
>
> In the past, some of our Java APIs have been difficult to call from Java. We 
> should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594855#comment-16594855
 ] 

Apache Spark commented on SPARK-24411:
--

User 'aai95' has created a pull request for this issue:
https://github.com/apache/spark/pull/22253

> Adding native Java tests for `isInCollection`
> -
>
> Key: SPARK-24411
> URL: https://issues.apache.org/jira/browse/SPARK-24411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Minor
>  Labels: starter
>
> In the past, some of our Java APIs have been difficult to call from Java. We 
> should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24411:


Assignee: Apache Spark

> Adding native Java tests for `isInCollection`
> -
>
> Key: SPARK-24411
> URL: https://issues.apache.org/jira/browse/SPARK-24411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> In the past, some of our Java APIs have been difficult to call from Java. We 
> should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-28 Thread Nikita Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594809#comment-16594809
 ] 

Nikita Poberezkin edited comment on SPARK-25102 at 8/28/18 10:43 AM:
-

Hi, [~zi]. I've tried to override getName method in 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The 
problem is that the only way (that i know about) to find Spark version 
programmatically on an executor node is 
SparkSession.builder().getOrCreate().version. But, when I ran tests i received 
the following error: 
 Caused by: java.lang.IllegalStateException: SparkSession should only be 
created and accessed on the driver.
 Do you know any other way to find Spark version in an executor?


was (Author: npoberezkin):
Hi, Zoltan. I've tried to override getName method in 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The 
problem is that the only way (that i know about) to find Spark version 
programmatically on an executor node is 
SparkSession.builder().getOrCreate().version. But, when I ran tests i received 
the following error: 
Caused by: java.lang.IllegalStateException: SparkSession should only be created 
and accessed on the driver.
Do you know any other way to find Spark version in an executor?

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-28 Thread Nikita Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594809#comment-16594809
 ] 

Nikita Poberezkin commented on SPARK-25102:
---

Hi, Zoltan. I've tried to override getName method in 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport. The 
problem is that the only way (that i know about) to find Spark version 
programmatically on an executor node is 
SparkSession.builder().getOrCreate().version. But, when I ran tests i received 
the following error: 
Caused by: java.lang.IllegalStateException: SparkSession should only be created 
and accessed on the driver.
Do you know any other way to find Spark version in an executor?

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25261:


Assignee: (was: Apache Spark)

> Update configuration.md, correct the default units of 
> spark.driver|executor.memory
> --
>
> Key: SPARK-25261
> URL: https://issues.apache.org/jira/browse/SPARK-25261
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Minor
>
> From  
> [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
>  and 
> [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
>  can see that spark.driver.memory and spark.executor.memory are parsed as 
> bytes if no units specified. But in the doc, they are described as mb in 
> default, which may lead to some misunderstanding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594716#comment-16594716
 ] 

Apache Spark commented on SPARK-25261:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/22252

> Update configuration.md, correct the default units of 
> spark.driver|executor.memory
> --
>
> Key: SPARK-25261
> URL: https://issues.apache.org/jira/browse/SPARK-25261
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Minor
>
> From  
> [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
>  and 
> [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
>  can see that spark.driver.memory and spark.executor.memory are parsed as 
> bytes if no units specified. But in the doc, they are described as mb in 
> default, which may lead to some misunderstanding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25261:


Assignee: Apache Spark

> Update configuration.md, correct the default units of 
> spark.driver|executor.memory
> --
>
> Key: SPARK-25261
> URL: https://issues.apache.org/jira/browse/SPARK-25261
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Assignee: Apache Spark
>Priority: Minor
>
> From  
> [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
>  and 
> [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
>  can see that spark.driver.memory and spark.executor.memory are parsed as 
> bytes if no units specified. But in the doc, they are described as mb in 
> default, which may lead to some misunderstanding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25261) Update configuration.md, correct the default units of spark.driver|executor.memory

2018-08-28 Thread huangtengfei (JIRA)
huangtengfei created SPARK-25261:


 Summary: Update configuration.md, correct the default units of 
spark.driver|executor.memory
 Key: SPARK-25261
 URL: https://issues.apache.org/jira/browse/SPARK-25261
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.3.0
Reporter: huangtengfei


From  
[SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
 and 
[SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
 can see that spark.driver.memory and spark.executor.memory are parsed as bytes 
if no units specified. But in the doc, they are described as mb in default, 
which may lead to some misunderstanding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25260:


Assignee: Apache Spark

> Fix namespace handling in SchemaConverters.toAvroType
> -
>
> Key: SPARK-25260
> URL: https://issues.apache.org/jira/browse/SPARK-25260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Major
>
> `toAvroType` converts spark data type to avro schema. It always appends the 
> record name to namespace so its impossible to have an Avro namespace 
> independent of the record name.
>  
> When invoked with a spark data type like,
>  
> {code:java}
> val sparkSchema = StructType(Seq(
>     StructField("name", StringType, nullable = false),
>     StructField("address", StructType(Seq(
>         StructField("city", StringType, nullable = false),
>         StructField("state", StringType, nullable = false))),
>     nullable = false)))
>  
> // map it to an avro schema with top level namespace "foo.bar",
> val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
> "foo.bar")
> // result is
> // avroSchema.getName = employee
> // avroSchema.getNamespace = foo.bar.employee
> // avroSchema.getFullname = foo.bar.employee.employee
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594682#comment-16594682
 ] 

Apache Spark commented on SPARK-25260:
--

User 'arunmahadevan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22251

> Fix namespace handling in SchemaConverters.toAvroType
> -
>
> Key: SPARK-25260
> URL: https://issues.apache.org/jira/browse/SPARK-25260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> `toAvroType` converts spark data type to avro schema. It always appends the 
> record name to namespace so its impossible to have an Avro namespace 
> independent of the record name.
>  
> When invoked with a spark data type like,
>  
> {code:java}
> val sparkSchema = StructType(Seq(
>     StructField("name", StringType, nullable = false),
>     StructField("address", StructType(Seq(
>         StructField("city", StringType, nullable = false),
>         StructField("state", StringType, nullable = false))),
>     nullable = false)))
>  
> // map it to an avro schema with top level namespace "foo.bar",
> val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
> "foo.bar")
> // result is
> // avroSchema.getName = employee
> // avroSchema.getNamespace = foo.bar.employee
> // avroSchema.getFullname = foo.bar.employee.employee
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25260:


Assignee: (was: Apache Spark)

> Fix namespace handling in SchemaConverters.toAvroType
> -
>
> Key: SPARK-25260
> URL: https://issues.apache.org/jira/browse/SPARK-25260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
>
> `toAvroType` converts spark data type to avro schema. It always appends the 
> record name to namespace so its impossible to have an Avro namespace 
> independent of the record name.
>  
> When invoked with a spark data type like,
>  
> {code:java}
> val sparkSchema = StructType(Seq(
>     StructField("name", StringType, nullable = false),
>     StructField("address", StructType(Seq(
>         StructField("city", StringType, nullable = false),
>         StructField("state", StringType, nullable = false))),
>     nullable = false)))
>  
> // map it to an avro schema with top level namespace "foo.bar",
> val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
> "foo.bar")
> // result is
> // avroSchema.getName = employee
> // avroSchema.getNamespace = foo.bar.employee
> // avroSchema.getFullname = foo.bar.employee.employee
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25260) Fix namespace handling in SchemaConverters.toAvroType

2018-08-28 Thread Arun Mahadevan (JIRA)
Arun Mahadevan created SPARK-25260:
--

 Summary: Fix namespace handling in SchemaConverters.toAvroType
 Key: SPARK-25260
 URL: https://issues.apache.org/jira/browse/SPARK-25260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Arun Mahadevan


`toAvroType` converts spark data type to avro schema. It always appends the 
record name to namespace so its impossible to have an Avro namespace 
independent of the record name.

 
When invoked with a spark data type like,

 
{code:java}
val sparkSchema = StructType(Seq(
    StructField("name", StringType, nullable = false),
    StructField("address", StructType(Seq(
        StructField("city", StringType, nullable = false),
        StructField("state", StringType, nullable = false))),
    nullable = false)))
 
// map it to an avro schema with top level namespace "foo.bar",
val avroSchema = SchemaConverters.toAvroType(sparkSchema,  false, "employee", 
"foo.bar")

// result is
// avroSchema.getName = employee
// avroSchema.getNamespace = foo.bar.employee
// avroSchema.getFullname = foo.bar.employee.employee
 
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20977) NPE in CollectionAccumulator

2018-08-28 Thread howie yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594659#comment-16594659
 ] 

howie yu commented on SPARK-20977:
--

I use pyspark 2.3.1 also have this problem , but in different line

 

[2018-08-28 15:29:06,146][ERROR] Utils    : Uncaught exception 
in thread heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
    at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:477)
    at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:449)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8$$anonfun$9.apply(TaskSchedulerImpl.scala:449)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8$$anonfun$9.apply(TaskSchedulerImpl.scala:449)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8.apply(TaskSchedulerImpl.scala:449)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$8.apply(TaskSchedulerImpl.scala:448)
    at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
    at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:448)
    at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
    at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Priority: Major
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> 

[jira] [Resolved] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25226.
--
Resolution: Won't Fix

Please don't reopen until we are clear that this is going to be fixed. For the 
current status, I wouldn't fix it, or less sure what to fix for now.

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594648#comment-16594648
 ] 

Hyukjin Kwon commented on SPARK-25226:
--

{quote}
1. What exactly is a catalog string?
{quote}

Hive DDL formatted schema string. 

{quote}
2. We need JSON, because the client will expect JSON. I am not sure, why you 
are saying, that JSON should be deprecated.
{quote}

JSON format support only exists for backword compatibility, which can be 
replaced by the catalog string in most case. This can be retreived by 
{{dataType.catalogString}}.

{quote}
3. Yes, I can use 'array' in the current master. But I cannot use array 
 or more complex types (nested arrays, etc.), which is 
what I really need (see ticket description).
{quote}

Spark SQL requires a schema known ahead so that we can do other optimizations. 
What type do you expect for that? String can contains all of the data inside. 
Do you propose a like object type to contains all of them? If so, please ask it 
to the mailing list. FWIW, I don't think it's a good idea since Spark SQL's 
advantage is basically from the fact that we know the schema ahead.

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-28 Thread Yuriy Davygora (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy Davygora reopened SPARK-25226:


> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25226) Extend functionality of from_json to support arrays of differently-typed elements

2018-08-28 Thread Yuriy Davygora (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594643#comment-16594643
 ] 

Yuriy Davygora commented on SPARK-25226:


[~hyukjin.kwon]

1. What exactly is a catalog string?
2. We need JSON, because the client will expect JSON. I am not sure, why you 
are saying, that JSON should be deprecated.
3. Yes, I can use 'array' in the current master. But I cannot use array 
 or more complex types (nested arrays, etc.), which is 
what I really need (see ticket description).

Reopening

> Extend functionality of from_json to support arrays of differently-typed 
> elements
> -
>
> Key: SPARK-25226
> URL: https://issues.apache.org/jira/browse/SPARK-25226
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuriy Davygora
>Priority: Minor
>
> At the moment, the 'from_json' function only supports a STRUCT or an ARRAY of 
> STRUCTS as input. Support for ARRAY of primitives is, apparently, coming with 
> Spark 2.4, but it will only support arrays of elements of same data type. It 
> will not, for example, support JSON-arrays like
> {noformat}
> ["string_value", 0, true, null]
> {noformat}
> which is JSON-valid with schema
> {noformat}
> {"containsNull":true,"elementType":["string","integer","boolean"],"type":"array"}
> {noformat}
> We would like to kindly ask you to add support for different-typed element 
> arrays in the 'from_json' function. This will necessitate extending the 
> functionality of ArrayType or maybe adding a new type (refer to 
> [[SPARK-25225]])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25259:

Comment: was deleted

(was: I'm working on.)

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25259:


Assignee: (was: Apache Spark)

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25259:


Assignee: Apache Spark

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594606#comment-16594606
 ] 

Apache Spark commented on SPARK-25259:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22250

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594563#comment-16594563
 ] 

Yuming Wang commented on SPARK-25259:
-

I'm working on.

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-28 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25259:
---

 Summary: Left/Right join support push down during-join predicates
 Key: SPARK-25259
 URL: https://issues.apache.org/jira/browse/SPARK-25259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


For example:
{code:sql}
create temporary view EMPLOYEE as select * from values
  ("10", "HAAS", "A00"),
  ("10", "THOMPSON", "B01"),
  ("30", "KWAN", "C01"),
  ("000110", "LUCCHESSI", "A00"),
  ("000120", "O'CONNELL", "A))"),
  ("000130", "QUINTANA", "C01")
  as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);

create temporary view DEPARTMENT as select * from values
  ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
  ("B01", "PLANNING", "20"),
  ("C01", "INFORMATION CENTER", "30"),
  ("D01", "DEVELOPMENT CENTER", null)
  as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);

create temporary view PROJECT as select * from values
  ("AD3100", "ADMIN SERVICES", "D01"),
  ("IF1000", "QUERY SERVICES", "C01"),
  ("IF2000", "USER EDUCATION", "E01"),
  ("MA2100", "WELD LINE AUDOMATION", "D01"),
  ("PL2100", "WELD LINE PLANNING", "01")
  as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
{code}

below SQL:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}

can Optimized to:
{code:sql}
SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
ON P.DEPTNO = D.DEPTNO
AND P.DEPTNO='E01';
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >