[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252059#comment-14252059
 ] 

Xuefu Zhang commented on HIVE-9127:
---

Spark patch is also committed to Spark branch.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
Affects Versions: 0.14.0
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 0.15.0

 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249631#comment-14249631
 ] 

Hive QA commented on HIVE-9127:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12687603/HIVE-9127.3.patch.txt

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 6713 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2103/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2103/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2103/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12687603 - PreCommit-HIVE-TRUNK-Build

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249916#comment-14249916
 ] 

Xuefu Zhang commented on HIVE-9127:
---

+1. Please modify the query if the patch is going to apply to trunk.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 2014-12-16 14:36:22,203 INFO  

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250104#comment-14250104
 ] 

Brock Noland commented on HIVE-9127:


Thank you Xuefu!

bq. Please modify the query if the patch is going to apply to trunk.

I don't follow? The latest patch applies to trunk and was tested on trunk.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250325#comment-14250325
 ] 

Jimmy Xiang commented on HIVE-9127:
---

In looking into HIVE-9135, I was wondering if it is better to fix the root 
cause of HIVE-7431 instead disabling the cache for Spark. If so, probably we 
don't need this work around?

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 0.14.0
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250373#comment-14250373
 ] 

Brock Noland commented on HIVE-9127:


bq. In looking into HIVE-9135, I was wondering if it is better to fix the root 
cause of HIVE-7431 instead disabling the cache for Spark.

I think that would be awesome. I think we disabled it early on when we were 
just trying to get HOS working.

bq. If so, probably we don't need this work around?

I think this work around results in better code generally. In 
CombineHiveInputFormat we were looking up the partition information on each 
loop iteration but with this fix we do it once before the loop, which is 
generally better.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 0.14.0
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250383#comment-14250383
 ] 

Jimmy Xiang commented on HIVE-9127:
---

bq. I think this work around results in better code generally.
Agreed. Thanks.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 0.14.0
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 2014-12-16 14:36:22,203 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250389#comment-14250389
 ] 

Xuefu Zhang commented on HIVE-9127:
---

{quote}
Please modify the query if the patch is going to apply to trunk.
{quote}
My bad. I meant to say modify the JIRA, but now I see again and it seems 
alright except for a Spark component, which probably doesn't matter.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: 0.14.0
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]

2014-12-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248823#comment-14248823
 ] 

Brock Noland commented on HIVE-9127:


The attached patch should fix {{CombineHiveInputFormat}} but does not address 
{{HiveInputFormat}}.

 Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
 -

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]

2014-12-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248835#comment-14248835
 ] 

Xuefu Zhang commented on HIVE-9127:
---

[~brocknoland], thanks for working on this. A brief glimpse seemingly shows 
that your changes are all in trunk and have nothing to do with RSC, unless I 
missed something. If that's the case, we should apply the patch to trunk 
instead.

 Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
 -

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]

2014-12-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248843#comment-14248843
 ] 

Brock Noland commented on HIVE-9127:


Yep that makes sense. We'll run tests on spark branch first since it's faster 
and then if it works I will attach a patch for trunk.

 Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
 -

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]

2014-12-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249062#comment-14249062
 ] 

Hive QA commented on HIVE-9127:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12687559/HIVE-9127.2-spark.patch.txt

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7235 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/556/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/556/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-556/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12687559 - PreCommit-HIVE-SPARK-Build

 Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
 -

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC

2014-12-16 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249069#comment-14249069
 ] 

Brock Noland commented on HIVE-9127:


Attached patch for trunk.

 Improve CombineHiveInputFormat.getSplit performance in RSC
 --

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249287#comment-14249287
 ] 

Rui Li commented on HIVE-9127:
--

Will this cache Map/Reduce works for spark? Seems changes to Utilities doesn't 
change how work is retrieved or cached.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249427#comment-14249427
 ] 

Xuefu Zhang commented on HIVE-9127:
---

Hi [~lirui], I think this JIRA was re-purposed to enhance getSplit performance. 
Brock created HIVE-9135 for cloning Map/Reduce-Work.

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 

[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance

2014-12-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249481#comment-14249481
 ] 

Rui Li commented on HIVE-9127:
--

[~xuefuz] - Oh I see. Thanks for the explanation!

 Improve CombineHiveInputFormat.getSplit performance
 ---

 Key: HIVE-9127
 URL: https://issues.apache.org/jira/browse/HIVE-9127
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Brock Noland
 Attachments: HIVE-9127.1-spark.patch.txt, 
 HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt


 In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would 
 fail. However, we should be able to cache these objects in RSC for split 
 generation. See: 
 https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622
  how this impacts performance.
 Caller ST:
 {noformat}
 
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,202 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 scala.Option.getOrElse(Option.scala:120)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.rdd.RDD.dependencies(RDD.scala:190)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: client.SparkClientImpl 
 (SparkClientImpl.java:run(435)) -at 
 org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
 2014-12-16 14:36:22,203 INFO  [stdout-redir-1]: