[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252059#comment-14252059 ] Xuefu Zhang commented on HIVE-9127: --- Spark patch is also committed to Spark branch. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Affects Versions: 0.14.0 Reporter: Brock Noland Assignee: Brock Noland Fix For: 0.15.0 Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]:
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249631#comment-14249631 ] Hive QA commented on HIVE-9127: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12687603/HIVE-9127.3.patch.txt {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 6713 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2103/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2103/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2103/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12687603 - PreCommit-HIVE-TRUNK-Build Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249916#comment-14249916 ] Xuefu Zhang commented on HIVE-9127: --- +1. Please modify the query if the patch is going to apply to trunk. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) 2014-12-16 14:36:22,203 INFO
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250104#comment-14250104 ] Brock Noland commented on HIVE-9127: Thank you Xuefu! bq. Please modify the query if the patch is going to apply to trunk. I don't follow? The latest patch applies to trunk and was tested on trunk. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250325#comment-14250325 ] Jimmy Xiang commented on HIVE-9127: --- In looking into HIVE-9135, I was wondering if it is better to fix the root cause of HIVE-7431 instead disabling the cache for Spark. If so, probably we don't need this work around? Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.14.0 Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250373#comment-14250373 ] Brock Noland commented on HIVE-9127: bq. In looking into HIVE-9135, I was wondering if it is better to fix the root cause of HIVE-7431 instead disabling the cache for Spark. I think that would be awesome. I think we disabled it early on when we were just trying to get HOS working. bq. If so, probably we don't need this work around? I think this work around results in better code generally. In CombineHiveInputFormat we were looking up the partition information on each loop iteration but with this fix we do it once before the loop, which is generally better. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.14.0 Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250383#comment-14250383 ] Jimmy Xiang commented on HIVE-9127: --- bq. I think this work around results in better code generally. Agreed. Thanks. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.14.0 Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) 2014-12-16 14:36:22,203
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250389#comment-14250389 ] Xuefu Zhang commented on HIVE-9127: --- {quote} Please modify the query if the patch is going to apply to trunk. {quote} My bad. I meant to say modify the JIRA, but now I see again and it seems alright except for a Spark component, which probably doesn't matter. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.14.0 Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248823#comment-14248823 ] Brock Noland commented on HIVE-9127: The attached patch should fix {{CombineHiveInputFormat}} but does not address {{HiveInputFormat}}. Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch] - Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301)
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248835#comment-14248835 ] Xuefu Zhang commented on HIVE-9127: --- [~brocknoland], thanks for working on this. A brief glimpse seemingly shows that your changes are all in trunk and have nothing to do with RSC, unless I missed something. If that's the case, we should apply the patch to trunk instead. Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch] - Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]:
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248843#comment-14248843 ] Brock Noland commented on HIVE-9127: Yep that makes sense. We'll run tests on spark branch first since it's faster and then if it works I will attach a patch for trunk. Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch] - Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249062#comment-14249062 ] Hive QA commented on HIVE-9127: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12687559/HIVE-9127.2-spark.patch.txt {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7235 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/556/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/556/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-556/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12687559 - PreCommit-HIVE-SPARK-Build Improve CombineHiveInputFormat.getSplit performance in RSC [Spark Branch] - Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance in RSC
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249069#comment-14249069 ] Brock Noland commented on HIVE-9127: Attached patch for trunk. Improve CombineHiveInputFormat.getSplit performance in RSC -- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]:
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249287#comment-14249287 ] Rui Li commented on HIVE-9127: -- Will this cache Map/Reduce works for spark? Seems changes to Utilities doesn't change how work is retrieved or cached. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249427#comment-14249427 ] Xuefu Zhang commented on HIVE-9127: --- Hi [~lirui], I think this JIRA was re-purposed to enhance getSplit performance. Brock created HIVE-9135 for cloning Map/Reduce-Work. Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at
[jira] [Commented] (HIVE-9127) Improve CombineHiveInputFormat.getSplit performance
[ https://issues.apache.org/jira/browse/HIVE-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249481#comment-14249481 ] Rui Li commented on HIVE-9127: -- [~xuefuz] - Oh I see. Thanks for the explanation! Improve CombineHiveInputFormat.getSplit performance --- Key: HIVE-9127 URL: https://issues.apache.org/jira/browse/HIVE-9127 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9127.1-spark.patch.txt, HIVE-9127.2-spark.patch.txt, HIVE-9127.3.patch.txt In HIVE-7431 we disabled caching of Map/Reduce works because some tasks would fail. However, we should be able to cache these objects in RSC for split generation. See: https://issues.apache.org/jira/browse/HIVE-9124?focusedCommentId=14248622page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14248622 how this impacts performance. Caller ST: {noformat} 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:328) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:421) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:510) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,202 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.ShuffleDependency.init(Dependency.scala:79) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:192) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at scala.Option.getOrElse(Option.scala:120) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.rdd.RDD.dependencies(RDD.scala:190) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(435)) -at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:301) 2014-12-16 14:36:22,203 INFO [stdout-redir-1]: