[jira] [Commented] (HIVE-9370) Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279832#comment-14279832 ] Xuefu Zhang commented on HIVE-9370: --- [~sandyr] Could you take a look at about issue regarding sortByKey and share your thought? Thanks. Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch] - Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.ShuffleTran.transform(ShuffleTran.java:45) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SparkPlan.generateGraph
[jira] [Commented] (HIVE-9370) Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279856#comment-14279856 ] Xuefu Zhang commented on HIVE-9370: --- Yeah. This seems interfering with Hive's way of launching Spark jobs with the remote SparkContext. Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch] - Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.ShuffleTran.transform(ShuffleTran.java:45) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SparkPlan.generateGraph
[jira] [Comment Edited] (HIVE-9370) Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279856#comment-14279856 ] Xuefu Zhang edited comment on HIVE-9370 at 1/16/15 6:15 AM: Yeah. This seems interfering with Hive's way of launching Spark jobs with the remote SparkContext. cc: [~vanzin] was (Author: xuefuz): Yeah. This seems interfering with Hive's way of launching Spark jobs with the remote SparkContext. Enable Hive on Spark for BigBench and run Query 8, the test failed [Spark Branch] - Key: HIVE-9370 URL: https://issues.apache.org/jira/browse/HIVE-9370 Project: Hive Issue Type: Sub-task Components: Spark Reporter: yuyun.chen enable hive on spark and run BigBench Query 8 then got the following exception: 2015-01-14 11:43:46,057 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-14 11:43:46,061 ERROR [main]: status.SparkJobMonitor (SessionState.java:printError(839)) - Status: Failed 2015-01-14 11:43:46,062 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - /PERFLOG method=SparkRunJob start=1421206996052 end=1421207026062 duration=30010 from=org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - 15/01/14 11:43:46 INFO RemoteDriver: Failed to run job 0a9a7782-0e0b-4561-8468-959a6d8df0a3 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) - java.lang.InterruptedException 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Native Method) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at java.lang.Object.wait(Object.java:503) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514) 2015-01-14 11:43:46,071 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1282) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1300) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1314) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.RDD.collect(RDD.scala:780) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:262) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.RangePartitioner.init(Partitioner.scala:124) 2015-01-14 11:43:46,072 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:63) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:894) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:864) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.SortByShuffler.shuffle(SortByShuffler.java:48) 2015-01-14 11:43:46,073 INFO [stderr-redir-1]: client.SparkClientImpl (SparkClientImpl.java:run(436)) -at org.apache.hadoop.hive.ql.exec.spark.ShuffleTran.transform(ShuffleTran.java:45
[jira] [Commented] (HIVE-9378) Spark qfile tests should reuse RSC
[ https://issues.apache.org/jira/browse/HIVE-9378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277628#comment-14277628 ] Xuefu Zhang commented on HIVE-9378: --- [~jxiang], qfile test is supposed to run independently though it may seem inefficient. I'm not sure if there is a strong need to change this. However, 2e don't want them to share a session because each test may have different configurations. Spark qfile tests should reuse RSC -- Key: HIVE-9378 URL: https://issues.apache.org/jira/browse/HIVE-9378 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Run several qfile tests, use jps to monitor the java processes. You will find several SparkSubmitDriverBootstrapper processes are created (not the same time of course). It seems to me that we create a RSC for each qfile, then terminate it when this qfile test is done. The RSC seems not shared among qfiles. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277642#comment-14277642 ] Xuefu Zhang commented on HIVE-9367: --- Thanks for the explanation. This is a shim class, so we are okay. Patch looks good to me. One note though, is that prune() method seems no longer needed. Could you remove it? CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9379) Fix tests with some versions of Spark + Snappy [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277875#comment-14277875 ] Xuefu Zhang commented on HIVE-9379: --- +1. Looks good to me, and good for mac users. Fix tests with some versions of Spark + Snappy [Spark Branch] - Key: HIVE-9379 URL: https://issues.apache.org/jira/browse/HIVE-9379 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9379.1-spark.patch Some versions of spark use a snappy versions which requires the following properties on OSX: {noformat} -Dorg.xerial.snappy.tempdir=/tmp -Dorg.xerial.snappy.lib.name=libsnappyjava.jnilib {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9342) add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode
[ https://issues.apache.org/jira/browse/HIVE-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277900#comment-14277900 ] Xuefu Zhang commented on HIVE-9342: --- +1 add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode -- Key: HIVE-9342 URL: https://issues.apache.org/jira/browse/HIVE-9342 Project: Hive Issue Type: Improvement Components: spark-branch Affects Versions: spark-branch Reporter: Pierre Yin Priority: Minor Labels: spark Fix For: spark-branch Attachments: HIVE-9342.1-spark.patch, HIVE-9342.2-spark.patch When I run hive on spark with Yarn mode, I want to control some yarn option, such as --num-executors, --executor-cores, --executor-memory. We can append these options into argv in SparkClientImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9342) add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9342: -- Summary: add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch] (was: add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode) add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch] - Key: HIVE-9342 URL: https://issues.apache.org/jira/browse/HIVE-9342 Project: Hive Issue Type: Improvement Components: spark-branch Affects Versions: spark-branch Reporter: Pierre Yin Priority: Minor Labels: spark Fix For: spark-branch Attachments: HIVE-9342.1-spark.patch, HIVE-9342.2-spark.patch When I run hive on spark with Yarn mode, I want to control some yarn option, such as --num-executors, --executor-cores, --executor-memory. We can append these options into argv in SparkClientImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9372) Parallel checking non-combinable paths in CombineHiveInputFormat
[ https://issues.apache.org/jira/browse/HIVE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278053#comment-14278053 ] Xuefu Zhang commented on HIVE-9372: --- Your patch is for trunk, which has a longer test queue: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/ Parallel checking non-combinable paths in CombineHiveInputFormat Key: HIVE-9372 URL: https://issues.apache.org/jira/browse/HIVE-9372 Project: Hive Issue Type: Improvement Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9372.1.patch Checking if an input path is combinable is expensive. So we should make it parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9342) add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9342: -- Issue Type: Sub-task (was: Improvement) Parent: HIVE-7292 add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode [Spark Branch] - Key: HIVE-9342 URL: https://issues.apache.org/jira/browse/HIVE-9342 Project: Hive Issue Type: Sub-task Components: spark-branch Affects Versions: spark-branch Reporter: Pierre Yin Priority: Minor Labels: spark Fix For: spark-branch Attachments: HIVE-9342.1-spark.patch, HIVE-9342.2-spark.patch When I run hive on spark with Yarn mode, I want to control some yarn option, such as --num-executors, --executor-cores, --executor-memory. We can append these options into argv in SparkClientImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277931#comment-14277931 ] Xuefu Zhang commented on HIVE-9367: --- +1 pending on test CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9372) Parallel checking non-combinable paths in CombineHiveInputFormat
[ https://issues.apache.org/jira/browse/HIVE-9372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277982#comment-14277982 ] Xuefu Zhang commented on HIVE-9372: --- Patch looks good to me. One minor comment, can we give initial size to the following list since we know how many elements we will have? {code} ListFutureSetInteger futureList = new ArrayListFutureSetInteger(); {code} Parallel checking non-combinable paths in CombineHiveInputFormat Key: HIVE-9372 URL: https://issues.apache.org/jira/browse/HIVE-9372 Project: Hive Issue Type: Improvement Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9372.1.patch Checking if an input path is combinable is expensive. So we should make it parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278179#comment-14278179 ] Xuefu Zhang commented on HIVE-9178: --- +1 Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch, HIVE-9178.2-spark.patch, HIVE-9178.2-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277070#comment-14277070 ] Xuefu Zhang commented on HIVE-9367: --- Nice improvement. However, I'm a little concerned about overriding listStatus() method, as an caller (including subclasses) would suddently get a list with folders excluded. I'm wondering if it's possible to achieve the same optimization w/o overriding that method. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276485#comment-14276485 ] Xuefu Zhang edited comment on HIVE-9178 at 1/14/15 5:10 AM: The dummy patch produces 3 failures and the run takes 1 hour 48 minutes. It seems likely that the patch has some defects. Chengxiang's question might be a hint. was (Author: xuefuz): The dummy patch produces 3 failures and the run takes 1 hour 48 minutes. It seems likely that the patch has some defects. Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch, HIVE-9178.2-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276485#comment-14276485 ] Xuefu Zhang commented on HIVE-9178: --- The dummy patch produces 3 failures and the run takes 1 hour 48 minutes. It seems likely that the patch has some defects. Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch, HIVE-9178.2-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9178: -- Attachment: HIVE-9178.1-spark.patch Reattach the same patch to have another test run. Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9178: -- Attachment: HIVE-9178.2-spark.patch Attached a dummy patch to test the test env. Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch, HIVE-9178.2-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276223#comment-14276223 ] Xuefu Zhang commented on HIVE-9178: --- It looks like that the patch somehow has increased the test run time quite dramatically. Normally it takes about an hour to finish, but now it has been running for over 4 hours and is still going. http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/638/ -- last one http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/639/ -- currently running Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch, HIVE-9178.1-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9342) add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode
[ https://issues.apache.org/jira/browse/HIVE-9342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274074#comment-14274074 ] Xuefu Zhang commented on HIVE-9342: --- [~fangxi.yin], thanks for working on this. [~chengxiang li], could you please take a look the proposed change, especially in light of Spark dynamic executor scaling? Also note that Spark standalone mode is also supported by Hive. add num-executors / executor-cores / executor-memory option support for hive on spark in Yarn mode -- Key: HIVE-9342 URL: https://issues.apache.org/jira/browse/HIVE-9342 Project: Hive Issue Type: Improvement Components: spark-branch Affects Versions: spark-branch Reporter: Pierre Yin Priority: Minor Labels: spark Fix For: spark-branch Attachments: HIVE-9342.1-spark.patch When I run hive on spark with Yarn mode, I want to control some yarn option, such as --num-executors, --executor-cores, --executor-memory. We can append these options into argv in SparkClientImpl. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9340) Address review of HIVE-9257 (ii) [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9340: -- Summary: Address review of HIVE-9257 (ii) [Spark Branch] (was: Address review of HIVE-9257 (ii)) Address review of HIVE-9257 (ii) [Spark Branch] --- Key: HIVE-9340 URL: https://issues.apache.org/jira/browse/HIVE-9340 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-9257-spark.patch Some minor fixes: 1. Get rid of spark_test.q, which was used to test the sparkCliDriver test fw. 2. Get rid of spark-snapshot repository dep in pom (found by Xuefu) 3. Cleanup ExplainTask to get rid of * in imports. (found by Xuefu) 4. Reorder the scala/spark dependencies in pom to fit the alphabetical order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9340) Address review of HIVE-9257 (ii)
[ https://issues.apache.org/jira/browse/HIVE-9340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274303#comment-14274303 ] Xuefu Zhang commented on HIVE-9340: --- +1 pending on test Address review of HIVE-9257 (ii) Key: HIVE-9340 URL: https://issues.apache.org/jira/browse/HIVE-9340 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-9257-spark.patch Some minor fixes: 1. Get rid of spark_test.q, which was used to test the sparkCliDriver test fw. 2. Get rid of spark-snapshot repository dep in pom (found by Xuefu) 3. Cleanup ExplainTask to get rid of * in imports. (found by Xuefu) 4. Reorder the scala/spark dependencies in pom to fit the alphabetical order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: new hive udfs
Hi Alex, This should be a good starting point: https://cwiki.apache.org/confluence/display/Hive/HowToContribute. Thanks, Xuefu On Mon, Jan 12, 2015 at 2:37 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone I have several custom udfs I want to contribute to hive month_add last_day greatest least What is the process for adding new UDFs? Alex
Re: new hive udfs
No. You can just create JIRA describing your reasoning and attach your patch for review. On Mon, Jan 12, 2015 at 2:53 PM, Alexander Pivovarov apivova...@gmail.com wrote: I mean should I get any approval before creating JIRA? Just want to make sure that these UDFs are needed. On Mon, Jan 12, 2015 at 2:48 PM, Xuefu Zhang xzh...@cloudera.com wrote: Hi Alex, This should be a good starting point: https://cwiki.apache.org/confluence/display/Hive/HowToContribute. Thanks, Xuefu On Mon, Jan 12, 2015 at 2:37 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone I have several custom udfs I want to contribute to hive month_add last_day greatest least What is the process for adding new UDFs? Alex
[jira] [Commented] (HIVE-9178) Create a separate API for remote Spark Context RPC other than job submission [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274678#comment-14274678 ] Xuefu Zhang commented on HIVE-9178: --- The patch looks good to me. [~chengxiang li], could you also take a look. [~brocknoland], I'm wondering why the test hasn't kicked in for this. Create a separate API for remote Spark Context RPC other than job submission [Spark Branch] --- Key: HIVE-9178 URL: https://issues.apache.org/jira/browse/HIVE-9178 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Marcelo Vanzin Attachments: HIVE-9178.1-spark.patch Based on discussions in HIVE-8972, it seems making sense to create a separate API for RPCs, such as addJar and getExecutorCounter. These jobs are different from a query submission in that they don't need to be queued in the backend and can be executed right away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9258) Explain query should share the same Spark application with regular queries [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9258: -- Description: Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which should be shared with regular queries so that we don't launch additional Spark remote context. (was: Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which is costly. To make things worse, the application is discarded right way. Ideally, we shouldn't launch a Spark application even for an explain query.) Explain query should share the same Spark application with regular queries [Spark Branch] - Key: HIVE-9258 URL: https://issues.apache.org/jira/browse/HIVE-9258 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Jimmy Xiang Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which should be shared with regular queries so that we don't launch additional Spark remote context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9258) Explain query should share the same Spark application with regular queries [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9258: -- Summary: Explain query should share the same Spark application with regular queries [Spark Branch] (was: Explain query shouldn't launch a Spark application [Spark Branch]) Explain query should share the same Spark application with regular queries [Spark Branch] - Key: HIVE-9258 URL: https://issues.apache.org/jira/browse/HIVE-9258 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Jimmy Xiang Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which is costly. To make things worse, the application is discarded right way. Ideally, we shouldn't launch a Spark application even for an explain query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9258) Explain query shouldn't launch a Spark application [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274528#comment-14274528 ] Xuefu Zhang commented on HIVE-9258: --- [~jxiang], thanks for looking into this. Looking at the code, I see that it uses SparkSession instance which is indeed shared with regular queries. Since this is confirmed, please close this as not a problem. BTW, I noticed that we have local cache for sparkMemoryAndCores as in: {code} if (sparkMemoryAndCores == null) { {code} This would mean that we wouldn't update the value in the entire user session. However, this value can change dynamically. Do you think we should not cache the value? Explain query shouldn't launch a Spark application [Spark Branch] - Key: HIVE-9258 URL: https://issues.apache.org/jira/browse/HIVE-9258 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Jimmy Xiang Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which is costly. To make things worse, the application is discarded right way. Ideally, we shouldn't launch a Spark application even for an explain query. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9258) Explain query should share the same Spark application with regular queries [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274632#comment-14274632 ] Xuefu Zhang commented on HIVE-9258: --- Makes sense. Thanks for the explanation. Explain query should share the same Spark application with regular queries [Spark Branch] - Key: HIVE-9258 URL: https://issues.apache.org/jira/browse/HIVE-9258 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Jimmy Xiang Currently for Hive on Spark, query plan includes the number of reducers, which is determined partly by the Spark cluster. Thus, explain query will need to launch a Spark application (Spark remote context), which should be shared with regular queries so that we don't launch additional Spark remote context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9135) Cache Map and Reduce works in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274537#comment-14274537 ] Xuefu Zhang commented on HIVE-9135: --- +1 Cache Map and Reduce works in RSC [Spark Branch] Key: HIVE-9135 URL: https://issues.apache.org/jira/browse/HIVE-9135 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Jimmy Xiang Fix For: spark-branch Attachments: HIVE-9135.1-spark.patch, HIVE-9135.1-spark.patch, HIVE-9135.3-spark.patch, HIVE-9135.3.patch, HIVE-9135.4-spark.patch HIVE-9127 works around the fact that we don't cache Map/Reduce works in Spark. However, other input formats such as HiveInputFormat will not benefit from that fix. We should investigate how to allow caching on the RSC while not on tasks (see HIVE-7431). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9135) Cache Map and Reduce works in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9135: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks, Jimmy! Cache Map and Reduce works in RSC [Spark Branch] Key: HIVE-9135 URL: https://issues.apache.org/jira/browse/HIVE-9135 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Jimmy Xiang Fix For: spark-branch Attachments: HIVE-9135.1-spark.patch, HIVE-9135.1-spark.patch, HIVE-9135.3-spark.patch, HIVE-9135.3.patch, HIVE-9135.4-spark.patch HIVE-9127 works around the fact that we don't cache Map/Reduce works in Spark. However, other input formats such as HiveInputFormat will not benefit from that fix. We should investigate how to allow caching on the RSC while not on tasks (see HIVE-7431). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9257) Merge from spark to trunk January 2015
[ https://issues.apache.org/jira/browse/HIVE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273177#comment-14273177 ] Xuefu Zhang commented on HIVE-9257: --- Actually my comments on RB were not covered by HIVE-9335, which already has +1 pending. We may need a separate JIRA to cover them. Merge from spark to trunk January 2015 -- Key: HIVE-9257 URL: https://issues.apache.org/jira/browse/HIVE-9257 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.15.0 Reporter: Szehon Ho Assignee: Szehon Ho Labels: TODOC15 Fix For: 0.15.0 Attachments: trunk-mr2-spark-merge.properties The hive on spark work has reached a point where we can merge it into the trunk branch. Note that spark execution engine is optional and no current users should be impacted. This JIRA will be used to track the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9339) Optimize split grouping for CombineHiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273185#comment-14273185 ] Xuefu Zhang commented on HIVE-9339: --- cc: [~lirui] Optimize split grouping for CombineHiveInputFormat [Spark Branch] - Key: HIVE-9339 URL: https://issues.apache.org/jira/browse/HIVE-9339 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang It seems that split generation, especially in terms of grouping inputs, needs to be improved. For this, we may need cluster information. Because of this, we will first try to solve the problem for Spark. As to cluster information, Spark doesn't provide an API (SPARK-5080). However, Spark doesn't have a listener API, with which Spark driver can get notifications about executor going up/down, task starting/finishing, etc. With this information, Spark client should be able to have a view of the current cluster image. Spark developers mentioned that the listener can only be created after SparkContext is started, at which time, some executions may have already started and so the listener will miss some information. This can be fixed. File a JIRA with Spark project if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9339) Optimize split grouping for CombineHiveInputFormat [Spark Branch]
Xuefu Zhang created HIVE-9339: - Summary: Optimize split grouping for CombineHiveInputFormat [Spark Branch] Key: HIVE-9339 URL: https://issues.apache.org/jira/browse/HIVE-9339 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang It seems that split generation, especially in terms of grouping inputs, needs to be improved. For this, we may need cluster information. Because of this, we will first try to solve the problem for Spark. As to cluster information, Spark doesn't provide an API (SPARK-5080). However, Spark doesn't have a listener API, with which Spark driver can get notifications about executor going up/down, task starting/finishing, etc. With this information, Spark client should be able to have a view of the current cluster image. Spark developers mentioned that the listener can only be created after SparkContext is started, at which time, some executions may have already started and so the listener will miss some information. This can be fixed. File a JIRA with Spark project if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9335) Address review items on HIVE-9257 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273104#comment-14273104 ] Xuefu Zhang commented on HIVE-9335: --- +1 Address review items on HIVE-9257 [Spark Branch] Key: HIVE-9335 URL: https://issues.apache.org/jira/browse/HIVE-9335 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9335.1-spark.patch, HIVE-9335.2-spark.patch I made a pass through HIVE-9257 and found the following issues: {{HashTableSinkOperator.java}} The fields EMPTY_OBJECT_ARRAY and EMPTY_ROW_CONTAINER are no longer constants and should not be in upper case. {{HivePairFlatMapFunction.java}} We share NumberFormat accross threads and it's not thread safe. {{KryoSerializer.java}} we eat the stack trace in deserializeJobConf {{SparkMapRecordHandler}} in processRow we should not be using {{StringUtils.stringifyException}} since LOG can handle stack traces. in close: {noformat} // signal new failure to map-reduce LOG.error(Hit error while closing operators - failing tree); throw new IllegalStateException(Error while closing operators, e); {noformat} Should be: {noformat} String msg = Error while closing operators: + e; throw new IllegalStateException(msg, e); {noformat} {{SparkSessionManagerImpl}} - the method {{canReuseSession}} is useless {{GenSparkSkewJoinProcessor}} {noformat} + // keep it as reference in case we need fetch work +//localPlan.getAliasToFetchWork().put(small_alias.toString(), +//new FetchWork(tblDir, tableDescList.get(small_alias))); {noformat} {{GenSparkWorkWalker}} trim ws {{SparkCompiler}} remote init {{SparkEdgeProperty}} trim ws {{CounterStatsPublisher}} eat exception {{Hadoop23Shims}} unused import of {{ResourceBundles}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9119) ZooKeeperHiveLockManager does not use zookeeper in the proper way
[ https://issues.apache.org/jira/browse/HIVE-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9119: -- Resolution: Fixed Fix Version/s: 0.15.0 Status: Resolved (was: Patch Available) Committed to trunk. Thanks, Na. ZooKeeperHiveLockManager does not use zookeeper in the proper way - Key: HIVE-9119 URL: https://issues.apache.org/jira/browse/HIVE-9119 Project: Hive Issue Type: Improvement Components: Locking Affects Versions: 0.13.0, 0.14.0, 0.13.1 Reporter: Na Yang Assignee: Na Yang Fix For: 0.15.0 Attachments: HIVE-9119.1.patch, HIVE-9119.2.patch, HIVE-9119.3.patch, HIVE-9119.4.patch ZooKeeperHiveLockManager does not use zookeeper in the proper way. Currently a new zookeeper client instance is created for each getlock/releaselock query which sometimes causes the number of open connections between HiveServer2 and ZooKeeper exceed the max connection number that zookeeper server allows. To use zookeeper as a distributed lock, there is no need to create a new zookeeper instance for every getlock try. A single zookeeper instance could be reused and shared by ZooKeeperHiveLockManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9119) ZooKeeperHiveLockManager does not use zookeeper in the proper way
[ https://issues.apache.org/jira/browse/HIVE-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9119: -- Labels: TODOC15 (was: ) ZooKeeperHiveLockManager does not use zookeeper in the proper way - Key: HIVE-9119 URL: https://issues.apache.org/jira/browse/HIVE-9119 Project: Hive Issue Type: Improvement Components: Locking Affects Versions: 0.13.0, 0.14.0, 0.13.1 Reporter: Na Yang Assignee: Na Yang Labels: TODOC15 Fix For: 0.15.0 Attachments: HIVE-9119.1.patch, HIVE-9119.2.patch, HIVE-9119.3.patch, HIVE-9119.4.patch ZooKeeperHiveLockManager does not use zookeeper in the proper way. Currently a new zookeeper client instance is created for each getlock/releaselock query which sometimes causes the number of open connections between HiveServer2 and ZooKeeper exceed the max connection number that zookeeper server allows. To use zookeeper as a distributed lock, there is no need to create a new zookeeper instance for every getlock try. A single zookeeper instance could be reused and shared by ZooKeeperHiveLockManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9112) Query may generate different results depending on the number of reducers
[ https://issues.apache.org/jira/browse/HIVE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9112: -- Status: Patch Available (was: Open) Query may generate different results depending on the number of reducers Key: HIVE-9112 URL: https://issues.apache.org/jira/browse/HIVE-9112 Project: Hive Issue Type: Bug Reporter: Chao Assignee: Ted Xu Attachments: HIVE-9112.patch Some queries may generate different results depending on the number of reducers, for example, tests like ppd_multi_insert.q, join_nullsafe.q, subquery_in.q, etc. Take subquery_in.q as example, if we add {noformat} set mapred.reduce.tasks=3; {noformat} to this test file, the result will be different (and wrong): {noformat} @@ -903,5 +903,3 @@ where li.l_linenumber = 1 and POSTHOOK: type: QUERY POSTHOOK: Input: default@lineitem A masked pattern was here -108570 8571 -4297 1798 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 29787: HIVE-9257 : Merge spark to trunk January 2015 (Modified files)
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29787/#review67603 --- pom.xml https://reviews.apache.org/r/29787/#comment111661 A followup to get rid of this? ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java https://reviews.apache.org/r/29787/#comment111662 We should restrain from using * in imports. - Xuefu Zhang On Jan. 9, 2015, 11:55 p.m., Szehon Ho wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29787/ --- (Updated Jan. 9, 2015, 11:55 p.m.) Review request for hive. Bugs: HIVE-9257 https://issues.apache.org/jira/browse/HIVE-9257 Repository: hive-git Description --- As the entire patch is too big, this shows the modified files. These have been cleanuped as part of HIVE-9319, HIVE-9306, HIVE-9305. The new files can be found here: http://svn.apache.org/repos/asf/hive/branches/spark/ or https://github.com/apache/hive/tree/spark under: # data/conf/spark/ # itests/hive-unit/src/test/java/org/apache/hive/jdbc/TestJdbcWithLocalClusterSpark.java # itests/hive-unit/src/test/java/org/apache/hive/jdbc/TestMultiSessionsHS2WithLocalClusterSpark.java # itests/qtest-spark/ # ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java # ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ # ql/src/java/org/apache/hadoop/hive/ql/lib/TypeRule.java # ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkMapJoinProcessor.java # ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenSparkSkewJoinProcessor.java # ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkCrossProductCheck.java # ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java # ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/ # ql/src/java/org/apache/hadoop/hive/ql/parse/spark/ # ql/src/java/org/apache/hadoop/hive/ql/plan/SparkBucketMapJoinContext.java # ql/src/java/org/apache/hadoop/hive/ql/plan/SparkEdgeProperty.java # ql/src/java/org/apache/hadoop/hive/ql/plan/SparkHashTableSinkDesc.java # ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java # ql/src/java/org/apache/hadoop/hive/ql/stats/CounterStatsAggregatorSpark.java # ql/src/test/org/apache/hadoop/hive/ql/exec/spark/ # ql/src/test/queries/clientpositive/auto_join_stats.q # ql/src/test/queries/clientpositive/auto_join_stats2.q # ql/src/test/queries/clientpositive/bucket_map_join_spark1.q # ql/src/test/queries/clientpositive/bucket_map_join_spark2.q # ql/src/test/queries/clientpositive/bucket_map_join_spark3.q # ql/src/test/queries/clientpositive/bucket_map_join_spark4.q # ql/src/test/queries/clientpositive/multi_insert_mixed.q # ql/src/test/queries/clientpositive/multi_insert_union_src.q # ql/src/test/queries/clientpositive/parallel_join0.q # ql/src/test/queries/clientpositive/parallel_join1.q # ql/src/test/queries/clientpositive/spark_test.q # ql/src/test/queries/clientpositive/udf_example_add.q # ql/src/test/results/clientpositive/auto_join_stats.q.out # ql/src/test/results/clientpositive/auto_join_stats2.q.out # ql/src/test/results/clientpositive/bucket_map_join_spark1.q.out # ql/src/test/results/clientpositive/bucket_map_join_spark2.q.out # ql/src/test/results/clientpositive/bucket_map_join_spark3.q.out # ql/src/test/results/clientpositive/bucket_map_join_spark4.q.out # ql/src/test/results/clientpositive/multi_insert_mixed.q.out # ql/src/test/results/clientpositive/multi_insert_union_src.q.out # ql/src/test/results/clientpositive/parallel_join0.q.out # ql/src/test/results/clientpositive/parallel_join1.q.out # ql/src/test/results/clientpositive/spark/ # ql/src/test/results/clientpositive/spark_test.q.out # ql/src/test/results/clientpositive/udf_example_add.q.out # spark-client/ Cleanup and review of those have been done as part of HIVE-9281 and HIVE-9288. Diffs - common/src/java/org/apache/hadoop/hive/common/StatsSetupConst.java cd4beeb common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 8264b16 data/conf/hive-log4j.properties a5b9c9a itests/hive-unit/pom.xml f9f59c9 itests/pom.xml 0a154d6 itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java 878202a pom.xml efe5e3a ql/pom.xml 84e912e ql/src/java/org/apache/hadoop/hive/ql/Context.java 0373273 ql/src/java/org/apache/hadoop/hive/ql/Driver.java 8bb6d0f ql/src/java/org/apache/hadoop/hive/ql/HashTableLoaderFactory.java 10ad933 ql/src/java/org/apache/hadoop/hive/ql/exec
[jira] [Updated] (HIVE-9104) windowing.q failed when mapred.reduce.tasks is set to larger than one
[ https://issues.apache.org/jira/browse/HIVE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9104: -- Resolution: Fixed Fix Version/s: 0.15.0 Status: Resolved (was: Patch Available) Committed to trunk. Thank Chao for the contribution and Harish for the review. windowing.q failed when mapred.reduce.tasks is set to larger than one - Key: HIVE-9104 URL: https://issues.apache.org/jira/browse/HIVE-9104 Project: Hive Issue Type: Sub-task Reporter: Chao Assignee: Chao Fix For: 0.15.0 Attachments: HIVE-9104.2.patch, HIVE-9104.patch Test {{windowing.q}} is actually not enabled in Spark branch - in test configurations it is {{windowing.q.q}}. I just run this test, and query {code} -- 12. testFirstLastWithWhere select p_mfgr,p_name, p_size, rank() over(distribute by p_mfgr sort by p_name) as r, sum(p_size) over (distribute by p_mfgr sort by p_name rows between current row and current row) as s2, first_value(p_size) over w1 as f, last_value(p_size, false) over w1 as l from part where p_mfgr = 'Manufacturer#3' window w1 as (distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following); {code} failed with the following exception: {noformat} java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:446) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.closeRecordProcessor(HiveReduceFunctionResultList.java:58) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.ArrayDeque.getFirst(ArrayDeque.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFFirstValue$FirstValStreamingFixedWindow.terminate(GenericUDAFFirstValue.java:290) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.finishPartition(WindowingTableFunction.java:413) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:337) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:431) ... 15 more {noformat} We need to find out: - Since which commit this test started failing, and - Why it fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9257) Merge from spark to trunk January 2015
[ https://issues.apache.org/jira/browse/HIVE-9257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272801#comment-14272801 ] Xuefu Zhang commented on HIVE-9257: --- I reviewed most of patches in Spark branch during the months, and also produced some. I reviewed the mega patch here, and left a couple of comments on RB. However, these can be addressed as followups. +1 pending on test. Merge from spark to trunk January 2015 -- Key: HIVE-9257 URL: https://issues.apache.org/jira/browse/HIVE-9257 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: 0.15.0 Reporter: Szehon Ho Assignee: Szehon Ho The hive on spark work has reached a point where we can merge it into the trunk branch. Note that spark execution engine is optional and no current users should be impacted. This JIRA will be used to track the merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9335) Address review items on HIVE-9257 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272780#comment-14272780 ] Xuefu Zhang commented on HIVE-9335: --- Patch looks good, except it seems containing code changes from HIVE-9289, which hasn't been finalized. It's okay to address HIVE-9289 after the merge, but I think we shouldn't include that patch here. Address review items on HIVE-9257 [Spark Branch] Key: HIVE-9335 URL: https://issues.apache.org/jira/browse/HIVE-9335 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Brock Noland Attachments: HIVE-9335.1-spark.patch I made a pass through HIVE-9257 and found the following issues: {{HashTableSinkOperator.java}} The fields EMPTY_OBJECT_ARRAY and EMPTY_ROW_CONTAINER are no longer constants and should not be in upper case. {{HivePairFlatMapFunction.java}} We share NumberFormat accross threads and it's not thread safe. {{KryoSerializer.java}} we eat the stack trace in deserializeJobConf {{SparkMapRecordHandler}} in processRow we should not be using {{StringUtils.stringifyException}} since LOG can handle stack traces. in close: {noformat} // signal new failure to map-reduce LOG.error(Hit error while closing operators - failing tree); throw new IllegalStateException(Error while closing operators, e); {noformat} Should be: {noformat} String msg = Error while closing operators: + e; throw new IllegalStateException(msg, e); {noformat} {{SparkSessionManagerImpl}} - the method {{canReuseSession}} is useless {{GenSparkSkewJoinProcessor}} {noformat} + // keep it as reference in case we need fetch work +//localPlan.getAliasToFetchWork().put(small_alias.toString(), +//new FetchWork(tblDir, tableDescList.get(small_alias))); {noformat} {{GenSparkWorkWalker}} trim ws {{SparkCompiler}} remote init {{SparkEdgeProperty}} trim ws {{CounterStatsPublisher}} eat exception {{Hadoop23Shims}} unused import of {{ResourceBundles}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9104) windowing.q failed when mapred.reduce.tasks is set to larger than one
[ https://issues.apache.org/jira/browse/HIVE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9104: -- Component/s: (was: Spark) windowing.q failed when mapred.reduce.tasks is set to larger than one - Key: HIVE-9104 URL: https://issues.apache.org/jira/browse/HIVE-9104 Project: Hive Issue Type: Sub-task Reporter: Chao Assignee: Chao Fix For: 0.15.0 Attachments: HIVE-9104.2.patch, HIVE-9104.patch Test {{windowing.q}} is actually not enabled in Spark branch - in test configurations it is {{windowing.q.q}}. I just run this test, and query {code} -- 12. testFirstLastWithWhere select p_mfgr,p_name, p_size, rank() over(distribute by p_mfgr sort by p_name) as r, sum(p_size) over (distribute by p_mfgr sort by p_name rows between current row and current row) as s2, first_value(p_size) over w1 as f, last_value(p_size, false) over w1 as l from part where p_mfgr = 'Manufacturer#3' window w1 as (distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following); {code} failed with the following exception: {noformat} java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:446) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.closeRecordProcessor(HiveReduceFunctionResultList.java:58) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.ArrayDeque.getFirst(ArrayDeque.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFFirstValue$FirstValStreamingFixedWindow.terminate(GenericUDAFFirstValue.java:290) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.finishPartition(WindowingTableFunction.java:413) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:337) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:431) ... 15 more {noformat} We need to find out: - Since which commit this test started failing, and - Why it fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 29494: HIVE-9119: ZooKeeperHiveLockManager does not use zookeeper in the proper way
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29494/#review67606 --- Ship it! Ship It! - Xuefu Zhang On Jan. 5, 2015, 6:43 p.m., Na Yang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29494/ --- (Updated Jan. 5, 2015, 6:43 p.m.) Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-9119 https://issues.apache.org/jira/browse/HIVE-9119 Repository: hive-git Description --- 1. Use singleton ZooKeeper client for ZooKeeperHiveLocManager 2. Use CuratorFramework to manage ZooKeeper client Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 2e51518 itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java 878202a ql/pom.xml 84e912e ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/CuratorFrameworkSingleton.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/lockmgr/zookeeper/ZooKeeperHiveLockManager.java 1334a91 ql/src/test/org/apache/hadoop/hive/ql/lockmgr/zookeeper/TestZookeeperLockManager.java aacb73f Diff: https://reviews.apache.org/r/29494/diff/ Testing --- Thanks, Na Yang
[jira] [Commented] (HIVE-9119) ZooKeeperHiveLockManager does not use zookeeper in the proper way
[ https://issues.apache.org/jira/browse/HIVE-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272806#comment-14272806 ] Xuefu Zhang commented on HIVE-9119: --- +1 patch looks good to me. Thanks for fixing this long haunting issue, and now the zookeeper related code is much cleaner also. ZooKeeperHiveLockManager does not use zookeeper in the proper way - Key: HIVE-9119 URL: https://issues.apache.org/jira/browse/HIVE-9119 Project: Hive Issue Type: Improvement Components: Locking Affects Versions: 0.13.0, 0.14.0, 0.13.1 Reporter: Na Yang Assignee: Na Yang Attachments: HIVE-9119.1.patch, HIVE-9119.2.patch, HIVE-9119.3.patch, HIVE-9119.4.patch ZooKeeperHiveLockManager does not use zookeeper in the proper way. Currently a new zookeeper client instance is created for each getlock/releaselock query which sometimes causes the number of open connections between HiveServer2 and ZooKeeper exceed the max connection number that zookeeper server allows. To use zookeeper as a distributed lock, there is no need to create a new zookeeper instance for every getlock try. A single zookeeper instance could be reused and shared by ZooKeeperHiveLockManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9104) windowing.q failed when mapred.reduce.tasks is set to larger than one
[ https://issues.apache.org/jira/browse/HIVE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272187#comment-14272187 ] Xuefu Zhang commented on HIVE-9104: --- +1. Code looks reasonable to me. However, it's great if [~rhbutani] or someone else familiar to this part of code to take a look. windowing.q failed when mapred.reduce.tasks is set to larger than one - Key: HIVE-9104 URL: https://issues.apache.org/jira/browse/HIVE-9104 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chao Assignee: Chao Attachments: HIVE-9104.patch Test {{windowing.q}} is actually not enabled in Spark branch - in test configurations it is {{windowing.q.q}}. I just run this test, and query {code} -- 12. testFirstLastWithWhere select p_mfgr,p_name, p_size, rank() over(distribute by p_mfgr sort by p_name) as r, sum(p_size) over (distribute by p_mfgr sort by p_name rows between current row and current row) as s2, first_value(p_size) over w1 as f, last_value(p_size, false) over w1 as l from part where p_mfgr = 'Manufacturer#3' window w1 as (distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following); {code} failed with the following exception: {noformat} java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:446) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.closeRecordProcessor(HiveReduceFunctionResultList.java:58) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.ArrayDeque.getFirst(ArrayDeque.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFFirstValue$FirstValStreamingFixedWindow.terminate(GenericUDAFFirstValue.java:290) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.finishPartition(WindowingTableFunction.java:413) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:337) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:431) ... 15 more {noformat} We need to find out: - Since which commit this test started failing, and - Why it fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9104) windowing.q failed when mapred.reduce.tasks is set to larger than one
[ https://issues.apache.org/jira/browse/HIVE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272299#comment-14272299 ] Xuefu Zhang commented on HIVE-9104: --- [~csun] Could you add a test case in which perhaps the same query runs with multiple reducers. It can be in the same .q file. windowing.q failed when mapred.reduce.tasks is set to larger than one - Key: HIVE-9104 URL: https://issues.apache.org/jira/browse/HIVE-9104 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chao Assignee: Chao Attachments: HIVE-9104.patch Test {{windowing.q}} is actually not enabled in Spark branch - in test configurations it is {{windowing.q.q}}. I just run this test, and query {code} -- 12. testFirstLastWithWhere select p_mfgr,p_name, p_size, rank() over(distribute by p_mfgr sort by p_name) as r, sum(p_size) over (distribute by p_mfgr sort by p_name rows between current row and current row) as s2, first_value(p_size) over w1 as f, last_value(p_size, false) over w1 as l from part where p_mfgr = 'Manufacturer#3' window w1 as (distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following); {code} failed with the following exception: {noformat} java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:446) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.closeRecordProcessor(HiveReduceFunctionResultList.java:58) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.ArrayDeque.getFirst(ArrayDeque.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFFirstValue$FirstValStreamingFixedWindow.terminate(GenericUDAFFirstValue.java:290) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.finishPartition(WindowingTableFunction.java:413) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:337) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:431) ... 15 more {noformat} We need to find out: - Since which commit this test started failing, and - Why it fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9251: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to spark branch. Thanks, Rui. SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Fix For: spark-branch Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch, HIVE-9251.3-spark.patch, HIVE-9251.4-spark.patch, HIVE-9251.5-spark.patch, HIVE-9251.6-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271116#comment-14271116 ] Xuefu Zhang commented on HIVE-9290: --- I was aware of this but knew Rui was immediately working on HIVE-9251 which depends on this issue. Yes, there will be a little time period where the tests would fail, but I think it's okay as long as we are aware. Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Fix For: spark-branch, 0.15.0 Attachments: HIVE-9290-spark.patch, HIVE-9290.1.patch, HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271116#comment-14271116 ] Xuefu Zhang edited comment on HIVE-9290 at 1/9/15 3:42 PM: --- I was aware of this but knew Rui was immediately working on HIVE-9251 which depends on this issue. Yes, there would be a little time period where the tests would fail, but I thought it's okay as long as we are aware. Sorry for the inconvenience. was (Author: xuefuz): I was aware of this but knew Rui was immediately working on HIVE-9251 which depends on this issue. Yes, there will be a little time period where the tests would fail, but I think it's okay as long as we are aware. Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Fix For: spark-branch, 0.15.0 Attachments: HIVE-9290-spark.patch, HIVE-9290.1.patch, HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9326) BaseProtocol.Error failed to deserialization due to NPE.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9326: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) The test failures are known and unrelated. Committed to Spark branch. Thanks, Chengxiang. BaseProtocol.Error failed to deserialization due to NPE.[Spark Branch] -- Key: HIVE-9326 URL: https://issues.apache.org/jira/browse/HIVE-9326 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Fix For: spark-branch Attachments: HIVE-9326.1-spark.patch Throwables.getStackTraceAsString(cause) throw NPE if cause is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9326) BaseProtocol.Error failed to deserialization due to NPE.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271772#comment-14271772 ] Xuefu Zhang commented on HIVE-9326: --- +1 BaseProtocol.Error failed to deserialization due to NPE.[Spark Branch] -- Key: HIVE-9326 URL: https://issues.apache.org/jira/browse/HIVE-9326 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Labels: Spark-M5 Attachments: HIVE-9326.1-spark.patch Throwables.getStackTraceAsString(cause) throw NPE if cause is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270202#comment-14270202 ] Xuefu Zhang commented on HIVE-9306: --- Test failure above, org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_percentile_approx_23.q, doesn't seem related to the patch. It didn't happen in previous run, and neither in my local run. Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch, HIVE-9306.3-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks to Szehon for the review. Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: spark-branch Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch, HIVE-9306.3-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9290: -- Resolution: Fixed Fix Version/s: 0.15.0 spark-branch Status: Resolved (was: Patch Available) Committed to trunk and merged to Spark branch. Thanks, Rui. Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Fix For: spark-branch, 0.15.0 Attachments: HIVE-9290.1.patch, HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 29733: HIVE-9319 : Cleanup Modified Files [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29733/#review67348 --- Ship it! Ship It! - Xuefu Zhang On Jan. 9, 2015, 12:01 a.m., Szehon Ho wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29733/ --- (Updated Jan. 9, 2015, 12:01 a.m.) Review request for hive and Xuefu Zhang. Repository: hive-git Description --- Note that this limits cleanup to lines of code changed in spark-branch in the merge to trunk, not cleanup of all of the modified files, in order to reduce merge conflicts. Diffs - ql/src/java/org/apache/hadoop/hive/ql/Driver.java fa40082 ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java b25a639 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java ee42f4c ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java abdb6af ql/src/java/org/apache/hadoop/hive/ql/io/HiveKey.java 33aeda4 ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 6f216c9 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java a6d5c62 ql/src/java/org/apache/hadoop/hive/ql/optimizer/unionproc/UnionProcessor.java fec6822 ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java 1b6de64 ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 1efbb12 ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 4582678 ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java 076d2fa shims/common/src/main/java/org/apache/hadoop/hive/shims/HadoopShims.java f1743ae Diff: https://reviews.apache.org/r/29733/diff/ Testing --- Thanks, Szehon Ho
[jira] [Commented] (HIVE-9319) Cleanup Modified Files [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270303#comment-14270303 ] Xuefu Zhang commented on HIVE-9319: --- +1 pending on test Cleanup Modified Files [Spark Branch] - Key: HIVE-9319 URL: https://issues.apache.org/jira/browse/HIVE-9319 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Priority: Minor Attachments: HIVE-9319-spark.patch Cleanup the code that is modified based on checkstyle/TODO/warnings. It is a follow-up of HIVE-9281 which is for new files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9267) Ensure custom UDF works with Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9267: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks to Szehon for the review. Ensure custom UDF works with Spark [Spark Branch] - Key: HIVE-9267 URL: https://issues.apache.org/jira/browse/HIVE-9267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: spark-branch Attachments: HIVE-9267.1-spark.patch Create or add auto qtest if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9293) Cleanup SparkTask getMapWork to skip UnionWork check [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9293: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks, Chao. Cleanup SparkTask getMapWork to skip UnionWork check [Spark Branch] --- Key: HIVE-9293 URL: https://issues.apache.org/jira/browse/HIVE-9293 Project: Hive Issue Type: Task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Chao Priority: Minor Fix For: spark-branch Attachments: HIVE-9293.1-spark.patch As we don't have UnionWork anymore, we can simplify the logic to get root mapworks from the SparkWork. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269409#comment-14269409 ] Xuefu Zhang commented on HIVE-9306: --- fs_default_name2.q output needs to be updated, which will be consistent with trunk. skewjoinopt5.q failed due to error below. It shouldn't be related to the changes here. In hive.log: {code} 2015-01-07 22:05:46,510 INFO [main]: impl.RemoteSparkJobStatus (RemoteSparkJobStatus.java:getSparkJobInfo(143)) - Job hasn't been submitted after 30s. Aborting it. 2015-01-07 22:05:46,511 ERROR [main]: exec.Task (SessionState.java:printError(839)) - Failed to execute spark task, with exception 'java.lang.IllegalStateException(RPC channel is closed.)' java.lang.IllegalStateException: RPC channel is closed. at com.google.common.base.Preconditions.checkState(Preconditions.java:149) at org.apache.hive.spark.client.rpc.Rpc.call(Rpc.java:264) at org.apache.hive.spark.client.rpc.Rpc.call(Rpc.java:251) at org.apache.hive.spark.client.SparkClientImpl$ClientProtocol.cancel(SparkClientImpl.java:375) at org.apache.hive.spark.client.SparkClientImpl.cancel(SparkClientImpl.java:159) at org.apache.hive.spark.client.JobHandleImpl.cancel(JobHandleImpl.java:59) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getSparkJobInfo(RemoteSparkJobStatus.java:144) at org.apache.hadoop.hive.ql.exec.spark.status.impl.RemoteSparkJobStatus.getState(RemoteSparkJobStatus.java:75) at org.apache.hadoop.hive.ql.exec.spark.status.SparkJobMonitor.startMonitor(SparkJobMonitor.java:72) at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:108) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1634) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1393) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1179) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1035) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:206) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:158) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:369) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:304) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:880) at org.apache.hadoop.hive.cli.TestSparkCliDriver.runTest(TestSparkCliDriver.java:234) at org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_skewjoinopt5(TestSparkCliDriver.java:206) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:176) at junit.framework.TestCase.runBare(TestCase.java:141) at junit.framework.TestResult$1.protect(TestResult.java:122) at junit.framework.TestResult.runProtected(TestResult.java:142) at junit.framework.TestResult.run(TestResult.java:125) at junit.framework.TestCase.run(TestCase.java:129) at junit.framework.TestSuite.runTest(TestSuite.java:255) at junit.framework.TestSuite.run(TestSuite.java:250) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) {code} This might be transient, but we need to address it if the problem persists. Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE
[jira] [Updated] (HIVE-9301) Potential null dereference in MoveTask#createTargetPath()
[ https://issues.apache.org/jira/browse/HIVE-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9301: -- Resolution: Fixed Fix Version/s: 0.15.0 Status: Resolved (was: Patch Available) Committed to trunk. Thanks, Ted. Potential null dereference in MoveTask#createTargetPath() - Key: HIVE-9301 URL: https://issues.apache.org/jira/browse/HIVE-9301 Project: Hive Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 0.15.0 Attachments: HIVE-9301.patch {code} if (mkDirPath != null !fs.exists(mkDirPath)) { {code} '' should be used instead of single ampersand. If mkDirPath is null, fs.exists() would still be called - resulting in NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9293) Cleanup SparkTask getMapWork to skip UnionWork check [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269352#comment-14269352 ] Xuefu Zhang commented on HIVE-9293: --- +1 Cleanup SparkTask getMapWork to skip UnionWork check [Spark Branch] --- Key: HIVE-9293 URL: https://issues.apache.org/jira/browse/HIVE-9293 Project: Hive Issue Type: Task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Chao Priority: Minor Attachments: HIVE-9293.1-spark.patch As we don't have UnionWork anymore, we can simplify the logic to get root mapworks from the SparkWork. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269371#comment-14269371 ] Xuefu Zhang commented on HIVE-9290: --- +1 Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269371#comment-14269371 ] Xuefu Zhang edited comment on HIVE-9290 at 1/8/15 2:37 PM: --- +1 pending on test was (Author: xuefuz): +1 Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-9112) Query may generate different results depending on the number of reducers
[ https://issues.apache.org/jira/browse/HIVE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang reassigned HIVE-9112: - Assignee: Ted Xu (was: Chao) Hi [~tedxu], I'm assigning this to you for investigation/fix. Thanks. Query may generate different results depending on the number of reducers Key: HIVE-9112 URL: https://issues.apache.org/jira/browse/HIVE-9112 Project: Hive Issue Type: Bug Reporter: Chao Assignee: Ted Xu Some queries may generate different results depending on the number of reducers, for example, tests like ppd_multi_insert.q, join_nullsafe.q, subquery_in.q, etc. Take subquery_in.q as example, if we add {noformat} set mapred.reduce.tasks=3; {noformat} to this test file, the result will be different (and wrong): {noformat} @@ -903,5 +903,3 @@ where li.l_linenumber = 1 and POSTHOOK: type: QUERY POSTHOOK: Input: default@lineitem A masked pattern was here -108570 8571 -4297 1798 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Attachment: HIVE-9306.2-spark.patch Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Attachment: HIVE-9306.3-spark.patch Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch, HIVE-9306.3-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Attachment: (was: HIVE-9306.3-spark.patch) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch, HIVE-9306.3-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Attachment: HIVE-9306.3-spark.patch Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch, HIVE-9306.2-spark.patch, HIVE-9306.3-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9219) Investigate differences for auto join tests in explain after merge from trunk [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved HIVE-9219. --- Resolution: Not a Problem Investigate differences for auto join tests in explain after merge from trunk [Spark Branch] Key: HIVE-9219 URL: https://issues.apache.org/jira/browse/HIVE-9219 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Chao {noformat} diff --git a/ql/src/test/results/clientpositive/spark/auto_join14.q.out b/ql/src/test/results/clientpositive/spark/auto_join14.q.out index cbca649..830314e 100644 --- a/ql/src/test/results/clientpositive/spark/auto_join14.q.out +++ b/ql/src/test/results/clientpositive/spark/auto_join14.q.out @@ -38,9 +38,6 @@ STAGE PLANS: predicate: (key 100) (type: boolean) Statistics: Num rows: 166 Data size: 1763 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator - condition expressions: -0 -1 {value} keys: 0 key (type: string) 1 key (type: string) @@ -62,9 +59,6 @@ STAGE PLANS: Map Join Operator condition map: Inner Join 0 to 1 - condition expressions: -0 {key} -1 {value} keys: 0 key (type: string) 1 key (type: string) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
Xuefu Zhang created HIVE-9306: - Summary: Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Status: Patch Available (was: Open) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268841#comment-14268841 ] Xuefu Zhang commented on HIVE-9251: --- It should be okay. Limit is still pushed down in the extra stage introduced by order by. SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch, HIVE-9251.3-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9306) Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9306: -- Attachment: HIVE-9306.1-spark.patch Let Context.isLocalOnlyExecutionMode() return false if execution engine is Spark [Spark Branch] --- Key: HIVE-9306 URL: https://issues.apache.org/jira/browse/HIVE-9306 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9306.1-spark.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9290) Make some test results deterministic
[ https://issues.apache.org/jira/browse/HIVE-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268835#comment-14268835 ] Xuefu Zhang commented on HIVE-9290: --- It should be okay. Even though an extra stage is introduced, I see the limit is still pushed down in the second stage according to the plan. Make some test results deterministic Key: HIVE-9290 URL: https://issues.apache.org/jira/browse/HIVE-9290 Project: Hive Issue Type: Test Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9290.1.patch {noformat} limit_pushdown.q optimize_nullscan.q ppd_gby_join.q vector_string_concat.q {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9281) Code cleanup [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268466#comment-14268466 ] Xuefu Zhang commented on HIVE-9281: --- Could you load both versions to RB so that I can just look at the diff between the versions? Code cleanup [Spark Branch] --- Key: HIVE-9281 URL: https://issues.apache.org/jira/browse/HIVE-9281 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-9281-spark.patch, HIVE-9281.2-spark.patch In preparation for merge, we need to cleanup the codes. This includes removing TODO's, fixing checkstyles, removing commented or unused code, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9104) windowing.q failed when mapred.reduce.tasks is set to larger than one
[ https://issues.apache.org/jira/browse/HIVE-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9104: -- Affects Version/s: (was: spark-branch) windowing.q failed when mapred.reduce.tasks is set to larger than one - Key: HIVE-9104 URL: https://issues.apache.org/jira/browse/HIVE-9104 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chao Assignee: Chao Test {{windowing.q}} is actually not enabled in Spark branch - in test configurations it is {{windowing.q.q}}. I just run this test, and query {code} -- 12. testFirstLastWithWhere select p_mfgr,p_name, p_size, rank() over(distribute by p_mfgr sort by p_name) as r, sum(p_size) over (distribute by p_mfgr sort by p_name rows between current row and current row) as s2, first_value(p_size) over w1 as f, last_value(p_size, false) over w1 as l from part where p_mfgr = 'Manufacturer#3' window w1 as (distribute by p_mfgr sort by p_name rows between 2 preceding and 2 following); {code} failed with the following exception: {noformat} java.lang.RuntimeException: Hive Runtime Error while closing operators: null at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:446) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.closeRecordProcessor(HiveReduceFunctionResultList.java:58) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:108) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$2.apply(AsyncRDDActions.scala:115) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1390) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.NoSuchElementException at java.util.ArrayDeque.getFirst(ArrayDeque.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFFirstValue$FirstValStreamingFixedWindow.terminate(GenericUDAFFirstValue.java:290) at org.apache.hadoop.hive.ql.udf.ptf.WindowingTableFunction.finishPartition(WindowingTableFunction.java:413) at org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.finishPartition(PTFOperator.java:337) at org.apache.hadoop.hive.ql.exec.PTFOperator.closeOp(PTFOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:598) at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:610) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.close(SparkReduceRecordHandler.java:431) ... 15 more {noformat} We need to find out: - Since which commit this test started failing, and - Why it fails -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9281) Code cleanup [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268443#comment-14268443 ] Xuefu Zhang commented on HIVE-9281: --- My eyes got sored after going thru the patch, but +1. It's a big, nice cleanup. Code cleanup [Spark Branch] --- Key: HIVE-9281 URL: https://issues.apache.org/jira/browse/HIVE-9281 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-9281-spark.patch In preparation for merge, we need to cleanup the codes. This includes removing TODO's, fixing checkstyles, removing commented or unused code, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9281) Code cleanup [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268523#comment-14268523 ] Xuefu Zhang commented on HIVE-9281: --- +1 Code cleanup [Spark Branch] --- Key: HIVE-9281 URL: https://issues.apache.org/jira/browse/HIVE-9281 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Szehon Ho Assignee: Szehon Ho Attachments: HIVE-9281-spark.patch, HIVE-9281.2-spark.patch In preparation for merge, we need to cleanup the codes. This includes removing TODO's, fixing checkstyles, removing commented or unused code, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9301) Potential null dereference in MoveTask#createTargetPath()
[ https://issues.apache.org/jira/browse/HIVE-9301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268564#comment-14268564 ] Xuefu Zhang commented on HIVE-9301: --- +1 Potential null dereference in MoveTask#createTargetPath() - Key: HIVE-9301 URL: https://issues.apache.org/jira/browse/HIVE-9301 Project: Hive Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Attachments: HIVE-9301.patch {code} if (mkDirPath != null !fs.exists(mkDirPath)) { {code} '' should be used instead of single ampersand. If mkDirPath is null, fs.exists() would still be called - resulting in NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9289) TODO : Store user name in session [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268105#comment-14268105 ] Xuefu Zhang commented on HIVE-9289: --- Maybe I don't fully understand the code, but it seems a little concerning that session is reused purely based on user name. Hive supports multiple sessions for the same user, and we don't want such session is reused. I also have a feeling that we don't have any session reuse (though we have code here and there for that). If that's the case, we'd rather just get rid of the code. [~chengxiang li], could you comment on this? TODO : Store user name in session [Spark Branch] Key: HIVE-9289 URL: https://issues.apache.org/jira/browse/HIVE-9289 Project: Hive Issue Type: Bug Components: Spark Reporter: Chinna Rao Lalam Assignee: Chinna Rao Lalam Attachments: HIVE-9289.1-spark.patch TODO : this we need to store the session username somewhere else as getUGIForConf never used the conf SparkSessionManagerImpl.java /hive-exec/src/java/org/apache/hadoop/hive/ql/exec/spark/session line 145 Java Task -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 29655: HIVE-9288 TODO cleanup task1[Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29655/#review67024 --- Ship it! Ship It! - Xuefu Zhang On Jan. 7, 2015, 9:03 a.m., chengxiang li wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29655/ --- (Updated Jan. 7, 2015, 9:03 a.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-9288 https://issues.apache.org/jira/browse/HIVE-9288 Repository: hive-git Description --- clean job status related TODO. Diffs - spark-client/src/main/java/org/apache/hive/spark/client/JobHandle.java fd5daf4 spark-client/src/main/java/org/apache/hive/spark/client/JobHandleImpl.java 6aeb6b7 spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java c873e8a Diff: https://reviews.apache.org/r/29655/diff/ Testing --- Thanks, chengxiang li
[jira] [Commented] (HIVE-9288) TODO cleanup task1.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267729#comment-14267729 ] Xuefu Zhang commented on HIVE-9288: --- +1 TODO cleanup task1.[Spark Branch] - Key: HIVE-9288 URL: https://issues.apache.org/jira/browse/HIVE-9288 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Labels: Spark-M5 Attachments: HIVE-9288.1-spark.patch cleanup TODO for job status related class if available before merge back to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9110) Performance of SELECT COUNT(*) FROM store_sales WHERE ss_item_sk IS NOT NULL [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9110: -- Assignee: Rui Li (was: Chao) Performance of SELECT COUNT(*) FROM store_sales WHERE ss_item_sk IS NOT NULL [Spark Branch] --- Key: HIVE-9110 URL: https://issues.apache.org/jira/browse/HIVE-9110 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li The query {noformat} SELECT COUNT(*) FROM store_sales WHERE ss_item_sk IS NOT NULL {noformat} could benefit from performance enhancements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268017#comment-14268017 ] Xuefu Zhang commented on HIVE-9251: --- Besides HIVE-9290, it seems that golden files for limit_pushdown.q and outer_join_ppr.q also need to be updated. +1 for the code change. SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch, HIVE-9251.3-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9288) TODO cleanup task1.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9288: -- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Committed to Spark branch. Thanks, Chengxiang. TODO cleanup task1.[Spark Branch] - Key: HIVE-9288 URL: https://issues.apache.org/jira/browse/HIVE-9288 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Labels: Spark-M5 Fix For: spark-branch Attachments: HIVE-9288.1-spark.patch cleanup TODO for job status related class if available before merge back to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9288) TODO cleanup task1.[Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9288: -- Status: Patch Available (was: Open) TODO cleanup task1.[Spark Branch] - Key: HIVE-9288 URL: https://issues.apache.org/jira/browse/HIVE-9288 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chengxiang Li Assignee: Chengxiang Li Priority: Minor Labels: Spark-M5 Attachments: HIVE-9288.1-spark.patch cleanup TODO for job status related class if available before merge back to trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6173) Beeline doesn't accept --hiveconf option as Hive CLI does
[ https://issues.apache.org/jira/browse/HIVE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266711#comment-14266711 ] Xuefu Zhang commented on HIVE-6173: --- Aha. I see. These undocumented properties, such as maxHeight and trimScripts, are not internal, but unknown to majority of users (probably due to lack of documentation). While it's nice to have them documented, but doing so requires work load from the community. Not just write a few works about them, but also ensure they are doing what they are supposed to do. Here I give some descriptions w/o the merit of guaranty: 1. showElapsedTime -- whether to log elapsed time at command prompt. Default true. 2. maxHeight, maxWidth -- maximum height/width of the output. Default, the height/width of the terminal. 3. timeout -- unused 4. trimScripts -- whether to trim leading/trailing spaces/tabs in the script. Default true. 5. allowMultiLineCommand -- whether to allow multi-line commands. Default true. Beeline doesn't accept --hiveconf option as Hive CLI does - Key: HIVE-6173 URL: https://issues.apache.org/jira/browse/HIVE-6173 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.10.0, 0.11.0, 0.12.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Labels: TODOC13 Fix For: 0.13.0 Attachments: HIVE-6173.1.patch, HIVE-6173.2.patch, HIVE-6173.patch {code} beeline -u jdbc:hive2:// --hiveconf a=b Usage: java org.apache.hive.cli.beeline.BeeLine {code} Since Beeline is replacing Hive CLI, it should support this command line option as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9267) Ensure custom UDF works with Spark [Spark Branch]
Xuefu Zhang created HIVE-9267: - Summary: Ensure custom UDF works with Spark [Spark Branch] Key: HIVE-9267 URL: https://issues.apache.org/jira/browse/HIVE-9267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Create or add auto qtest if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9267) Ensure custom UDF works with Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9267: -- Status: Patch Available (was: Open) Ensure custom UDF works with Spark [Spark Branch] - Key: HIVE-9267 URL: https://issues.apache.org/jira/browse/HIVE-9267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9267.1-spark.patch Create or add auto qtest if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9267) Ensure custom UDF works with Spark [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9267: -- Attachment: HIVE-9267.1-spark.patch Ensure custom UDF works with Spark [Spark Branch] - Key: HIVE-9267 URL: https://issues.apache.org/jira/browse/HIVE-9267 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: HIVE-9267.1-spark.patch Create or add auto qtest if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267223#comment-14267223 ] Xuefu Zhang commented on HIVE-9251: --- Hi Rui, for our unit test, the input size and cluster are all fixed. It shouldn't matter whether reducer count is exposed in the plan. As to the question of whether or not, we briefly discussed about this today and we will try to use the same RSC with query execution for explain query. If this can be nicely shared, it seems okay to have it in the plan. Let me know if I missed anything. SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267243#comment-14267243 ] Xuefu Zhang commented on HIVE-9251: --- The patch looks good. One question though: (-1, -1) is returned for get memory and core call, which makes me wonder what's the behavior on Hive side if that's the case. Should we somehow safeguard on this? SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9154) Cache pathToPartitionInfo in context aware record reader
[ https://issues.apache.org/jira/browse/HIVE-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9154: -- Resolution: Fixed Fix Version/s: (was: spark-branch) Status: Resolved (was: Patch Available) Committed to trunk. Thanks, Jimmy. Cache pathToPartitionInfo in context aware record reader Key: HIVE-9154 URL: https://issues.apache.org/jira/browse/HIVE-9154 Project: Hive Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.15.0 Attachments: HIVE-9154.1-spark.patch, HIVE-9154.1-spark.patch, HIVE-9154.2.patch, HIVE-9154.3.patch This is similar to HIVE-9127. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9219) Investigate differences for auto join tests in explain after merge from trunk [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267299#comment-14267299 ] Xuefu Zhang commented on HIVE-9219: --- [~csun], anything to be done here? If not, we just close this as not a problem then. Investigate differences for auto join tests in explain after merge from trunk [Spark Branch] Key: HIVE-9219 URL: https://issues.apache.org/jira/browse/HIVE-9219 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Brock Noland Assignee: Chao {noformat} diff --git a/ql/src/test/results/clientpositive/spark/auto_join14.q.out b/ql/src/test/results/clientpositive/spark/auto_join14.q.out index cbca649..830314e 100644 --- a/ql/src/test/results/clientpositive/spark/auto_join14.q.out +++ b/ql/src/test/results/clientpositive/spark/auto_join14.q.out @@ -38,9 +38,6 @@ STAGE PLANS: predicate: (key 100) (type: boolean) Statistics: Num rows: 166 Data size: 1763 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator - condition expressions: -0 -1 {value} keys: 0 key (type: string) 1 key (type: string) @@ -62,9 +59,6 @@ STAGE PLANS: Map Join Operator condition map: Inner Join 0 to 1 - condition expressions: -0 {key} -1 {value} keys: 0 key (type: string) 1 key (type: string) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9251) SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267293#comment-14267293 ] Xuefu Zhang commented on HIVE-9251: --- I see it in the code now. Patch looks good. I just had one minor comment/question on RB. SetSparkReducerParallelism is likely to set too small number of reducers [Spark Branch] --- Key: HIVE-9251 URL: https://issues.apache.org/jira/browse/HIVE-9251 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Rui Li Assignee: Rui Li Attachments: HIVE-9251.1-spark.patch, HIVE-9251.2-spark.patch This may hurt performance or even lead to task failures. For example, spark's netty-based shuffle limits the max frame size to be 2G. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9243) Static Map in IOContext is not thread safe
[ https://issues.apache.org/jira/browse/HIVE-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-9243: -- Resolution: Fixed Fix Version/s: 0.15.0 Status: Resolved (was: Patch Available) Committed to trunk. Thanks, Brock. Static Map in IOContext is not thread safe -- Key: HIVE-9243 URL: https://issues.apache.org/jira/browse/HIVE-9243 Project: Hive Issue Type: Bug Affects Versions: 0.15.0 Reporter: Brock Noland Assignee: Brock Noland Fix For: 0.15.0 Attachments: HIVE-9243.patch, HIVE-9243.patch, HIVE-9243.patch This map can be accessed by multiple threads. We can either map it a {{ConcurrentHashMap}} or synchronize the calls to this class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8578) Investigate test failures related to HIVE-8545 [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-8578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265287#comment-14265287 ] Xuefu Zhang commented on HIVE-8578: --- Hi [~jxiang], what's the latest status of this issue? Investigate test failures related to HIVE-8545 [Spark Branch] - Key: HIVE-8578 URL: https://issues.apache.org/jira/browse/HIVE-8578 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Chao Assignee: Jimmy Xiang In HIVE-8545, there are a few test failures, for instance, {{multi_insert_lateral_view.q}} and {{ppd_multi_insert.q}}. They appear to be happening at random, and not reproducible locally. We need to track down the root cause, and fix in this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HIVE-9135) Cache Map and Reduce works in RSC [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved HIVE-9135. --- Resolution: Won't Fix I'm closing this because of limited benefit. We have cached input path info via other JIRA, which further reduced the importance of this one. Cache Map and Reduce works in RSC [Spark Branch] Key: HIVE-9135 URL: https://issues.apache.org/jira/browse/HIVE-9135 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Jimmy Xiang Attachments: HIVE-9135.1-spark.patch, HIVE-9135.1-spark.patch HIVE-9127 works around the fact that we don't cache Map/Reduce works in Spark. However, other input formats such as HiveInputFormat will not benefit from that fix. We should investigate how to allow caching on the RSC while not on tasks (see HIVE-7431). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6173) Beeline doesn't accept --hiveconf option as Hive CLI does
[ https://issues.apache.org/jira/browse/HIVE-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265377#comment-14265377 ] Xuefu Zhang commented on HIVE-6173: --- Re #2: autocommit is still not applicable, even if Hive's update/delete. Re #4: default means no format, which means values are read as strings. Re #7: Reply A seems better. Beeline doesn't accept --hiveconf option as Hive CLI does - Key: HIVE-6173 URL: https://issues.apache.org/jira/browse/HIVE-6173 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.10.0, 0.11.0, 0.12.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Labels: TODOC13 Fix For: 0.13.0 Attachments: HIVE-6173.1.patch, HIVE-6173.2.patch, HIVE-6173.patch {code} beeline -u jdbc:hive2:// --hiveconf a=b Usage: java org.apache.hive.cli.beeline.BeeLine {code} Since Beeline is replacing Hive CLI, it should support this command line option as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)