[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089400#comment-14089400 ] Brock Noland commented on HIVE-7492: +1 Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089544#comment-14089544 ] Venki Korukanti commented on HIVE-7492: --- Hi [~brocknoland], I was about to create a JIRA for the same, but have following questions: * how cleanup works in case the task exits abnormally. * where to create these tmp files on DFS. Currently RowContainer is used in join operator (mainline hive not just spark branch), so it can create temp files as part of the Reduce task if the output exceeds in memory blocksize. In case of MapReduce tasks MR framework overrides the default tmp dir location with a location under JVM working directory (See [here|https://github.com/apache/hadoop-common/blob/0d1861b3eaf134e085055ae0b51c3ba7921b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/MapReduceChildJVM.java#L207]) using jvm arg java.io.tmpdir and the working directory of the JVM is deleted by framework whenever the JVM exits or job is killed. As RowContainer temp files are also created under this temp dir using [java.io.File.createTempDir|http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File)], they will also get cleaned up. I was looking at Spark code. Spark provides an API org.apache.spark.util.Utils.createTempDir() which also adds a shutdown hook to delete the tmpdir when jvm exits. Should we use the same API and provide it to RowContainer? It will be still on local FS. Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Fix For: spark-branch Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089839#comment-14089839 ] Chao commented on HIVE-7492: Hi [~vkorukanti] and [~brocknoland], after applying the patch, running following simple query: {code} select key, sum(value) from src group by key {code} produces no result. However, it works before this patch. Can you take a look? Thanks! Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Fix For: spark-branch Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089871#comment-14089871 ] Venki Korukanti commented on HIVE-7492: --- GroupBy operator is adding output records to Output Collector after closing the ExecMapper/ExecReducer. Need to check if {{lastRecordOutput}} has any records after closing the ExecMapper/ExecReducer. I will log a JIRA and attach the patch. Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Fix For: spark-branch Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086783#comment-14086783 ] Venki Korukanti commented on HIVE-7492: --- RB link: https://reviews.apache.org/r/24342/ Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087246#comment-14087246 ] Hive QA commented on HIVE-7492: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12660024/HIVE-7492.2-spark.patch {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5828 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_context_ngrams org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_fs_default_name2 org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/16/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/16/console Test logs: http://ec2-54-176-176-199.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-16/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12660024 Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)