[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-07 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089400#comment-14089400
 ] 

Brock Noland commented on HIVE-7492:


+1

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-07 Thread Venki Korukanti (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089544#comment-14089544
 ] 

Venki Korukanti commented on HIVE-7492:
---

Hi [~brocknoland], 

I was about to create a JIRA for the same, but have following questions:
* how cleanup works in case the task exits abnormally.
* where to create these tmp files on DFS.

Currently RowContainer is used in join operator (mainline hive not just spark 
branch), so it can create temp files as part of the Reduce task if the output 
exceeds in memory blocksize. In case of MapReduce tasks MR framework overrides 
the default tmp dir location with a location under JVM working directory (See 
[here|https://github.com/apache/hadoop-common/blob/0d1861b3eaf134e085055ae0b51c3ba7921b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/MapReduceChildJVM.java#L207])
 using jvm arg java.io.tmpdir and the working directory of the JVM is deleted 
by framework whenever the JVM exits or job is killed. As RowContainer temp 
files are also created under this temp dir using  
[java.io.File.createTempDir|http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File)],
 they will also get cleaned up.

I was looking at Spark code. Spark provides an API 
org.apache.spark.util.Utils.createTempDir() which also adds a shutdown hook to 
delete the tmpdir when jvm exits. Should we use the same API and provide it to 
RowContainer? It will be still on local FS.

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Fix For: spark-branch

 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-07 Thread Chao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089839#comment-14089839
 ] 

Chao commented on HIVE-7492:


Hi [~vkorukanti] and [~brocknoland], after applying the patch, running 
following simple query:

{code}
select key, sum(value) from src group by key
{code}

produces no result. However, it works before this patch.
Can you take a look? Thanks!

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Fix For: spark-branch

 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-07 Thread Venki Korukanti (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089871#comment-14089871
 ] 

Venki Korukanti commented on HIVE-7492:
---

GroupBy operator is adding output records to Output Collector after closing the 
ExecMapper/ExecReducer. Need to check if {{lastRecordOutput}} has any records 
after closing the ExecMapper/ExecReducer. I will log a JIRA and attach the 
patch.

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Fix For: spark-branch

 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14086783#comment-14086783
 ] 

Venki Korukanti commented on HIVE-7492:
---

RB link: https://reviews.apache.org/r/24342/

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087246#comment-14087246
 ] 

Hive QA commented on HIVE-7492:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12660024/HIVE-7492.2-spark.patch

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 5828 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_context_ngrams
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_fs_default_name2
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/16/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/16/console
Test logs: 
http://ec2-54-176-176-199.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-16/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12660024

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)