[jira] [Commented] (HIVE-7492) Enhance SparkCollector

Venki Korukanti (JIRA) Thu, 07 Aug 2014 11:09:01 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089544#comment-14089544
 ]


Venki Korukanti commented on HIVE-7492:
---------------------------------------

Hi [~brocknoland], 

I was about to create a JIRA for the same, but have following questions:
* how cleanup works in case the task exits abnormally.
* where to create these tmp files on DFS.

Currently RowContainer is used in join operator (mainline hive not just spark 
branch), so it can create temp files as part of the Reduce task if the output 
exceeds in memory blocksize. In case of MapReduce tasks MR framework overrides 
the default tmp dir location with a location under JVM working directory (See 
[here|https://github.com/apache/hadoop-common/blob/0d1861b3eaf134e085055a8888e0b51c3ba7921b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/MapReduceChildJVM.java#L207])
 using jvm arg java.io.tmpdir and the working directory of the JVM is deleted 
by framework whenever the JVM exits or job is killed. As RowContainer temp 
files are also created under this temp dir using  
[java.io.File.createTempDir|http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File)],
 they will also get cleaned up.

I was looking at Spark code. Spark provides an API 
org.apache.spark.util.Utils.createTempDir() which also adds a shutdown hook to 
delete the tmpdir when jvm exits. Should we use the same API and provide it to 
RowContainer? It will be still on local FS.

> Enhance SparkCollector
> ----------------------
>
>                 Key: HIVE-7492
>                 URL: https://issues.apache.org/jira/browse/HIVE-7492
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Venki Korukanti
>             Fix For: spark-branch
>
>         Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch
>
>
> SparkCollector is used to collect the rows generated by HiveMapFunction or 
> HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
> unbounded memory usage. Ideally, the collector should have a bounded memory 
> usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7492) Enhance SparkCollector

Reply via email to