[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089544#comment-14089544 ]
Venki Korukanti commented on HIVE-7492: --------------------------------------- Hi [~brocknoland], I was about to create a JIRA for the same, but have following questions: * how cleanup works in case the task exits abnormally. * where to create these tmp files on DFS. Currently RowContainer is used in join operator (mainline hive not just spark branch), so it can create temp files as part of the Reduce task if the output exceeds in memory blocksize. In case of MapReduce tasks MR framework overrides the default tmp dir location with a location under JVM working directory (See [here|https://github.com/apache/hadoop-common/blob/0d1861b3eaf134e085055a8888e0b51c3ba7921b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/MapReduceChildJVM.java#L207]) using jvm arg java.io.tmpdir and the working directory of the JVM is deleted by framework whenever the JVM exits or job is killed. As RowContainer temp files are also created under this temp dir using [java.io.File.createTempDir|http://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File)], they will also get cleaned up. I was looking at Spark code. Spark provides an API org.apache.spark.util.Utils.createTempDir() which also adds a shutdown hook to delete the tmpdir when jvm exits. Should we use the same API and provide it to RowContainer? It will be still on local FS. > Enhance SparkCollector > ---------------------- > > Key: HIVE-7492 > URL: https://issues.apache.org/jira/browse/HIVE-7492 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Venki Korukanti > Fix For: spark-branch > > Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch > > > SparkCollector is used to collect the rows generated by HiveMapFunction or > HiveReduceFunction. It currently is backed by a ArrayList, and thus has > unbounded memory usage. Ideally, the collector should have a bounded memory > usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)