[jira] [Updated] (HIVE-7492) Enhance SparkCollector [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brock Noland updated HIVE-7492: --- Summary: Enhance SparkCollector [Spark Branch] (was: Enhance SparkCollector) Enhance SparkCollector [Spark Branch] - Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Fix For: spark-branch Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brock Noland updated HIVE-7492: --- Resolution: Fixed Fix Version/s: spark-branch Status: Resolved (was: Patch Available) Thank you very much [~vkorukanti] for your contribution! Would you mind opening another jira to allow RowContainer to write to the DFS as opposed to /tmp? I don't think this work should be done on the Spark branch and I don't think it's urgent. However, since many users have extremely small /tmp I don't think we should be writing unbounded amounts of data there. Committed to spark! Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Fix For: spark-branch Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated HIVE-7492: -- Attachment: HIVE-7492-1-spark.patch Attaching a patch. Instead of processing all records at once and returning an Iterable, lazily evaluate input records when output is requested by downstream consumer of returned Iterable. From the PairFlat(Map/Reduce)Function implementation, we return a custom implementation of Iterable which returns again a custom Iterator. This custom iterator will take initialized ExecMapper/ExecReducer and input record Iterator. When hasNext() in custom Iterator is called it reads record(s) from input Iterator and applies ExecMapper/ExecReducer function. Output records generated by processing a one record are stored in a HiveKVResultCache which has support for spilling if the number of output records exceeds a certain number (currently 512). The next() method in custom Iterator returns the results from HiveKVResultCache. Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated HIVE-7492: -- Status: Patch Available (was: Open) Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated HIVE-7492: -- Attachment: HIVE-7492-2-spark.patch Attaching patch-2. Addressed review comments. Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch, HIVE-7492-2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated HIVE-7492: -- Attachment: (was: HIVE-7492-2-spark.patch) Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HIVE-7492) Enhance SparkCollector
[ https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated HIVE-7492: -- Attachment: HIVE-7492.2-spark.patch Enhance SparkCollector -- Key: HIVE-7492 URL: https://issues.apache.org/jira/browse/HIVE-7492 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Xuefu Zhang Assignee: Venki Korukanti Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch SparkCollector is used to collect the rows generated by HiveMapFunction or HiveReduceFunction. It currently is backed by a ArrayList, and thus has unbounded memory usage. Ideally, the collector should have a bounded memory usage, and be able to spill to disc when its quota is reached. -- This message was sent by Atlassian JIRA (v6.2#6252)