[jira] [Updated] (HIVE-7492) Enhance SparkCollector [Spark Branch]

2014-08-18 Thread Brock Noland (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brock Noland updated HIVE-7492:
---

Summary: Enhance SparkCollector [Spark Branch]  (was: Enhance 
SparkCollector)

 Enhance SparkCollector [Spark Branch]
 -

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Fix For: spark-branch

 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-07 Thread Brock Noland (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brock Noland updated HIVE-7492:
---

   Resolution: Fixed
Fix Version/s: spark-branch
   Status: Resolved  (was: Patch Available)

Thank you very much [~vkorukanti] for your contribution! Would you mind opening 
another jira to allow RowContainer to write to the DFS as opposed to /tmp?

I don't think this work should be done on the Spark branch and I don't think 
it's urgent. However, since many users have extremely small /tmp I don't think 
we should be writing unbounded amounts of data there.

Committed to spark!

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Fix For: spark-branch

 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venki Korukanti updated HIVE-7492:
--

Attachment: HIVE-7492-1-spark.patch

Attaching a patch.

Instead of processing all records at once and returning an Iterable, lazily 
evaluate input records when output is requested by downstream consumer of 
returned Iterable. From the PairFlat(Map/Reduce)Function implementation, we 
return a custom implementation of Iterable which returns again a custom 
Iterator. This custom iterator will take initialized ExecMapper/ExecReducer and 
input record Iterator. When hasNext() in custom Iterator is called it reads 
record(s) from input Iterator and applies ExecMapper/ExecReducer function. 
Output records generated by processing a one record are stored in a 
HiveKVResultCache which has support for spilling if the number of output 
records exceeds a certain number (currently 512). The next() method in custom 
Iterator returns the results from HiveKVResultCache.

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venki Korukanti updated HIVE-7492:
--

Status: Patch Available  (was: Open)

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venki Korukanti updated HIVE-7492:
--

Attachment: HIVE-7492-2-spark.patch

Attaching patch-2. Addressed review comments.

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch, HIVE-7492-2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venki Korukanti updated HIVE-7492:
--

Attachment: (was: HIVE-7492-2-spark.patch)

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-7492) Enhance SparkCollector

2014-08-05 Thread Venki Korukanti (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venki Korukanti updated HIVE-7492:
--

Attachment: HIVE-7492.2-spark.patch

 Enhance SparkCollector
 --

 Key: HIVE-7492
 URL: https://issues.apache.org/jira/browse/HIVE-7492
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Reporter: Xuefu Zhang
Assignee: Venki Korukanti
 Attachments: HIVE-7492-1-spark.patch, HIVE-7492.2-spark.patch


 SparkCollector is used to collect the rows generated by HiveMapFunction or 
 HiveReduceFunction. It currently is backed by a ArrayList, and thus has 
 unbounded memory usage. Ideally, the collector should have a bounded memory 
 usage, and be able to spill to disc when its quota is reached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)