[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

2017-01-20 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-15580:
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   Status: Resolved  (was: Patch Available)

Committed to master. Thanks, Chao!

> Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
> -
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 2.2.0
>
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch
>
>
> Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
> memory. For orderBy, Hive accumulates key groups using ArrayList (described 
> in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
> which has a shortcoming of not being able to spill to disk within a key 
> group. Thus, for large key group, memory usage is also unbounded.
> It's likely that this will impact performance. We will profile and optimize 
> afterwards. We could also make this change configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

2017-01-19 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-15580:
---
Description: 
Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. 
For orderBy, Hive accumulates key groups using ArrayList (described in 
HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
which has a shortcoming of not being able to spill to disk within a key group. 
Thus, for large key group, memory usage is also unbounded.

It's likely that this will impact performance. We will profile and optimize 
afterwards. We could also make this change configurable.

  was:Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
memory. For orderBy, Hive accumulates key groups using ArrayList (described in 
HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
which has a shortcoming of not being able to spill to disk within a key group. 
Thus, for large key group, memory usage is also unbounded.


> Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
> -
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch
>
>
> Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
> memory. For orderBy, Hive accumulates key groups using ArrayList (described 
> in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
> which has a shortcoming of not being able to spill to disk within a key 
> group. Thus, for large key group, memory usage is also unbounded.
> It's likely that this will impact performance. We will profile and optimize 
> afterwards. We could also make this change configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

2017-01-19 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-15580:
---
Description: Currently, orderBy (sortBy) and groupBy in Hive on Spark uses 
unbounded memory. For orderBy, Hive accumulates key groups using ArrayList 
(described in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey 
operator, which has a shortcoming of not being able to spill to disk within a 
key group. Thus, for large key group, memory usage is also unbounded.

> Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
> -
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch
>
>
> Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded 
> memory. For orderBy, Hive accumulates key groups using ArrayList (described 
> in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, 
> which has a shortcoming of not being able to spill to disk within a key 
> group. Thus, for large key group, memory usage is also unbounded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-15580) Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark

2017-01-19 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated HIVE-15580:
---
Summary: Eliminate unbounded memory usage for orderBy and groupBy in Hive 
on Spark  (was: Replace Spark's groupByKey operator with something with bounded 
memory)

> Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark
> -
>
> Key: HIVE-15580
> URL: https://issues.apache.org/jira/browse/HIVE-15580
> Project: Hive
>  Issue Type: Improvement
>  Components: Spark
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15580.1.patch, HIVE-15580.1.patch, 
> HIVE-15580.2.patch, HIVE-15580.2.patch, HIVE-15580.3.patch, 
> HIVE-15580.4.patch, HIVE-15580.5.patch, HIVE-15580.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)