RE: Combine() optimization

Joydeep Sen Sarma Tue, 03 Mar 2009 08:24:56 -0800

HIVE-224<https://issues.apache.org/jira/browse/HIVE-224>

________________________________
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Monday, March 02, 2009 1:44 PM
To: hive-user@hadoop.apache.org
Subject: Re: Combine() optimization

Changing to LRU would be simple.  Extend LinkedHashMap, construct with the 
access order flag true, and implement removeEldestEntry() to check your 
'isTooBig' condition.  Its likely fewer lines of code than randomly removing 
10% of the entries.

A more optimal cache algorithm akin to 2Q or something like that (O(1), 
multi-queue) takes more time and testing, will have higher hitrate, but is also 
more expensive.  It would represent overhead in the mapper that is only worth 
it if there is expected to be a lot of cache spillage.  More advanced 
algorithms aren't O(1) and are probably not good fits for this use case.

Is there already a JIRA for this improvement?

On 2/27/09 2:22 PM, "Joydeep Sen Sarma" <jssa...@facebook.com> wrote:
Yeah - we definitely want to convert it to a MFU type flush algorithm.

If someone wants to take a crack at it before we can get to it - that would be 
awesome

________________________________

From: Namit Jain [mailto:nj...@facebook.com]
Sent: Friday, February 27, 2009 1:59 PM
To: hive-user@hadoop.apache.org
Subject: RE: Combine() optimization

It dumps 10% of the hash table randomly today

From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Friday, February 27, 2009 1:41 PM
To: hive-user@hadoop.apache.org
Subject: Re: Combine() optimization

Does it dump all contents and start over, or use a LRU or MFU algorithm? 
LinkedHashMap makes LRUs and similar constructs fairly easy to make.
My guess is that most data types have biased value distributions that will take 
advantage of map side partial aggregation fairly well.

On 2/26/09 6:02 PM, "Namit Jain" <nj...@facebook.com> wrote:
Yes, it flushes the data when the hash table is occupying too much memory

From: Qing Yan [mailto:qing...@gmail.com]
Sent: Thursday, February 26, 2009 5:58 PM
To: hive-user@hadoop.apache.org
Subject: Re: Combine() optimization

Got it.

Does map side aggregation has any special requirement about the dataset? E.g. 
The number of unqiue group by keys could be too big to hold in memory. Will it 
still work?

On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao <zsh...@gmail.com> wrote:
Hi Qing,

We did think about Combiner when we started Hive. However earlier discussions 
lead us to believe that hash-based aggregation inside the mapper will be as 
competitive as using combiner in most cases.

In order to enable map-side aggregation, we just need to do the following 
before running the hive query:
set hive.map.aggr=true;

Zheng

On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy <ra...@facebook.com> wrote:
Right now Hive does not exploit the combiner. But hash-based map-side
aggregation in hive (controlled by hints) provides a similar optimization.
Using the combiner in addition to map-side aggregation should improve the
performance even more if the combiner can further aggregate the partial
aggregates generated from the mapper.

On 2/26/09 5:57 AM, "Qing Yan" <qing...@gmail.com> wrote:

> Is there any way/plan for Hive to take advantage of M/R's combine()
> phrase? There can be either rules embedded in in the query optimizer  or hints
> passed by user...
> GROUP BY should benefit from this alot..
>
> Any comment?
>
>
>

RE: Combine() optimization

Reply via email to