[ 
https://issues.apache.org/jira/browse/HIVE-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Hammerbacher updated HIVE-135:
-----------------------------------

    Component/s: Query Processor

Adding to "Query Processor" component.

> need more accurate way of tracking memory consumption on map side aggregates
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-135
>                 URL: https://issues.apache.org/jira/browse/HIVE-135
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> from email thread:
> Just trying it out - I am confused by one thing:
>  
> hive> set hive.map.aggr=true;
> set hive.map.aggr=true;
> hive> explain from mytable u insert overwrite directory 
> '/user/jssarma/tmp_agg' select u.a, avg(size(u.b)) group by u.a;
>  everything looks good. Now I submit this query and this is what I see on the 
> tracker:
> Map input records 87,912,961 0 87,912,961 
> Map output records 87,912,960 0 87,912,960
> This doesn't make sense. With map-side aggregates - we should be getting 
> vastly reduced number of rows emitted from mapper.
> I am wondering whether we should rethink our flushing logic. The freeMemory() 
> call is not reliable (since it doesn't account for stuff that's not cleaned 
> out by GC). Perhaps we should switch to an explicit setting for amount of 
> memory for hash tables (we do know the size of each hash table entry and 
> overall size and should be able to guess reasonably). From what Dhruba 
> reported - there's no way to call the garbage collector and wait for it to 
> complete (to get a more accurate report of free memory). so the whole route 
> of obtaining free memory seems a little hosed.
> by way of comparison - hadoop also estimates memory usage in sorting. there - 
> the sort run is just stored in a sequential stream and it just takes the size 
> of the stream and compares it to max allowed sort memory usage (which is a 
> configuration option)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to