[ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836840#action_12836840 ]
John Sichi commented on HIVE-535: --------------------------------- Apache wicket has one: http://wicket.apache.org/docs/1.4/org/apache/wicket/util/collections/IntHashMap.html Maybe we can clone the code from there since it looks like efforts to put something similar in Apache Commons haven't gone anywhere so far. > Memory-efficient hash-based Aggregation > --------------------------------------- > > Key: HIVE-535 > URL: https://issues.apache.org/jira/browse/HIVE-535 > Project: Hadoop Hive > Issue Type: Improvement > Affects Versions: 0.4.0 > Reporter: Zheng Shao > > Currently there are a lot of memory overhead in the hash-based aggregation in > GroupByOperator. > The net result is that GroupByOperator won't be able to store many entries in > its HashTable, and flushes frequently, and won't be able to achieve very good > partial aggregation result. > Here are some initial thoughts (some of them are from Joydeep long time ago): > A1. Serialize the key of the HashTable. This will eliminate the 16-byte > per-object overhead of Java in keys (depending on how many objects there are > in the key, the saving can be substantial). > A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 > bytes of overhead per entry. > A3. Use primitive array to store aggregation results. Basically, the UDAF > should manage the array of aggregation results, so UDAFCount should manage a > long[], UDAFAvg should manage a double[] and a long[]. The external code > should pass an index to iterate/merge/terminal an aggregation result. This > will eliminate the 16-byte per-object overhead of Java. > More ideas are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.