[
https://issues.apache.org/jira/browse/TEZ-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054238#comment-14054238
]
Rohini Palaniswamy commented on TEZ-1260:
-----------------------------------------
https://wiki.apache.org/pig/PigHashBasedAggInMap - In simpler terms, For
groupby if there is a combine plan instead of writing out K,V to output
collector, we keep adding them to a hashmap and if the size hits a limit do
aggregation and if the size still does not reduce then write out the contents
of map to output collector which will do merge,spill to disk, etc. If we had
the option to write out K,List<V> then we can collect them in hashmap as
K,List<V> and write out when we reach memory limits for group by (even without
combiner plan) and join. Since one level of grouping is done in hashmap, the
sorting that has to be done by OnFileSortedOutput would be less. If hashmap
could be integrated into OnFileSortedOutput itself or a wrapper output could do
that, then it would make it easy for Pig and Hive. But generalizing it might
need more thought as we do lot of memory calculations based on tuple size (APIs
on Tuple) and decide when to spill.
> Allow KeyValueWriter to support writing list of values also
> -----------------------------------------------------------
>
> Key: TEZ-1260
> URL: https://issues.apache.org/jira/browse/TEZ-1260
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
>
> TEZ-1228 adds support to IFile for storing K,L<V>. Currently KeyValueWriter
> allows write of K,V
> public void write(Object key, Object value) throws IOException;
> We should add support for
> public void write(Object key, Iterable<Object> values) throws IOException;
> taking advantage of TEZ-1228. In few cases, pig unwraps key, list<values> and
> writes them as separate K,V pairs. This can avoid that overhead. That may
> enable us to even add something similar to hash based partial aggregation for
> join like what we do for groupby.
--
This message was sent by Atlassian JIRA
(v6.2#6252)