[ 
https://issues.apache.org/jira/browse/FLINK-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140871#comment-15140871
 ] 

ASF GitHub Bot commented on FLINK-2237:
---------------------------------------

Github user fhueske commented on the pull request:

    https://github.com/apache/flink/pull/1517#issuecomment-182398031
  
    Hi @ggevay, sorry it took me very long to review your PR.
    
    As I said before, this is a very desirable feature and a solid 
implementation.
    I think a few things can be improved. Especially the full stop to resize / 
rebuild / emit records might take some time, depending on the size of the table.
    
    I have the following suggestions / ideas:
    - Split the single record into multiple partitions (i.e., use multiple 
`RecordArea`s). Each partition holds the data of multiple buckets. This allows 
to restrict rebuilding, compaction, etc. to only a part of the whole table. In 
fact this is what I meant in my initial comment about tracking the updates of 
buckets (which I confused with partitions...). Partitioning is also used in the 
other hash tables in Flink. It can also help to make your implementation more 
suitable for the final reduce case, because it allows to spill individual 
partitions to disk. In `CompactingHashTable` the max number of partitions is 
set to `32`.
    - Should we think about [linear 
hashing](https://en.wikipedia.org/wiki/Linear_hashing) for resizing the table. 
This technique grows the table by splitting individual buckets without the need 
to reorganize the whole table.
    - Do you think it is possible to extract the ReduceFunction from the table? 
IMO this would be a cleaner design if we want to use the table instead of the 
`CompactingHashTable`.
    
    What do you think?
    
    In any case, we need to update the documentation for the added 
`CombineHint`. I would also be good to extend `ReduceITCase` with a few tests 
that use `CombineHint.HASH`.


> Add hash-based Aggregation
> --------------------------
>
>                 Key: FLINK-2237
>                 URL: https://issues.apache.org/jira/browse/FLINK-2237
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Rafiullah Momand
>            Assignee: Gabor Gevay
>            Priority: Minor
>
> Aggregation functions at the moment are implemented in a sort-based way.
> How can we implement hash based Aggregation for Flink?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to