Ben Rahamim created DATAFU-176:
----------------------------------
Summary: Add a way to do dedupTopN with combiner
Key: DATAFU-176
URL: https://issues.apache.org/jira/browse/DATAFU-176
Project: DataFu
Issue Type: New Feature
Affects Versions: 2.1.0
Reporter: Ben Rahamim
In a lot of our solutions, we select only a fixed number of rows, based on
ordering by a column, usually a small amount. Datafu has dedupTopN, which uses
a window function, and dedupWithCombiner, which is limited to only taking one
record per grouping. dedupTopN is using a window function, which is of course
not efficient because it orders all of the rows per group, and is very
susceptible to skew. DedupWithCombiner won't let us take more than 1 row. A
better solution would be to write a class, like dedupWithCombiner, that allows
selecting many rows. One possible solution will be a class that implements
DeclarativeAggregate, to avoid having to declare the schemas explicitly and use
the combiner to avoid skew and also Codegen.
I have prepared code that does this and will submit it as a PR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)