[ https://issues.apache.org/jira/browse/DATAFU-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17835311#comment-17835311 ]
Eyal Allweil commented on DATAFU-176: ------------------------------------- This sounds good - we'll be happy to see the PR. > Add a way to do dedupTopN with combiner > --------------------------------------- > > Key: DATAFU-176 > URL: https://issues.apache.org/jira/browse/DATAFU-176 > Project: DataFu > Issue Type: New Feature > Affects Versions: 2.1.0 > Reporter: Ben Rahamim > Priority: Major > Labels: pull-request-available > > In a lot of our solutions, we select only a fixed number of rows, based on > ordering by a column, usually a small amount. Datafu has dedupTopN, which > uses a window function, and dedupWithCombiner, which is limited to only > taking one record per grouping. dedupTopN is using a window function, which > is of course not efficient because it orders all of the rows per group, and > is very susceptible to skew. DedupWithCombiner won't let us take more than 1 > row. A better solution would be to write a class, like dedupWithCombiner, > that allows selecting many rows. One possible solution will be a class that > implements DeclarativeAggregate, to avoid having to declare the schemas > explicitly and use the combiner to avoid skew and also Codegen. > > I have prepared code that does this and will submit it as a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)