[ https://issues.apache.org/jira/browse/DATAFU-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863082#comment-17863082 ]
Ben Rahamim commented on DATAFU-176: ------------------------------------ Attached the code as getting to the PR took longer than expected, I will get to it soon (unless someone else wants do it) > Add a way to do dedupTopN with combiner > --------------------------------------- > > Key: DATAFU-176 > URL: https://issues.apache.org/jira/browse/DATAFU-176 > Project: DataFu > Issue Type: New Feature > Affects Versions: 2.1.0 > Reporter: Ben Rahamim > Priority: Major > Labels: pull-request-available > Attachments: scratch_340.sc > > > In a lot of our solutions, we select only a fixed number of rows, based on > ordering by a column, usually a small amount. Datafu has dedupTopN, which > uses a window function, and dedupWithCombiner, which is limited to only > taking one record per grouping. dedupTopN is using a window function, which > is of course not efficient because it orders all of the rows per group, and > is very susceptible to skew. DedupWithCombiner won't let us take more than 1 > row. A better solution would be to write a class, like dedupWithCombiner, > that allows selecting many rows. One possible solution will be a class that > implements DeclarativeAggregate, to avoid having to declare the schemas > explicitly and use the combiner to avoid skew and also Codegen. > > I have prepared code that does this and will submit it as a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)