[jira] [Commented] (DATAFU-176) Add a way to do dedupTopN with combiner

Ben Rahamim (Jira) Wed, 22 Jan 2025 06:26:08 -0800


    [ 
https://issues.apache.org/jira/browse/DATAFU-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17916067#comment-17916067
 ]


Ben Rahamim commented on DATAFU-176:
------------------------------------

Hi!

 

Initially, I wrote this code to work with Spark 3.3, so had to make some 
changes to work with 3.0.x ... Sorry again for taking so long.

The PR is now ready and passing the tests.

> Add a way to do dedupTopN with combiner
> ---------------------------------------
>
>                 Key: DATAFU-176
>                 URL: https://issues.apache.org/jira/browse/DATAFU-176
>             Project: DataFu
>          Issue Type: New Feature
>    Affects Versions: 2.1.0
>            Reporter: Ben Rahamim
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: scratch_340-1.sc
>
>
> In a lot of our solutions, we select only a fixed number of rows, based on 
> ordering by a column, usually a small amount. Datafu has dedupTopN, which 
> uses a window function, and dedupWithCombiner, which is limited to only 
> taking one record per grouping. dedupTopN is using a window function, which 
> is of course not efficient because it orders all of the rows per group, and 
> is very susceptible to skew. DedupWithCombiner won't let us take more than 1 
> row. A better solution would be to write a class, like dedupWithCombiner, 
> that allows selecting many rows. One possible solution will be a class that 
> implements DeclarativeAggregate, to avoid having to declare the schemas 
> explicitly and use the combiner to avoid skew and also Codegen.
>  
> I have prepared code that does this and will submit it as a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (DATAFU-176) Add a way to do dedupTopN with combiner

Reply via email to