[ https://issues.apache.org/jira/browse/FLINK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895177#comment-15895177 ]
ASF GitHub Bot commented on FLINK-5722: --------------------------------------- GitHub user fhueske opened a pull request: https://github.com/apache/flink/pull/3471 [FLINK-5722] [table] Add dedicated DataSetDistinct operator. A dedicated DISTINCT operator is more efficient because we can use a `ReduceFunction` which (optionally) support hash-based combining. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fhueske/flink tableDistinct Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3471.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3471 ---- ---- > Implement DISTINCT as dedicated operator > ---------------------------------------- > > Key: FLINK-5722 > URL: https://issues.apache.org/jira/browse/FLINK-5722 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL > Affects Versions: 1.2.0, 1.3.0 > Reporter: Fabian Hueske > Assignee: Fabian Hueske > > DISTINCT is currently implemented for batch Table API / SQL as an aggregate > which groups on all fields. Grouped aggregates are implemented as GroupReduce > with sort-based combiner. > This operator can be more efficiently implemented by using ReduceFunction and > hinting a HashCombine strategy. The same ReduceFunction can be used for all > DISTINCT operations and can be assigned with appropriate forward field > annotations. > We would need a custom conversion rule which translates distinct aggregations > (grouping on all fields and returning all fields) into a custom > DataSetRelNode. -- This message was sent by Atlassian JIRA (v6.3.15#6346)