[
https://issues.apache.org/jira/browse/FLINK-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117387#comment-14117387
]
ASF GitHub Bot commented on FLINK-1060:
---------------------------------------
GitHub user fhueske opened a pull request:
https://github.com/apache/incubator-flink/pull/108
[FLINK-1060] Added methods to DataSet to explicitly hash-partition or
rebalance the input of Map-based operators.
Adds three methods to DataSet:
- `DataSet.partitionByHash(int...)`
- `DataSet.partitionByHash(KeySelector)`
- `DataSet.rebalance()`
The methods create a PartitionedDataSet on which Map-based operators can be
applied (`filter()`, `map()`, `flatMap()`, and `mapPartition()`).
Feedback welcome.
I will add documentation once code changes are considered to be good to be
merged.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/fhueske/incubator-flink userPartition
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-flink/pull/108.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #108
----
commit 75d73f8ee801575691d414ea9104fb63cf3ef494
Author: Fabian Hueske <[email protected]>
Date: 2014-08-19T23:03:56Z
[FLINK-1060] Added methods to DataSet to explicitly hash-partition or
rebalance the input of Map-based operators.
----
> Support explicit shuffling of DataSets
> --------------------------------------
>
> Key: FLINK-1060
> URL: https://issues.apache.org/jira/browse/FLINK-1060
> Project: Flink
> Issue Type: Improvement
> Components: Java API, Optimizer
> Reporter: Fabian Hueske
> Assignee: Fabian Hueske
> Priority: Minor
>
> Right now, Flink only shuffles data if it is required by some operation such
> as Reduce, Join, or CoGroup. There is no way to explicitly shuffle a data set.
> However, in some situations explicit shuffling would be very helpful
> including:
> - rebalancing before compute-intensive Map operations
> - balancing, random or hash partitioning before PartitionMap operations (see
> FLINK-1053)
> - better integration of support for HadoopJobs (see FLINK-838)
> With this issue, I propose to add the following methods to {{DataSet}}
> - {{DataSet.partitionHashBy(int...)}} and
> {{DataSet.partitionHashBy(KeySelector)}} to perform an explicit hash
> partitioning
> - {{DataSet.partitionRandomly()}} to shuffle data completely random
> - {{DataSet.partitionRoundRobin()}} to shuffle data in a round-robin fashion
> that generates very even distribution with possible bias due to prior
> distributions
> The {{DataSet.partitionRoundRobin()}} might not be necessary if we think that
> random shuffling balances good enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)