Right now, sharding can be specified only via target `shardCount`, be it
user or runner. Next to configurable shardCount, I am proposing to be able
to pass also a function which will allow to the user (or runner) control
how is shard determined and what key will be used to represent it

interface ShardingFunction[UserT, DestinationT, ShardKeyT]  extends
Serializable {
   ShardKeyT assign(DestinationT destination, UserT element, shardCount:
Integer);
}

Default implementation can be what is right now =>  random shard
encapsulated as ShardedKey<Integer>.


On Thu, Apr 25, 2019 at 4:07 PM Reuven Lax <re...@google.com> wrote:

> If sharding is not specified, then the semantics are "runner-determined
> sharding." The DataflowRunner already takes advantage of this to impose its
> own sharding if the user hasn't specified an explicit one. Could the Flink
> runner do the same instead of pushing this to the users?
>
> On Thu, Apr 25, 2019 at 6:23 AM Maximilian Michels <m...@apache.org> wrote:
>
>> Hi Jozef,
>>
>> For sharding in FileIO there are basically two options:
>>
>> (1) num_shards ~= num_workers => bad spread of the load across workers
>> (2) num_shards >> num_workers => good spread of the load across workers,
>> but huge number of files
>>
>> Your approach would give users control over the sharding keys such that
>> they could be adjusted to spread load more evenly.
>>
>> I'd like to hear from Beam IO experts if that would make sense.
>>
>> Thanks,
>> Max
>>
>> On 25.04.19 08:52, Jozef Vilcek wrote:
>> > Hello,
>> >
>> > Right now, if someone needs sharded files via FileIO, there is only one
>> > option which is random (round robin) shard assignment per element and
>> it
>> > always use ShardedKey<Integer> as a key for the GBK which follows.
>> >
>> > I would like to generalize this and have a possibility to provide some
>> > ShardingFn[UserT, DestinationT, ShardKeyT] via FileIO.
>> > What I am mainly after is, to have a possibility to provide
>> optimisation
>> > for Flink runtime and pass in a special function which generates shard
>> > keys in a way that they are evenly spread among workers (BEAM-5865).
>> >
>> > Would such extension for FileIO make sense? If yes, I would create a
>> > ticket for it and try to draft a PR.
>> >
>> > Best,
>> > Jozef
>>
>

Reply via email to