If sharding is not specified, then the semantics are "runner-determined
sharding." The DataflowRunner already takes advantage of this to impose its
own sharding if the user hasn't specified an explicit one. Could the Flink
runner do the same instead of pushing this to the users?

On Thu, Apr 25, 2019 at 6:23 AM Maximilian Michels <m...@apache.org> wrote:

> Hi Jozef,
>
> For sharding in FileIO there are basically two options:
>
> (1) num_shards ~= num_workers => bad spread of the load across workers
> (2) num_shards >> num_workers => good spread of the load across workers,
> but huge number of files
>
> Your approach would give users control over the sharding keys such that
> they could be adjusted to spread load more evenly.
>
> I'd like to hear from Beam IO experts if that would make sense.
>
> Thanks,
> Max
>
> On 25.04.19 08:52, Jozef Vilcek wrote:
> > Hello,
> >
> > Right now, if someone needs sharded files via FileIO, there is only one
> > option which is random (round robin) shard assignment per element and it
> > always use ShardedKey<Integer> as a key for the GBK which follows.
> >
> > I would like to generalize this and have a possibility to provide some
> > ShardingFn[UserT, DestinationT, ShardKeyT] via FileIO.
> > What I am mainly after is, to have a possibility to provide optimisation
> > for Flink runtime and pass in a special function which generates shard
> > keys in a way that they are evenly spread among workers (BEAM-5865).
> >
> > Would such extension for FileIO make sense? If yes, I would create a
> > ticket for it and try to draft a PR.
> >
> > Best,
> > Jozef
>

Reply via email to