[
https://issues.apache.org/jira/browse/BEAM-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jozef Vilcek updated BEAM-12493:
--------------------------------
Resolution: Won't Do
Status: Resolved (was: Open)
> FileIO should allow to opt-in for custom sharding function
> ----------------------------------------------------------
>
> Key: BEAM-12493
> URL: https://issues.apache.org/jira/browse/BEAM-12493
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Affects Versions: 2.29.0
> Reporter: Jozef Vilcek
> Assignee: Jozef Vilcek
> Priority: P2
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> When number of shards is explicitly specified, then default sharding function
> is `RandomShardingFunction`. `WriteFiles` does have an option to pass in
> custom sharding function but that is not surfaced on user facing API at
> `FileIO`.
> This is limiting in these 2 use-cases:
> # I need to generate shards which are compatible with Hive bucketing and
> therefore need to decide shard assignment based on data fields of element
> being sharded
> # When run e.g. on Spark and job encounters failure which cause loss of some
> data from previous stages, Spark does issue recompute of necessary task in
> necessary stages. Because shard assignment is random, some data will end up
> in different shards and cause duplicates in final dataset
> I propose to surface `.withShardingFunction()` at FileIO level so user can
> choose custom sharding strategy when desired.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)