[jira] [Updated] (BEAM-12493) FileIO should allow to opt-in for custom sharding function

Jozef Vilcek (Jira) Thu, 22 Jul 2021 23:52:06 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jozef Vilcek updated BEAM-12493:
--------------------------------
    Resolution: Won't Do
        Status: Resolved  (was: Open)

> FileIO should allow to opt-in for custom sharding function
> ----------------------------------------------------------
>
>                 Key: BEAM-12493
>                 URL: https://issues.apache.org/jira/browse/BEAM-12493
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 2.29.0
>            Reporter: Jozef Vilcek
>            Assignee: Jozef Vilcek
>            Priority: P2
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When number of shards is explicitly specified, then default sharding function 
> is `RandomShardingFunction`. `WriteFiles` does have an option to pass in 
> custom sharding function but that is not surfaced on user facing API at 
> `FileIO`.
> This is limiting in these 2 use-cases:
>  # I need to generate shards which are compatible with Hive bucketing and 
> therefore need to decide shard assignment based on data fields of element 
> being sharded
>  # When run e.g. on Spark and job encounters failure which cause loss of some 
> data from previous stages, Spark does issue recompute of necessary task in 
> necessary stages. Because shard assignment is random, some data will end up 
> in different shards and cause duplicates in final dataset
> I propose to surface `.withShardingFunction()` at FileIO level so user can 
> choose custom sharding strategy when desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-12493) FileIO should allow to opt-in for custom sharding function

Reply via email to