I would like to extend FileIO with possibility to specify a custom sharding
function:
https://issues.apache.org/jira/browse/BEAM-12493

I have 2 use-cases for this:

   1. I need to generate shards which are compatible with Hive bucketing
   and therefore need to decide shard assignment based on data fields of input
   element
   2. When running e.g. on Spark and job encounters kind of failure which
   cause a loss of some data from previous stages, Spark does issue recompute
   of necessary task in necessary stages to recover data. Because the shard
   assignment function is random as default, some data will end up in
   different shards and cause duplicates in the final output.

Please let me know your thoughts in case you see a reason to not to add
such improvement.

Thanks,
Jozef

Reply via email to