Hi,

while implementing new transform for Apache Beam, we hit some questions about what guarantees a DataSet#rebalance() method has in terms of chaining. According to [1] there was some suboptimality if execution of chained rebalance with intermediate mapping operation. As a result the implementation was changed to use optimizer hints [2]. Unfortunately, it seems that this hint (HINT_SHIP_STRATEGY_REPARTITION) can get ignored (i.e. the data is not redstributed among workers, but is processed sequentially by single worker). Beam has some primitives where it relies heavily on the guarantees that Reshuffle really does redistribute the data.

My question therefore is - what would be the best strategy to ensure the required semantics, that is that we enforce redistribution data, even in the case where there is a chain of Reshuffle -> Map -> Reshuffle? One option seems to use groupBy -> groupReduce(identity), which has the correct semantics, but is inefficient. Alternative way might be to use hint HINT_SHIP_STRATEGY_REPARTITION_HASH, but it remains unclear if this enforces repartition in all required cases. Would anyone have any suggestions?

Thanks,

 Jan

[1] https://lists.apache.org/thread/tsbyzlk8p7m59qbgtgfm4rl0cx1rnz1j

[2] https://github.com/apache/beam/pull/11530

Reply via email to