Hi,
while implementing new transform for Apache Beam, we hit some questions
about what guarantees a DataSet#rebalance() method has in terms of
chaining. According to [1] there was some suboptimality if execution of
chained rebalance with intermediate mapping operation. As a result the
implementation was changed to use optimizer hints [2]. Unfortunately, it
seems that this hint (HINT_SHIP_STRATEGY_REPARTITION) can get ignored
(i.e. the data is not redstributed among workers, but is processed
sequentially by single worker). Beam has some primitives where it relies
heavily on the guarantees that Reshuffle really does redistribute the data.
My question therefore is - what would be the best strategy to ensure the
required semantics, that is that we enforce redistribution data, even in
the case where there is a chain of Reshuffle -> Map -> Reshuffle? One
option seems to use groupBy -> groupReduce(identity), which has the
correct semantics, but is inefficient. Alternative way might be to use
hint HINT_SHIP_STRATEGY_REPARTITION_HASH, but it remains unclear if this
enforces repartition in all required cases. Would anyone have any
suggestions?
Thanks,
Jan
[1] https://lists.apache.org/thread/tsbyzlk8p7m59qbgtgfm4rl0cx1rnz1j
[2] https://github.com/apache/beam/pull/11530