In case of rebalance(), all sources start the round-robin partitioning at
index 0. Since each source emits only very few elements, only the first 15
mappers receive any input.
It would be better to let each source start the round-robin partitioning at
a different index, something like startIdx =
The purpose of rebalance() should be to rebalance the partitions of a data
streams as evenly as possible, right?
If all senders start sending data to the same receiver and there is less
data in each partition than receivers, partitions are not evenly rebalanced.
That is exactly the problem Arnaud
Btw, it is working with a parallelism 1 source, because only a single
source partitions (round-robin or random) the data.
Several sources do not assign work to the same few mappers.
2015-09-03 15:22 GMT+02:00 Matthias J. Sax :
> If it would be only 14 elements, you are
Hi,
If I understand you correctly, you want to have 100 mappers. Thus you
need to apply the .setParallelism() after .map()
> addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(100)
The order of commands you used, set the dop for the source to 100 (which
might be ignored, if
Hi,
You are right, but in fact it does not solve my problem, since I have 100
parallelism everywhere. Each of my 100 sources gives only a few lines (say 14
max), and only the first 14 next nodes will receive data.
Same problem by replacing rebalance() with shuffle().
But I found a workaround: