Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
In case of rebalance(), all sources start the round-robin partitioning at index 0. Since each source emits only very few elements, only the first 15 mappers receive any input. It would be better to let each source start the round-robin partitioning at a different index, something like startIdx =

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
The purpose of rebalance() should be to rebalance the partitions of a data streams as evenly as possible, right? If all senders start sending data to the same receiver and there is less data in each partition than receivers, partitions are not evenly rebalanced. That is exactly the problem Arnaud

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske
Btw, it is working with a parallelism 1 source, because only a single source partitions (round-robin or random) the data. Several sources do not assign work to the same few mappers. 2015-09-03 15:22 GMT+02:00 Matthias J. Sax : > If it would be only 14 elements, you are

Re: How to force the parallelism on small streams?

2015-09-02 Thread Matthias J. Sax
Hi, If I understand you correctly, you want to have 100 mappers. Thus you need to apply the .setParallelism() after .map() > addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(100) The order of commands you used, set the dop for the source to 100 (which might be ignored, if

RE: How to force the parallelism on small streams?

2015-09-02 Thread LINZ, Arnaud
Hi, You are right, but in fact it does not solve my problem, since I have 100 parallelism everywhere. Each of my 100 sources gives only a few lines (say 14 max), and only the first 14 next nodes will receive data. Same problem by replacing rebalance() with shuffle(). But I found a workaround: