Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Raghu Angadi
On Sun, Nov 13, 2016 at 10:14 PM, Davor Bonaci wrote: > Luke is bringing up great questions, I think. > Yes, better handling of 'desiredNumSplits' by a runner would be very useful. I wanted to limit my proposal to what a source like KafkaIO could do on its own. > My first impression is that th

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Raghu Angadi
I agree with all of this, except I think this also avoids the need to "remember" the original number of > parallelism. KafkaIO still need to decide how many splits it needs to return in generateInitialSplits(). 'Update' could be Dataflow specific concern. We could drop it for this thread, thoug

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-14 Thread Amit Sela
For Kafka, I don't think you're over-splitting if you split according to Kafka partitions. If your backend provides enough parallelism, you'll get a 1-1 (Source splits-to-Kafka partitions) parallelism from the KafkaIO today. The problem is with the backend not providing enough parallelism: - C

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-13 Thread Davor Bonaci
Luke is bringing up great questions, I think. My first impression is that the current state is "possibly under-split", and the proposal is to move us to "possibly over-split" state. Neither is the ideal solution, as I'm sure we can find scenarios when either is not performing well. That said, if w

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-11 Thread Lukasz Cwik
Why is it that we don't generate initial splits after the pipeline has been created and the runner is processing it? This would allow a runner to look at the old state of the pipeline and see how many splits there were. This would allow the runner to provide a hint as to how many splits it wants.

Re: [PROPOSAL] Change to KafkaIO splits

2016-11-11 Thread Amit Sela
+1 I think this makes more sense then the existing form of a split that is made of several Kafka partitions since, as mentioned, Kafka partitions are in fact it's parallelism. As for supporting a change in the number of partitions (mainly, added partitions), I'll suggest something I brought up bef

[PROPOSAL] Change to KafkaIO splits

2016-11-10 Thread Raghu Angadi
I would like to propose a change to how many splits (sources) KafkaIO creates. The code changes are relatively simple, but it has a couple of drawbacks I would to discuss here. KafkaIO currently takes '*desiredNumWorkers