Re: [DISCUSS] Partitioning in Kafka

2015-08-06 Thread Gianmarco De Francisci Morales
OK, the general consensus seems to be that more elaborate partitioning functions belong to the scope of Kafka. Could somebody have a look at KAFKA-2092 then? -- Gianmarco On 30 July 2015 at 05:57, Jiangjie Qin wrote: > Just my two cents. I thin

Re: [DISCUSS] Partitioning in Kafka

2015-07-29 Thread Jiangjie Qin
Just my two cents. I think it might be OK to put this into Kafka if we agree that this might be a good use case for people who wants to use Kafka as temporary store for stream processing. At very least I don't see down side on this. Thanks, Jiangjie (Becket) Qin On Tue, Jul 28, 2015 at 3:41 AM,

Re: [DISCUSS] Partitioning in Kafka

2015-07-28 Thread Gianmarco De Francisci Morales
Jason, Thanks for starting the discussion and for your very concise (and correct) summary. Ewen, while what you say is true, those kinds of detasets (large number of keys with skew) are very typical in the Web (think Twitter users, or Web pages, or even just plain text). If you want to compute an

Re: [DISCUSS] Partitioning in Kafka

2015-07-27 Thread Gwen Shapira
I guess it depends on whether the original producer did any "map" tasks or simply wrote raw data. We usually advocate writing raw data, and since we need to write it anyway, the partitioner doesn't introduce any extra "hops". Its definitely useful to look at use-cases and I need to think a bit mor

Re: [DISCUSS] Partitioning in Kafka

2015-07-27 Thread Ewen Cheslack-Postava
Gwen - this is really like two steps of map reduce though, right? The first step does the partial shuffle to two partitions per key, second step does partial reduce + final full shuffle, final step does the final reduce. This strikes me as similar to partition assignment strategies in the consumer

Re: [DISCUSS] Partitioning in Kafka

2015-07-27 Thread Gwen Shapira
If you are used to map-reduce patterns, this sounds like a perfectly natural way to process streams of data. Call the first consumer "map-combine-log", the topic "shuffle-log" and the second consumer "reduce-log" :) I like that a lot. It works well for either "embarrassingly parallel" cases, or "s

Re: [DISCUSS] Partitioning in Kafka

2015-07-27 Thread Jason Gustafson
For a little background, the difference between this partitioner and the default one is that it breaks the deterministic mapping from key to partition. Instead, messages for a given key can end up in either of two partitions. This means that the consumer generally won't see all messages for a given

[DISCUSS] Partitioning in Kafka

2015-07-22 Thread Gianmarco De Francisci Morales
Hello folks, I'd like to ask the community about its opinion on the partitioning functions in Kafka. With KAFKA-2091 integrated we are now able to have custom partitioners in the producer. The question now becomes *which* partitioners should ship