Re: Optimizing Kafka Stream

2014-06-02 Thread Raphael Hsieh
Thanks for the tips Chi, I'm a little confused about the partitioning. I had thought that the number of partitions was determined by the amount of parallelism in the topology. For example if I said .parallelismHint(4), then I would have 4 different partitions. Is this not the case ? Is there a set

Re: Optimizing Kafka Stream

2014-06-02 Thread Chi Hoang
Raphael, The number of partitions is defined in your Kafka configuration - http://kafka.apache.org/documentation.html#brokerconfigs (num.partitions) - or when you create the topic. The behavior is different for each version of Kafka, so you should read more documentation. Your topology needs to

Re: Optimizing Kafka Stream

2014-06-02 Thread Raphael Hsieh
Oh ok. Thanks Chi! Do you have any ideas about why my batch size never seems to get any bigger than 83K tuples ? Currently I'm just using a barebones topology that looks like this: Stream spout = topology.newStream(..., ...) .parallelismHint() .groupBy(new Fields(time)) .aggregate(new

Re: Optimizing Kafka Stream

2014-05-31 Thread Chi Hoang
Raphael, You can try tuning your parallelism (and num workers). For Kafka 0.7, your spout parallelism could max out at: # brokers x # partitions (for the topic). If you have 4 Kafka brokers, and your topic has 5 partitions, then you could set the spout parallelism to 20 to maximize the

Optimizing Kafka Stream

2014-05-30 Thread Raphael Hsieh
I am in the process of optimizing my stream. Currently I expect 5 000 000 tuples to come out of my spout per minute. I am trying to beef up my topology in order to process this in real time without falling behind. For some reason my batch size is capping out at 83 thousand tuples. I can't seem to