Thanks for the tips Chi, I'm a little confused about the partitioning. I had thought that the number of partitions was determined by the amount of parallelism in the topology. For example if I said .parallelismHint(4), then I would have 4 different partitions. Is this not the case ? Is there a set number of partitions my topology has that I need to increase in order to have higher parallelism ?
Thanks On Sat, May 31, 2014 at 11:50 AM, Chi Hoang <c...@groupon.com> wrote: > Raphael, > You can try tuning your parallelism (and num workers). > > For Kafka 0.7, your spout parallelism could max out at: # brokers x # > partitions (for the topic). If you have 4 Kafka brokers, and your topic > has 5 partitions, then you could set the spout parallelism to 20 to > maximize the throughput. > > For Kafka 0.8+, your spout parallelism could max out at # partitions for > the topic, so if your topic has 5 partitions, then you would set the spout > parallelism to 5. To increase parallelism, you would need to increase the > number of partitions for your topic (by using the add partitions utility). > > As for the number of workers, setting it to 1 means that your spout will > only run on a single Storm node, and would likely share resources with > other Storm processes (spouts and bolts). I recommend to increase the > number of workers so Storm has a chance to spread out the work, and keep a > good balance. > > Hope this helps. > > Chi > > > On Fri, May 30, 2014 at 4:24 PM, Raphael Hsieh <raffihs...@gmail.com> > wrote: > >> I am in the process of optimizing my stream. Currently I expect 5 000 000 >> tuples to come out of my spout per minute. I am trying to beef up my >> topology in order to process this in real time without falling behind. >> >> For some reason my batch size is capping out at 83 thousand tuples. I >> can't seem to make it any bigger. the processing time doesn't seem to get >> any smaller than 2-3 seconds either. >> I'm not sure how to configure the topology to get any faster / more >> efficient. >> >> Currently all the topology does is a groupby on time and an aggregation >> (Count) to aggregate everything. >> >> Here are some data points i've figured out. >> >> Batch Size:5mb >> num-workers: 1 >> parallelismHint: 2 >> (I'll write this a 5mb, 1, 2) >> >> 5mb, 1, 2 = 83K tuples / 6s >> 10mb, 1, 2 = 83k / 7s >> 5mb, 1, 4 = 83k / 6s >> 5mb, 2, 4 = 83k / 3s >> 5mb, 3, 6 = 83k / 3s >> 10mb, 3, 6 = 83k / 3s >> >> Can anybody help me figure out how to get it to process things faster ? >> >> My maxSpoutPending is at 1, but when I increased it to 2 it was the same. >> MessageTimeoutSec = 100 >> >> I've been following this blog: https://gist.github.com/mrflip/5958028 >> to an extent, not everything word for word though. >> >> I need to be able to process around 66,000 tuples per second and I'm >> starting to run out of ideas. >> >> Thanks >> >> -- >> Raphael Hsieh >> >> >> > > > -- Raphael Hsieh