Couple of things I'd suggest to check:

1.- Perhaps your data is skewed, i.e. the hash function sends the bulk of
the messages to a single bolt? Check in the storm UI the number of executed
tuples in each bolt. If this is the case, then all the paralelism you can
set won't give you gains. You'd need to perhaps use a better grouping
function than the field you're using right now.

2.- Is it possible to batch at the spout? We had huge improvements to
performance while batching at the spout and receiving lists of values in
the next bolt instead of single objects. We were using AMPS (similar to
Kafka) and we were able to squeeze up to 80k msg/sec out of it from a
single spout.

3.- considering the size of your messages, perhaps my second suggestion
isn't that good :) check the parameters you can set to set the internal
storm queue sizes. This article is from 2013 but it's a start on
experimenting with these parameters.
http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/

Also you can play with the max in-flight messages settings - perhaps your
big messages are clogging your internals more than you think.

4.- Are you using a single worker? I've found that we achieved better
performance using 1 worker per machine and allocating a lot of memory to it
- inter-JVM communications are costly.

Regards,
Javier

On Fri, Jun 5, 2015 at 4:29 PM, Fang Chen <fc2...@gmail.com> wrote:

> Thanks for your suggestion.  I actually came from multiple partitions and
> found out I could not control the severity of out-of-order tuple issues
> from different partitions (I could only tolerate a small degree of
> out-of-order, but throughput from different partitions can't be coordinated
> so it's a no-no for us).  That said, multiple partitions do work and
> throughput was way better.
>
> With Kafka console consumer, I did not do anything fancy, just timing how
> fast it could consume from the same Kafka partition. I did some
> calculation, it seems like the console consumer almost saturated the 1G
> link taking into account compression used. This makes me believe the
> bottleneck is inside storm.
>
> My real topology is also simple, like: kafka topic (1partition) -> spout
> (1task) --> boltType1(20, fieldgrouping).  Field grouping is used to ensure
> that each tuple only goes to one consumer bolt.  I even replaced my real
> bolt with an empty one as mentioned before and also tried disabling acking,
> still without significant improvement.
>
> Thanks,
> Fang
>
> On Fri, Jun 5, 2015 at 8:03 AM, nitin sharma <kumarsharma.ni...@gmail.com>
> wrote:
>
>> Hi Fang,
>>
>> Can you elaborate more on what kind of testing you performed with
>> "Kafka's consumer console"?
>>
>> Moreover, can you try creating more partitions in your kafka topics and
>> then increase the kafka spout parallelism and see if that works?
>>
>> it will be great if you can paste how u are building your topology
>>
>> Regards,
>> Nitin Kumar Sharma.
>>
>>
>> On Thu, Jun 4, 2015 at 3:03 PM, Fang Chen <fc2...@gmail.com> wrote:
>>
>>> One correction, my msg size is about 3k each.
>>>
>>> I did another round for comparison. I disabled acking all together,
>>> still the throughput is only slightly better at 12k tuples/s. So I used
>>> kafka's console consumer from one of the cluster nodes (different from
>>> where the partition is located) in order to find out if network was the
>>> bottleneck. This time I easily achieved 70k+ tuples/s.
>>>
>>> Any thoughts on this?
>>>
>>> Thanks
>>>
>>> On Wed, Jun 3, 2015 at 10:55 PM, Fang Chen <fc2...@gmail.com> wrote:
>>>
>>>> My use case requires total order in kafka queue, so I tested with a
>>>> topic with only 1 partition. My spout parallelism was set to 1, and bolt
>>>> parallelism 20. The message size is less than 1k bytes each.
>>>>
>>>> No matter how I tune kafka spout configs, including those queue fetch
>>>> related params, and max spout pending, I could only get about 10K tuples/s
>>>> with very low complete latency (<10ms)
>>>>
>>>> I even tried with empty bolt that acks tuples immediately without any
>>>> extra processing. Still the throughput is similar, though the complete
>>>> latency was even lower. This makes me wonder if I hit some sort of perf.
>>>> walls.
>>>>
>>>> My boxes are quite powerful baremetals (40 cores, lots of disk space,
>>>> 96G memory, 1G network), also the worker jvm was tuned so negligible pauses
>>>> there.
>>>>
>>>>  Any advice on what I can tune or look into?
>>>>
>>>> Thanks a lot!
>>>>
>>>> Fang
>>>>
>>>
>>>
>>
>


-- 
Javier González Nicolini

Reply via email to