[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

revans2 Fri, 28 Jul 2017 12:58:47 -0700

Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2241
  
    @roshannaik I appreciate your last comment and trying to summarize the 
concerns that have been raised.
    
    > 1. Better handling of low throughput Topos.
    
    Yes lower CPU usage and lower latency by default.  If all this takes is 
changing some default configs then lets do that.  I am very concerned with 
having something that requires a lot of manual tuning.  Most users will not 
know how to do it,end up copying and pasting something off of the internet and 
get it wrong. That is why I was running my tests with out of the box 
performance.
    
    I also want to be sure that we pay attention to a mixed use case topology 
like with DRPC queries.  You may have one part of your topology that has high 
throughput, aka the data path.  And yet there is another part of the topology 
(DRPC control/query path) that has very low throughput.  Waiting seconds for a 
DRPC query to fill a batch that will never fill is painful.
    
    > 2. TVL topo: Able to run this ...
    
    OK, but to me it was just an indication that something had changed 
drastically and not necessarily in a good way.  My big concern is not TVL.  I 
don't really care much about that (and we can discuss benchmark/testing 
methodologies on a separate JIRA).  It is that with STORM-2306 there are some 
seriously counter intuitive situations (low throughput which you already called 
out) and some really scary pitfalls (the CPU contention which was [mentioned 
above](https://github.com/apache/storm/pull/2241#issuecomment-318494665)).  But 
I want to be sure that it is addressed in some way.  As an end user I see that 
one of my bolts is backed up, so I increase the parallelism and the performance 
gets much worse with no indication at all in any logs or metrics why it got 
worse.  At a minimum we need a good way to know when this is happening, and 
ideally have the performance degrade gracefully instead.
    
    > 3. Bug in Multi worker mode prevents inter-worker communication.
    
    I was wrong this works.  I was just seeing messages time out because of the 
original problem with the host being overloaded and interpreted it wrong.
    
    > 5. Some "real-world topology" runs as in addition to benchmark style 
topos.
    
    Yes and preferably ones that are running on more then one machine.  Ideally 
some that have multiple different topologies running at the same time on the 
same cluster too so we can see what happens when there is CPU contention.  Also 
I would like to add that it would be good to observe someone who has a 
currently working topology try and run it under the new system.  It might help 
us see where we need better documentation or to adjust default settings.
    
    ...
    
    > 6. get some more runs of TVL.  I am happy to provide some of that.  I 
spent some time on it the past few days trying to understand better how this 
patch compares to what is on master, but I'll put that in a separate post as 
this is getting long already, and I may have to talk about benchmark 
methodology some.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] storm issue #2241: STORM-2306 : Messaging subsystem redesign.

Reply via email to