Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2241 @roshannaik I have some new performance numbers. These are not final, as I have not done any tuning yet. But let me explain the test. I really wanted to get a handle on how this would impact me, yes I am very selfish that way. I have been wanting for a long time to be able to write a tool that could simulate this, but it was never high enough up on my priority list to make it happen until now. Thanks for the nudge to do this by the way. So I wrote a [tool](https://github.com/revans2/incubator-storm/tree/STORM-2306-with-loadgen/examples/storm-starter/src/jvm/org/apache/storm/starter/loadgen) and will do a formal pull request after some more features, cleanup, and moving it into its own stand-alone package. The tool will capture a snapshot of topologies running on a cluster. It grabs the bolts, spouts, and all of the streams connecting them along with metrics for those streams. The metrics include the output rate and the process/execute latency for each incoming stream. I can then simulate the throughput for each stream and the latency for each bolt (I use a Gaussian distribution that matches the measured distribution). For this particular test I captured 104 topologies from a single production cluster 600+ nodes with 19k+ CPU cores and about 70 TB of memory. The nodes are heterogeneous and the cluster + all of the topologies are tuned for the Resource Aware Scheduler. (I really want to release these captured metrics, but I need to fight with legal to get approval to do it first). It is a mix of trident and regular storm topologies. Some with very high throughput (850k/sec in the highest case) and some very low throughput ones. I then set up another much smaller cluster 9 nodes (1 for nimbus + ZK, 8 for supervisors). ``` Processors: 2 x Xeon E5-2430 2.20GHz, 7.2GT QPI (HT enabled, 12 cores, 24 threads) - Sandy Bridge-EP C2, 64-bit, 6-core, 32nm, L3: 15MB Memory: 46.7GB / 48GB 1600MHz DDR3 == 6 x 8GB - 8GB PC3-12800 Samsung DDR3-1600 ECC Registered CL11 4Rx4 RHEL 6.7 java 1.8.0_131-b11 ``` This was a non RAS cluster. The only settings I changed for the cluster were the location of nimbus and zookeeper. I then replayed each of the topologies one at a time with default settings + `-c topology.worker.max.heap.size.mb=5500 -c nimbus.thrift.max_buffer_size=10485760`. (Takes almost 9 hours because each test runs for about 5 mins to be sure that the cluster has stabilized) The first changed setting is because I had a few issues with nimbus rejecting some topologies because they had some RAS settings for a very large cluster and the default settings in storm did not allow them to run. The second one is because on a few very large topologies the client was rejecting results from nimbus because the thrift metrics were too large (1MB is the default cutoff). Of these 104 topologies 95 of them ran on 2.x-SNAPSHOT without any real issues and were able to keep up with the throughput. For STORM-2306 94 of them ran without any issues and were able to keep up with the throughput. I have not debugged yet the one topology that would just stop processing on STORM-2306 but was running fine on 2.x-SNAPSHOT. From those I measured the latency and CPU/Memory used (just like with ThroughputVsLatency). My goal was to see how these changed from one version of storm to another and get an idea of how much pain out of the box it would be for me and my customers to start moving to it, again my own selfishness here in action. As I said I have not tuned any of the topologies yet. I plan on doing that next to see if I can improve some of the measurements. Here are my results. As a note for COST I used 1 core equal to 2 GB of memory because that is the ratio we bought hardware at originally (but not so much any more). Â | CPU Summary | Memory Summary | Cost Summary | Avg Latency Diff ms (weighted by throughput) -- | -- | -- | -- | -- Measured Diff | about 250 more needed | about 27 GB less needed | 235 more needed | 5.22 ms more per tuple In Cluster Total | about 19k cores | about 70TB | about 55k | Â Percent diff of cluster total | 1.28% | -0.04% | 0.43% | Â In Cluster Assigned | about 13k cores | about 50TB | about 40k | Â Percent of Assigned | 1.92% | -0.05% | 0.61% | Â Amount Per Node | 47 | about 200GB| Â | Â Change (Nodes) | 6 | -0 | Â | Â As I said I have not tuned anything yet. I need to because the assumption here is that users will need to tune their topologies for higher throughput jobs. I don't like that because it adds to the pain for my users, but I will try to quantify that in some way. As it stands now things don't look good for STORM-2306 (and I know running a test that no one else can reproduce is really crappy, but hopefully others can take my tool and play around with their own topologies). On average each tuple is 5 ms slower with STORM-2306, although most of that came from the one really big topology with 850k/sec where STORM-2306 was 30 ms slower. On average though 5 ms or 30 ms is not a big deal, and I would consider them more or less equal. For resource usage though we start to run into something that is more significant. Memory usage went down by about 27GB, but CPU usage went up by about 250 cores. This means my currently running topologies would need approximately 6 more servers to be able to handle the same load that they have now. Or just under a 2% increase in cluster usage. It is small enough that for most topologies they might just be able to absorb it without any changes, and hopefully with some tuning we can make it smaller. My next steps are to take a sample of the topologies and tune them on both systems to see how much I can reduce the cost, but still maintain the throughput, and not increase the latency by too much.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---