GitHub user srdo opened a pull request: https://github.com/apache/storm/pull/2704
STORM-1038: Upgrade to Netty 4 This is a continuation of the work done at https://github.com/apache/storm/pull/728. ### Important changes: * Split MessageEncoder into multiple encoders depending on the message type, and made the stateless ones singletons * Moved worker context termination to after the transfer thread has shut down. Shutting the context down first looks racy to me, since the transfer thread uses it. * ChannelGroups automatically remove closed Channels, so Server/PacemakerServer doesn't bother removing closed channels manually anymore * Made some changes to TransferDrainer for readability. I didn't see any performance impact from them, and they aren't necessary to upgrade to Netty 4. Let me know if they should go in another PR. # Benchmarking I benchmarked on a single-node install. Results are accessible at https://drive.google.com/open?id=1OuYHusyshQbzx38q5jchj8W98WG6An0L. ## Speed of light Ran `./storm jar .\storm-perf-2.0.0-SNAPSHOT.jar org.apache.storm.perf.ConstSpoutIdBoltNullBoltTopo 600 .\ConstSpoutIdBoltNullBoltTopo.yaml` with the following config ``` spout.count : 1 bolt1.count : 1 # IdBolt instances bolt2.count : 1 # DevNullBolt instances topology.workers : 3 ``` This should force all messages to cross to another worker, which hopefully does a good job showing any decrease in throughput for the changed code. ## TVL I didn't put much thought into setting the number of workers, or component counts here, let me know if I should try rerunning with some other numbers. Ran `./storm jar .\storm-loadgen-2.0.0-SNAPSHOT.jar org.apache.storm.loadgen.ThroughputVsLatency --rate 90000 --spouts 4 --splitters 4 --counters 4 --reporter 'tsv:test.csv' -c topology.workers=4 --test-time 10`. 90k tuples is more than my machine can handle, so this is mostly to demonstrate that Storm doesn't choke any worse than it did before. Also ran `./storm jar .\storm-loadgen-2.0.0-SNAPSHOT.jar org.apache.storm.loadgen.ThroughputVsLatency --rate 75000 --spouts 4 --splitters 4 --counters 4 --reporter 'tsv:test.csv' -c topology.workers=4 --test-time 10` 75k is right on the edge of what my computer can put through, so it shows what latencies look like compared to master. Performance looks pretty much the same to me pre- and post these changes. Let me know if there are other tests I should run to validate these changes. # To do at some point * ByteBufs are reference counted, so we should validate that we don't leak references when testing. Netty can tell us when a leak occurs, but I haven't set anything up to fail the build if these logs occur during testing yet https://netty.io/wiki/reference-counted-objects.html#leak-detection-levels. I did run some tests with the detection level set to paranoid, and didn't see any leaks. * Didn't really test Pacemaker You can merge this pull request into a Git repository by running: $ git pull https://github.com/srdo/storm STORM-1038-clean Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/2704.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2704 ---- commit 20db8c339cbabefab1fe4cccb74d68c87cc36264 Author: Hang Sun <hsun@...> Date: 2018-05-11T13:57:47Z STORM-1038: Upgrade to Netty 4.x. See https://github.com/apache/storm/pull/728 for the original contribution. commit 2d0f96102e20ca1c89fb65d05a860456c9d10a96 Author: Stig Rohde Døssing <srdo@...> Date: 2018-05-11T14:55:35Z STORM-1038: Upgrade to latest Netty commit 5fbe93339061c539ae20e06c7158f892ba4abb3c Author: Stig Rohde Døssing <srdo@...> Date: 2018-06-05T14:32:06Z Refactor of transfer drainer ---- ---