[ https://issues.apache.org/jira/browse/DISPATCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Giusti resolved DISPATCH-1256. ---------------------------------- Fix Version/s: 1.9.0 Resolution: Fixed We've identified the bulk of the inter-router bottleneck is due to heavy processing of Message Annotations on the inter-router links. Release 1.9.0 contains a number of optimizations to improve inter-router forwarding, however further improvements to the MA processing codepath can be made. This work is being tracked in DISPATCH-1389 > Throughput gets worse as more routers are added > ----------------------------------------------- > > Key: DISPATCH-1256 > URL: https://issues.apache.org/jira/browse/DISPATCH-1256 > Project: Qpid Dispatch > Issue Type: Bug > Reporter: Alan Conway > Assignee: Ted Ross > Priority: Major > Fix For: 1.9.0 > > Attachments: reproducer.tgz > > > Creating a chain of routers gives worse throughput as the chain gets longer. > Reproducer: > Install quiver from here: https://github.com/alanconway/quiver (extened with > --send-url) > Run the attached script like this: > qdr-chain.sh quiver --arrow c --count 1000 --send-url //:10001/testme > //:10004/testme > Sets up a chain of 4 routers, each connected to the previous. Waits for > qdstat -nv at router 1 to show 4 routers in the topology. Receive messages > from address 'testme' at router 4, send to address 'testme' at router 1. > This runs the routers with enough tracing to run the dispatch scraper tool, > which may affect the results. However 1 million message runs without tracing > show a drop-off in throughput going from 1m0 (no router) up to 1m4 (4-router > chain) - the more routers the worse it gets. This seems consistent with the > credit delays seen in the traced case. > 1m0/out.txt:End-to-end rate ..................................... 296,824 > messages/s > 1m0/out.txt: 50% ........ 2 ms 99.90% ........ 6 ms > 1m1/out.txt:End-to-end rate ..................................... 133,851 > messages/s > 1m1/out.txt: 50% ........ 1 ms 99.90% ........ 3 ms > 1m2/out.txt:End-to-end rate ...................................... 59,011 > messages/s > 1m2/out.txt: 50% ........ 2 ms 99.90% ........ 8 ms > 1m3/out.txt:End-to-end rate ...................................... 42,146 > messages/s > 1m3/out.txt: 50% ........ 3 ms 99.90% ....... 11 ms > 1m4/out.txt:End-to-end rate ...................................... 32,415 > messages/s > 1m4/out.txt: 50% ........ 4 ms 99.90% ....... 13 ms > The output (config, logs and scrapes) from a 1000-message, 4-router tracing > run is included, as well as the script to generate new data. Runs of 10k > messages are feasible, but the HTML gets large and slow to view. From my > experiments the results are similar. > From file:scrape.html#c_connections the sender connection is starved of > credit far more than any other connection: 20x more starvation instances, 40x > longer total starvation and 100x longer wait for initial credit. > Overall the flow of credit is slower than the flow of messages, and each > additional router is reducing the throughput of credits, which reduces > message throughput. > Observations: > - initial credit delay is excessive (100ms in this case when worst-case > message latency is 33ms), something is wrong there. > - inter-router link credit is small (250) compared to application credit > (1000). Giving full credit to a sender creates a credit sharing problem, > but we could give full credit to the inter-router links. > - re-evaluate our credit sharing strategy to perform better in the common > unshared case. > Extracts from scrape: > ◊ 2019-01-23 13:55:16.524477 D0#141 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (0,1000) > ◊ 2019-01-23 13:55:16.633700 A0#287 1 [A0_6] -> [] quiver-sender-3db6ad3f > flow [0,0] (0,250) > Initial delay 109ms, much worse than worst-case message latency of 33ms > ◊ 2019-01-23 13:55:16.639758 D0#153 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (3,1000) > ◊ 2019-01-23 13:55:16.642682 A0#543 1 [A0_6] -> [] quiver-sender-3db6ad3f > flow [0,0] (168,85) > ◊ 2019-01-23 13:55:16.645205 D0#163 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (5,1000) > ◊ 2019-01-23 13:55:16.645448 D0#167 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (6,1000) > ◊ 2019-01-23 13:55:16.646510 A0#713 1 [A0_6] <- [] quiver-sender-3db6ad3f > transfer [0,0] (252) transfer_281 - 158 bytes no more credit (accepted > settled 0.028323 S) > ◊ 2019-01-23 13:55:16.647142 D0#193 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (7,1000) > ◊ 2019-01-23 13:55:16.647198 D0#195 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (10,1000) > ◊ 2019-01-23 13:55:16.647915 D0#214 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (16,1000) > ◊ 2019-01-23 13:55:16.648060 D0#217 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (25,1000) > ◊ 2019-01-23 13:55:16.651294 D0#239 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (28,1000) > ◊ 2019-01-23 13:55:16.652235 D0#257 4 [D0_3] <- [] quiver-receiver-2ebb30e5 > flow [0,0] (41,1000) > ◊ 2019-01-23 13:55:16.654294 A0#802 1 [A0_6] -> [] quiver-sender-3db6ad3f > flow [0,0] (253,9) credit restored > Here we see that the receiver has given 9 credit responses in the same time > that the front router only gave 2. Either the router is slow to forward > messages, or it's not giving back credit quickly when it does (or both) See > flow.txt for the complete history of flow events at the sender and receiver, > scrape.html for the complete history of everything. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org