Alan Conway created DISPATCH-1256:
-------------------------------------

             Summary: Throughput gets worse as more routers are added
                 Key: DISPATCH-1256
                 URL: https://issues.apache.org/jira/browse/DISPATCH-1256
             Project: Qpid Dispatch
          Issue Type: Bug
            Reporter: Alan Conway


Creating a chain of routers gives worse throughput as the chain gets longer.

Reproducer:

Install quiver from here: https://github.com/alanconway/quiver (extened with 
--send-url)

Run the attached script like this:

qdr-chain.sh quiver --arrow c --count 1000 --send-url //:10001/testme 
//:10004/testme

Sets up a chain of 4 routers, each connected to the previous. Waits for qdstat 
-nv at router 1 to show 4 routers in the topology. Receive messages from 
address 'testme' at router 4, send to address 'testme' at router 1.

This runs the routers with enough tracing to run the dispatch scraper tool, 
which may affect the results. However 1 million message runs without tracing 
show a drop-off in throughput going from 1m0 (no router) up to 1m4 (4-router 
chain) - the more routers the worse it gets. This seems consistent with the 
credit delays seen in the traced case.

1m0/out.txt:End-to-end rate ..................................... 296,824 
messages/s
1m0/out.txt: 50% ........ 2 ms 99.90% ........ 6 ms
1m1/out.txt:End-to-end rate ..................................... 133,851 
messages/s
1m1/out.txt: 50% ........ 1 ms 99.90% ........ 3 ms
1m2/out.txt:End-to-end rate ...................................... 59,011 
messages/s
1m2/out.txt: 50% ........ 2 ms 99.90% ........ 8 ms
1m3/out.txt:End-to-end rate ...................................... 42,146 
messages/s
1m3/out.txt: 50% ........ 3 ms 99.90% ....... 11 ms
1m4/out.txt:End-to-end rate ...................................... 32,415 
messages/s
1m4/out.txt: 50% ........ 4 ms 99.90% ....... 13 ms

The output (config, logs and scrapes) from a 1000-message, 4-router tracing run 
is included, as well as the script to generate new data. Runs of 10k messages 
are feasible, but the HTML gets large and slow to view. From my experiments the 
results are similar.

>From file:scrape.html#c_connections the sender connection is starved of credit 
>far more than any other connection: 20x more starvation instances, 40x longer 
>total starvation and 100x longer wait for initial credit.

Overall the flow of credit is slower than the flow of messages, and each 
additional router is reducing the throughput of credits, which reduces message 
throughput.

Observations:

- initial credit delay is excessive (100ms in this case when worst-case message 
latency is 33ms), something is wrong there.
- inter-router link credit is small (250) compared to application credit 
(1000). Giving full credit to a sender creates a credit sharing problem,
 but we could give full credit to the inter-router links.
- re-evaluate our credit sharing strategy to perform better in the common 
unshared case.

Extracts from scrape:

◊ 2019-01-23 13:55:16.524477 D0#141 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (0,1000)
◊ 2019-01-23 13:55:16.633700 A0#287 1 [A0_6] -> [] quiver-sender-3db6ad3f flow 
[0,0] (0,250)
Initial delay 109ms, much worse than worst-case message latency of 33ms

◊ 2019-01-23 13:55:16.639758 D0#153 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (3,1000)
◊ 2019-01-23 13:55:16.642682 A0#543 1 [A0_6] -> [] quiver-sender-3db6ad3f flow 
[0,0] (168,85)
◊ 2019-01-23 13:55:16.645205 D0#163 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (5,1000)
◊ 2019-01-23 13:55:16.645448 D0#167 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (6,1000)
◊ 2019-01-23 13:55:16.646510 A0#713 1 [A0_6] <- [] quiver-sender-3db6ad3f 
transfer [0,0] (252) transfer_281 - 158 bytes no more credit (accepted settled 
0.028323 S)
◊ 2019-01-23 13:55:16.647142 D0#193 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (7,1000)
◊ 2019-01-23 13:55:16.647198 D0#195 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (10,1000)
◊ 2019-01-23 13:55:16.647915 D0#214 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (16,1000)
◊ 2019-01-23 13:55:16.648060 D0#217 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (25,1000)
◊ 2019-01-23 13:55:16.651294 D0#239 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (28,1000)
◊ 2019-01-23 13:55:16.652235 D0#257 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
flow [0,0] (41,1000)
◊ 2019-01-23 13:55:16.654294 A0#802 1 [A0_6] -> [] quiver-sender-3db6ad3f flow 
[0,0] (253,9) credit restored

Here we see that the receiver has given 9 credit responses in the same time 
that the front router only gave 2. Either the router is slow to forward 
messages, or it's not giving back credit quickly when it does (or both) See 
flow.txt for the complete history of flow events at the sender and receiver, 
scrape.html for the complete history of everything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to