[jira] [Commented] (DISPATCH-1256) Throughput gets worse as more routers are added

JIRA Tue, 19 Mar 2019 12:08:59 -0700


    [ 
https://issues.apache.org/jira/browse/DISPATCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796428#comment-16796428
 ]


Jiri Daněk commented on DISPATCH-1256:
--------------------------------------

This throughput degradation has apparently been present since at least QDR 
1.0.0 release. See the following quote from PERFORMANCE TESTING AND ANALYSIS OF 
QPID DISPATCH ROUTER thesis at 
https://www.vutbr.cz/www_base/zav_prace_soubor_verejne.php?file_id=181688 which 
used that version of qdrouterd. Page 54.

{quote}
[...] the topology consisting only of the routers shows significant performance 
degradation. The throughput falls from the 90,000 messages per second to the 
approximately 23,000 messages per second. This degradation is caused by the 
interior flow-control mechanism, which should prevent the overload of the 
network. However, in this case study we can see that the performance 
degradation is too high and the mechanism used in the Qpid Dispatch should be 
reimplemented.
{quote}

> Throughput gets worse as more routers are added
> -----------------------------------------------
>
>                 Key: DISPATCH-1256
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-1256
>             Project: Qpid Dispatch
>          Issue Type: Bug
>            Reporter: Alan Conway
>            Assignee: Ted Ross
>            Priority: Major
>         Attachments: reproducer.tgz
>
>
> Creating a chain of routers gives worse throughput as the chain gets longer.
> Reproducer:
> Install quiver from here: https://github.com/alanconway/quiver (extened with 
> --send-url)
> Run the attached script like this:
> qdr-chain.sh quiver --arrow c --count 1000 --send-url //:10001/testme 
> //:10004/testme
> Sets up a chain of 4 routers, each connected to the previous. Waits for 
> qdstat -nv at router 1 to show 4 routers in the topology. Receive messages 
> from address 'testme' at router 4, send to address 'testme' at router 1.
> This runs the routers with enough tracing to run the dispatch scraper tool, 
> which may affect the results. However 1 million message runs without tracing 
> show a drop-off in throughput going from 1m0 (no router) up to 1m4 (4-router 
> chain) - the more routers the worse it gets. This seems consistent with the 
> credit delays seen in the traced case.
> 1m0/out.txt:End-to-end rate ..................................... 296,824 
> messages/s
> 1m0/out.txt: 50% ........ 2 ms 99.90% ........ 6 ms
> 1m1/out.txt:End-to-end rate ..................................... 133,851 
> messages/s
> 1m1/out.txt: 50% ........ 1 ms 99.90% ........ 3 ms
> 1m2/out.txt:End-to-end rate ...................................... 59,011 
> messages/s
> 1m2/out.txt: 50% ........ 2 ms 99.90% ........ 8 ms
> 1m3/out.txt:End-to-end rate ...................................... 42,146 
> messages/s
> 1m3/out.txt: 50% ........ 3 ms 99.90% ....... 11 ms
> 1m4/out.txt:End-to-end rate ...................................... 32,415 
> messages/s
> 1m4/out.txt: 50% ........ 4 ms 99.90% ....... 13 ms
> The output (config, logs and scrapes) from a 1000-message, 4-router tracing 
> run is included, as well as the script to generate new data. Runs of 10k 
> messages are feasible, but the HTML gets large and slow to view. From my 
> experiments the results are similar.
> From file:scrape.html#c_connections the sender connection is starved of 
> credit far more than any other connection: 20x more starvation instances, 40x 
> longer total starvation and 100x longer wait for initial credit.
> Overall the flow of credit is slower than the flow of messages, and each 
> additional router is reducing the throughput of credits, which reduces 
> message throughput.
> Observations:
> - initial credit delay is excessive (100ms in this case when worst-case 
> message latency is 33ms), something is wrong there.
> - inter-router link credit is small (250) compared to application credit 
> (1000). Giving full credit to a sender creates a credit sharing problem,
>  but we could give full credit to the inter-router links.
> - re-evaluate our credit sharing strategy to perform better in the common 
> unshared case.
> Extracts from scrape:
> ◊ 2019-01-23 13:55:16.524477 D0#141 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (0,1000)
> ◊ 2019-01-23 13:55:16.633700 A0#287 1 [A0_6] -> [] quiver-sender-3db6ad3f 
> flow [0,0] (0,250)
> Initial delay 109ms, much worse than worst-case message latency of 33ms
> ◊ 2019-01-23 13:55:16.639758 D0#153 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (3,1000)
> ◊ 2019-01-23 13:55:16.642682 A0#543 1 [A0_6] -> [] quiver-sender-3db6ad3f 
> flow [0,0] (168,85)
> ◊ 2019-01-23 13:55:16.645205 D0#163 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (5,1000)
> ◊ 2019-01-23 13:55:16.645448 D0#167 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (6,1000)
> ◊ 2019-01-23 13:55:16.646510 A0#713 1 [A0_6] <- [] quiver-sender-3db6ad3f 
> transfer [0,0] (252) transfer_281 - 158 bytes no more credit (accepted 
> settled 0.028323 S)
> ◊ 2019-01-23 13:55:16.647142 D0#193 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (7,1000)
> ◊ 2019-01-23 13:55:16.647198 D0#195 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (10,1000)
> ◊ 2019-01-23 13:55:16.647915 D0#214 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (16,1000)
> ◊ 2019-01-23 13:55:16.648060 D0#217 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (25,1000)
> ◊ 2019-01-23 13:55:16.651294 D0#239 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (28,1000)
> ◊ 2019-01-23 13:55:16.652235 D0#257 4 [D0_3] <- [] quiver-receiver-2ebb30e5 
> flow [0,0] (41,1000)
> ◊ 2019-01-23 13:55:16.654294 A0#802 1 [A0_6] -> [] quiver-sender-3db6ad3f 
> flow [0,0] (253,9) credit restored
> Here we see that the receiver has given 9 credit responses in the same time 
> that the front router only gave 2. Either the router is slow to forward 
> messages, or it's not giving back credit quickly when it does (or both) See 
> flow.txt for the complete history of flow events at the sender and receiver, 
> scrape.html for the complete history of everything.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (DISPATCH-1256) Throughput gets worse as more routers are added

Reply via email to