[jira] [Comment Edited] (DISPATCH-957) Unbalanced memory consumption in a 2 routers configuration and specific workload
[ https://issues.apache.org/jira/browse/DISPATCH-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443122#comment-16443122 ] Ken Giusti edited comment on DISPATCH-957 at 4/18/18 8:08 PM: -- Thanks for the logs, they are quite revealing. I don't think this is a router bug. I think it's a client issue. Taking a look at router1.log I see the following: Periodically a client connects every 30 seconds, but seems to clean up its links. I'm confident that's your status collector. Ignoring that I see: 1) 19:59:23ish - Servers begin to connection. 4 links are created for each connection - this is expected for RPC servers 2) 19:58:08ish - Servers done connecting 3) 19:59:23ish - a "link storm" hits - a couple thousand links are established in less than 2 seconds. 4) 20:01:07 - links storm over. All links remain up for the rest of the log. So here's what I think is happening: After the servers finish connecting the test starts the RPC clients start making RPC calls. Each client has created a read link for receiving the RPC reply. Obviously each client has its own unique reply-to address. Once the server receives and processes a request it then sends a reply to the reply-to address. This means that the server is going to create a unique reply-to link for each client it received a request from. That's a where all the links are coming from. IIRC, the oslo.messaging driver has a periodic task that expires these reply-to links after they've been idle for > 600 seconds. Once the test is over would it be possible to not disturb the servers for > 600 seconds, then get a qdstat -a from router1? I suspect the number of in-use qdr_link_t's will drop after these links are cleaned up. was (Author: kgiusti): Thanks for the logs, they are quite revealing. I don't think this is a router bug. I think it's a client issue. Taking a look at router1.log I see the following: Periodically a client connects every 30 seconds, but seems to clean up its links. I'm confident that's your status collector. Ignoring that I see: 1) 19:59:23ish - Servers begin to connection. 4 links are created for each connection - this is expected for RPC servers 2) 19:58:08ish - Servers done connecting 3) 19:59:23ish - a "link storm" hits - a couple thousand links are established in less than 2 seconds. 4) 20:01:07 - links storm over. All links remain up for the rest of the log. So here's what I think is happening: After the servers finish connecting the test starts are the RPC clients start making RPC calls. Each client has created a read link for receiving the RPC reply. Obviously each client has its own unique reply-to address. Once the server receives and processes a request it then sends a reply to the reply-to address. This means that the server is going to create a unique reply-to link for each client it received a request from. That's a where all the links are coming from. IIRC, the oslo.messaging driver has a periodic task that expires these reply-to links after they've been idle for > 600 seconds. Once the test is over would it be possible to not disturb the servers for > 600 seconds, then get a qdstat -a from router1? I suspect the number of in-use qdr_link_t's will drop after these links are cleaned up. > Unbalanced memory consumption in a 2 routers configuration and specific > workload > > > Key: DISPATCH-957 > URL: https://issues.apache.org/jira/browse/DISPATCH-957 > Project: Qpid Dispatch > Issue Type: Bug > Components: Router Node >Affects Versions: 1.0.1 > Environment: * At the time I was experimenting, I built the router > from the source and > used 22400df dockerized. It's available in docker hub : > msimonin/qdrouterd:22400df or msimonin/qdrouterd-collectd:22400f > > * ombt version used embeds the following library > oslo.messaging==5.35.0 > pyngus==2.2.2 > python-qpid-proton==0.19.0 > > * I used ombt-orchestrator to deploy all the stack using the g5k provider > (see > https://github.com/msimonin/ombt-orchestrator/) In a local machine setup, > vagrant provider can be used but I'm not sure if it is reasonnable to scale > the > above number of agents. I've attached nevertheless the configuration used. > > * Host Linux distribution is debian9 > > >Reporter: Matthieu Simonin >Assignee: Ken Giusti >Priority: Major > Attachments: call.png, cast.png, conf.yaml, conf.yaml, inc-calls.png, > mem_usage.tar.gz, router0.log, router1.log > > > After discussion with Ken Giusti we deem appropriate to fill a bug to track > the > following behavior. > Note also that the exact version used in the following description isn't > exactly 1.1.0 but one built from so
[jira] [Comment Edited] (DISPATCH-957) Unbalanced memory consumption in a 2 routers configuration and specific workload
[ https://issues.apache.org/jira/browse/DISPATCH-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437863#comment-16437863 ] Matthieu Simonin edited comment on DISPATCH-957 at 4/13/18 8:53 PM: In attachment the log files of the two qdrouterd. * router0 is where clients are connected * router1 is where servers are connected I ran (conf in attachment): * {{oo deploy --driver=router g5k}} * {{oo test_case_1 --nbr_clients=100 --nbr_servers=100 --nbr_calls=1000 --pause=0.1 --timeout 200}} edit: I forgot to mention that I observed the same behaviour (increased memory consumption on router1) was (Author: msimonin): In attachment the log files of the two qdrouterd. * router0 is where clients are connected * router1 is where servers are connected I ran (conf in attachment): * {{oo deploy --driver=router g5k}} * {{oo test_case_1 --nbr_clients=100 --nbr_servers=100 --nbr_calls=1000 --pause=0.1 --timeout 200}} > Unbalanced memory consumption in a 2 routers configuration and specific > workload > > > Key: DISPATCH-957 > URL: https://issues.apache.org/jira/browse/DISPATCH-957 > Project: Qpid Dispatch > Issue Type: Bug > Components: Router Node >Affects Versions: 1.0.1 > Environment: * At the time I was experimenting, I built the router > from the source and > used 22400df dockerized. It's available in docker hub : > msimonin/qdrouterd:22400df or msimonin/qdrouterd-collectd:22400f > > * ombt version used embeds the following library > oslo.messaging==5.35.0 > pyngus==2.2.2 > python-qpid-proton==0.19.0 > > * I used ombt-orchestrator to deploy all the stack using the g5k provider > (see > https://github.com/msimonin/ombt-orchestrator/) In a local machine setup, > vagrant provider can be used but I'm not sure if it is reasonnable to scale > the > above number of agents. I've attached nevertheless the configuration used. > > * Host Linux distribution is debian9 > > >Reporter: Matthieu Simonin >Assignee: Ken Giusti >Priority: Major > Attachments: call.png, cast.png, conf.yaml, conf.yaml, inc-calls.png, > mem_usage.tar.gz, router0.log, router1.log > > > After discussion with Ken Giusti we deem appropriate to fill a bug to track > the > following behavior. > Note also that the exact version used in the following description isn't > exactly 1.1.0 but one built from source 22400df (master back in february). > I started two interconnected routers (router0 and router1) > router0 is where all my consumers connect router1 is where all my producers > connect . > The workload is an RPC test using oslo.messaging library using calls > (resp. casts) : clients keep sending message and block for the response > (resp. do not block). > I've attached some observations: > 1) > With 100 consumers and 100 producers and calls I observe a higher memory > consumption of router0 in comparison of the memory consumption of router1 (see > calls.png). Casts seem to less affect the router memory. Calls usually > requires > more ressources because of the return values flowing back to the producer but > I wouldn't expect this big difference. > I've attached a tgz in which, you'll find the results of qdtsat -a,-m,-l > * before the benchmark (start) > * early during the benchmark (during) > * late during the benchmark (during-1) > * after the benchmark completed (after) > 2) I've run a second test increasing incrementaly the (#clients, #servers) : > [50, 100, 200, 500] (calls only) > see inc-calls.png > in this case the difference of the memory consumption between router0 and > router1 is [50MB, 100MB, 300MB, 1.5GB] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org