I've prototyped a fix for this, and it took the VmDataCommand from ~7 seconds on restarting one VM down to ~300ms. For rebooting a router, with multiple VMs connected, that should be significant. I'm just dumping the data sent to vmdata into a file as json, copying that up to the router, and processing it there.
On Thu, Jul 18, 2013 at 3:04 PM, Marcus Sorensen (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/CLOUDSTACK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712845#comment-13712845 > ] > > Marcus Sorensen commented on CLOUDSTACK-3163: > --------------------------------------------- > > ... and each vmdata.sh calls ssh and/or scp several times. Off the top > of my head, it seems like we could serialize that cmd.getVmData() > output to maybe JSON or something, get it up on the router in one > call, and then process it there in a python script. > > On Thu, Jul 18, 2013 at 7:08 AM, Wido den Hollander (JIRA) > > >> KVM Virtual Router startup time is painfully long >> ------------------------------------------------- >> >> Key: CLOUDSTACK-3163 >> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163 >> Project: CloudStack >> Issue Type: Bug >> Security Level: Public(Anyone can view this level - this is the >> default.) >> Components: KVM >> Affects Versions: pre-4.0.0 >> Environment: CloudPlatform 3.0.3, but I don't see any changes to the >> relevant code (I think) on master >> Reporter: Andrew Bayer >> Priority: Critical >> >> When you've got a couple thousand instances, spread across 10 or so pods, >> virtual router startup time is near crippling - actually, if you don't >> enable the option to have virtual routers only populated with instances in >> their pod, it *is* crippling, in that the virtual routers don't finish >> starting before the management server decides they've timed out and tries to >> start a new one. >> This seems to be the result of a few painful inefficiencies: >> - The same codepath is followed whether you're adding a new instance to an >> already running VR, or adding two hundred already running instances to a new >> VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each >> instance, rather than finding efficiencies by doing things across the whole >> set of instances. >> - But what really eats up the time is the population of vm data - for each >> piece of vm data (which, from a rough look at the code, seems to be >> something like 10 or 11 data files), there are something like 7 ssh calls >> and an scp call. So that means that per instance, we have somewhere around >> 80 to 90 ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with >> 200 instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all >> the overhead entailed in opening that many ssh connections, starting bash, >> etc, etc... Given that in my experience, a VR with ~200 instances takes ~90 >> minutes to start up (I may be misremembering slightly - it could be ~200 >> instances takes closer to 60 minutes, and ~300 takes closer to 90), that >> works out to 3 seconds or so per ssh/scp, which doesn't seem implausible to >> me. >> So, this shouldn't be this way. At a minimum, there's no reason not to >> offload the whole process from a script run on the host making repeated ssh >> calls to the VR to a script on the VR that gets called from the host, albeit >> possibly a temporary one that's generated on the fly and copied over to the >> VR. That alone would probably save most of the VR startup time, just by >> dropping the number of ssh/scp connections per instance from 80-90 to 3 >> (dhcp_entry.sh call, scp of temporary script, execution of temporary script). > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira