I've prototyped a fix for this, and it took the VmDataCommand from ~7
seconds on restarting one VM down to ~300ms.  For rebooting a router,
with multiple VMs connected, that should be significant. I'm just
dumping the data sent to vmdata into a file as json, copying that up
to the router, and processing it there.

On Thu, Jul 18, 2013 at 3:04 PM, Marcus Sorensen (JIRA) <j...@apache.org> wrote:
>
>     [ 
> https://issues.apache.org/jira/browse/CLOUDSTACK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712845#comment-13712845
>  ]
>
> Marcus Sorensen commented on CLOUDSTACK-3163:
> ---------------------------------------------
>
> ... and each vmdata.sh calls ssh and/or scp several times. Off the top
> of my head, it seems like we could serialize that cmd.getVmData()
> output to maybe JSON or something, get it up on the router in one
> call, and then process it there in a python script.
>
> On Thu, Jul 18, 2013 at 7:08 AM, Wido den Hollander (JIRA)
>
>
>> KVM Virtual Router startup time is painfully long
>> -------------------------------------------------
>>
>>                 Key: CLOUDSTACK-3163
>>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163
>>             Project: CloudStack
>>          Issue Type: Bug
>>      Security Level: Public(Anyone can view this level - this is the 
>> default.)
>>          Components: KVM
>>    Affects Versions: pre-4.0.0
>>         Environment: CloudPlatform 3.0.3, but I don't see any changes to the 
>> relevant code (I think) on master
>>            Reporter: Andrew Bayer
>>            Priority: Critical
>>
>> When you've got a couple thousand instances, spread across 10 or so pods, 
>> virtual router startup time is near crippling - actually, if you don't 
>> enable the option to have virtual routers only populated with instances in 
>> their pod, it *is* crippling, in that the virtual routers don't finish 
>> starting before the management server decides they've timed out and tries to 
>> start a new one.
>> This seems to be the result of a few painful inefficiencies:
>> - The same codepath is followed whether you're adding a new instance to an 
>> already running VR, or adding two hundred already running instances to a new 
>> VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each 
>> instance, rather than finding efficiencies by doing things across the whole 
>> set of instances.
>> - But what really eats up the time is the population of vm data - for each 
>> piece of vm data (which, from a rough look at the code, seems to be 
>> something like 10 or 11 data files), there are something like 7 ssh calls 
>> and an scp call. So that means that per instance, we have somewhere around 
>> 80 to 90 ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with 
>> 200 instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all 
>> the overhead entailed in opening that many ssh connections, starting bash, 
>> etc, etc... Given that in my experience, a VR with ~200 instances takes ~90 
>> minutes to start up (I may be misremembering slightly - it could be ~200 
>> instances takes closer to 60 minutes, and ~300 takes closer to 90), that 
>> works out to 3 seconds or so per ssh/scp, which doesn't seem implausible to 
>> me.
>> So, this shouldn't be this way. At a minimum, there's no reason not to 
>> offload the whole process from a script run on the host making repeated ssh 
>> calls to the VR to a script on the VR that gets called from the host, albeit 
>> possibly a temporary one that's generated on the fly and copied over to the 
>> VR. That alone would probably save most of the VR startup time, just by 
>> dropping the number of ssh/scp connections per instance from 80-90 to 3 
>> (dhcp_entry.sh call, scp of temporary script, execution of temporary script).
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to