Aha!  I restarted cloudstack-agent, which caused the virtual router to
change to a "stopped" status in the management console.  However, the
console viewer icon was still visible, so I clicked it.  The router had run
out of memory and caused a kernel panic.  I created a new system service
offering with 500 MB of memory, changed the router's service offering, and
started it.  It booted with no problem.  The default memory size of 128 MB
is not enough.  This is the system VM template I was using:

http://cloudstack.apt-get.eu/systemvm/4.4/systemvm64template-4.4.0-6-kvm.qcow2.bz2

On Fri, Oct 10, 2014 at 7:28 PM, Ian Young <iyo...@ratespecial.com> wrote:

> I dropped all the cloud* databases, deleted everything in primary and
> secondary storage, and reinstalled the management server, following the
> guide I wrote for myself the last time I built a stable CloudStack system.
> Then I imported one of my backed up instances as a template and tried to
> create a new VM.  Same problem as before.  How is this possible?
>
> 2014-10-10 19:17:44,075 WARN  [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Timed out:
> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/patchviasocket.pl
> -n r-4-VM -p
> %template=domP%name=r-4-VM%eth0ip=192.168.102.222%eth0mask=255.255.255.0%gateway=192.168.102.1%domain=
> lax.ratespecial.com%cidrsize=24%dhcprange=192.168.102.1%eth1ip=169.254.0.33%eth1mask=255.255.0.0%type=dhcpsrvr%disable_rp_filter=true%dns1=192.168.100.2%dns2=192.168.100.3
> .  Output is:
> 2014-10-10 19:18:05,078 WARN  [kvm.resource.LibvirtComputingResource]
> (Script-3:null) Interrupting script.
>
> On Fri, Oct 10, 2014 at 4:33 PM, Ian Young <iyo...@ratespecial.com> wrote:
>
>> I've restarted all the services and restarted the servers too.  The SSVM
>> and CP start with no trouble.  Every time I try to start or create an
>> instance, I see repeated messages like these:
>>
>> /var/log/cloudstack/agent/cloudstack-agent.out:
>> 2014-10-10 16:27:21,841{GMT} WARN
>>  [kvm.resource.LibvirtComputingResource] (Script-8:) Interrupting script.
>> 2014-10-10 16:27:21,841{GMT} WARN
>>  [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-4:) Timed
>> out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/
>> patchviasocket.pl -n r-19-VM -p
>> %template=domP%name=r-19-VM%eth0ip=192.168.102.89%eth0mask=255.255.255.0%gateway=192.168.102.1%domain=
>> lax.ratespecial.com%cidrsize=24%dhcprange=192.168.102.1%eth1ip=169.254.2.193%eth1mask=255.255.0.0%type=dhcpsrvr%disable_rp_filter=true%dns1=192.168.100.2%dns2=192.168.100.3
>> .  Output is:
>>
>> /var/log/cloudstack/agent/security_group.log:
>> 2014-10-10 16:27:33,259 - Failed to get rule logs, better luck next time!
>>
>> On Fri, Oct 10, 2014 at 3:04 PM, Ian Young <iyo...@ratespecial.com>
>> wrote:
>>
>>> I tried to restart the network with the "clean up" option, via the web
>>> console.  After several minutes, it failed to restart the network.  The
>>> SSVM and CP are still running but the VR no longer exists.  Why would these
>>> be able to start but not the virtual router?
>>>
>>> On Fri, Oct 10, 2014 at 2:48 PM, Ian Young <iyo...@ratespecial.com>
>>> wrote:
>>>
>>>> I restarted the libvirtd service and the management service is now
>>>> fully started (there are services listening on ports 8250 and 9090).  The
>>>> SSVM health check script now reports no problems.
>>>>
>>>> However, I tried starting an instance and both the instance and the
>>>> virtual router are in a "starting" state but have been so for almost 10
>>>> minutes.  In the catalina.out log I see:
>>>> INFO  [c.c.v.VirtualMachineManagerImpl] (AgentManager-Handler-10:null)
>>>> There is pending job or HA tasks working on the VM. vm id: 4, postpone
>>>> power-change report by resetting power-change counters
>>>> INFO  [c.c.v.VirtualMachineManagerImpl] (AgentManager-Handler-10:null)
>>>> There is pending job or HA tasks working on the VM. vm id: 13, postpone
>>>> power-change report by resetting power-change counters
>>>>
>>>> I'm also seeing this in the agent.log:
>>>> 2014-10-10 14:43:26,833 WARN  [kvm.resource.LibvirtComputingResource]
>>>> (Script-6:null) Interrupting script.
>>>> 2014-10-10 14:43:26,833 WARN  [kvm.resource.LibvirtComputingResource]
>>>> (agentRequest-Handler-2:null) Timed out:
>>>> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/
>>>> patchviasocket.pl -n r-4-VM -p
>>>> %template=domP%name=r-4-VM%eth0ip=192.168.102.110%eth0mask=255.255.255.0%gateway=192.168.102.1%domain=
>>>> lax.ratespecial.com%cidrsize=24%dhcprange=192.168.102.1%eth1ip=169.254.2.181%eth1mask=255.255.0.0%type=dhcpsrvr%disable_rp_filter=true%dns1=192.168.100.2%dns2=192.168.100.3
>>>> .  Output is:
>>>>
>>>> And in the security_group.log:
>>>> 2014-10-10 14:42:41,926 - Failed to get rule logs, better luck next
>>>> time!
>>>> 2014-10-10 14:43:41,926 - Failed to get rule logs, better luck next
>>>> time!
>>>>
>>>> What does this mean?
>>>>
>>>> On Fri, Oct 10, 2014 at 2:11 PM, Ian Young <iyo...@ratespecial.com>
>>>> wrote:
>>>>
>>>>> This morning I was unable to start new instances.  I discovered that I
>>>>> could SSH into the SSVM and the console proxy but not the virtual router.
>>>>> Something strange was happening so I thought it might be a good time to
>>>>> gracefully stop all the instances and reboot the hypervisor to see if the
>>>>> VR would start working again.  I also rebooted the management server (a
>>>>> separate machine) to have a clean slate.  Now that they've both been
>>>>> rebooted, the following symptoms exist:
>>>>>
>>>>> * On the management server, there is no services listening on 9090 or
>>>>> 8250.
>>>>> * When I run the SSVM health check script, it says NFS is not
>>>>> currently mounted.
>>>>> * The management server log is reporting that Zone 1 is not ready to
>>>>> launch SSVM/CP yet, even though both of those are running.
>>>>>
>>>>> The NFS server is running just fine.  I can mount it in the management
>>>>> server with no problems.  I've restarted cloudstack-management and
>>>>> cloudstack-agent but the problems persist.  The "not ready to launch
>>>>> SSVM/CP yet" messages sounds like the management server and the hypervisor
>>>>> are not communicating or some information about the system state is out of
>>>>> sync.  How can I confirm this?
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to