Intermittent loss of connectivity with KVM-Ceph-Network (solved)

Australian Jade Mon, 16 Jul 2012 18:27:00 -0700

Hello,

Just want to share my recent experience with KVM backed by RBD. Cephappears to be not at fault but I post it here for others to read as myconfiguration is something other users of Ceph may configure.

Over last three weeks I was battling some elusive issue: KVM backed byRBD intermittently lost network under some consistent (but relativelylow) load. It was driving me nuts as nothing appeared in the logs andeverything was seemingly OK. With only one exception: pings to nearbyhost sometimes were coming with "No buffer space available" error andnumber of pings could be delayed by 20-30 seconds (obviously such delaycauses a lot of timeouts). KVM has two network virtio interfaces and onhost vhost_net module running. One interface is publicly available,second one connected to private network. I have tried to change virtioto e1000, increase network buffers - all in vain. I also noticed thatwhen hold up on interface happened pings come back with seconddifference: ie they piling up on interface and then suddenly sentthrough all at once and remote host returns all of them pretty muchsimultaneously. Another observation which I have made: this behaviourwas clearly evident when network/disk activity was reasonable - duringbackup. I have stopped backup for one day but it did not help (howeverloss of connectivity did not happen as much).

Being unable to identify the cause I started to pull things apart: Ihave moved image from RBD to qcow and magically everything becamenormal. Back to RBD and issue manifested itself again. But on other handI had number of freshly installed VMs which also backed by RBD and donot have such issue. VMs which had this fault were different: they weremigrated from hardware hosts to VM environment. Fresh VMs and migratedVMs are distro-synched FC17 and so I did not expect any difference. Theonly difference I had left was that migrated VMs were 32 bit and freshlyinstalled were 64 bit. So in the end I have upgraded kernel in onefaulty VM to 64 bit (while leaving balance of the system 32 bit) andproblem disappeared! Next day I have upgraded another VM the same wayand it also became problem free! So I am now sure that problem lies in32 bit kernel which is run on 64 bit host.

So I guess that there some race condition, likely in virtio_net driveror in tcp stack, which is apparently triggered by context switching from32 bit to 64 bit and io delays introduced by QEMU-RBD driver. Only when32 bit VM runs on 64 bit host and it is backed by RBD image this issueappears. Being unable to identify exact spot in the kernel were theproblem is I even not sure where exactly I should report it to so Idecided to post it here as the place where people with VMs backed by RBDmost likely will look for solution.


Regards,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Intermittent loss of connectivity with KVM-Ceph-Network (solved)

Reply via email to