Hello,
Just want to share my recent experience with KVM backed by RBD. Ceph
appears to be not at fault but I post it here for others to read as my
configuration is something other users of Ceph may configure.
Over last three weeks I was battling some elusive issue: KVM backed by
RBD intermittently lost network under some consistent (but relatively
low) load. It was driving me nuts as nothing appeared in the logs and
everything was seemingly OK. With only one exception: pings to nearby
host sometimes were coming with "No buffer space available" error and
number of pings could be delayed by 20-30 seconds (obviously such delay
causes a lot of timeouts). KVM has two network virtio interfaces and on
host vhost_net module running. One interface is publicly available,
second one connected to private network. I have tried to change virtio
to e1000, increase network buffers - all in vain. I also noticed that
when hold up on interface happened pings come back with second
difference: ie they piling up on interface and then suddenly sent
through all at once and remote host returns all of them pretty much
simultaneously. Another observation which I have made: this behaviour
was clearly evident when network/disk activity was reasonable - during
backup. I have stopped backup for one day but it did not help (however
loss of connectivity did not happen as much).
Being unable to identify the cause I started to pull things apart: I
have moved image from RBD to qcow and magically everything became
normal. Back to RBD and issue manifested itself again. But on other hand
I had number of freshly installed VMs which also backed by RBD and do
not have such issue. VMs which had this fault were different: they were
migrated from hardware hosts to VM environment. Fresh VMs and migrated
VMs are distro-synched FC17 and so I did not expect any difference. The
only difference I had left was that migrated VMs were 32 bit and freshly
installed were 64 bit. So in the end I have upgraded kernel in one
faulty VM to 64 bit (while leaving balance of the system 32 bit) and
problem disappeared! Next day I have upgraded another VM the same way
and it also became problem free! So I am now sure that problem lies in
32 bit kernel which is run on 64 bit host.
So I guess that there some race condition, likely in virtio_net driver
or in tcp stack, which is apparently triggered by context switching from
32 bit to 64 bit and io delays introduced by QEMU-RBD driver. Only when
32 bit VM runs on 64 bit host and it is backed by RBD image this issue
appears. Being unable to identify exact spot in the kernel were the
problem is I even not sure where exactly I should report it to so I
decided to post it here as the place where people with VMs backed by RBD
most likely will look for solution.
Regards,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html