On Tue, May 21, 2019 at 11:28 AM Marc Schöchlin <m...@256bit.org> wrote: > > Hello Jason, > > Am 20.05.19 um 23:49 schrieb Jason Dillaman: > > On Mon, May 20, 2019 at 2:17 PM Marc Schöchlin <m...@256bit.org> wrote: > > Hello cephers, > > we have a few systems which utilize a rbd-bd map/mount to get access to a rbd > volume. > (This problem seems to be related to "[ceph-users] Slow requests from > bluestore osds" (the original thread)) > > Unfortunately the rbd-nbd device of a system crashes three mondays in series > at ~00:00 when the systemd fstrim timer executes "fstrim -av". > (which runs in parallel to deep scrub operations) > > That's probably not a good practice if you have lots of VMs doing this > at the same time *and* you are not using object-map. The reason is > that "fstrim" could discard huge extents that result around a thousand > concurrent remove/truncate/zero ops per image being thrown at your > cluster. > > Sure, currently we do not have lots of vms which are capable to run fstim on > rbd volumes. > But the already involved RBD Images are multiple-tb images with a high > write/deletetion rate. > Therefore i am already in progress to distribute fstrims by adding random > delays > > After that the device constantly reports io errors every time a access to the > filesystem happens. > Unmounting, remapping and mounting helped to get the filesystem/device back > into business :-) > > If the cluster was being DDoSed by the fstrims, the VM OSes' might > have timed out thinking a controller failure. > > > Yes and no :-) Probably my problem is related to the kernel release, kernel > setting or the operating system release. > Why? > > we run ~800 RBD images on that ceph cluster with rbd-nbd 12.2.5 in our xen > cluster as dom0-storage repository device without any timeout problems > (kernel 4.4.0+10, centos 7) > we run some 35TB kRBD images with multiples of the load of the crashed > rbd-nbd very write/read/deletion load without any timeout problems > the timeout problem appears on two vms (ubuntu 18.04, ubuntu 16.04) which > utilize the described settings > > From my point of view, the error behavior is currently reproducible with a > good probability. > Do you have suggestions how to find the root cause of this problem?
Can you provide any logs/backtraces/core dumps from the rbd-nbd process? > Regards > Marc -- Jason _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com