On Tue, May 21, 2019 at 11:28 AM Marc Schöchlin <m...@256bit.org> wrote:
>
> Hello Jason,
>
> Am 20.05.19 um 23:49 schrieb Jason Dillaman:
>
> On Mon, May 20, 2019 at 2:17 PM Marc Schöchlin <m...@256bit.org> wrote:
>
> Hello cephers,
>
> we have a few systems which utilize a rbd-bd map/mount to get access to a rbd 
> volume.
> (This problem seems to be related to "[ceph-users] Slow requests from 
> bluestore osds" (the original thread))
>
> Unfortunately the rbd-nbd device of a system crashes three mondays in series 
> at ~00:00 when the systemd fstrim timer executes "fstrim -av".
> (which runs in parallel to deep scrub operations)
>
> That's probably not a good practice if you have lots of VMs doing this
> at the same time *and* you are not using object-map. The reason is
> that "fstrim" could discard huge extents that result around a thousand
> concurrent remove/truncate/zero ops per image being thrown at your
> cluster.
>
> Sure, currently we do not have lots of vms which are capable to run fstim on 
> rbd volumes.
> But the already involved RBD Images are multiple-tb images with a high 
> write/deletetion rate.
> Therefore i am already in progress to distribute fstrims by adding random 
> delays
>
> After that the device constantly reports io errors every time a access to the 
> filesystem happens.
> Unmounting, remapping and mounting helped to get the filesystem/device back 
> into business :-)
>
> If the cluster was being DDoSed by the fstrims, the VM OSes' might
> have timed out thinking a controller failure.
>
>
> Yes and no :-) Probably my problem is related to the kernel release, kernel 
> setting or the operating system release.
> Why?
>
> we run ~800 RBD images on that ceph cluster with rbd-nbd 12.2.5 in our xen 
> cluster as dom0-storage repository device without any timeout problems
> (kernel 4.4.0+10, centos 7)
> we run some 35TB kRBD images with multiples of the load of the crashed 
> rbd-nbd very write/read/deletion load without any timeout problems
> the timeout problem appears on two vms (ubuntu 18.04, ubuntu 16.04) which 
> utilize the described settings
>
> From my point of view, the error behavior is currently reproducible with a 
> good probability.
> Do you have suggestions how to find the root cause of this problem?

Can you provide any logs/backtraces/core dumps from the rbd-nbd process?

> Regards
> Marc


-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to