Re: [ceph-users] reproducible rbd-nbd crashes

Marc Schöchlin Tue, 13 Aug 2019 08:59:01 -0700

Hello Jason,

thanks for your response.
See my inline comments.


Am 31.07.19 um 14:43 schrieb Jason Dillaman:
> On Wed, Jul 31, 2019 at 6:20 AM Marc Schöchlin <m...@256bit.org> wrote:
>
>
> The problem not seems to be related to kernel releases, filesystem types or 
> the ceph and network setup.
> Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
> to have the described problem.
>  ...
>
> It's basically just a log message tweak and some changes to how the
> process is daemonized. If you could re-test w/ each release after
> 12.2.5 and pin-point where the issue starts occurring, we would have
> something more to investigate.

Are there changes related to https://tracker.ceph.com/issues/23891?


You showed me the very low amount of changes in rbd-nbd.
What about librbd, librados, ...?

What else can we do to find a detailed reason for the crash?
Do you think it is useful to activate coredump-creation for that process?

>> Whats next? Is i a good idea to do a binary search between 12.2.12 and 
>> 12.2.5?
>>
Due to the absence of a coworker i almost had no capacity to execute deeper 
tests with this problem.
But i can say that in reproduced the problem also with release 12.2.12.

The new (updated) list:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
*- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout****-> failed after < 1 hour, rbd-nbd map/device is gone, mount 
throws io errors, map/mount can be re-created*

Regards
Marc

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] reproducible rbd-nbd crashes

Reply via email to