[ceph-users] reproducible rbd-nbd crashes

Marc Schöchlin Wed, 14 Aug 2019 05:36:14 -0700

Hello Mike,

see my inline comments.


Am 14.08.19 um 02:09 schrieb Mike Christie:
>>> -----
>>> Previous tests crashed in a reproducible manner with "-P 1" (single io 
>>> gzip/gunzip) after a few minutes up to 45 minutes.
>>>
>>> Overview of my tests:
>>>
>>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 
>>> 120s device timeout
>>>   -> 18 hour testrun was successful, no dmesg output
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created without reboot
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
>>> device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>>>   -> parallel krbd device usage with 99% io usage worked without a problem 
>>> while running the test
>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no 
>>> timeout
>>>   -> failed after < 10 minutes
>>>   -> system runs in a high system load, system is almost unusable, unable 
>>> to shutdown the system, hard reset of vm necessary, manual exclusive lock 
>>> removal is necessary before remapping the device

There is something new compared to yesterday.....three days ago i downgraded a 
production system to client version 12.2.5.
This night also this machine crashed. So it seems that rbd-nbd is broken in 
general also with release 12.2.5 and potentially before.

The new (updated) list:

*- FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s 
device timeout**
**  -> crashed in production while snapshot trimming is running on that pool*
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
- FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

>>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 
>>> 120s device timeout
>>>   -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io 
>>> errors, map/mount can be re-created
>> How many CPUs and how much memory does the VM have?

Charateristic of the crashed vm machine:

  * Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5
  * Services: NFS kernel Server, nothing else
  * Crash behavior:
      o daily Task for snapshot creation/deletion started at 19:00
      o a daily database backup started at 19:00, this created
          + 120 IOPS write, and 1 IOPS read
          + 22K/sectors per second write, 0 sectors/per second
          + 97 MBIT inbound and 97 MBIT outbound network usage (nfs server)
      o we had slow requests at the time of the crash
      o rbd-nbd process terminated 25min later without segfault
      o the nfs usage created a 5 min load of 10 from start, 5K context 
switches/sec
      o memory usage (kernel+userspace) was 10% of the system
      o no swap usage
  * ceph.conf
    [client]
    rbd cache = true
    rbd cache size = 67108864
    rbd cache max dirty = 33554432
    rbd cache target dirty = 25165824
    rbd cache max dirty age = 3
    rbd readahead max bytes = 4194304
    admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
  * 4 CPUs
  * 6 GB RAM
  * Non default Sysctl Settings
    vm.swappiness = 1
    fs.aio-max-nr = 262144
    fs.file-max = 1000000
    kernel.pid_max = 4194303
    vm.zone_reclaim_mode = 0
    kernel.randomize_va_space = 0
    kernel.panic = 0
    kernel.panic_on_oops = 0


>> I'm not sure which test it covers above, but for
>> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like
>> the command that probably triggered the timeout got stuck in safe_write
>> or write_fd, because we see:
>>
>> // Command completed and right after this log message we try to write
>> the reply and data to the nbd.ko module.
>>
>> 2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got:
>> [4500000000000000 READ 24043755000~20000 0]
>>
>> // We got stuck and 2 minutes go by and so the timeout fires. That kills
>> the socket, so we get an error here and after that rbd-nbd is going to exit.
>>
>> 2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000
>> READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe
>>
>> We could hit this in a couple ways:
>>
>> 1. The block layer sends a command that is larger than the socket's send
>> buffer limits. These are those values you sometimes set in sysctl.conf like:
>>
>> net.core.rmem_max
>> net.core.wmem_max
>> net.core.rmem_default
>> net.core.wmem_default
>> net.core.optmem_max
see attached file.
>> There does not seem to be any checks/code to make sure there is some
>> alignment with limits. I will send a patch but that will not help you
>> right now. The max io size for nbd is 128k so make sure your net values
>> are large enough. Increase the values in sysctl.conf and retry if they
>> were too small.
> Not sure what I was thinking. Just checked the logs and we have done IO
> of the same size that got stuck and it was fine, so the socket sizes
> should be ok.
>
> We still need to add code to make sure IO sizes and the af_unix sockets
> size limits match up.
>
>
>> 2. If memory is low on the system, we could be stuck trying to allocate
>> memory in the kernel in that code path too.
memory was definitely not low, we only had 10% memory usage at the time of the 
crash.
>> rbd-nbd just uses more memory per device, so it could be why we do not
>> see a problem with krbd.
>>
>> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd.
>> He removed that code from the krbd. I will ping him on that.

Interesting. I activated Coredumps for that processes - probably we can find 
something interesting here...

Regards
Marc

sysctl_settings.txt.gz
Description: application/gzip

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] reproducible rbd-nbd crashes

Reply via email to