Re: [ceph-users] Slow requests from bluestore osds / crashing rbd-nbd

Marc Schöchlin Mon, 20 May 2019 05:18:00 -0700

Hello cephers,

we have a few systems which utilize a rbd-bd map/mount to get access to a rbd 
volume.
(This problem seems to be related to "[ceph-users] Slow requests from bluestore 
osds" (the original thread))


Unfortunately the rbd-nbd device of a system crashes three mondays in series at 
~00:00 when the systemd fstrim timer executes "fstrim -av".
(which runs in parallel to deep scrub operations)

After that the device constantly reports io errors every time a access to the 
filesystem happens.
Unmounting, remapping and mounting helped to get the filesystem/device back 
into business :-)

Manual 30 minute stresstests using the following fio command, did not produce 
any problems on client side
(Ceph storage reported some slow requests while testing).

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test 
--filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw 
--rwmixread=50 --numjobs=50 --loops=10

It seems that others also experienced this problem: 
https://ceph-users.ceph.narkive.com/2FIfyx1U/rbd-nbd-timeout-and-crash
The change for setting device timeouts by not seems to be merged to luminous.
Experiments setting the timeout manually after mapping using 
https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c haven't 
change the situation.

Do you have suggestions how to analyze/solve the situation?

Regards
Marc
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



The client kernel throws messages like this:

May 19 23:59:01 int-nfs-001 CRON[836295]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 60 2)
May 20 00:00:30 int-nfs-001 systemd[1]: Starting Discard unused blocks...
May 20 00:01:02 int-nfs-001 kernel: [1077851.623582] block nbd0: Connection 
timed out
May 20 00:01:02 int-nfs-001 kernel: [1077851.623613] block nbd0: shutting down 
sockets
May 20 00:01:02 int-nfs-001 kernel: [1077851.623617] print_req_error: I/O 
error, dev nbd0, sector 84082280
May 20 00:01:02 int-nfs-001 kernel: [1077851.623632] block nbd0: Connection 
timed out
May 20 00:01:02 int-nfs-001 kernel: [1077851.623636] print_req_error: I/O 
error, dev nbd0, sector 92470887
May 20 00:01:02 int-nfs-001 kernel: [1077851.623642] block nbd0: Connection 
timed out

Ceph throws messages like this:

2019-05-20 00:00:00.000124 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173572 : 
cluster [INF] overall HEALTH_OK
2019-05-20 00:00:54.249998 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173586 : 
cluster [WRN] Health check failed: 644 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:00.330566 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173587 : 
cluster [WRN] Health check update: 594 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:09.768476 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173591 : 
cluster [WRN] Health check update: 505 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:14.768769 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173592 : 
cluster [WRN] Health check update: 497 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:20.610398 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173593 : 
cluster [WRN] Health check update: 509 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:28.721891 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173594 : 
cluster [WRN] Health check update: 501 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:34.909842 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173596 : 
cluster [WRN] Health check update: 494 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:44.770330 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173597 : 
cluster [WRN] Health check update: 500 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:49.770625 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173599 : 
cluster [WRN] Health check update: 608 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:55.073734 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173600 : 
cluster [WRN] Health check update: 593 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:04.771432 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173607 : 
cluster [WRN] Health check update: 552 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:09.771730 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173609 : 
cluster [WRN] Health check update: 720 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:19.393803 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173610 : 
cluster [WRN] Health check update: 539 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:25.474605 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173611 : 
cluster [WRN] Health check update: 527 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:34.773039 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173612 : 
cluster [WRN] Health check update: 496 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:39.773312 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173613 : 
cluster [WRN] Health check update: 493 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:44.773604 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173614 : 
cluster [WRN] Health check update: 528 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:49.801997 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173616 : 
cluster [WRN] Health check update: 537 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:02:59.779779 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173617 : 
cluster [WRN] Health check update: 520 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:04.780074 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173622 : 
cluster [WRN] Health check update: 493 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:10.073854 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173624 : 
cluster [WRN] Health check update: 452 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:19.780877 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173625 : 
cluster [WRN] Health check update: 515 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:24.781177 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173626 : 
cluster [WRN] Health check update: 540 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:30.321540 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173627 : 
cluster [WRN] Health check update: 545 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:39.781968 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173628 : 
cluster [WRN] Health check update: 508 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:44.782261 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173629 : 
cluster [WRN] Health check update: 469 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:50.610639 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173630 : 
cluster [WRN] Health check update: 513 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:03:58.724045 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173631 : 
cluster [WRN] Health check update: 350 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:04:04.801989 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173638 : 
cluster [WRN] Health check update: 356 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:04:14.783787 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173640 : 
cluster [WRN] Health check update: 395 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:04:19.234877 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173641 : 
cluster [INF] Health check cleared: REQUEST_SLOW (was: 238 slow requests are 
blocked > 32 sec. Implicated osds 51)
2019-05-20 00:04:19.234921 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173642 : 
cluster [INF] Cluster is now healthy
2019-05-20 01:00:00.000124 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 174035 : 
cluster [INF] overall HEALTH_OK

The parameters of our environment:

  * Storage System (OSDs and MONs)
      o Ceph 12.2.11
      o Ubuntu 16.04/1804
      o 30 * 8GB spinners distributed over
  * Client
      o Ceph 12.2.11
      o Ubuntu 18.04 / 64 Bit
      o ceph.conf:
        [global]
        fsid = <redacted>
        mon host = <redacted>
        public network = <redacted>

        [client]
        rbd cache = true
        rbd cache size = 536870912
        rbd cache max dirty = 268435456
        rbd cache target dirty = 134217728
        rbd cache max dirty age = 30
        rbd readahead max bytes = 4194304


Regards
Marc

Am 13.05.19 um 07:40 schrieb EDH - Manuel Rios Fernandez:
> Hi Marc,
>
> Try to compact OSD with slow request 
>
> ceph tell osd.[ID] compact
>
> This will make the OSD offline for some seconds(SSD) to minutes(HDD) and 
> perform a compact of OMAP database.
>
> Regards,
>
>
>
>
> -----Mensaje original-----
> De: ceph-users <ceph-users-boun...@lists.ceph.com> En nombre de Marc Schöchlin
> Enviado el: lunes, 13 de mayo de 2019 6:59
> Para: ceph-users@lists.ceph.com
> Asunto: Re: [ceph-users] Slow requests from bluestore osds
>
> Hello cephers,
>
> one week ago we replaced the bluestore cache size by "osd memory target" and 
> removed the detail memory settings.
> This storage class now runs 42*8GB spinners with a permanent write workload 
> of 2000-3000 write IOPS, and 1200-8000 read IOPS.
>
> Out new setup is now:
> (12.2.10 on Ubuntu 16.04)
>
> [osd]
> osd deep scrub interval = 2592000
> osd scrub begin hour = 19
> osd scrub end hour = 6
> osd scrub load threshold = 6
> osd scrub sleep = 0.3
> osd snap trim sleep = 0.4
> pg max concurrent snap trims = 1
>
> [osd.51]
> osd memory target = 8589934592
> ...
>
> After that (restarting the entire cluster with these settings) we were very 
> happy to not seeany slow request for 7 days.
>
> Unfortunately this night the slow requests returned on one osd without any 
> known change of the workload of the last 14 days (according to our detailed 
> monitoring)
>
> 2019-05-12 22:00:00.000117 mon.ceph-mon-s43 [INF] overall HEALTH_OK
> 2019-05-12 23:00:00.000130 mon.ceph-mon-s43 [INF] overall HEALTH_OK
> 2019-05-13 00:00:00.000129 mon.ceph-mon-s43 [INF] overall HEALTH_OK
> 2019-05-13 00:00:44.069793 mon.ceph-mon-s43 [WRN] Health check failed: 416 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:00:50.151190 mon.ceph-mon-s43 [WRN] Health check update: 439 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:00:59.750398 mon.ceph-mon-s43 [WRN] Health check update: 452 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:04.750697 mon.ceph-mon-s43 [WRN] Health check update: 283 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:10.419801 mon.ceph-mon-s43 [WRN] Health check update: 230 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:19.751516 mon.ceph-mon-s43 [WRN] Health check update: 362 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:24.751822 mon.ceph-mon-s43 [WRN] Health check update: 324 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:30.675160 mon.ceph-mon-s43 [WRN] Health check update: 341 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:38.759012 mon.ceph-mon-s43 [WRN] Health check update: 390 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:44.858392 mon.ceph-mon-s43 [WRN] Health check update: 366 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:54.753388 mon.ceph-mon-s43 [WRN] Health check update: 352 
> slow requests are blocked > 32 sec. Implicated osds 51 (REQUEST_SLOW)
> 2019-05-13 00:01:59.045220 mon.ceph-mon-s43 [INF] Health check cleared: 
> REQUEST_SLOW (was: 168 slow requests are blocked > 32 sec. Implicated osds 51)
> 2019-05-13 00:01:59.045257 mon.ceph-mon-s43 [INF] Cluster is now healthy
> 2019-05-13 01:00:00.000114 mon.ceph-mon-s43 [INF] overall HEALTH_OK
> 2019-05-13 02:00:00.000130 mon.ceph-mon-s43 [INF] overall HEALTH_OK
>
>
> The output of a "ceph health detail" loop at the time the problem occurred:
>
> Mon May 13 00:01:27 CEST 2019
> HEALTH_WARN 324 slow requests are blocked > 32 sec. Implicated osds 51 
> REQUEST_SLOW 324 slow requests are blocked > 32 sec. Implicated osds 51
>     324 ops are blocked > 32.768 sec
>     osd.51 has blocked requests > 32.768 sec
>
> The logfile of the OSD:
>
> 2019-05-12 23:57:28.767463 7f38da4e2700  4 rocksdb: (Original Log Time 
> 2019/05/12-23:57:28.767419) 
> [/build/ceph-12.2.10/src/rocksdb/db/db_impl_compaction_flush.cc:132] 
> [default] Level summary: base level 1 max b ytes base 268435456 files[2 4 21 
> 122 0 0 0] max score 0.94
>
> 2019-05-12 23:57:28.767511 7f38da4e2700  4 rocksdb: 
> [/build/ceph-12.2.10/src/rocksdb/db/db_impl_files.cc:388] [JOB 2991] Try to 
> delete WAL files size 256700142, prev total WAL file size 257271487, number 
> of live
>  WAL files 2.
>
> 2019-05-12 23:58:07.816376 7f38ddce9700  0 log_channel(cluster) log [DBG] : 
> 34.ac scrub ok
> 2019-05-12 23:59:54.070025 7f38de4ea700  0 log_channel(cluster) log [DBG] : 
> 34.236 scrub starts
> 2019-05-13 00:02:21.818689 7f38de4ea700  0 log_channel(cluster) log [DBG] : 
> 34.236 scrub ok
> 2019-05-13 00:04:37.613094 7f38ead03700  4 rocksdb: 
> [/build/ceph-12.2.10/src/rocksdb/db/db_impl_write.cc:684] reusing log 422507 
> from recycle list
>
> 2019-05-13 00:04:37.613186 7f38ead03700  4 rocksdb: 
> [/build/ceph-12.2.10/src/rocksdb/db/db_impl_write.cc:725] [default] New 
> memtable created with log file: #422511. Immutable memtables: 0.
>
> Any hints how to find more details about the origin of this problem?
> How can we solve that?
>
> Regards
> Marc
>
> Am 28.01.19 um 22:27 schrieb Marc Schöchlin:
>> Hello cephers,
>>
>> as described - we also have the slow requests in our setup.
>>
>> We recently updated from ceph 12.2.4 to 12.2.10, updated Ubuntu 16.04 to the 
>> latest patchlevel (with kernel 4.15.0-43) and applied dell firmware 2.8.0.
>>
>> On 12.2.5 (before updating the cluster) we had in a frequency of 10min to 
>> 30minutes in the entire deepscrub-window between 8:00 PM and 6:00 AM.
>> Especially between 04:00AM and 06:00 AM when when we sequentially create a 
>> rbd snapshots for every rbd image and delete a outdated snapshot (we hold 3 
>> snapshots per rbd device).
>>
>> After the upgrade to 12.2.10 (and the other patches) slow requests seems to 
>> be reduced, but they still occur after the snapshot creation/deletion 
>> procedure.
>> Today we changed the time of the creation/deletion procedure from 4:00 AM to 
>> 7:30PM and we experienced slow request right in the the snapshot process at 
>> 8:00PM.
>>
>> The slow requests only happen on a certain storage class osds (30 * 
>> 8GB spinners)  - i.e ssd osds do not have this problem on the same cluster 
>> The pools which use this storage class are loaded by 80% write requests.
>>
>> Our configuration looks like this:
>> ---
>> bluestore cache kv max = 2147483648
>> bluestore cache kv ratio = 0.9
>> bluestore cache meta ratio = 0.1
>> bluestore cache size hdd = 10737418240 osd deep scrub interval = 
>> 2592000 osd scrub begin hour = 19 osd scrub end hour = 6 osd scrub 
>> load threshold = 4 osd scrub sleep = 0.3 osd max trimming pgs = 2
>> ---
>> We do not have so much devices in this storage class (a enhancement is 
>> in progress to get more iops)
>>
>> What can i do to decrease the impact of snaptrims to prevent slow requests?
>> (i.e. reduce "osd max trimming pgs" to "1")
>>
>> Regards
>> Marc Schöchlin
>>
>> Am 03.09.18 um 10:13 schrieb Marc Schöchlin:
>>> Hi,
>>>
>>> we are also experiencing this type of behavior for some weeks on our 
>>> not so performance critical hdd pools.
>>> We haven't spent so much time on this problem, because there are 
>>> currently more important tasks - but here are a few details:
>>>
>>> Running the following loop results in the following output:
>>>
>>> while true; do ceph health|grep -q HEALTH_OK || (date;  ceph health 
>>> detail); sleep 2; done
>>>
>>> Sun Sep  2 20:59:47 CEST 2018
>>> HEALTH_WARN 4 slow requests are blocked > 32 sec REQUEST_SLOW 4 slow 
>>> requests are blocked > 32 sec
>>>     4 ops are blocked > 32.768 sec
>>>     osd.43 has blocked requests > 32.768 sec Sun Sep  2 20:59:50 CEST 
>>> 2018 HEALTH_WARN 4 slow requests are blocked > 32 sec REQUEST_SLOW 4 
>>> slow requests are blocked > 32 sec
>>>     4 ops are blocked > 32.768 sec
>>>     osd.43 has blocked requests > 32.768 sec Sun Sep  2 20:59:52 CEST 
>>> 2018 HEALTH_OK Sun Sep  2 21:00:28 CEST 2018 HEALTH_WARN 1 slow 
>>> requests are blocked > 32 sec REQUEST_SLOW 1 slow requests are 
>>> blocked > 32 sec
>>>     1 ops are blocked > 32.768 sec
>>>     osd.41 has blocked requests > 32.768 sec Sun Sep  2 21:00:31 CEST 
>>> 2018 HEALTH_WARN 7 slow requests are blocked > 32 sec REQUEST_SLOW 7 
>>> slow requests are blocked > 32 sec
>>>     7 ops are blocked > 32.768 sec
>>>     osds 35,41 have blocked requests > 32.768 sec Sun Sep  2 21:00:33 
>>> CEST 2018 HEALTH_WARN 7 slow requests are blocked > 32 sec 
>>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>>     7 ops are blocked > 32.768 sec
>>>     osds 35,51 have blocked requests > 32.768 sec Sun Sep  2 21:00:35 
>>> CEST 2018 HEALTH_WARN 7 slow requests are blocked > 32 sec 
>>> REQUEST_SLOW 7 slow requests are blocked > 32 sec
>>>     7 ops are blocked > 32.768 sec
>>>     osds 35,51 have blocked requests > 32.768 sec
>>>
>>> Our details:
>>>
>>>   * system details:
>>>     * Ubuntu 16.04
>>>      * Kernel 4.13.0-39
>>>      * 30 * 8 TB Disk (SEAGATE/ST8000NM0075)
>>>      * 3* Dell Power Edge R730xd (Firmware 2.50.50.50)
>>>        * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
>>>        * 2*10GBITS SFP+ Network Adapters
>>>        * 192GB RAM
>>>      * Pools are using replication factor 3, 2MB object size,
>>>        85% write load, 1700 write IOPS/sec
>>>        (ops mainly between 4k and 16k size), 300 read IOPS/sec
>>>   * we have the impression that this appears on deepscrub/scrub activity.
>>>   * Ceph 12.2.5, we alread played with the osd settings OSD Settings
>>>     (our assumtion was that the problem is related to rocksdb compaction)
>>>     bluestore cache kv max = 2147483648
>>>     bluestore cache kv ratio = 0.9
>>>     bluestore cache meta ratio = 0.1
>>>     bluestore cache size hdd = 10737418240
>>>   * this type problem only appears on hdd/bluestore osds, ssd/bluestore
>>>     osds did never experienced that problem
>>>   * the system is healthy, no swapping, no high load, no errors in 
>>> dmesg
>>>
>>> I attached a log excerpt of osd.35 - probably this is useful for 
>>> investigating the problem is someone owns deeper bluestore knowledge.
>>> (slow requests appeared on Sun Sep  2 21:00:35)
>>>
>>> Regards
>>> Marc
>>>
>>>
>>> Am 02.09.2018 um 15:50 schrieb Brett Chancellor:
>>>> The warnings look like this.
>>>>
>>>> 6 ops are blocked > 32.768 sec on osd.219
>>>> 1 osds have slow requests
>>>>
>>>> On Sun, Sep 2, 2018, 8:45 AM Alfredo Deza <ad...@redhat.com 
>>>> <mailto:ad...@redhat.com>> wrote:
>>>>
>>>>     On Sat, Sep 1, 2018 at 12:45 PM, Brett Chancellor
>>>>     <bchancel...@salesforce.com <mailto:bchancel...@salesforce.com>>
>>>>     wrote:
>>>>     > Hi Cephers,
>>>>     >   I am in the process of upgrading a cluster from Filestore to
>>>>     bluestore,
>>>>     > but I'm concerned about frequent warnings popping up against the new
>>>>     > bluestore devices. I'm frequently seeing messages like this,
>>>>     although the
>>>>     > specific osd changes, it's always one of the few hosts I've
>>>>     converted to
>>>>     > bluestore.
>>>>     >
>>>>     > 6 ops are blocked > 32.768 sec on osd.219
>>>>     > 1 osds have slow requests
>>>>     >
>>>>     > I'm running 12.2.4, have any of you seen similar issues? It
>>>>     seems as though
>>>>     > these messages pop up more frequently when one of the bluestore
>>>>     pgs is
>>>>     > involved in a scrub.  I'll include my bluestore creation process
>>>>     below, in
>>>>     > case that might cause an issue. (sdb, sdc, sdd are SATA, sde and
>>>>     sdf are
>>>>     > SSD)
>>>>
>>>>     Would be useful to include what those warnings say. The ceph-volume
>>>>     commands look OK to me
>>>>
>>>>     >
>>>>     >
>>>>     > ## Process used to create osds
>>>>     > sudo ceph-disk zap /dev/sdb /dev/sdc /dev/sdd /dev/sdd /dev/sde
>>>>     /dev/sdf
>>>>     > sudo ceph-volume lvm zap /dev/sdb
>>>>     > sudo ceph-volume lvm zap /dev/sdc
>>>>     > sudo ceph-volume lvm zap /dev/sdd
>>>>     > sudo ceph-volume lvm zap /dev/sde
>>>>     > sudo ceph-volume lvm zap /dev/sdf
>>>>     > sudo sgdisk -n 0:2048:+133GiB -t 0:FFFF -c 1:"ceph block.db sdb"
>>>>     /dev/sdf
>>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 2:"ceph block.db sdc"
>>>>     /dev/sdf
>>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 3:"ceph block.db sdd"
>>>>     /dev/sdf
>>>>     > sudo sgdisk -n 0:0:+133GiB -t 0:FFFF -c 4:"ceph block.db sde"
>>>>     /dev/sdf
>>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>>     --data
>>>>     > /dev/sdb --block.db /dev/sdf1
>>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>>     --data
>>>>     > /dev/sdc --block.db /dev/sdf2
>>>>     > sudo ceph-volume lvm create --bluestore --crush-device-class hdd
>>>>     --data
>>>>     > /dev/sdd --block.db /dev/sdf3
>>>>     >
>>>>     >
>>>>     > _______________________________________________
>>>>     > ceph-users mailing list
>>>>     > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>     >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow requests from bluestore osds / crashing rbd-nbd

Reply via email to