Hi Frank, bfq was definitely broken, deadlocking io for a few CentOS Stream 8 kernels between EL 8.5 and 8.6 -- we also hit that in production and switched over to `none`.
I don't recall exactly when the upstream kernel was also broken but apparently this was the fix: https://marc.info/?l=linux-block&m=164366111512992&w=2 In any case you might want to just use `none` with flash devs -- i'm not sure the fair scheduling and mq is very meaningful anymore for high iops devices and ceph. Cheers, Dan On Thu, Nov 17, 2022 at 1:23 PM Frank Schilder <fr...@dtu.dk> wrote: > > Hi Igor, > > I might have a smoking gun. Could it be that ceph (the kernel??) has issues > with certain disk schedulers? There was a recommendation on this list to use > bfq with bluestore. This was actually the one change other than ceph version > during upgrade: to make bfq default. Now, this might be a problem with > certain drives that have a preferred scheduler different than bfq. Here my > observation: > > I managed to get one of the OSDs to hang today. It was not the usual abort, I > don't know why the op_thread_timeout and suicide_timeout didn't trigger. The > OSD's worker thread was unresponsive for a bit more than 10 minutes before I > took action. Hence, nothing in the log (should maybe have used kill > sigabort). Now, this time I tried to check if I can access the disk with dd. > And, I could not. A > > dd if=/dev/sdn of=disk-dump bs=4096 count=100 > > got stuck right away in D-state: > > 1652472 D+ dd if=/dev/sdn of=disk-dump bs=4096 count=100 > > This time, since I was curious about the disk scheduler, I went to another > terminal on the same machine and did: > > # cat /sys/block/sdn/queue/scheduler > mq-deadline kyber [bfq] none > # echo none >> /sys/block/sdn/queue/scheduler > # cat /sys/block/sdn/queue/scheduler > [none] mq-deadline kyber bfq > > Going back to the stuck session, I see now (you can see my attempts to > interrupt the dd): > > # dd if=/dev/sdn of=disk-dump bs=4096 count=100 > ^C^C3+0 records in > 2+0 records out > 8192 bytes (8.2 kB) copied, 336.712 s, 0.0 kB/s > > Suddenly, the disk responds again! Also, the ceph container stopped (a docker > stop container returned without the container stopping - as before in this > situation). > > Could it be that recommendations for disk scheduler choice should be > reconsidered, or is this pointing towards a bug in either how ceph or the > kernel schedules disk IO? To confirm this hypothesis, I will retry the stress > test with the scheduler set to the default kernel choice. > > I did day-long fio benchmarks with all schedulers and all sorts of workloads > on our drives and could not find anything like that. It looks like it is very > difficult to impossible to reproduce a realistic ceph-osd IO pattern for > testing. Is there any tool available for this? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <fr...@dtu.dk> > Sent: 14 November 2022 13:03:58 > To: Igor Fedotov; ceph-users@ceph.io > Subject: [ceph-users] Re: LVM osds loose connection to disk > > I can't reproduce the problem with artificial workloads, I need to get one of > these OSDs running in the meta-data pool until it crashes. My plan is to > reduce time-outs and increase log level for these specific OSDs to capture > what happened before an abort in the memory log. I can spare about 100G of > RAM for log entries. I found the following relevant options with settings I > think will work for my case: > > osd_op_thread_suicide_timeout 30 # default 150 > osd_op_thread_timeout 10 # default 15 > debug_bluefs 1/20 # default 1/5 > debug_bluestore 1/20 # default 1/5 > bluestore_kv_sync_util_logging_s 3 # default 10 > log_max_recent 100000 # default 10000 > > It would be great if someone could confirm that these settings will achieve > what I want (or what is missing). I would like to capture at least 1 minute > worth of log entries in RAM with high debug settings. Does anyone have a good > estimate for how many log-entries are created per second with these settings > for tuning log_max_recent? > > Thanks for your help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <fr...@dtu.dk> > Sent: 11 November 2022 10:25:17 > To: Igor Fedotov; ceph-users@ceph.io > Subject: [ceph-users] Re: LVM osds loose connection to disk > > Hi Igor, > > thanks for your reply. We only exchanged the mimic containers with the > octopus ones. We didn't even reboot the servers during upgrade, only later > for trouble shooting. The only change since the upgrade is the ceph container. > > I'm trying to go down the stack and run a benchmark on the OSD directly. > Unfortunately, it seems that osd bench is too simplistic. I don't think we > have a problem with the disk, I rather think there is a race condition that > gets the bstore_kv_sync thread stuck. I would need to reproduce a realistic > workload on the OSD itself. Standard benchmarks on the disks directly did not > show any problems (even when running for days). Is there a way to address an > OSD directly with IO, for example, low-level rados commands? > > I'm almost down to thinking that the only way to create a realistic workload > is to create a size-1 pool on a single OSD, create an rbd image on this pool > and run fio with rbd engine. > > Another question. The osd bench command accepts 4 arguments: > > bench [<count:int>] [<size:int>] [<object_size:int>] [<object_num:int>] : > OSD benchmark: write <count> <size>-byte objects(with <obj_size> <obj_num>), > (default count=1G default size=4MB). Results in log. > > Only the first 2 are explained in the docs. Any idea what the other 2 are > doing? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Igor Fedotov <igor.fedo...@croit.io> > Sent: 10 November 2022 15:48:23 > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: LVM osds loose connection to disk > > Hi Frank, > > unfortunately IMO it's not an easy task to identify what are the > relevant difference between mimic and octopus in this respect.. At least > the question would be what minor Ceph releases are/were in use. > > I recall there were some tricks with setting/clearing bluefs_buffered_io > somewhere in that period. But I can hardly recall anything else... > Curious if OS/Container or other third-party software was upgraded with > Ceph upgrade as well? Just in case I presume you were using containers > in mimic too, right? > > > Instead I'd rather approach the issue from a different side: > > 1) Learn how to [easily] reproduce the issue, preferably in a test lab > rather than in the field. Better use exactly the same disk(s), H/W, OS > and S/W versions as in the production. > > 2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs > and debug-bdev to 20 and collect verbose log - check what's happening > there. Meanwhile monitor disk activity - is there any load to the disk > while in this state at all? Do disk reads (e.g. via dd) out of OSD > container succeed at this point? > > > Thanks, > > Igor > > On 11/10/2022 5:23 PM, Frank Schilder wrote: > > Hi all, > > > > I have some kind of update on the matter of stuck OSDs. It seems not to be > > an LVM issue and it also seems not to be connected to the OSD size. > > > > After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed > > the 400G disks with 1 OSD per disk and started to move data from the slow > > spare SSDs back to the fast ones. After moving about 100 out of 1024 PGs of > > the pool fast OSDs started failing again. It is kind of the same > > observation as before, I can't stop a container with a failed OSD. However, > > when I restart docker, everything comes up clean. > > > > When I look at the logs, I see that the OSD aborts with suicide timeout > > after many osd_op_tp thread timeouts. However, the OSD process and with it > > the container does not terminate because of a hanging thread. The syslog > > has the message (all messages below in chronological order): > > > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds > > > > about 30 seconds before the OSD aborts with suicide timeout with > > > > ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal > > (Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 > > ceph version 15.2.17 ... > > > > What I see in the syslog is, that the thread bstore_kv_sync seems not to be > > terminated with the abort. These messages continue to show up: > > > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds. > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds. > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds. > > > > On docker stop container the launch-script receives the term signal, but > > the OSD cannot be deactivated due to this thread: > > > > journal: osd_lvm_start: deactivating OSD 959 > > journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959 > > journal: umount: /var/lib/ceph/osd/ceph-959: target is busy. > > > > Its probably busy because of bstore_kv_sync thread hanging. As a > > consequence, the container is still running, has a ceph-osd process shown > > with docker top and these messages continue to show up: > > > > INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds. > > INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds. > > > > Although the bstore_kv_sync thread is unkillable, a restart of docker > > clears everything out and the OSD restarts fine. I'm somewhat hesitant to > > accept the simple "must be the firmware" statement, because these disks > > worked fine for 4 years with mimic. The only thing that changed was the > > ceph version from mimic to octopus, everything else stayed the same: OS > > version, kernel version, docker version, firmware version. > > > > Since it happens only on this type of disks, it could very well have to do > > with firmware, but not without ceph having had a serious change in > > low-level disk access between mimic and cotopus. So, I'm wondering what > > features of the firmware octopus is using that mimic was not. Would be > > great if somebody has some pointers for what part of the software stack I > > should look at, I would like to avoid hunting ghosts. > > > > Many thanks and best regards! > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Frank Schilder <fr...@dtu.dk> > > Sent: 10 October 2022 23:33:32 > > To: Igor Fedotov; ceph-users@ceph.io > > Subject: [ceph-users] Re: LVM osds loose connection to disk > > > > Hi Igor. > > > > The problem of OSD crashes was resolved after migrating just a little bit > > of the meta-data pool to other disks (we decided to evacuate the small OSDs > > onto larger disks to make space). Therefore, I don't think its an LVM or > > disk issue. The cluster is working perfectly now after migrating some data > > away from the small OSDs. I rather believe that its tightly related to "OSD > > crashes during upgrade mimic->octopus", it happens only on OSDs where the > > repair command errs out with abort on enospc. > > > > My hypothesis is now more along the lines of a dead-lock occurring as a > > consequence of an aborted daemon-thread. Is there any part of the bluestore > > code that acquires an exclusive device lock that gets passed through to the > > pv and can lead to a device freeze if not released? I'm wondering if > > something like this happened here as a consequence of the allocator fail. I > > saw a lot of lock-up warnings related to OSD threads in the syslog. > > > > Regarding the 2 minutes time difference and heartbeats. The OSD seems to > > have been responding to heartbeats all the time, even after the suicide > > time-out; see description below. I executed docker stop container at > > 16:15:39. Until this moment, the OSD was considered up+in by the MONs. > > > > Here is a recall of events from memory together with a description of how > > the 4 OSDs on 1 disk are executed in a singe container. I will send > > detailed logs and scripts via our private communication. If anyone else is > > interested as well, I'm happy to make it available. > > > > Being in the situation over a weekend and at night, we didn't take precise > > minutes. Our priority was to get everything working. I'm afraid this here > > is as accurate as it gets. > > > > Let's start with how the processes are started inside the container. We > > have a main script M executed as the entry-point to the container. For each > > OSD found on a drive, M forks off a copy Mn of itself, which in turn forks > > off the OSD process: > > > > M -> M1 -> OSD1 > > M -> M2 -> OSD2 > > M -> M3 -> OSD3 > > M -> M4 -> OSD4 > > > > At the end, we have 5 instances of the main script and 4 instances of OSDs > > running. This somewhat cumbersome looking startup is required to be able to > > forward signals sent by the docker daemon, most notably, SIGINT on docker > > stop container. In addition, all instances of M trap a number of signals, > > including SIGCHLD. If just one OSD dies, the entire container should stop > > and restart. On a disk fail all OSDs on that disk go down and will be > > rebuild in the background simultaneously. > > > > Executing docker top container on the above situation gives: > > > > M > > M1 > > M2 > > M3 > > M4 > > OSD1 > > OSD2 > > OSD3 > > OSD4 > > > > After the crash of, say, OSD1, I saw something like this (docker top > > container): > > > > M > > M1 > > M2 > > M3 > > M4 > > OSD2 > > OSD3 > > OSD4 > > > > The OSD processes were reported to be in Sl-state by ps. > > > > At this point, OSD1 was gone from the list but M1 was still running. There > > was no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, > > but not OSD1! Due to this, any IO targeting OSD1 got stuck and > > corresponding slow ops warnings started piling up. > > > > My best bet is that not all threads of OSD1 were terminated and, therefore, > > no SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I > > wonder if its left-overs might have responded to heartbeats. > > > > At the same time the disk was not accessible to LVM commands any more. A > > "ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not > > try to access the raw device with dd. I was thinking about it, but attended > > to more pressing issues. I actually don't think the raw device was locked > > up, but that's just a guess. > > > > In an attempt to clean-up the OSD down situation, I executed "docker stop > > container" (to be followed by docker start). The stop took a long time (I > > use an increased SIGKILL time-out) and resulted in this state (docker top > > container): > > > > OSD2 > > OSD3 > > OSD4 > > > > The OSD processes were now reported in D-state by ps and the container was > > still reported to be running by docker. However, at this point all 4 OSDs > > were marked down, PGs peered and IO started again. > > > > I'm wondering if a failed allocation attempt lead to a device/lvm lock > > being acquired but not released, leading to an LVM device freeze. There > > were thread lockup messages in the syslog. It smells a lot like a dead-lock > > situation created by not releasing a critical resource on SIGABRT. > > Unfortunately, there seem to be no log messages from the thread that got > > locked up. > > > > Hope this makes some sense when interpreting the logs. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Igor Fedotov <igor.fedo...@croit.io> > > Sent: 09 October 2022 22:07:16 > > To: Frank Schilder; ceph-users@ceph.io > > Subject: Re: [ceph-users] LVM osds loose connection to disk > > > > Hi Frank, > > > > can't advise much on the disk issue - just an obvious thought about > > upgrading the firmware and/or contacting the vendor. IIUC disk is > > totally inaccessible at this point, e.g. you're unable to read from it > > bypassing LVM as well, right? If so this definitely looks like a > > low-level problem. > > > > > > As for OSD down issue - may I have some clarification please - did this > > osd.975 never go down or it was just a few minutes later? In the log > > snippet you shared I can see a 2 min gap between operation timeouts > > indication and the final OSD suicide. I presume it had been able to > > response heartbeats prior to that suicide and hence stayed online... But > > mostly speculating so far... > > > > > > Thanks, > > > > Igor > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io