[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Thanks for the suggestions, I will try this. /Z On Fri, 7 Oct 2022 at 18:13, Konstantin Shalygin wrote: > Zakhar, try to look to top of slow ops in daemon socket for this osd, you > may find 'snapc' operations, for example. By rbd head you can find rbd > image, and then try to look how much snapshots in chain for this image. > More than 10 snaps for one image can increase client ops latency to tens > milliseconds even for NVMe drives, that usually operates at usec or 1-2msec > > > k > Sent from my iPhone > > > On 7 Oct 2022, at 14:35, Zakhar Kirpichenko wrote: > > > > The drive doesn't show increased utilization on average, but it does > > sporadically get more I/O than other drives, usually in short bursts. I > am > > now trying to find a way to trace this to a specific PG, pool and object > > (s) – not sure if that is possible. > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Zakhar, try to look to top of slow ops in daemon socket for this osd, you may find 'snapc' operations, for example. By rbd head you can find rbd image, and then try to look how much snapshots in chain for this image. More than 10 snaps for one image can increase client ops latency to tens milliseconds even for NVMe drives, that usually operates at usec or 1-2msec k Sent from my iPhone > On 7 Oct 2022, at 14:35, Zakhar Kirpichenko wrote: > > The drive doesn't show increased utilization on average, but it does > sporadically get more I/O than other drives, usually in short bursts. I am > now trying to find a way to trace this to a specific PG, pool and object > (s) – not sure if that is possible. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Hi, I’d look for deep-scrubs on that OSD, those are logged, maybe those timestamps match your observations. Zitat von Zakhar Kirpichenko : Thanks for this! The drive doesn't show increased utilization on average, but it does sporadically get more I/O than other drives, usually in short bursts. I am now trying to find a way to trace this to a specific PG, pool and object (s) – not sure if that is possible. /Z On Fri, 7 Oct 2022, 12:17 Dan van der Ster, wrote: Hi Zakhar, I can back up what Konstantin has reported -- we occasionally have HDDs performing very slowly even though all smart tests come back clean. Besides ceph osd perf showing a high latency, you could see high ioutil% with iostat. We normally replace those HDDs -- usually by draining and zeroing them, then putting them back in prod (e.g. in a different cluster or some other service). I don't have statistics on how often those sick drives come back to full performance or not -- that could indicate it was a poor physical connection, vibrations, ... , for example. But I do recall some drives came back repeatedly as "sick" but not dead w/ clean SMART tests. If you have time you can dig deeper with increased bluestore debug levels. In our environment this happens often enough that we simply drain, replace, move on. Cheers, dan On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko wrote: > > Unfortunately, that isn't the case: the drive is perfectly healthy and, > according to all measurements I did on the host itself, it isn't any > different from any other drive on that host size-, health- or > performance-wise. > > The only difference I noticed is that this drive sporadically does more I/O > than other drives for a split second, probably due to specific PGs placed > on its OSD, but the average I/O pattern is very similar to other drives and > OSDs, so it's somewhat unclear why the specific OSD is consistently showing > much higher latency. It would be good to figure out what exactly is causing > these I/O spikes, but I'm not yet sure how to do that. > > /Z > > On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin wrote: > > > Hi, > > > > When you see one of 100 drives perf is unusually different, this may mean > > 'this drive is not like the others' and should be replaced > > > > > > k > > > > Sent from my iPhone > > > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko wrote: > > > > > > Anyone, please? > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Thanks for this! The drive doesn't show increased utilization on average, but it does sporadically get more I/O than other drives, usually in short bursts. I am now trying to find a way to trace this to a specific PG, pool and object (s) – not sure if that is possible. /Z On Fri, 7 Oct 2022, 12:17 Dan van der Ster, wrote: > Hi Zakhar, > > I can back up what Konstantin has reported -- we occasionally have > HDDs performing very slowly even though all smart tests come back > clean. Besides ceph osd perf showing a high latency, you could see > high ioutil% with iostat. > > We normally replace those HDDs -- usually by draining and zeroing > them, then putting them back in prod (e.g. in a different cluster or > some other service). I don't have statistics on how often those sick > drives come back to full performance or not -- that could indicate it > was a poor physical connection, vibrations, ... , for example. But I > do recall some drives came back repeatedly as "sick" but not dead w/ > clean SMART tests. > > If you have time you can dig deeper with increased bluestore debug > levels. In our environment this happens often enough that we simply > drain, replace, move on. > > Cheers, dan > > > > > On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko > wrote: > > > > Unfortunately, that isn't the case: the drive is perfectly healthy and, > > according to all measurements I did on the host itself, it isn't any > > different from any other drive on that host size-, health- or > > performance-wise. > > > > The only difference I noticed is that this drive sporadically does more > I/O > > than other drives for a split second, probably due to specific PGs placed > > on its OSD, but the average I/O pattern is very similar to other drives > and > > OSDs, so it's somewhat unclear why the specific OSD is consistently > showing > > much higher latency. It would be good to figure out what exactly is > causing > > these I/O spikes, but I'm not yet sure how to do that. > > > > /Z > > > > On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin wrote: > > > > > Hi, > > > > > > When you see one of 100 drives perf is unusually different, this may > mean > > > 'this drive is not like the others' and should be replaced > > > > > > > > > k > > > > > > Sent from my iPhone > > > > > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko > wrote: > > > > > > > > Anyone, please? > > > > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Hi Zakhar, I can back up what Konstantin has reported -- we occasionally have HDDs performing very slowly even though all smart tests come back clean. Besides ceph osd perf showing a high latency, you could see high ioutil% with iostat. We normally replace those HDDs -- usually by draining and zeroing them, then putting them back in prod (e.g. in a different cluster or some other service). I don't have statistics on how often those sick drives come back to full performance or not -- that could indicate it was a poor physical connection, vibrations, ... , for example. But I do recall some drives came back repeatedly as "sick" but not dead w/ clean SMART tests. If you have time you can dig deeper with increased bluestore debug levels. In our environment this happens often enough that we simply drain, replace, move on. Cheers, dan On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko wrote: > > Unfortunately, that isn't the case: the drive is perfectly healthy and, > according to all measurements I did on the host itself, it isn't any > different from any other drive on that host size-, health- or > performance-wise. > > The only difference I noticed is that this drive sporadically does more I/O > than other drives for a split second, probably due to specific PGs placed > on its OSD, but the average I/O pattern is very similar to other drives and > OSDs, so it's somewhat unclear why the specific OSD is consistently showing > much higher latency. It would be good to figure out what exactly is causing > these I/O spikes, but I'm not yet sure how to do that. > > /Z > > On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin wrote: > > > Hi, > > > > When you see one of 100 drives perf is unusually different, this may mean > > 'this drive is not like the others' and should be replaced > > > > > > k > > > > Sent from my iPhone > > > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko wrote: > > > > > > Anyone, please? > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Unfortunately, that isn't the case: the drive is perfectly healthy and, according to all measurements I did on the host itself, it isn't any different from any other drive on that host size-, health- or performance-wise. The only difference I noticed is that this drive sporadically does more I/O than other drives for a split second, probably due to specific PGs placed on its OSD, but the average I/O pattern is very similar to other drives and OSDs, so it's somewhat unclear why the specific OSD is consistently showing much higher latency. It would be good to figure out what exactly is causing these I/O spikes, but I'm not yet sure how to do that. /Z On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin wrote: > Hi, > > When you see one of 100 drives perf is unusually different, this may mean > 'this drive is not like the others' and should be replaced > > > k > > Sent from my iPhone > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko wrote: > > > > Anyone, please? > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Hi, When you see one of 100 drives perf is unusually different, this may mean 'this drive is not like the others' and should be replaced k Sent from my iPhone > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko wrote: > > Anyone, please? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD
Anyone, please? On Thu, 6 Oct 2022 at 14:57, Zakhar Kirpichenko wrote: > Hi, > > I'm having a peculiar "issue" in my cluster, which I'm not sure whether > it's real: a particular OSD always shows significant latency in `ceph osd > perf` report, an order of magnitude higher than any other OSD. > > I traced this OSD to a particular drive in a particular host. OSD logs > don't look any different from any other OSD on the same node, iostat shows > that the drive is utilized similarly to all other drives, the OSD process > uses a similar amount of CPU and RAM to other OSD processes. I.e. the OSD > and its underlying drive look the same as other OSDs and their drives and > do not appear to have a higher latency on the host's system level. > > I am interested to find out what makes this OSD report much higher latency > and what steps I could take to diagnose and troubleshoot this situation. I > would appreciate any input and advice. > > Best regards, > Zakhar > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io