[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Thanks for the suggestions, I will try this.

/Z

On Fri, 7 Oct 2022 at 18:13, Konstantin Shalygin  wrote:

> Zakhar, try to look to top of slow ops in daemon socket for this osd, you
> may find 'snapc' operations, for example. By rbd head you can find rbd
> image, and then try to look how much snapshots in chain for this image.
> More than 10 snaps for one image can increase client ops latency to tens
> milliseconds even for NVMe drives, that usually operates at usec or 1-2msec
>
>
> k
> Sent from my iPhone
>
> > On 7 Oct 2022, at 14:35, Zakhar Kirpichenko  wrote:
> >
> > The drive doesn't show increased utilization on average, but it does
> > sporadically get more I/O than other drives, usually in short bursts. I
> am
> > now trying to find a way to trace this to a specific PG, pool and object
> > (s) – not sure if that is possible.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Konstantin Shalygin
Zakhar, try to look to top of slow ops in daemon socket for this osd, you may 
find 'snapc' operations, for example. By rbd head you can find rbd image, and 
then try to look how much snapshots in chain for this image. More than 10 snaps 
for one image can increase client ops latency to tens milliseconds even for 
NVMe drives, that usually operates at usec or 1-2msec


k
Sent from my iPhone

> On 7 Oct 2022, at 14:35, Zakhar Kirpichenko  wrote:
> 
> The drive doesn't show increased utilization on average, but it does
> sporadically get more I/O than other drives, usually in short bursts. I am
> now trying to find a way to trace this to a specific PG, pool and object
> (s) – not sure if that is possible.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Eugen Block

Hi,

I’d look for deep-scrubs on that OSD, those are logged, maybe those  
timestamps match your observations.


Zitat von Zakhar Kirpichenko :


Thanks for this!

The drive doesn't show increased utilization on average, but it does
sporadically get more I/O than other drives, usually in short bursts. I am
now trying to find a way to trace this to a specific PG, pool and object
(s) – not sure if that is possible.

/Z

On Fri, 7 Oct 2022, 12:17 Dan van der Ster,  wrote:


Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko 
wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more
I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives
and
> OSDs, so it's somewhat unclear why the specific OSD is consistently
showing
> much higher latency. It would be good to figure out what exactly is
causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may
mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko 
wrote:
> > >
> > > Anyone, please?
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Thanks for this!

The drive doesn't show increased utilization on average, but it does
sporadically get more I/O than other drives, usually in short bursts. I am
now trying to find a way to trace this to a specific PG, pool and object
(s) – not sure if that is possible.

/Z

On Fri, 7 Oct 2022, 12:17 Dan van der Ster,  wrote:

> Hi Zakhar,
>
> I can back up what Konstantin has reported -- we occasionally have
> HDDs performing very slowly even though all smart tests come back
> clean. Besides ceph osd perf showing a high latency, you could see
> high ioutil% with iostat.
>
> We normally replace those HDDs -- usually by draining and zeroing
> them, then putting them back in prod (e.g. in a different cluster or
> some other service). I don't have statistics on how often those sick
> drives come back to full performance or not -- that could indicate it
> was a poor physical connection, vibrations, ... , for example. But I
> do recall some drives came back repeatedly as "sick" but not dead w/
> clean SMART tests.
>
> If you have time you can dig deeper with increased bluestore debug
> levels. In our environment this happens often enough that we simply
> drain, replace, move on.
>
> Cheers, dan
>
>
>
>
> On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko 
> wrote:
> >
> > Unfortunately, that isn't the case: the drive is perfectly healthy and,
> > according to all measurements I did on the host itself, it isn't any
> > different from any other drive on that host size-, health- or
> > performance-wise.
> >
> > The only difference I noticed is that this drive sporadically does more
> I/O
> > than other drives for a split second, probably due to specific PGs placed
> > on its OSD, but the average I/O pattern is very similar to other drives
> and
> > OSDs, so it's somewhat unclear why the specific OSD is consistently
> showing
> > much higher latency. It would be good to figure out what exactly is
> causing
> > these I/O spikes, but I'm not yet sure how to do that.
> >
> > /Z
> >
> > On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
> >
> > > Hi,
> > >
> > > When you see one of 100 drives perf is unusually different, this may
> mean
> > > 'this drive is not like the others' and should be replaced
> > >
> > >
> > > k
> > >
> > > Sent from my iPhone
> > >
> > > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko 
> wrote:
> > > >
> > > > Anyone, please?
> > >
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Dan van der Ster
Hi Zakhar,

I can back up what Konstantin has reported -- we occasionally have
HDDs performing very slowly even though all smart tests come back
clean. Besides ceph osd perf showing a high latency, you could see
high ioutil% with iostat.

We normally replace those HDDs -- usually by draining and zeroing
them, then putting them back in prod (e.g. in a different cluster or
some other service). I don't have statistics on how often those sick
drives come back to full performance or not -- that could indicate it
was a poor physical connection, vibrations, ... , for example. But I
do recall some drives came back repeatedly as "sick" but not dead w/
clean SMART tests.

If you have time you can dig deeper with increased bluestore debug
levels. In our environment this happens often enough that we simply
drain, replace, move on.

Cheers, dan




On Fri, Oct 7, 2022 at 9:41 AM Zakhar Kirpichenko  wrote:
>
> Unfortunately, that isn't the case: the drive is perfectly healthy and,
> according to all measurements I did on the host itself, it isn't any
> different from any other drive on that host size-, health- or
> performance-wise.
>
> The only difference I noticed is that this drive sporadically does more I/O
> than other drives for a split second, probably due to specific PGs placed
> on its OSD, but the average I/O pattern is very similar to other drives and
> OSDs, so it's somewhat unclear why the specific OSD is consistently showing
> much higher latency. It would be good to figure out what exactly is causing
> these I/O spikes, but I'm not yet sure how to do that.
>
> /Z
>
> On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:
>
> > Hi,
> >
> > When you see one of 100 drives perf is unusually different, this may mean
> > 'this drive is not like the others' and should be replaced
> >
> >
> > k
> >
> > Sent from my iPhone
> >
> > > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> > >
> > > Anyone, please?
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Unfortunately, that isn't the case: the drive is perfectly healthy and,
according to all measurements I did on the host itself, it isn't any
different from any other drive on that host size-, health- or
performance-wise.

The only difference I noticed is that this drive sporadically does more I/O
than other drives for a split second, probably due to specific PGs placed
on its OSD, but the average I/O pattern is very similar to other drives and
OSDs, so it's somewhat unclear why the specific OSD is consistently showing
much higher latency. It would be good to figure out what exactly is causing
these I/O spikes, but I'm not yet sure how to do that.

/Z

On Fri, 7 Oct 2022 at 09:24, Konstantin Shalygin  wrote:

> Hi,
>
> When you see one of 100 drives perf is unusually different, this may mean
> 'this drive is not like the others' and should be replaced
>
>
> k
>
> Sent from my iPhone
>
> > On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> >
> > Anyone, please?
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-06 Thread Konstantin Shalygin
Hi,

When you see one of 100 drives perf is unusually different, this may mean 'this 
drive is not like the others' and should be replaced


k

Sent from my iPhone

> On 7 Oct 2022, at 07:33, Zakhar Kirpichenko  wrote:
> 
> Anyone, please?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-06 Thread Zakhar Kirpichenko
Anyone, please?

On Thu, 6 Oct 2022 at 14:57, Zakhar Kirpichenko  wrote:

> Hi,
>
> I'm having a peculiar "issue" in my cluster, which I'm not sure whether
> it's real: a particular OSD always shows significant latency in `ceph osd
> perf` report, an order of magnitude higher than any other OSD.
>
> I traced this OSD to a particular drive in a particular host. OSD logs
> don't look any different from any other OSD on the same node, iostat shows
> that the drive is utilized similarly to all other drives, the OSD process
> uses a similar amount of CPU and RAM to other OSD processes. I.e. the OSD
> and its underlying drive look the same as other OSDs and their drives and
> do not appear to have a higher latency on the host's system level.
>
> I am interested to find out what makes this OSD report much higher latency
> and what steps I could take to diagnose and troubleshoot this situation. I
> would appreciate any input and advice.
>
> Best regards,
> Zakhar
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io