[ceph-users] Laggy OSDs

2022-03-29 Thread Alex Closs
Hey folks,

We have a 16.2.7 cephadm cluster that's had slow ops and several (constantly 
changing) laggy PGs. The set of OSDs with slow ops seems to change at random, 
among all 6 OSD hosts in the cluster. All drives are enterprise SATA SSDs, by 
either Intel or Micron. We're still not ruling out a network issue, but wanted 
to troubleshoot from the Ceph side in case something broke there.

ceph -s:

 health: HEALTH_WARN
 3 slow ops, oldest one blocked for 246 sec, daemons 
[osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.

 services:
 mon: 5 daemons, quorum ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 
(age 28h)
 mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh, 
ceph-mon1.iogajr
 osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
 rgw: 3 daemons active (3 hosts, 1 zones)

 data:
 pools: 26 pools, 3936 pgs
 objects: 33.14M objects, 144 TiB
 usage: 338 TiB used, 162 TiB / 500 TiB avail
 pgs: 3916 active+clean
 19 active+clean+laggy
 1 active+clean+scrubbing+deep

 io:
 client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr

This is actually much faster than it's been for much of the past hour, it's 
been as low as 50 kb/s and dozens of iops in both directions (where the cluster 
typically does 300MB to a few gigs, and ~4k iops)

The cluster has been on 16.2.7 since a few days after release without issue. 
The only recent change was an apt upgrade and reboot on the hosts (which was 
last Friday and didn't show signs of problems).

Happy to provide logs, let me know what would be useful. Thanks for reading 
this wall :)

-Alex

MIT CSAIL
he/they
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Laggy OSDs

2022-03-29 Thread Alex Closs
Hi - I've been bitten by that too and checked, and that *did* happen but I 
swapped them off a while ago.

Thanks for your quick reply :)
-Alex
On Mar 29, 2022, 6:26 PM -0400, Arnaud M , wrote:
> Hello
>
> is swap enabled on your host ? Is swap used ?
>
> For our cluster we tend to allocate enough ram and disable swap
>
> Maybe the reboot of your host re-activated swap ?
>
> Try to disable swap and see if it help
>
> All the best
>
> Arnaud
>
> > Le mar. 29 mars 2022 à 23:41, David Orman  a écrit :
> > > We're definitely dealing with something that sounds similar, but hard to
> > > state definitively without more detail. Do you have object lock/versioned
> > > buckets in use (especially if one started being used around the time of 
> > > the
> > > slowdown)? Was this cluster always 16.2.7?
> > >
> > > What is your pool configuration (EC k+m or replicated X setup), and do you
> > > use the same pool for indexes and data? I'm assuming this is RGW usage via
> > > the S3 API, let us know if this is not correct.
> > >
> > > On Tue, Mar 29, 2022 at 4:13 PM Alex Closs  wrote:
> > >
> > > > Hey folks,
> > > >
> > > > We have a 16.2.7 cephadm cluster that's had slow ops and several
> > > > (constantly changing) laggy PGs. The set of OSDs with slow ops seems to
> > > > change at random, among all 6 OSD hosts in the cluster. All drives are
> > > > enterprise SATA SSDs, by either Intel or Micron. We're still not ruling 
> > > > out
> > > > a network issue, but wanted to troubleshoot from the Ceph side in case
> > > > something broke there.
> > > >
> > > > ceph -s:
> > > >
> > > >  health: HEALTH_WARN
> > > >  3 slow ops, oldest one blocked for 246 sec, daemons
> > > > [osd.124,osd.130,osd.141,osd.152,osd.27] have slow ops.
> > > >
> > > >  services:
> > > >  mon: 5 daemons, quorum
> > > > ceph-osd10,ceph-mon0,ceph-mon1,ceph-osd9,ceph-osd11 (age 28h)
> > > >  mgr: ceph-mon0.sckxhj(active, since 25m), standbys: ceph-osd10.xmdwfh,
> > > > ceph-mon1.iogajr
> > > >  osd: 143 osds: 143 up (since 92m), 143 in (since 2w)
> > > >  rgw: 3 daemons active (3 hosts, 1 zones)
> > > >
> > > >  data:
> > > >  pools: 26 pools, 3936 pgs
> > > >  objects: 33.14M objects, 144 TiB
> > > >  usage: 338 TiB used, 162 TiB / 500 TiB avail
> > > >  pgs: 3916 active+clean
> > > >  19 active+clean+laggy
> > > >  1 active+clean+scrubbing+deep
> > > >
> > > >  io:
> > > >  client: 59 MiB/s rd, 98 MiB/s wr, 1.66k op/s rd, 1.68k op/s wr
> > > >
> > > > This is actually much faster than it's been for much of the past hour,
> > > > it's been as low as 50 kb/s and dozens of iops in both directions (where
> > > > the cluster typically does 300MB to a few gigs, and ~4k iops)
> > > >
> > > > The cluster has been on 16.2.7 since a few days after release without
> > > > issue. The only recent change was an apt upgrade and reboot on the hosts
> > > > (which was last Friday and didn't show signs of problems).
> > > >
> > > > Happy to provide logs, let me know what would be useful. Thanks for
> > > > reading this wall :)
> > > >
> > > > -Alex
> > > >
> > > > MIT CSAIL
> > > > he/they
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

2022-05-11 Thread Alex Closs
Hey y'all -

As a datapoint, I *don't* see this issue on 5.17.4-200.fc35.x86_64. Hosts are 
Fedora 35 server, with 17.2.0. Happy to test or provide more data from this 
cluster if it would be helpful.

-Alex
On May 11, 2022, 2:02 PM -0400, David Rivera , wrote:
> Hi,
>
> My experience is similar, I was also using elrepo kernels on CentOS 8.
> Kernels 5.14+ were causing problems, I had to go back to 5.11. I did not
> test 5.12-5.13. I did not have enough time to narrow down the system
> instability to Ceph. Currently, I'm using the included Rocky Linux 8
> kernels (4.18); I very rarely get caps release warnings but besides that
> everything has been working great.
>
> David
>
>
> On Wed, May 11, 2022, 09:07 Stefan Kooman  wrote:
>
> > Hi List,
> >
> > We have quite a few linux kernel clients for CephFS. One of our
> > customers has been running mainline kernels (CentOS 7 elrepo) for the
> > past two years. They started out with 3.x kernels (default CentOS 7),
> > but upgraded to mainline when those kernels would frequently generate
> > MDS warnings like "failing to respond to capability release". That
> > worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and
> > *way* more bandwidth on CephFS than older kernels (order of magnitude).
> > After the MDS was upgraded from Nautilus to Octopus that behavior is
> > gone (comparable CPU / bandwidth usage as older kernels). However, the
> > newer kernels are now the ones that give "failing to respond to
> > capability release", and worse, clients get evicted (unresponsive as far
> > as the MDS is concerned). Even the latest 5.17 kernels have that. No
> > difference is observed between using messenger v1 or v2. MDS version is
> > 15.2.16.
> > Surprisingly the latest stable kernels from CentOS 7 work flawlessly
> > now. Although that is good news, newer operating systems come with newer
> > kernels.
> >
> > Does anyone else observe the same behavior with newish kernel clients?
> >
> > Gr. Stefan
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io