Hi Dan,

he he, I built a large omap object cluster, we are up to 5 now :)

It is possible that our meta-data pool became a bottleneck. I'm re-deploying 
OSDs on these disks at the moment, increasing the OSD count from 1 to 4. The 
disks I use require high concurrency access to get close to spec performance 
and a single OSD per disk doesn't get close to saturation (its Intel enterprise 
NVMe-SSD SAS drives with really good performance specs). Therefore, I don't see 
the disks themselves as a bottleneck in iostat or atop, but it is very well 
possible that the OSD daemon is at its limit. It will take a couple of days to 
complete this and I will report back.

> This covers the topic and relevant config:
> https://docs.ceph.com/en/latest/cephfs/dirfrags/

This is a classic ceph documentation page: just numbers without units (size of 
10000 what??) without any explanation of how this would relate to object sizes 
and/or key counts :) After reading it, I don't think we are looking at 
dirfrags. The key-count is simply too large and the size probably as well. 
Could it be MDS journals? What other objects might become large? Or, how could 
I check what it is, for example, by looking at a hexdump?

I should mention that we have a bunch of super-aggressive clients on the FS. 
Currently, I'm running 4 active MDS daemons and they seem to have distributed 
the client load very well between each other by now. The aggressive clients are 
probably open-foam or similar jobs that create millions and millions of small 
files in very short time. I have seen peaks of 4-8K requests per second to the 
MDSes. On our old Lustre system they managed to run out of inodes long before 
the storage capacity was reached, its probably the worst data to inode ratio 
one can think off. One of the advantages of ceph is its unlimited inode 
capacity and it seems to cope with the usage pattern reasonably well - modulo 
the problems I seem to observe here.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <d...@vanderster.com>
Sent: 25 August 2021 15:46:27
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?

Hi,

On Wed, Aug 25, 2021 at 2:37 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Dan,
>
> > [...] Do you have some custom mds config in this area?
>
> none that I'm aware of. What MDS config parameters should I look for?

This covers the topic and relevant config:
https://docs.ceph.com/en/latest/cephfs/dirfrags/

Here in our clusters we've never had to tune any of these options --
it works well with the defaults on our hw/workloads.

> I recently seem to have had problems with very slow dirfrag operations that 
> made an MDS unresponsive long enough for a MON to kick it out. I had to 
> increase the MDS beacon timeout to get out of an MDS restart loop (it also 
> had oversized cache by the time I discovered the problem). The dirfrag was 
> reported as a slow op warning.

That sounds related. In our env I've never noticed slow dirfrag ops.
Do you have any underlying slowness or overload on your metadata osds?

-- dan



>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <d...@vanderster.com>
> Sent: 25 August 2021 14:05:00
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?
>
> Those are probably large directories; each omap key is a file/subdir
> in the directory.
>
> Normally the mds fragments dirs across several objects, so you
> shouldn't have a huge number of omap entries in any one single object.
> Do you have some custom mds config in this area?
>
> -- dan
>
> On Wed, Aug 25, 2021 at 2:01 PM Frank Schilder <fr...@dtu.dk> wrote:
> >
> > Hi Dan,
> >
> > thanks for looking at this. Here are the lines from health detail and 
> > ceph.log:
> >
> > [root@gnosis ~]# ceph health detail
> > HEALTH_WARN 4 large omap objects
> > LARGE_OMAP_OBJECTS 4 large omap objects
> >     4 large objects found in pool 'con-fs2-meta1'
> >     Search the cluster log for 'Large omap object found' for more details.
> >
> > The search gives:
> >
> > 2021-08-25 11:17:00.675474 osd.21 osd.21 192.168.32.77:6846/12302 651 : 
> > cluster [WRN] Large omap object found. Object: 
> > 12:373fb013:::1000eec35f5.01000000:head PG: 12.c80dfcec (12.6c) Key count: 
> > 216000 Size (bytes): 101520000
> > 2021-08-25 11:17:06.866726 osd.37 osd.37 192.168.32.77:6850/12306 644 : 
> > cluster [WRN] Large omap object found. Object: 
> > 12:05982a7e:::1000d7fd167.02800000:head PG: 12.7e5419a0 (12.20) Key count: 
> > 2293816 Size (bytes): 1078093520
> > 2021-08-25 11:17:11.152671 osd.37 osd.37 192.168.32.77:6850/12306 645 : 
> > cluster [WRN] Large omap object found. Object: 
> > 12:05da1450:::1000e118c0a.00000000:head PG: 12.a285ba0 (12.20) Key count: 
> > 220612 Size (bytes): 103687640
> > 2021-08-25 11:17:36.603664 osd.36 osd.36 192.168.32.75:6848/11882 1243 : 
> > cluster [WRN] Large omap object found. Object: 
> > 12:0b298d19:::1000eec35f7.04e00000:head PG: 12.98b194d0 (12.50) Key count: 
> > 657212 Size (bytes): 308889640
> >
> > They are all in the fs meta-data pool.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Dan van der Ster <d...@vanderster.com>
> > Sent: 25 August 2021 13:57:44
> > To: Frank Schilder
> > Cc: ceph-users
> > Subject: Re: [ceph-users] LARGE_OMAP_OBJECTS: any proper action possible?
> >
> > Hi Frank,
> >
> > Which objects are large? (You should see this in ceph.log when the
> > large obj was detected).
> >
> > -- dan
> >
> > On Wed, Aug 25, 2021 at 12:27 PM Frank Schilder <fr...@dtu.dk> wrote:
> > >
> > > Hi all,
> > >
> > > I have the notorious "LARGE_OMAP_OBJECTS: 4 large omap objects" warning 
> > > and am again wondering if there is any proper action one can take except 
> > > "wait it out and deep-scrub (numerous ceph-users threads)" or "ignore 
> > > (https://docs.ceph.com/en/latest/rados/operations/health-checks/#large-omap-objects)".
> > >  Only for RGWs is a proper action described, but mine come from MDSes. Is 
> > > there any way to ask an MDS to clean up or split the objects?
> > >
> > > The disks with the meta-data pool can easily deal with objects of this 
> > > size. My question is more along the lines: If I can't do anything anyway, 
> > > why the warning? If there is a warning, I would assume that one can do 
> > > something proper to prevent large omap objects from being born by an MDS. 
> > > What is it?
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to