The balancer is unfortunately not that good when you have large k+m in erasure coding profiles and relatively few servers, some manual balancing will be required
Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Aug 26, 2019 at 12:33 PM Simon Oosthoek <s.oosth...@science.ru.nl> wrote: > > On 26-08-19 12:00, EDH - Manuel Rios Fernandez wrote: > > Balancer just balance in Healthy mode. > > > > The problem is that data is distributed without be balanced in their first > > write, that cause unproperly data balanced across osd. > > I suppose the crush algorithm doesn't take the fullness of the osds into > account when placing objects... > > > > > This problem only happens in CEPH, we are the same with 14.2.2, having to > > change the weight manually.Because the balancer is a passive element of the > > cluster. > > > > I hope in next version we get a more aggressive balancer, like enterprises > > storages that allow to fill up 95% storage (raw). > > I'm thinking a cronjob with a script to parse the output of `ceph osd df > tree` and reweight according to the percentage used would be relatively > easy to write. But I'll concentrate on monitoring before I start > tweaking there ;-) > > Cheers > > /Simon > > > > > Regards > > > > > > -----Mensaje original----- > > De: ceph-users <ceph-users-boun...@lists.ceph.com> En nombre de Simon > > Oosthoek > > Enviado el: lunes, 26 de agosto de 2019 11:52 > > Para: Dan van der Ster <d...@vanderster.com> > > CC: ceph-users <ceph-users@lists.ceph.com> > > Asunto: Re: [ceph-users] cephfs full, 2/3 Raw capacity used > > > > On 26-08-19 11:37, Dan van der Ster wrote: > >> Thanks. The version and balancer config look good. > >> > >> So you can try `ceph osd reweight osd.10 0.8` to see if it helps to > >> get you out of this. > > > > I've done this and the next fullest 3 osds. This will take some time to > > recover, I'll let you know when it's done. > > > > Thanks, > > > > /simon > > > >> > >> -- dan > >> > >> On Mon, Aug 26, 2019 at 11:35 AM Simon Oosthoek > >> <s.oosth...@science.ru.nl> wrote: > >>> > >>> On 26-08-19 11:16, Dan van der Ster wrote: > >>>> Hi, > >>>> > >>>> Which version of ceph are you using? Which balancer mode? > >>> > >>> Nautilus (14.2.2), balancer is in upmap mode. > >>> > >>>> The balancer score isn't a percent-error or anything humanly usable. > >>>> `ceph osd df tree` can better show you exactly which osds are > >>>> over/under utilized and by how much. > >>>> > >>> > >>> Aha, I ran this and sorted on the %full column: > >>> > >>> 81 hdd 10.81149 1.00000 11 TiB 5.2 TiB 5.1 TiB 4 KiB 14 GiB > >>> 5.6 TiB 48.40 0.73 96 up osd.81 > >>> 48 hdd 10.81149 1.00000 11 TiB 5.3 TiB 5.2 TiB 15 KiB 14 GiB > >>> 5.5 TiB 49.08 0.74 95 up osd.48 > >>> 154 hdd 10.81149 1.00000 11 TiB 5.5 TiB 5.4 TiB 2.6 GiB 15 GiB > >>> 5.3 TiB 50.95 0.76 96 up osd.154 > >>> 129 hdd 10.81149 1.00000 11 TiB 5.5 TiB 5.4 TiB 5.1 GiB 16 GiB > >>> 5.3 TiB 51.33 0.77 96 up osd.129 > >>> 42 hdd 10.81149 1.00000 11 TiB 5.6 TiB 5.5 TiB 2.6 GiB 14 GiB > >>> 5.2 TiB 51.81 0.78 96 up osd.42 > >>> 122 hdd 10.81149 1.00000 11 TiB 5.7 TiB 5.6 TiB 16 KiB 14 GiB > >>> 5.1 TiB 52.47 0.79 96 up osd.122 > >>> 120 hdd 10.81149 1.00000 11 TiB 5.7 TiB 5.6 TiB 2.6 GiB 15 GiB > >>> 5.1 TiB 52.92 0.79 95 up osd.120 > >>> 96 hdd 10.81149 1.00000 11 TiB 5.8 TiB 5.7 TiB 2.6 GiB 15 GiB > >>> 5.0 TiB 53.58 0.80 96 up osd.96 > >>> 26 hdd 10.81149 1.00000 11 TiB 5.8 TiB 5.7 TiB 20 KiB 15 GiB > >>> 5.0 TiB 53.68 0.80 97 up osd.26 > >>> ... > >>> 6 hdd 10.81149 1.00000 11 TiB 8.3 TiB 8.2 TiB 88 KiB 18 GiB > >>> 2.5 TiB 77.14 1.16 96 up osd.6 > >>> 16 hdd 10.81149 1.00000 11 TiB 8.4 TiB 8.3 TiB 28 KiB 18 GiB > >>> 2.4 TiB 77.56 1.16 95 up osd.16 > >>> 0 hdd 10.81149 1.00000 11 TiB 8.6 TiB 8.4 TiB 48 KiB 17 GiB > >>> 2.2 TiB 79.24 1.19 96 up osd.0 > >>> 144 hdd 10.81149 1.00000 11 TiB 8.6 TiB 8.5 TiB 2.6 GiB 18 GiB > >>> 2.2 TiB 79.57 1.19 95 up osd.144 > >>> 136 hdd 10.81149 1.00000 11 TiB 8.6 TiB 8.5 TiB 48 KiB 17 GiB > >>> 2.2 TiB 79.60 1.19 95 up osd.136 > >>> 63 hdd 10.81149 1.00000 11 TiB 8.6 TiB 8.5 TiB 2.6 GiB 17 GiB > >>> 2.2 TiB 79.60 1.19 95 up osd.63 > >>> 155 hdd 10.81149 1.00000 11 TiB 8.6 TiB 8.5 TiB 8 KiB 19 GiB > >>> 2.2 TiB 79.85 1.20 95 up osd.155 > >>> 89 hdd 10.81149 1.00000 11 TiB 8.7 TiB 8.5 TiB 12 KiB 20 GiB > >>> 2.2 TiB 80.04 1.20 96 up osd.89 > >>> 106 hdd 10.81149 1.00000 11 TiB 8.8 TiB 8.7 TiB 64 KiB 19 GiB > >>> 2.0 TiB 81.38 1.22 96 up osd.106 > >>> 94 hdd 10.81149 1.00000 11 TiB 9.0 TiB 8.9 TiB 0 B 19 GiB > >>> 1.8 TiB 83.53 1.25 96 up osd.94 > >>> 33 hdd 10.81149 1.00000 11 TiB 9.1 TiB 9.0 TiB 44 KiB 19 GiB > >>> 1.7 TiB 84.40 1.27 96 up osd.33 > >>> 15 hdd 10.81149 1.00000 11 TiB 10 TiB 9.8 TiB 16 KiB 20 GiB > >>> 877 GiB 92.08 1.38 96 up osd.15 > >>> 53 hdd 10.81149 1.00000 11 TiB 10 TiB 10 TiB 2.6 GiB 20 GiB > >>> 676 GiB 93.90 1.41 96 up osd.53 > >>> 51 hdd 10.81149 1.00000 11 TiB 10 TiB 10 TiB 2.6 GiB 20 GiB > >>> 666 GiB 93.98 1.41 96 up osd.51 > >>> 10 hdd 10.81149 1.00000 11 TiB 10 TiB 10 TiB 40 KiB 22 GiB > >>> 552 GiB 95.01 1.42 97 up osd.10 > >>> > >>> So the fullest one is at 95.01%, the emptiest one at 48.4%, so > >>> there's some balancing to be done. > >>> > >>>> You might be able to manually fix things by using `ceph osd reweight > >>>> ...` on the most full osds to move data elsewhere. > >>> > >>> I'll look into this, but I was hoping that the balancer module would > >>> take care of this... > >>> > >>>> > >>>> Otherwise, in general, its good to setup monitoring so you notice > >>>> and take action well before the osds fill up. > >>> > >>> Yes, I'm still working on this, I want to add some checks to our > >>> check_mk+icinga setup using native plugins, but my python skills are > >>> not quite up to the task, at least, not yet ;-) > >>> > >>> Cheers > >>> > >>> /Simon > >>> > >>>> > >>>> Cheers, Dan > >>>> > >>>> On Mon, Aug 26, 2019 at 11:09 AM Simon Oosthoek > >>>> <s.oosth...@science.ru.nl> wrote: > >>>>> > >>>>> Hi all, > >>>>> > >>>>> we're building up our experience with our ceph cluster before we > >>>>> take it into production. I've now tried to fill up the cluster with > >>>>> cephfs, which we plan to use for about 95% of all data on the cluster. > >>>>> > >>>>> The cephfs pools are full when the cluster reports 67% raw capacity > >>>>> used. There are 4 pools we use for cephfs data, 3-copy, 4-copy, EC > >>>>> 8+3 and EC 5+7. The balancer module is turned on and `ceph balancer > >>>>> eval` gives `current cluster score 0.013255 (lower is better)`, so > >>>>> well within the default 5% margin. Is there a setting we can tweak > >>>>> to increase the usable RAW capacity to say 85% or 90%, or is this > >>>>> the most we can expect to store on the cluster? > >>>>> > >>>>> [root@cephmon1 ~]# ceph df > >>>>> RAW STORAGE: > >>>>> CLASS SIZE AVAIL USED RAW USED %RAW > > USED > >>>>> hdd 1.8 PiB 605 TiB 1.2 PiB 1.2 PiB > > 66.71 > >>>>> TOTAL 1.8 PiB 605 TiB 1.2 PiB 1.2 PiB > > 66.71 > >>>>> > >>>>> POOLS: > >>>>> POOL ID STORED OBJECTS USED > >>>>> %USED MAX AVAIL > >>>>> cephfs_data 1 111 MiB 79.26M 1.2 GiB > >>>>> 100.00 0 B > >>>>> cephfs_metadata 2 52 GiB 4.91M 52 GiB > >>>>> 100.00 0 B > >>>>> cephfs_data_4copy 3 106 TiB 46.36M 428 TiB > >>>>> 100.00 0 B > >>>>> cephfs_data_3copy 8 93 TiB 42.08M 282 TiB > >>>>> 100.00 0 B > >>>>> cephfs_data_ec83 13 106 TiB 50.11M 161 TiB > >>>>> 100.00 0 B > >>>>> rbd 14 21 GiB 5.62k 63 GiB > >>>>> 100.00 0 B > >>>>> .rgw.root 15 1.2 KiB 4 1 MiB > >>>>> 100.00 0 B > >>>>> default.rgw.control 16 0 B 8 0 B > >>>>> 0 0 B > >>>>> default.rgw.meta 17 765 B 4 1 MiB > >>>>> 100.00 0 B > >>>>> default.rgw.log 18 0 B 207 0 B > >>>>> 0 0 B > >>>>> scbench 19 133 GiB 34.14k 400 GiB > >>>>> 100.00 0 B > >>>>> cephfs_data_ec57 20 126 TiB 51.84M 320 TiB > >>>>> 100.00 0 B > >>>>> [root@cephmon1 ~]# ceph balancer eval current cluster score > >>>>> 0.013255 (lower is better) > >>>>> > >>>>> > >>>>> Being full at 2/3 Raw used is a bit too "pretty" to be accidental, > >>>>> it seems like this could be a parameter for cephfs, however, I > >>>>> couldn't find anything like this in the documentation for Nautilus. > >>>>> > >>>>> > >>>>> The logs in the dashboard show this: > >>>>> 2019-08-26 11:00:00.000630 > >>>>> [ERR] > >>>>> overall HEALTH_ERR 3 backfillfull osd(s); 1 full osd(s); 12 pool(s) > >>>>> full > >>>>> > >>>>> 2019-08-26 10:57:44.539964 > >>>>> [INF] > >>>>> Health check cleared: POOL_BACKFILLFULL (was: 12 pool(s) > >>>>> backfillfull) > >>>>> > >>>>> 2019-08-26 10:57:44.539944 > >>>>> [WRN] > >>>>> Health check failed: 12 pool(s) full (POOL_FULL) > >>>>> > >>>>> 2019-08-26 10:57:44.539926 > >>>>> [ERR] > >>>>> Health check failed: 1 full osd(s) (OSD_FULL) > >>>>> > >>>>> 2019-08-26 10:57:44.539899 > >>>>> [WRN] > >>>>> Health check update: 3 backfillfull osd(s) (OSD_BACKFILLFULL) > >>>>> > >>>>> 2019-08-26 10:00:00.000088 > >>>>> [WRN] > >>>>> overall HEALTH_WARN 4 backfillfull osd(s); 12 pool(s) backfillfull > >>>>> > >>>>> So it seems that ceph is completely stuck at 2/3 full, while we > >>>>> anticipated being able to fill up the cluster to at least 85-90% of > >>>>> the raw capacity. Or at least so that we would keep a functioning > >>>>> cluster when we have a single osd node fail. > >>>>> > >>>>> Cheers > >>>>> > >>>>> /Simon > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@lists.ceph.com > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com