Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-12 Thread Miroslav Kalina
I just briefly peaked into source of module and I suppose it's because
main design idea is just to forward existing metrics from ceph core and
do not calculate anything.

To me it seems most users probably use prometheus which doesn't have
this kind of issue.

Monitor down is also easy as pie, because it's just "num_mon -
mon_quorum". But there is also metric mon_outside_quorum which I have
always zero and don't really know how it works.

OSD near full will be probably more tricky, you have to use
"osd.stat_bytes_used / osd.stat_bytes" and compare it with your own
configured value (not metric so not exported) per each OSD.

Or you can just watch general cluster health metric (what you should
anyway) and rise general alarm in this case.

M.



On 11. 12. 19 21:18, Mario Giammarco wrote:
> Miroslav replied better for us why "is not so simple" to use math.
> And osd down was the easiest. How can I calculate:
> - monitor down
> - osd near full
>
> ?
>
> I do not understand why ceph plugin cannot send to influx all the
> metrics it has, especially the most useful for creating alarms.
>
> Il giorno mer 11 dic 2019 alle ore 04:58 Konstantin Shalygin
> mailto:k0...@k0ste.ru>> ha scritto:
>
>> But it is very difficult/complicated to make simple queries because, for
>> example I have osd up and osd total but not osd down metric.
>>
> To determine how much osds down you don't need special metric,
> because you already
>
> have osd_up and osd_in metrics. Just use math.
>
>
>
>
> k
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Miroslav Kalina
Systems development specialist

miroslav.kal...@livesport.eu
+420 773 071 848

Livesport s.r.o.
Aspira Business Centre
Bucharova 2928/14a, 158 00 Praha 5
www.livesport.eu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-12 Thread Stefan Kooman
Quoting Miroslav Kalina (miroslav.kal...@livesport.eu):

> Monitor down is also easy as pie, because it's just "num_mon -
> mon_quorum". But there is also metric mon_outside_quorum which I have
> always zero and don't really know how it works.

See this issue if you want to know where it is used for:
https://tracker.ceph.com/issues/35947

TL;DR: it's not what you think it is.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Object-Map Usuage incorrect

2019-12-12 Thread Ilya Dryomov
On Thu, Dec 12, 2019 at 9:12 AM Ashley Merrick  wrote:
>
> Due to the recent 5.3.x kernel having support for Object-Map and other 
> features required in KRBD I have now enabled object-map,fast-diff on some RBD 
> images with CEPH (14.2.5), I have rebuilt the object map using "rbd 
> object-map rebuild"
>
> However for some RBD images, the Provisioned/Total Provisioned then listed in 
> the Ceph MGR for some images is the full RBD size and not the true size 
> reflected in a VM using df -h, I have discard enabled and have run fstrim but 
> I know that for example a 20TB RBD has never gone above the current 9TB shown 
> in df -h but in CEPH MGR shows as 20TB under Provisioned/Total Provisioned.
>
> Not sure if I am hitting a bug? Or if this is expected behavior?

Unless you know *exactly* what the filesystem is doing in your case and
see an inconsistency, this is expected.

If you are interested, here is an example:

$ rbd create --size 1G img
$ sudo rbd map img
/dev/rbd0
$ sudo mkfs.ext4 /dev/rbd0
$ sudo mount /dev/rbd0 /mnt
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  2.6M  907M   1% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 60 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED OBJECTS USED   %USED MAX AVAIL
rbd   1 33 MiB  20 33 MiB 0  1013 GiB

After I create a big file, almost the entire image is shown as used:

$ dd if=/dev/zero of=/mnt/file bs=1M count=900
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  903M  6.2M 100% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 956 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED  OBJECTS USED%USED MAX AVAIL
rbd   1 933 MiB 248 933 MiB  0.09  1012 GiB

Now if I carefully punch out most of that file, leaving one page in
each megabyte, and run fstrim:

$ for ((i = 0; i < 900; i++)); do fallocate -p -n -o $((i * 2**20)) -l
$((2**20 - 4096)) /mnt/file; done
$ sudo fstrim /mnt
$ df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd0   976M  6.1M  903M   1% /mnt
$ rbd du img
NAME PROVISIONED USED
img1 GiB 956 MiB
$ ceph df | grep -B1 rbd
POOL ID STORED OBJECTS USED   %USED MAX AVAIL
rbd   1 36 MiB 248 36 MiB 0  1013 GiB

You can see that df -h is back to ~6M, but "rbd du" USED remained
the same.  This is because "rbd du" is very coarse-grained, it works
at the object level and doesn't go any deeper.  If the number of
objects and their sizes remain the same, "rbd du" USED remains the
same.  It doesn't account for sparseness which I produced above.

"ceph df" goes down to the individual bluestore blobs, but only per
pool.  Looking at STORED, you can see that the space is back, even
though the number of objects remained the same.  Unfortunately, there
is no (fast) way to get the same information per image.

So what you see in the dashboard is basically "rbd du".  It is fast
to compute (especially when object map is enabled), but it shows you
the picture at the object level, not at the blob level.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Jelle de Jong

Hello everybody,

I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's 
with 32GB Intel Optane NVMe journal, 10GB networking.


I wanted to move to bluestore due to dropping support of filestore, our 
cluster was working fine with filestore and we could take complete nodes 
out for maintenance without issues.


root@ceph04:~# ceph osd pool get libvirt-pool size
size: 3
root@ceph04:~# ceph osd pool get libvirt-pool min_size
min_size: 2

I removed all osds from one node, zapping the osd and journal devices, 
we recreated the osds as bluestore and used a small 5GB partition as 
rockdb device instead of journal for all osd's.


I saw the cluster suffer with pgs inactive and slow request.

I tried setting the following on all nodes, but no diffrence:
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
systemctl restart ceph-osd.target

It took three days to recover and during this time clients were not 
responsive.


How can I migrate to bluestore without inactive pgs or slow request. I 
got several more filestore clusters and I would like to know how to 
migrate without inactive pgs and slow reguests?


As a side question, I optimized our cluster for filestore, the Intel 
Optane NVMe journals showed good fio dsync write tests, does bluestore 
also use dsync writes for rockdb caching or can we select NVMe devices 
on other specifications? My test with filestores showed that Optane NVMe 
SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few 
GB for filestore journals, but with bluestore rockdb caching the 
situation is different and I can't find documentation on how to speed 
test NVMe devices for bluestore.


Kind regards,

Jelle

root@ceph04:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   60.04524 root default
-2   20.01263 host ceph04
 0   hdd  2.72899 osd.0   up  1.0 1.0
 1   hdd  2.72899 osd.1   up  1.0 1.0
 2   hdd  5.45799 osd.2   up  1.0 1.0
 3   hdd  2.72899 osd.3   up  1.0 1.0
14   hdd  3.63869 osd.14  up  1.0 1.0
15   hdd  2.72899 osd.15  up  1.0 1.0
-3   20.01263 host ceph05
 4   hdd  5.45799 osd.4   up  1.0 1.0
 5   hdd  2.72899 osd.5   up  1.0 1.0
 6   hdd  2.72899 osd.6   up  1.0 1.0
13   hdd  3.63869 osd.13  up  1.0 1.0
16   hdd  2.72899 osd.16  up  1.0 1.0
18   hdd  2.72899 osd.18  up  1.0 1.0
-4   20.01997 host ceph06
 8   hdd  5.45999 osd.8   up  1.0 1.0
 9   hdd  2.73000 osd.9   up  1.0 1.0
10   hdd  2.73000 osd.10  up  1.0 1.0
11   hdd  2.73000 osd.11  up  1.0 1.0
12   hdd  3.64000 osd.12  up  1.0 1.0
17   hdd  2.73000 osd.17  up  1.0 1.0


root@ceph04:~# ceph status
  cluster:
id: 85873cda-4865-4147-819d-8deda5345db5
health: HEALTH_WARN
18962/11801097 objects misplaced (0.161%)
1/3933699 objects unfound (0.000%)
Reduced data availability: 42 pgs inactive
Degraded data redundancy: 3645135/11801097 objects degraded 
(30.888%), 959 pgs degraded, 960 pgs undersized

110 slow requests are blocked > 32 sec. Implicated osds 3,10,11

  services:
mon: 3 daemons, quorum ceph04,ceph05,ceph06
mgr: ceph04(active), standbys: ceph06, ceph05
osd: 18 osds: 18 up, 18 in; 964 remapped pgs

  data:
pools:   1 pools, 1024 pgs
objects: 3.93M objects, 15.0TiB
usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
pgs: 4.102% pgs not active
 3645135/11801097 objects degraded (30.888%)
 18962/11801097 objects misplaced (0.161%)
 1/3933699 objects unfound (0.000%)
 913 active+undersized+degraded+remapped+backfill_wait
 60  active+clean
 41  activating+undersized+degraded+remapped
 4   active+remapped+backfill_wait
 4   active+undersized+degraded+remapped+backfilling
 1   undersized+degraded+remapped+backfilling+peered
 1   active+recovery_wait+undersized+remapped

  io:
recovery: 197MiB/s, 49objects/s


root@ceph04:~# ceph health detail
HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects 
unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded 
data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs 
degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. 
Implicated osds 3,10,11

OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%)
OBJECT_UNFOUND 1/3933699 objects unfound (0.000%)
pg 3.361 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 42 pgs i

Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Bryan Stillwell
Jelle,

Try putting just the WAL on the Optane NVMe.  I'm guessing your DB is too big 
to fit within 5GB.  We used a 5GB journal on our nodes as well, but when we 
switched to BlueStore (using ceph-volume lvm batch) it created 37GiB logical 
volumes (200GB SSD / 5 or 400GB SSD / 10) for our DBs.

Also, injecting those settings into the cluster will only work until the OSD is 
restarted.  You'll need to add them to ceph.conf to be persistent.

Bryan

> On Dec 12, 2019, at 3:40 PM, Jelle de Jong  wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> Hello everybody,
> 
> I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's
> with 32GB Intel Optane NVMe journal, 10GB networking.
> 
> I wanted to move to bluestore due to dropping support of filestore, our
> cluster was working fine with filestore and we could take complete nodes
> out for maintenance without issues.
> 
> root@ceph04:~# ceph osd pool get libvirt-pool size
> size: 3
> root@ceph04:~# ceph osd pool get libvirt-pool min_size
> min_size: 2
> 
> I removed all osds from one node, zapping the osd and journal devices,
> we recreated the osds as bluestore and used a small 5GB partition as
> rockdb device instead of journal for all osd's.
> 
> I saw the cluster suffer with pgs inactive and slow request.
> 
> I tried setting the following on all nodes, but no diffrence:
> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
> ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
> systemctl restart ceph-osd.target
> 
> It took three days to recover and during this time clients were not
> responsive.
> 
> How can I migrate to bluestore without inactive pgs or slow request. I
> got several more filestore clusters and I would like to know how to
> migrate without inactive pgs and slow reguests?
> 
> As a side question, I optimized our cluster for filestore, the Intel
> Optane NVMe journals showed good fio dsync write tests, does bluestore
> also use dsync writes for rockdb caching or can we select NVMe devices
> on other specifications? My test with filestores showed that Optane NVMe
> SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few
> GB for filestore journals, but with bluestore rockdb caching the
> situation is different and I can't find documentation on how to speed
> test NVMe devices for bluestore.
> 
> Kind regards,
> 
> Jelle
> 
> root@ceph04:~# ceph osd tree
> ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
> -1   60.04524 root default
> -2   20.01263 host ceph04
> 0   hdd  2.72899 osd.0   up  1.0 1.0
> 1   hdd  2.72899 osd.1   up  1.0 1.0
> 2   hdd  5.45799 osd.2   up  1.0 1.0
> 3   hdd  2.72899 osd.3   up  1.0 1.0
> 14   hdd  3.63869 osd.14  up  1.0 1.0
> 15   hdd  2.72899 osd.15  up  1.0 1.0
> -3   20.01263 host ceph05
> 4   hdd  5.45799 osd.4   up  1.0 1.0
> 5   hdd  2.72899 osd.5   up  1.0 1.0
> 6   hdd  2.72899 osd.6   up  1.0 1.0
> 13   hdd  3.63869 osd.13  up  1.0 1.0
> 16   hdd  2.72899 osd.16  up  1.0 1.0
> 18   hdd  2.72899 osd.18  up  1.0 1.0
> -4   20.01997 host ceph06
> 8   hdd  5.45999 osd.8   up  1.0 1.0
> 9   hdd  2.73000 osd.9   up  1.0 1.0
> 10   hdd  2.73000 osd.10  up  1.0 1.0
> 11   hdd  2.73000 osd.11  up  1.0 1.0
> 12   hdd  3.64000 osd.12  up  1.0 1.0
> 17   hdd  2.73000 osd.17  up  1.0 1.0
> 
> 
> root@ceph04:~# ceph status
>  cluster:
>id: 85873cda-4865-4147-819d-8deda5345db5
>health: HEALTH_WARN
>18962/11801097 objects misplaced (0.161%)
>1/3933699 objects unfound (0.000%)
>Reduced data availability: 42 pgs inactive
>Degraded data redundancy: 3645135/11801097 objects degraded
> (30.888%), 959 pgs degraded, 960 pgs undersized
>110 slow requests are blocked > 32 sec. Implicated osds 3,10,11
> 
>  services:
>mon: 3 daemons, quorum ceph04,ceph05,ceph06
>mgr: ceph04(active), standbys: ceph06, ceph05
>osd: 18 osds: 18 up, 18 in; 964 remapped pgs
> 
>  data:
>pools:   1 pools, 1024 pgs
>objects: 3.93M objects, 15.0TiB
>usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
>pgs: 4.102% pgs not active
> 3645135/11801097 objects degraded (30.888%)
> 18962/11801097 objects misplaced (0.161%)
> 1/3933699 objects unfound (0.000%)
> 913 active+undersized+degraded+remapped+backfill_wait
> 60  active+clean
> 41  activating+undersized+degraded+remapped
> 4   active+remapped+backfill_wait
> 4   active+undersized+degraded+remapped+backfillin

[ceph-users] PG Balancer Upmap mode not working

2019-12-12 Thread Philippe D'Anjou
@Wido Den Hollander 
Regarding the amonut of PGs, and I quote from the docs:
"If you have more than 50 OSDs, we recommend approximately 50-100placement 
groups per OSD to balance out resource usage, datadurability and distribution." 
(https://docs.ceph.com/docs/master/rados/operations/placement-groups/) 






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com