[ceph-users] Re: extract disk usage stats from running ceph cluster

mj Wed, 12 Feb 2020 02:24:46 -0800

Hi Muhammad,

Yes, that tool helps! Thank you for pointing it out!

With a combination of openSeaChest_Info and smartctl I was able toextract the following stats of our cluster, and the numbers are verysurprising to me. I hope someone here can explain the what we see below:

node1   AnnualWrkload   Read    Written         Power On Hours  
osd0     93.14          318.79    19.48         31815.65        
osd1     94.38          322.67    20.11         31815.42        
osd2     41.08           38.95    11.33         10722.47        new disk
osd3     94.56          323.98    19.45         31815.35        
osd12   124.20          340.11    20.09         25406.73        
osd13   112.43          308.18    17.88         25405.72        
sdb14   120.67          330.96    19.01         25405.65        
osd15   105.59          287.78    18.45         25405.90        
ssd journal               0.46  1643.58         31813.00        
                                        
node2                                   
osd4    697.75          2390     151.23         31864.88        (2.39PB)
osd5    677.74          2320     144.94         31864.68        (2.32PB)
osd6    687.13          2340     157.11         31865.05        (2.34PB)
osd7    619.19          2100     151.08         31864.67        (2.10PB)
osd16   827.57          2260     142.81         25405.93        (2.26PB)
osd17   996.03          2720     167.97         25405.87        (2.72PB)
osd18   809.36          2210     137.96         25405.82        (2.21PB)
osd19   844.06          2300     146.84         25405.90        (2.30PB)
ssd journal             0.46    1637.60         31862.00        
                                        
node3                                   
osd8     75.30          258.79    14.67         31813.67        
osd9     77.30          264.87    15.85         31813.68        
osd10    82.32          282.43    16.53         31813.60        
osd11    82.26          282.72    16.01         31813.73        
osd20    96.86          265.25    15.65         25404.37        
osd21    93.18          256.11    14.12         25404.22        
osd22   108.43          298.29    16.15         25404.23        
osd23    30.80           33.61    10.78         12625.07        new disk
ssd journal               0.46  1644.83         31811.00

AnnualWrkload = Annualized Workload Rate (TB/year)
Read = Total Bytes Read (TB)
Written = Total Bytes Written (TB)
Power On Hours = hours the drive has been used

From the numbers above, it seems the OSDs on node2 are used INCREDIBLYmuch more than those on the other two nodes. The numbers for node2 areeven reported in PB, and the other nodes in TB. (converted to TB usinghttps://www.gbmb.org/pb-to-tb, to make sure there are no conversion errors)


However, SSD journal usage across the three nodes looks similar.

All OSDs have the same weight:

root@node2:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF-1 87.35376 root default-2 29.11688 host pm10 hdd 3.64000 osd.0 up 1.00000 1.000001 hdd 3.64000 osd.1 up 1.00000 1.000002 hdd 3.63689 osd.2 up 1.00000 1.000003 hdd 3.64000 osd.3 up 1.00000 1.0000012 hdd 3.64000 osd.12 up 1.00000 1.0000013 hdd 3.64000 osd.13 up 1.00000 1.0000014 hdd 3.64000 osd.14 up 1.00000 1.0000015 hdd 3.64000 osd.15 up 1.00000 1.00000-3 29.12000 host pm24 hdd 3.64000 osd.4 up 1.00000 1.000005 hdd 3.64000 osd.5 up 1.00000 1.000006 hdd 3.64000 osd.6 up 1.00000 1.000007 hdd 3.64000 osd.7 up 1.00000 1.0000016 hdd 3.64000 osd.16 up 1.00000 1.0000017 hdd 3.64000 osd.17 up 1.00000 1.0000018 hdd 3.64000 osd.18 up 1.00000 1.0000019 hdd 3.64000 osd.19 up 1.00000 1.00000-4 29.11688 host pm38 hdd 3.64000 osd.8 up 1.00000 1.000009 hdd 3.64000 osd.9 up 1.00000 1.0000010 hdd 3.64000 osd.10 up 1.00000 1.0000011 hdd 3.64000 osd.11 up 1.00000 1.0000020 hdd 3.64000 osd.20 up 1.00000 1.0000021 hdd 3.64000 osd.21 up 1.00000 1.0000022 hdd 3.64000 osd.22 up 1.00000 1.0000023 hdd 3.63689 osd.23 up 1.00000 1.00000


Disk usage also looks ok:

root@pm2:~# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS0 hdd 3.64000 1.00000 3.64TiB 2.01TiB 1.62TiB 55.34 0.98 1371 hdd 3.64000 1.00000 3.64TiB 2.09TiB 1.54TiB 57.56 1.02 1412 hdd 3.63689 1.00000 3.64TiB 1.92TiB 1.72TiB 52.79 0.94 1283 hdd 3.64000 1.00000 3.64TiB 2.07TiB 1.57TiB 56.90 1.01 14312 hdd 3.64000 1.00000 3.64TiB 2.15TiB 1.48TiB 59.18 1.05 13813 hdd 3.64000 1.00000 3.64TiB 1.99TiB 1.64TiB 54.80 0.97 13114 hdd 3.64000 1.00000 3.64TiB 1.93TiB 1.70TiB 53.13 0.94 12715 hdd 3.64000 1.00000 3.64TiB 2.19TiB 1.45TiB 60.10 1.07 1434 hdd 3.64000 1.00000 3.64TiB 2.11TiB 1.53TiB 57.97 1.03 1425 hdd 3.64000 1.00000 3.64TiB 1.97TiB 1.67TiB 54.11 0.96 1346 hdd 3.64000 1.00000 3.64TiB 2.12TiB 1.51TiB 58.40 1.04 1427 hdd 3.64000 1.00000 3.64TiB 1.97TiB 1.66TiB 54.28 0.97 12816 hdd 3.64000 1.00000 3.64TiB 2.00TiB 1.64TiB 54.90 0.98 13317 hdd 3.64000 1.00000 3.64TiB 2.33TiB 1.30TiB 64.14 1.14 15318 hdd 3.64000 1.00000 3.64TiB 1.97TiB 1.67TiB 54.07 0.96 13219 hdd 3.64000 1.00000 3.64TiB 1.89TiB 1.75TiB 51.94 0.92 1248 hdd 3.64000 1.00000 3.64TiB 1.79TiB 1.85TiB 49.24 0.88 1239 hdd 3.64000 1.00000 3.64TiB 2.17TiB 1.46TiB 59.72 1.06 14410 hdd 3.64000 1.00000 3.64TiB 2.40TiB 1.24TiB 65.88 1.17 15711 hdd 3.64000 1.00000 3.64TiB 2.06TiB 1.58TiB 56.64 1.01 13320 hdd 3.64000 1.00000 3.64TiB 2.19TiB 1.45TiB 60.23 1.07 14821 hdd 3.64000 1.00000 3.64TiB 1.74TiB 1.90TiB 47.80 0.85 11522 hdd 3.64000 1.00000 3.64TiB 2.05TiB 1.59TiB 56.27 1.00 13823 hdd 3.63689 1.00000 3.64TiB 1.96TiB 1.67TiB 54.01 0.96 130TOTAL 87.3TiB 49.1TiB 38.2TiB 56.23MIN/MAX VAR: 0.85/1.17 STDDEV: 4.08


The cluster is HEALTH_OK and seems to be  working fine.

When comparing "iostat -x 1" between node2 and the other two, we seesimilar %util for all OSDs across all nodes.

How can the reported disk stats for node2 be SO different than the othertwo nodes, whereas for the rest everything seems to be running as it should?


Or are we missing something?

Thanks!

MJ
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: extract disk usage stats from running ceph cluster

Reply via email to