[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-18 Thread Igor Fedotov



On 3/10/2022 6:10 PM, Sasa Glumac wrote:



> In this respect could you please try to switch bluestore and bluefs
> allocators to bitmap and run some smoke benchmarking again.
Can i change this on live server (is there possibility of losing data 
etc )? Can you please share correct procedure.



To change the allocator for an OSD.N one should run:

ceph config set osd.N bluestore_allocator bitmap

and restart an OSD.

I'm unware about any issues with such a switch...

Alternatively/additionally you might want to try stupid allocator as well.



> Additionally you might want to upgrade to 15.2.16 which includes a bunch
> of improvements for Avl/Hybrid allocators tail latency numbers as per
> the ticket above.
Atm we use pve repository where 15.2.15 is latest , I will need to 
either wait for .16 from them or create second cluster without proxmox 
but would like to test on existing.
Is there any difference between pve ceph and regular so i can change 
repo and install over existing ?

Sorry I don't know.

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-10 Thread Sasa Glumac
> First of all I'd like to clarify what exact command are you using to
> assess the fragmentation. There are two options: "bluestore allocator
> score" and "bluestore allocator fragmentation"
I am using this one : "ceph daemon osd.$i bluestore allocator score block"

> Both are not very accurate though but it would be interesting to have
> both numbers for the case with presumably high fragmentation.
Here are numbers from single server to keep email shorter but almost exact
scores are on other 2 nodes , I recreated OSD's 0,1 46h ago and they were
perfect , now already extra slow and fragmented:
for i in 5 9 0 1 ; do echo $i ; ceph daemon osd.$i bluestore allocator
score block ; done
>
> 5
> {
> "fragmentation_rating": 0.29451514185657074
> }
> 9
> {
> "fragmentation_rating": 0.29940778224909959
> }
> 0
> {
> "fragmentation_rating": 0.84247390671066713
> }
> 1
> {
> "fragmentation_rating": 0.78098161172652247
> }


for i in 5 9 0 1 ; do echo $i ; ceph daemon osd.$i bluestore allocator
fragmentation block ; done

> 5
> {
> "fragmentation_rating": 0.0055253213950322861
> }
> 9
> {
> "fragmentation_rating": 0.0053455960516075665
> }
> 0
> {
> "fragmentation_rating": 0.014439265895713198
> }
> 1
> {
> "fragmentation_rating": 0.013245320572893494
> }


> In this respect could you please try to switch bluestore and bluefs
> allocators to bitmap and run some smoke benchmarking again.
Can i change this on live server (is there possibility of losing data etc
)? Can you please share correct procedure.


> Additionally you might want to upgrade to 15.2.16 which includes a bunch
> of improvements for Avl/Hybrid allocators tail latency numbers as per
> the ticket above.
Atm we use pve repository where 15.2.15 is latest , I will need to either
wait for .16 from them or create second cluster without proxmox but would
like to test on existing.
Is there any difference between pve ceph and regular so i can change repo
and install over existing ?

> And finally it would be great to get bluestore performance counters for
> both good and bad benchmarks. This can be obtained via: ceph tell osd.N
> perf dump bluestore
>
> but please reset the counters before each benchmarking with: ceph tell
> osd.N perf reset all
DATEBENCH=$(date +"%Y-%m-%d-%H-%M-%S") && ceph tell osd.0 perf reset all &&
ceph tell osd.0 bench >>
/root/ceph_osd_bench_results/$DATEBENCH-perf-dump-bluestore-osd-0-and-bench-fragmented.log
&& ceph tell osd.0 perf dump bluestore >>
/root/ceph_osd_bench_results/$DATEBENCH-perf-dump-bluestore-osd-0-and-bench-fragmented.log

{
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "elapsed_sec": 1.140284746999,
> "bytes_per_sec": 941643591.06348729,
> "iops": 224.50532700144942
> }
> {
> "bluestore": {
> "kv_flush_lat": {
> "avgcount": 142,
> "sum": 0.000820554,
> "avgtime": 0.05778
> },
> "kv_commit_lat": {
> "avgcount": 142,
> "sum": 1.208369108,
> "avgtime": 0.008509641
> },
> "kv_sync_lat": {
> "avgcount": 142,
> "sum": 1.209189662,
> "avgtime": 0.008515420
> },
> "kv_final_lat": {
> "avgcount": 141,
> "sum": 0.044558120,
> "avgtime": 0.000316015
> },
> "state_prepare_lat": {
> "avgcount": 407,
> "sum": 1.443276139,
> "avgtime": 0.003546133
> },
> "state_aio_wait_lat": {
> "avgcount": 407,
> "sum": 12.148961431,
> "avgtime": 0.029850028
> },
> "state_io_done_lat": {
> "avgcount": 407,
> "sum": 0.009644771,
> "avgtime": 0.23697
> },
> "state_kv_queued_lat": {
> "avgcount": 407,
> "sum": 5.441919173,
> "avgtime": 0.013370808
> },
> "state_kv_commiting_lat": {
> "avgcount": 407,
> "sum": 8.541078753,
> "avgtime": 0.020985451
> },
> "state_kv_done_lat": {
> "avgcount": 407,
> "sum": 0.000117127,
> "avgtime": 0.00287
> },
> "state_deferred_queued_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_deferred_aio_wait_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_deferred_cleanup_lat": {
> "avgcount": 0,
> "sum": 0.0,
> "avgtime": 0.0
> },
> "state_finishing_lat": {
> "avgcount": 407,
> "sum": 0.41350,
> "avgtime": 0.00101
> },
> "state_done_lat": {
> "avgcount": 407,
> "sum": 0.033037493,
> "avgtim

[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-10 Thread Igor Fedotov

Hi Sasa,

jsut a few thoughts/questions on your issue in attempt to understand 
what's happening.


First of all I'd like to clarify what exact command are you using to 
assess the fragmentation. There are two options: "bluestore allocator 
score" and "bluestore allocator fragmentation"


Both are not very accurate though but it would be interesting to have 
both numbers for the case with presumably high fragmentation.



Secondly - I can imagine two performance issues when writing to 
all-flash OSD under heavy fragmentation:


1) Bluestore Allocator takes too long to allocate a new block.

2) Bluestore to invoke a large bunch of disk write requests to process 
single 4M user writing. Which might be less efficient.


I've never seen the latter being a significant issue when SSDs are in 
use (it definitely is for spinners) .


But I recall we've seen some issues with 1), e.g. 
https://tracker.ceph.com/issues/52804


In this respect could you please try to switch bluestore and bluefs 
allocators to bitmap and run some smoke benchmarking again.


Additionally you might want to upgrade to 15.2.16 which includes a bunch 
of improvements for Avl/Hybrid allocators tail latency numbers as per 
the ticket above.



And finally it would be great to get bluestore performance counters for 
both good and bad benchmarks. This can be obtained via: ceph tell osd.N 
perf dump bluestore


but please reset the counters before each benchmarking with: ceph tell 
osd.N perf reset all



Thanks,

Igor

On 3/8/2022 12:50 PM, Sasa Glumac wrote:

Proxmox = 6.4-8

CEPH =  15.2.15

Nodes = 3

Network = 2x100G / node

Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB

 nvme Samsung PM-1733 MZWLJ1T9HBJR  2TB

CPU = EPYC 7252

CEPH pools = 2 separate pools for each disk type and each disk spliced in 2
OSD's

Replica = 3


VM don't do many writes and i migrated main testing VM's to 2TB pool which
in turns fragments faster.


[SPOILER="ceph osd df"]

[CODE]ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS

  3   nvme  1.74660   1.0  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB
1.3 TiB  24.18  0.90  186  up

10   nvme  1.74660   1.0  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB
1.4 TiB  21.38  0.79  151  up

  7  ssd2n  0.87329   1.0  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB
615 GiB  31.19  1.16  113  up

  8  ssd2n  0.87329   1.0  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB
544 GiB  39.22  1.46  143  up

  4   nvme  1.74660   1.0  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB
1.3 TiB  23.85  0.89  180  up

11   nvme  1.74660   1.0  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB
1.4 TiB  21.72  0.81  157  up

  2  ssd2n  0.87329   1.0  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB
598 GiB  33.18  1.23  121  up

  6  ssd2n  0.87329   1.0  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB
561 GiB  37.23  1.38  135  up

  5   nvme  1.74660   1.0  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB
1.3 TiB  23.18  0.86  176  up

  9   nvme  1.74660   1.0  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB
1.4 TiB  22.38  0.83  161  up

  0  ssd2n  0.87329   1.0  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB
563 GiB  37.07  1.38  135  up

  1  ssd2n  0.87329   1.0  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB
596 GiB  33.35  1.24  121  up

TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB
11 TiB  26.92

MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88[/CODE]

[/SPOILER]


[SPOILER="ceph osd crush tree"]

[CODE]ID   CLASS  WEIGHTTYPE NAME

-12  ssd2n   5.23975  root default~ssd2n

  -9  ssd2n   1.74658  host pmx-s01~ssd2n

   7  ssd2n   0.87329  osd.7

   8  ssd2n   0.87329  osd.8

-10  ssd2n   1.74658  host pmx-s02~ssd2n

   2  ssd2n   0.87329  osd.2

   6  ssd2n   0.87329  osd.6

-11  ssd2n   1.74658  host pmx-s03~ssd2n

   0  ssd2n   0.87329  osd.0

   1  ssd2n   0.87329  osd.1

  -2   nvme  10.47958  root default~nvme

  -4   nvme   3.49319  host pmx-s01~nvme

   3   nvme   1.74660  osd.3

  10   nvme   1.74660  osd.10

  -6   nvme   3.49319  host pmx-s02~nvme

   4   nvme   1.74660  osd.4

  11   nvme   1.74660  osd.11

  -8   nvme   3.49319  host pmx-s03~nvme

   5   nvme   1.74660  osd.5

   9   nvme   1.74660  osd.9

  -1 15.71933  root default

  -3  5.23978  host pmx-s01

   3   nvme   1.74660  osd.3

  10   nvme   1.74660  osd.10

   7  ssd2n   0.87329  osd.7

   8  ssd2n   0.87329  osd.8

  -5  5.23978  host pmx-s02

   4   nvme   1.74660  osd.4

  11   nvme   1.74660  osd.11

   2  ssd2n   0.87329  osd.2

   6  ssd2n   0.87329  osd.6

  -7  5.23978  host pmx-s03

   5   nvme   1.74660  osd.5

   9   nvme   1.74660  osd.9

   0  ssd2n   0.87329   

[ceph-users] Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

2022-03-08 Thread Sasa Glumac
Rados bench before deleting OSD's and recreating them + syncing with
fragmentation 0.89

>  T1 - wr,4M
> Total time run 60.0405
> Total writes made  9997
> Write size 4194304
> Object size4194304
> Bandwidth (MB/sec) 666,017
> Stddev Bandwidth   24.1108
> Max bandwidth (MB/sec) 744
> Min bandwidth (MB/sec) 604
> Average IOPS   166
> Stddev IOPS6.02769
> Max IOPS   186
> Min IOPS   151
> Average Latency(s) 0.0960791
> Stddev Latency(s)  0.0182781
> Max latency(s) 0.190993
> Min latency(s) 0.0284014


 T2 = ro,seq,4M
> Total time run 250.486
> Total reads made 9997
> Read size 4194304
> Object size 4194304
> Bandwidth (MB/sec) 1596,41
> Average IOPS 399
> Stddev IOPS 252.166
> Max IOPS 446
> Min IOPS 350
> Average Latency(s)   0.0395391
> Max latency(s)   0.187176
> Min latency(s)   0.0056981


 T3 = ro,rand,4M
> Total time run 600.463
> Total reads made 23.947
> Read size 4194304
> Object size 4194304
> Bandwidth (MB/sec) 1595,24
> Average IOPS 398
> Stddev IOPS 261.614
> Max IOPS 446
> Min IOPS 341
> Average Latency(s)   0.0395782
> Max latency(s)   0.17207
> Min latency(s)   0.00326339



Rados bench after recreating and sync with fragmentation 0.1

 T1
> Total time run 60.0143
> Total writes made  30868
> Write size 4194304
> Object size4194304
> Bandwidth (MB/sec) 2057,38
> Stddev Bandwidth 121.141
> Max bandwidth (MB/sec) 2208
> Min bandwidth (MB/sec) 1472
> Average IOPS   514
> Stddev IOPS30.2852
> Max IOPS   552
> Min IOPS   368
> Average Latency(s) 0.0310978
> Stddev Latency(s)  0.0120903
> Max latency(s) 0.127144
> Min latency(s) 0.00719787


 T2
> Total time run   51.9554
> Total reads made 30868
> Read size4194304
> Object size  4194304
> Bandwidth (MB/sec) 2376,5
> Average IOPS 594
> Stddev IOPS  27.1142
> Max IOPS 641
> Min IOPS 543
> Average Latency(s)   0.026446
> Max latency(s)   0.120386
> Min latency(s)   0.00436071


T3
> Total time run   60.0455
> Total reads made 33853
> Read size4194304
> Object size  4194304
> Bandwidth (MB/sec) 2255,16
> Average IOPS 563
> Stddev IOPS  23.7633
> Max IOPS 616
> Min IOPS 500
> Average Latency(s)   0.0278983
> Max latency(s)   0.13513
> Min latency(s)   0.00267677


> To me this looks like normal sequential write performance to an ssd.

This is normal write for these OSD's when they are not fragmented and
already have data , osd bench:

> Server disk osd MB/s IOPS
> s01 2TB osd.7 2.587 616
> s01 2TB osd.8 2.566 611
> s02 2TB osd.2 2.611 622
> s02 2TB osd.6 2.555 609
> s03 2TB osd.0 2.519 600
> s03 2TB osd.1 2.552 608
> s01 4TB osd.3 3.319 791
> s01 4TB osd.10 4.054 966
> s02 4TB osd.4 3.884 926
> s02 4TB osd.11 3.931 937
> s03 4TB osd.5 3.797 905
> s03 4TB osd.9 3.701 882




and this is when it is fragmented , first 3 are on 2TB drives OSD's and
second 3 on 4TB

MB/s IOPS
> 455 108
> 449 107
> 533 127
> 846 201
> 825 196
> 795 189


> I am curious what makes you think this is related to the
'fragmentation_rating'

I did hundreds of fio tests , rados , osd benches etc and recreated OSD's
many times with different number of OSD's , PG's etc and only thing that is
constant is this fragmentation that happens after few days of light use and
all performance in all tests follows it.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io