[ceph-users] Re: Unbalanced data distribution

2019-10-22 Thread Konstantin Shalygin

On 10/22/19 7:52 PM, Thomas wrote:

Node 1
48x 1.6TB
Node 2
48x 1.6TB
Node 3
48x 1.6TB
Node 4
48x 1.6TB
Node 5
48x 7.2TB
Node 6
48x 7.2TB
Node 7
48x 7.2TB


I suggest to balance disks in hosts, e.g. ~ 28x1.6TB + 20x7.2TB per host.


Why is the data distribution on the 1.6TB disks unequal?
How can I correct this?
Balancer in upmap mode works with pools. I guess some of your 1.6TB 
OSD's not serve some pools.




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-22 Thread Anthony D'Atri
I agree wrt making the nodes weights uniform.   

When mixing drive sizes, be careful that the larger ones don’t run afoul of the 
pg max — they will receive more pgs than the smaller ones, and if you lose a 
node that might be enough to send some over the max.   ‘ceph OSD df’ and look 
at the PG counts.  

This can also degrade performance since IO is not spread uniformly.   Primary 
affinity hops can mitigate somewhat.  

> On Oct 22, 2019, at 8:26 PM, Konstantin Shalygin  wrote:
> 
> On 10/22/19 7:52 PM, Thomas wrote:
>> Node 1
>> 48x 1.6TB
>> Node 2
>> 48x 1.6TB
>> Node 3
>> 48x 1.6TB
>> Node 4
>> 48x 1.6TB
>> Node 5
>> 48x 7.2TB
>> Node 6
>> 48x 7.2TB
>> Node 7
>> 48x 7.2TB
> 
> I suggest to balance disks in hosts, e.g. ~ 28x1.6TB + 20x7.2TB per host.
> 
>> Why is the data distribution on the 1.6TB disks unequal?
>> How can I correct this?
> Balancer in upmap mode works with pools. I guess some of your 1.6TB OSD's not 
> serve some pools.
> 
> 
> 
> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-22 Thread Thomas Schneider
The number of PGs on the 7.2TB disks is 120 in avg., and the number of
PGs on the 1.6TB disks is 35 in avg.
This means a difference by factor 3-4.

However I don't understand why this should explain the unbalanced data
distribution on the 1.6TB disks only (the 7.2 TB disks are balanced)
And all the disks are defined to serve the same pool only by a suitable
Crush Map configuration. This means any other pool is served by
different disks.
Here's an example for one node, all other 6 nodes are similar:
host ld5505-hdd_strgbox { id -16 # do not change unnecessarily id -18
class hdd # do not change unnecessarily id -20 class nvme # do not
change unnecessarily id -49 class ssd # do not change unnecessarily #
weight 78.720 alg straw2 hash 0 # rjenkins1 item osd.76 weight 1.640
item osd.77 weight 1.640 item osd.78 weight 1.640 [...]  item osd.97
weight 1.640 item osd.102 weight 1.640 item osd.110 weight 1.640 }

In addition I don't understand why distributing the disks equally over
all nodes should solve the issue?
My understanding is that Ceph's algorithm should be smart enough to
determine which object should be placed where and ensure balanced
utilisation.
I agree that I have a major impact if a node with 7.2TB disks go down,
though.



Am 23.10.2019 um 06:59 schrieb Anthony D'Atri:
> I agree wrt making the nodes weights uniform.   
>
> When mixing drive sizes, be careful that the larger ones don’t run afoul of 
> the pg max — they will receive more pgs than the smaller ones, and if you 
> lose a node that might be enough to send some over the max.   ‘ceph OSD df’ 
> and look at the PG counts.  
>
> This can also degrade performance since IO is not spread uniformly.   Primary 
> affinity hops can mitigate somewhat.  
>
>> On Oct 22, 2019, at 8:26 PM, Konstantin Shalygin  wrote:
>>
>> On 10/22/19 7:52 PM, Thomas wrote:
>>> Node 1
>>> 48x 1.6TB
>>> Node 2
>>> 48x 1.6TB
>>> Node 3
>>> 48x 1.6TB
>>> Node 4
>>> 48x 1.6TB
>>> Node 5
>>> 48x 7.2TB
>>> Node 6
>>> 48x 7.2TB
>>> Node 7
>>> 48x 7.2TB
>> I suggest to balance disks in hosts, e.g. ~ 28x1.6TB + 20x7.2TB per host.
>>
>>> Why is the data distribution on the 1.6TB disks unequal?
>>> How can I correct this?
>> Balancer in upmap mode works with pools. I guess some of your 1.6TB OSD's 
>> not serve some pools.
>>
>>
>>
>> k
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-22 Thread Konstantin Shalygin

On 10/23/19 1:14 PM, Thomas Schneider wrote:

My understanding is that Ceph's algorithm should be smart enough to
determine which object should be placed where and ensure balanced
utilisation.
I agree that I have a major impact if a node with 7.2TB disks go down,
though.


Ceph is don't care about disk utilization, Ceph is don't care (mostly) 
what is your OSD.


This is basic for usage of generic purpose hardware. Please pastebin your

`ceph osd tree`, `ceph osd df tree` & `ceph osd pool ls detail`.



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-23 Thread Thomas Schneider
Sure, here's the pastebin .

Am 23.10.2019 um 08:31 schrieb Konstantin Shalygin:
> On 10/23/19 1:14 PM, Thomas Schneider wrote:
>> My understanding is that Ceph's algorithm should be smart enough to
>> determine which object should be placed where and ensure balanced
>> utilisation.
>> I agree that I have a major impact if a node with 7.2TB disks go down,
>> though.
>
> Ceph is don't care about disk utilization, Ceph is don't care (mostly)
> what is your OSD.
>
> This is basic for usage of generic purpose hardware. Please pastebin your
>
> `ceph osd tree`, `ceph osd df tree` & `ceph osd pool ls detail`.
>
>
>
> k
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-23 Thread Konstantin Shalygin

On 10/23/19 2:46 PM, Thomas Schneider wrote:

Sure, here's the pastebin.


Since you have several rules, please also provide`ceph osd crush rule dump`.




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-23 Thread Thomas Schneider
OK.

Here's my new pastebin .


Am 23.10.2019 um 09:50 schrieb Konstantin Shalygin:
> ceph osd crush rule dump
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-23 Thread Konstantin Shalygin

On 10/23/19 2:46 PM, Thomas Schneider wrote:

Sure, here's the pastebin.


Some of your 1.6Tb OSD's is reweighted, like osd.89 is 0.8, osd.100 
is 0.7, etc...


By this reason this OSD's get less PG's then other.



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-24 Thread Thomas Schneider
Hello,
this is understood.

I needed to start reweighting specific OSD because rebalancing was not
working and I got a warning in Ceph that some OSDs are running out of space.

KR


Am 24.10.2019 um 05:58 schrieb Konstantin Shalygin:
> On 10/23/19 2:46 PM, Thomas Schneider wrote:
>> Sure, here's the pastebin.
>
> Some of your 1.6Tb OSD's is reweighted, like osd.89 is 0.8,
> osd.100 is 0.7, etc...
>
> By this reason this OSD's get less PG's then other.
>
>
>
> k
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unbalanced data distribution

2019-10-24 Thread Konstantin Shalygin

On 10/24/19 6:54 PM, Thomas Schneider wrote:

this is understood.

I needed to start reweighting specific OSD because rebalancing was not
working and I got a warning in Ceph that some OSDs are running out of space.


Still, the main your issue is that your buckets is uneven, 350TB vs 
79TB, more that 4 times.


I suggest to you disable multiroot (use only default), use your 1.6Tb 
drives from current default root (I count ~48 1.6Tb OSD's).


And mix your OSD's in hosts to be more evenly distributed in cluster - 
this is one of basic Ceph best practices.



Also you can try to use offline upmap method, some folks get better 
results with this (don't forget to disable balancer):


`ceph osd getmap -o om; osdmaptool om --upmap upmap.sh --upmap-deviation 
0; bash upmap.sh; rm -f upmap.sh om`




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io