Re: [ceph-users] handling different disk sizes

Maxime Guyot Tue, 06 Jun 2017 00:21:09 -0700

Hi Félix,

Changing the failure domain to OSD is probably the easiest option if this
is a test cluster. I think the commands would go like:
- ceph osd getcrushmap -o map.bin
- crushtool -d map.bin -o map.txt
- sed -i 's/step chooseleaf firstn 0 type host/step chooseleaf firstn 0
type osd/' map.txt
- crushtool -c map.txt -o map.bin
- ceph osd setcrushmap -i map.bin


Moving HDDs into ~8TB/server would be a good option if this is a capacity
focused use case. It will allow you to reboot 1 server at a time without
radosgw down time. You would target for 26/3 = 8.66TB/ node so:
- node1: 1x8TB
- node2: 1x8TB +1x2TB
- node3: 2x6 TB + 1x2TB

If you are more concerned about performance then set the weights to 1 on
all HDDs and forget about the wasted capacity.

Cheers,
Maxime


On Tue, 6 Jun 2017 at 00:44 Christian Wuerdig <christian.wuer...@gmail.com>
wrote:

> Yet another option is to change the failure domain to OSD instead host
> (this avoids having to move disks around and will probably meet you initial
> expectations).
> Means your cluster will become unavailable when you loose a host until you
> fix it though. OTOH you probably don't have too much leeway anyway with
> just 3 hosts so it might be an acceptable trade-off. It also means you can
> just add new OSDs to the servers wherever they fit.
>
> On Tue, Jun 6, 2017 at 1:51 AM, David Turner <drakonst...@gmail.com>
> wrote:
>
>> If you want to resolve your issue without purchasing another node, you
>> should move one disk of each size into each server.  This process will be
>> quite painful as you'll need to actually move the disks in the crush map to
>> be under a different host and then all of your data will move around, but
>> then your weights will be able to utilize the weights and distribute the
>> data between the 2TB, 3TB, and 8TB drives much more evenly.
>>
>> On Mon, Jun 5, 2017 at 9:21 AM Loic Dachary <l...@dachary.org> wrote:
>>
>>>
>>>
>>> On 06/05/2017 02:48 PM, Christian Balzer wrote:
>>> >
>>> > Hello,
>>> >
>>> > On Mon, 5 Jun 2017 13:54:02 +0200 Félix Barbeira wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> We have a small cluster for radosgw use only. It has three nodes,
>>> witch 3
>>> >             ^^^^^                                      ^^^^^
>>> >> osds each. Each node has different disk sizes:
>>> >>
>>> >
>>> > There's your answer, staring you right in the face.
>>> >
>>> > Your default replication size is 3, your default failure domain is
>>> host.
>>> >
>>> > Ceph can not distribute data according to the weight, since it needs
>>> to be
>>> > on a different node (one replica per node) to comply with the replica
>>> size.
>>>
>>> Another way to look at it is to imagine a situation where 10TB worth of
>>> data
>>> is stored on node01 which has 8x3 24TB. Since you asked for 3 replicas,
>>> this
>>> data must be replicated to node02 but ... there only is 2x3 6TB
>>> available.
>>> So the maximum you can store is 6TB and remaining disk space on node01
>>> and node03
>>> will never be used.
>>>
>>> python-crush analyze will display a message about that situation and
>>> show which buckets
>>> are overweighted.
>>>
>>> Cheers
>>>
>>> >
>>> > If your cluster had 4 or more nodes, you'd see what you expected.
>>> > And most likely wouldn't be happy about the performance with your 8TB
>>> HDDs
>>> > seeing 4 times more I/Os than then 2TB ones and thus becoming the
>>> > bottleneck of your cluster.
>>> >
>>> > Christian
>>> >
>>> >> node01 : 3x8TB
>>> >> node02 : 3x2TB
>>> >> node03 : 3x3TB
>>> >>
>>> >> I thought that the weight handle the amount of data that every osd
>>> receive.
>>> >> In this case for example the node with the 8TB disks should receive
>>> more
>>> >> than the rest, right? All of them receive the same amount of data and
>>> the
>>> >> smaller disk (2TB) reaches 100% before the bigger ones. Am I doing
>>> >> something wrong?
>>> >>
>>> >> The cluster is jewel LTS 10.2.7.
>>> >>
>>> >> # ceph osd df
>>> >> ID WEIGHT  REWEIGHT SIZE   USE   AVAIL  %USE  VAR  PGS
>>> >>  0 7.27060  1.00000  7445G 1012G  6432G 13.60 0.57 133
>>> >>  3 7.27060  1.00000  7445G 1081G  6363G 14.52 0.61 163
>>> >>  4 7.27060  1.00000  7445G  787G  6657G 10.58 0.44 120
>>> >>  1 1.81310  1.00000  1856G 1047G   809G 56.41 2.37 143
>>> >>  5 1.81310  1.00000  1856G  956G   899G 51.53 2.16 143
>>> >>  6 1.81310  1.00000  1856G  877G   979G 47.24 1.98 130
>>> >>  2 2.72229  1.00000  2787G 1010G  1776G 36.25 1.52 140
>>> >>  7 2.72229  1.00000  2787G  831G  1955G 29.83 1.25 130
>>> >>  8 2.72229  1.00000  2787G 1038G  1748G 37.27 1.56 146
>>> >>               TOTAL 36267G 8643G 27624G 23.83
>>> >> MIN/MAX VAR: 0.44/2.37  STDDEV: 18.60
>>> >> #
>>> >>
>>> >> # ceph osd tree
>>> >> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> >> -1 35.41795 root default
>>> >> -2 21.81180     host node01
>>> >>  0  7.27060         osd.0       up  1.00000          1.00000
>>> >>  3  7.27060         osd.3       up  1.00000          1.00000
>>> >>  4  7.27060         osd.4       up  1.00000          1.00000
>>> >> -3  5.43929     host node02
>>> >>  1  1.81310         osd.1       up  1.00000          1.00000
>>> >>  5  1.81310         osd.5       up  1.00000          1.00000
>>> >>  6  1.81310         osd.6       up  1.00000          1.00000
>>> >> -4  8.16687     host node03
>>> >>  2  2.72229         osd.2       up  1.00000          1.00000
>>> >>  7  2.72229         osd.7       up  1.00000          1.00000
>>> >>  8  2.72229         osd.8       up  1.00000          1.00000
>>> >> #
>>> >>
>>> >> # ceph -s
>>> >>     cluster 49ba9695-7199-4c21-9199-ac321e60065e
>>> >>      health HEALTH_OK
>>> >>      monmap e1: 3 mons at
>>> >>
>>> {ceph-mon01=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon02=[x:x:x:x:x:x:x:x]:6789/0,ceph-mon03=[x:x:x:x:x:x:x:x]:6789/0}
>>> >>             election epoch 48, quorum 0,1,2
>>> ceph-mon01,ceph-mon03,ceph-mon02
>>> >>      osdmap e265: 9 osds: 9 up, 9 in
>>> >>             flags sortbitwise,require_jewel_osds
>>> >>       pgmap v95701: 416 pgs, 11 pools, 2879 GB data, 729 kobjects
>>> >>             8643 GB used, 27624 GB / 36267 GB avail
>>> >>                  416 active+clean
>>> >> #
>>> >>
>>> >> # ceph osd pool ls
>>> >> .rgw.root
>>> >> default.rgw.control
>>> >> default.rgw.data.root
>>> >> default.rgw.gc
>>> >> default.rgw.log
>>> >> default.rgw.users.uid
>>> >> default.rgw.users.keys
>>> >> default.rgw.buckets.index
>>> >> default.rgw.buckets.non-ec
>>> >> default.rgw.buckets.data
>>> >> default.rgw.users.email
>>> >> #
>>> >>
>>> >> # ceph df
>>> >> GLOBAL:
>>> >>     SIZE       AVAIL      RAW USED     %RAW USED
>>> >>     36267G     27624G        8643G         23.83
>>> >> POOLS:
>>> >>     NAME                           ID     USED      %USED     MAX
>>> AVAIL
>>> >> OBJECTS
>>> >>     .rgw.root                      1       1588         0
>>>  5269G
>>> >>       4
>>> >>     default.rgw.control            2          0         0
>>>  5269G
>>> >>       8
>>> >>     default.rgw.data.root          3       8761         0
>>>  5269G
>>> >>      28
>>> >>     default.rgw.gc                 4          0         0
>>>  5269G
>>> >>      32
>>> >>     default.rgw.log                5          0         0
>>>  5269G
>>> >>     127
>>> >>     default.rgw.users.uid          6       4887         0
>>>  5269G
>>> >>      28
>>> >>     default.rgw.users.keys         7        144         0
>>>  5269G
>>> >>      16
>>> >>     default.rgw.buckets.index      9          0         0
>>>  5269G
>>> >>      14
>>> >>     default.rgw.buckets.non-ec     10         0         0
>>>  5269G
>>> >>       3
>>> >>     default.rgw.buckets.data       11     2879G     35.34
>>>  5269G
>>> >>  746848
>>> >>     default.rgw.users.email        12        13         0
>>>  5269G
>>> >>       1
>>> >> #
>>> >>
>>> >
>>> >
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] handling different disk sizes

Reply via email to