Re: [ceph-users] Help with setting device-class rule on pool without causing data to move

2018-12-31 Thread Eric Goirand
Hi David,

CERN has provided with a python script to swap the correct bucket IDs
(default <-> hdd), you can find it here :
https://github.com/cernceph/ceph-scripts/blob/master/tools/device-class-id-swap.py

The principle is the following :
- extract the CRUSH map
- run the script on it => it creates a new CRUSH file.
- edit the CRUSH map and modify the rule associated with the pool(s) you
want to associate with HDD OSDs only like :
=> step take default WITH step take default class hdd

Then recompile and reinject the new CRUSH map and voilà !

Your cluster should be using only the HDD OSDs without rebalancing (or a
very small amount).

In case you have forgotten something, just reapply the former CRUSH map and
start again.

Cheers and Happy new year 2019.

Eric



On Sun, Dec 30, 2018, 21:16 David C  wrote:

> Hi All
>
> I'm trying to set the existing pools in a Luminous cluster to use the hdd
> device-class but without moving data around. If I just create a new rule
> using the hdd class and set my pools to use that new rule it will cause a
> huge amount of data movement even though the pgs are all already on HDDs.
>
> There is a thread on ceph-large [1] which appears to have the solution but
> I can't get my head around what I need to do. I'm not too clear on which
> IDs I need to swap. Could someone give me some pointers on this please?
>
> [1]
> http://lists.ceph.com/pipermail/ceph-large-ceph.com/2018-April/000109.html
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with primary affinity 0 still used for primary PG

2018-02-15 Thread Eric Goirand

Hello Teun,

see below ..

On 02/15/2018 11:52 AM, Teun Docter wrote:

Hi David,

Thanks for explaining that, makes sense. (Though I guess the docs aren't very 
clear on that, but ok.) I have a follow up question on your suggestion to 
modify the crush map though.

I've seen a few examples on how to use crush rules to place primary copies on 
SSDs, and secondary copies on HDDs. In fact, one such example is in the main 
Ceph docs. However, they all seem to be based on the premise of having two 
types of OSD servers. One type would have *only* SSDs, and the other *only* 
HDDs.

However, that's not the scenario I'm investigating. I would like each of my OSD 
servers to be the same. Each would contain a number of SSDs, and a number of 
HDDs.

After reading up on crush rules, I think I understand how to setup a basic rule 
that would place the primary copy on a SSD, and the other copies on HDDs. But 
what I haven't figured out yet, is it possible to avoid placing one of the 
secondary copies on the same host that stores the primary copy?
The only way (as of now, before including the work of the bugzilla link 
[2]), to avoid having 2 copies on the same server (1 copy on a SSD drive 
and 1 copy on a HDD drive) would be to separate physically the servers 
that contain the SSD drives from the servers that contain the HDD drives.


You would then have to create your ruleset as you did previously but 
this time the two roots you start from (step take) are separated, thus 
no copy will end up on the same servers.


If you stick with the collocated drive setup, and if you still keep 
min_size equals to 2 in your pools, I would suggest to use replica 4 
instead of 3, to have access to all your data even if one server is down 
for maintenance.



I found an earlier thread [1] where you've hinted at using racks for this, but in that 
thread I think there is also some confusion about SSD/HDD only servers, versus 
"hybrid" servers. In addition, I found an issue in RedHats tracker [2], which 
also outlines this problem.

With my current understanding of crush rules, I'm not sure the setup I had in 
mind is feasible?

Thanks,
Teun

Thanks,
Eric.


[1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017589.html
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1517128

On 02/12/2018 09:17 PM, David Turner wrote:

If you look at the PGs that are primary on an OSD that has primary
affinity 0, you'll find that they are only on OSDs with primary affinity
of 0, so 1 of them has to take the reins or nobody would be responsible
for the PG.  To prevent this from happening, you would need to configure
your crush map in a way where all PGs are guaranteed to land on at least
1 OSD that doesn't have a primary affinity of 0.

On Mon, Feb 12, 2018 at 2:45 PM Teun Docter
> wrote:

 Hi,

 I'm looking into storing the primary copy on SSDs, and replicas on
 spinners.
 One way to achieve this should be the primary affinity setting, as
 outlined in this post:

 
https://www.sebastien-han.fr/blog/2015/08/06/ceph-get-the-best-of-your-ssd-with-primary-affinity

 So I've deployed a small test cluster and set the affinity to 0 for
 half the OSDs and to 1 for the rest:

 # ceph osd tree
 ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF
 -1       0.07751 root default
 -3       0.01938     host osd001
   1   hdd 0.00969         osd.1       up  1.0 1.0
   4   hdd 0.00969         osd.4       up  1.0       0
 -7       0.01938     host osd002
   2   hdd 0.00969         osd.2       up  1.0 1.0
   6   hdd 0.00969         osd.6       up  1.0       0
 -9       0.01938     host osd003
   3   hdd 0.00969         osd.3       up  1.0 1.0
   7   hdd 0.00969         osd.7       up  1.0       0
 -5       0.01938     host osd004
   0   hdd 0.00969         osd.0       up  1.0 1.0
   5   hdd 0.00969         osd.5       up  1.0       0

 Then I've created a pool. The summary at the end of "ceph pg dump"
 looks like this:

 sum 0 0 0 0 0 0 0 0
 OSD_STAT USED  AVAIL  TOTAL  HB_PEERS        PG_SUM PRIMARY_PG_SUM
 7        1071M  9067M 10138M [0,1,2,3,4,5,6]    192             26
 6        1072M  9066M 10138M [0,1,2,3,4,5,7]    198             18
 5        1071M  9067M 10138M [0,1,2,3,4,6,7]    192             21
 4        1076M  9062M 10138M [0,1,2,3,5,6,7]    202             15
 3        1072M  9066M 10138M [0,1,2,4,5,6,7]    202            121
 2        1072M  9066M 10138M [0,1,3,4,5,6,7]    195            114
 1        1076M  9062M 10138M [0,2,3,4,5,6,7]    161             95
 0        1071M  9067M 10138M [1,2,3,4,5,6,7]    194            102
 sum      8587M 72524M 8M

 Now, the OSDs for which the primary affinity is set to zero are
 acting as primary a lot less than the others.

Re: [ceph-users] erasure code profile

2017-09-23 Thread Eric Goirand

Hello Luis,

To find what EC profile would be best in your environment, you would 
need to know :


- how mans disks or hosts failure would you accept : I understood from 
your email that you want to be able to loose one room, but won't you 
need a bit more, such as loosing 1 disk (or 1 host) in another room 
while the first one is down ?


- how many OSD nodes you can (or will) have per room or will you adapt 
this number from the EC profile you set up ?


These two questions answered, you will be able to set up the m parameter 
of the EC profile and you would then need to compute the k parameter so 
that you have the same number ((k+m) / 3) necessary OSD nodes per room 
at minimum.


In each situation, you would then certainly need to adapt the CRUSH 
ruleset associated with the EC profile to have exactly ((k+m) / 3) x EC 
chunks per room to be able to have access to all your data when one room 
is down.


Suppose that we only accept one room down and nothing more :

   - if m=1, k will be mandatory equal to 2 as you arrived to it, and
   you would only have 1 OSD node per room.

   - if m=2, k will be equal to 4 If I apply it and you would need 2
   OSD nodes per room and you would need to change EC 4+2 ruleset to
   have 2 chunks per room.

Suppose now that you want to have more possible downtime, for example 
you want to be able to perform the maintenance of one OSD node when one 
room is down, then you would need to have at least m = (number of OSD 
node in 1 room) + 1.


   - if I have 2 OSD nodes per room, m will need to be equal to 3 and
   by deduction k would be equal to 3 and I would need exactly 2 ((3+3)
   / 3) ruleset chunks per room.

   - if I have 3 OSD nodes per room, then m=4 and k=5 and you would
   need 3 chunks per room.

Now, this is a minimum and for a given EC profile (let's say 5+4) I 
would recommend to have one spare OSD node per room so that you could 
perform backfilling inside one room in case another OSD is down.


Thus, if you can have 12 OSD nodes in total, 4 OSD nodes per room, I 
would still be using profile EC 5+4 and changing the ruleset to have 
exactly 3 chunks per room, the efficiency of your cluster will be 55% 
(55 TiB per 100 TiB of raw capacity).


Also remember that you would still need a good network between rooms 
(both for speed and latency) and powerful CPUs on OSD nodes to compute 
the EC chunks all the time.


Best Regards,

Eric.

On 09/22/2017 10:39 AM, Luis Periquito wrote:

Hi all,

I've been trying to think what will be the best erasure code profile,
but I don't really like the one I came up with...

I have 3 rooms that are part of the same cluster, and I need to design
so we can lose any one of the 3.

As this is a backup cluster I was thinking on doing a k=2 m=1 code,
with ruleset-failure-domain=room as the OSD tree is correctly built.

Can anyone think of a better profile?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com