Re: [ceph-users] Degraded data redundancy: NUM pgs undersized

2018-09-04 Thread Jörg Kastning

Hello Lothar,

Thanks for your reply.

Am 04.09.2018 um 11:20 schrieb Lothar Gesslein:

By pure chance 15 pgs are now actually replicated to all 3 osds, so they
have enough copies (clean). But the placement is "wrong", it would like
to move the data to different osds (remapped) if possible.


That seems to be correct. I've added a third bucket of type datacenter 
and moved on host bucket so that each datacenter has one host with one 
osd. The PGs were rebalanced (if that is the correct term) and status 
changed to HEALTH_OK with all PGs active+clean.


Now I moved the host in dc2 to another datacenter and removed dc2 from 
the CRUSH map. Now I have all PGs active+clean+remapped. So now your 
next statement applies:



It replicated to 2 osds in the initial placement but wasn't able to find
a suitable third osd. Then by increasing pgp_num it recalculated the
placement, again selected two osds and moved the data there. It won't
remove the data from the "wrong" osd until it has a new place for it, so
you end up with three copies, but remapped pgs.


Ok, I think I got this.



  3. What's wrong here and what do I have to do to get the cluster back
to active+clean, again?


I guess you want to have "two copies in dc1, one copy in dc2"?

If you stay with only 3 osds that is the only way to distribute 3
objects anyway, so you don't need any crush rule.

What your crush rule is currently expressing is

"in the default root, select n buckets (where n is the pool size, 3 in
this case) of type datacenter, select one leaf (meaning osd) in each
datacenter". You only have 2 datacenter buckets, so that will only ever
select 2 osds.


If your cluster is going to grow to at least 2 osds in each dc, you can
go with

http://cephnotes.ksperis.com/blog/2017/01/23/crushmap-for-2-dc/

I would translate this crush rule as

"in the default root, select 2 buckets of type datacenter, select n-1
(where n is the pool size, so here 3-1 = 2) leafs in each datacenter"

You will need at least two osds in each dc for this, because it is
random (with respect to the weights) in which dc the 2 copies will be
placed and which gets the remaining copy.


I don't get it why I need to have at least two osds in each dc. Because 
I thought when I only have three osds it is implicit clear where to 
write the two copies.


In case I have two osds in each dc I would never know on which side the 
two copies of my three replicas are.


Let's try an example to check if my understanding of the matter is 
correct or not:


I have two dc dcA and dcB with two osds in each dc. Due to the random 
placement two copies of object A are written in dcA and one in dcB. From 
the next object B two copies are written in dcB and one in dcA.


In case I have two osds in dcA and only one in dcB the two copies of an 
object are written to dcA every time and only one copy in dcB.


Did I get it right?

Best regards,
Joerg




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded data redundancy: NUM pgs undersized

2018-09-04 Thread Lothar Gesslein
On 09/04/2018 09:47 AM, Jörg Kastning wrote:
> My questions are:
> 
>  1. What does active+undersized actually mean? I did not find anything
> about it in the documentation on docs.ceph.com.

http://docs.ceph.com/docs/master/rados/operations/pg-states/

active
Ceph will process requests to the placement group.

undersized
The placement group has fewer copies than the configured pool
replication level.


Your crush map/rules and osds do not allow to have all pgs on three
"independent" osds, so pgs have fewer copies than configured.

>  2. Why are only 15 PGs were getting remapped after I've corrected the
> mistake with the wrong pgp_num value?

By pure chance 15 pgs are now actually replicated to all 3 osds, so they
have enough copies (clean). But the placement is "wrong", it would like
to move the data to different osds (remapped) if possible.

It replicated to 2 osds in the initial placement but wasn't able to find
a suitable third osd. Then by increasing pgp_num it recalculated the
placement, again selected two osds and moved the data there. It won't
remove the data from the "wrong" osd until it has a new place for it, so
you end up with three copies, but remapped pgs.

>  3. What's wrong here and what do I have to do to get the cluster back
> to active+clean, again?

I guess you want to have "two copies in dc1, one copy in dc2"?

If you stay with only 3 osds that is the only way to distribute 3
objects anyway, so you don't need any crush rule.

What your crush rule is currently expressing is

"in the default root, select n buckets (where n is the pool size, 3 in
this case) of type datacenter, select one leaf (meaning osd) in each
datacenter". You only have 2 datacenter buckets, so that will only ever
select 2 osds.


If your cluster is going to grow to at least 2 osds in each dc, you can
go with

http://cephnotes.ksperis.com/blog/2017/01/23/crushmap-for-2-dc/

I would translate this crush rule as

"in the default root, select 2 buckets of type datacenter, select n-1
(where n is the pool size, so here 3-1 = 2) leafs in each datacenter"

You will need at least two osds in each dc for this, because it is
random (with respect to the weights) in which dc the 2 copies will be
placed and which gets the remaining copy.


Best regards,
Lothar


-- 
Lothar Gesslein
Linux Consultant
Mail: gessl...@b1-systems.de

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Degraded data redundancy: NUM pgs undersized

2018-09-04 Thread Jörg Kastning

Good morning folks,

As a newbie to Ceph yesterday was the first time I've configured my 
CRUSH map, added a CRUSH rule and created my first pool using this rule.


Since then I get the status HEALTH_WARN with the following output:

~~~
$ sudo ceph status
  cluster:
id: 47c108bd-db66-4197-96df-cadde9e9eb45
health: HEALTH_WARN
Degraded data redundancy: 128 pgs undersized
1 pools have pg_num > pgp_num

  services:
mon: 3 daemons, quorum ccp-tcnm01,ccp-tcnm02,ccp-tcnm03
mgr: ccp-tcnm01(active), standbys: ccp-tcnm03, ccp-tcnm02
osd: 3 osds: 3 up, 3 in

  data:
pools:   1 pools, 128 pgs
objects: 0 objects, 0 bytes
usage:   3088 MB used, 3068 GB / 3071 GB avail
pgs: 128 active+undersized
~~~

The pool was created running `sudo ceph osd pool create joergsfirstpool 
128 replicated replicate_datacenter`.


I've figured out that I forgot to set the value for the key pgp_num 
accordingly. So I've done that by running `sudo ceph osd pool set 
joergsfirstpool pgp_num 128`. As you could see in the following output 
15 PGs were remapped but 113 still remain in active+undersized.


~~~
$ sudo ceph status
  cluster:
id: 47c108bd-db66-4197-96df-cadde9e9eb45
health: HEALTH_WARN
Degraded data redundancy: 113 pgs undersized

  services:
mon: 3 daemons, quorum ccp-tcnm01,ccp-tcnm02,ccp-tcnm03
mgr: ccp-tcnm01(active), standbys: ccp-tcnm03, ccp-tcnm02
osd: 3 osds: 3 up, 3 in; 15 remapped pgs

  data:
pools:   1 pools, 128 pgs
objects: 0 objects, 0 bytes
usage:   3089 MB used, 3068 GB / 3071 GB avail
pgs: 113 active+undersized
 15  active+clean+remapped
~~~

My questions are:

 1. What does active+undersized actually mean? I did not find anything 
about it in the documentation on docs.ceph.com.


 2. Why are only 15 PGs were getting remapped after I've corrected the 
mistake with the wrong pgp_num value?


 3. What's wrong here and what do I have to do to get the cluster back 
to active+clean, again?


For further information you could find my current CRUSH map below:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ccp-tcnm01 {
id -5   # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.1 weight 1.000
}
host ccp-tcnm03 {
id -7   # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.2 weight 1.000
}
datacenter dc1 {
id -9   # do not change unnecessarily
id -12 class hdd# do not change unnecessarily
# weight 2.000
alg straw2
hash 0  # rjenkins1
item ccp-tcnm01 weight 1.000
item ccp-tcnm03 weight 1.000
}
host ccp-tcnm02 {
id -3   # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.0 weight 1.000
}
datacenter dc3 {
id -10  # do not change unnecessarily
id -11 class hdd# do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item ccp-tcnm02 weight 1.000
}
root default {
id -1   # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 3.000
alg straw2
hash 0  # rjenkins1
item dc1 weight 2.000
item dc3 weight 1.000
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule replicate_datacenter {
id 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type datacenter
step emit
}

# end crush map

Best regards,
Joerg



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com