Re: [ceph-users] Degraded data redundancy: NUM pgs undersized
Hello Lothar, Thanks for your reply. Am 04.09.2018 um 11:20 schrieb Lothar Gesslein: By pure chance 15 pgs are now actually replicated to all 3 osds, so they have enough copies (clean). But the placement is "wrong", it would like to move the data to different osds (remapped) if possible. That seems to be correct. I've added a third bucket of type datacenter and moved on host bucket so that each datacenter has one host with one osd. The PGs were rebalanced (if that is the correct term) and status changed to HEALTH_OK with all PGs active+clean. Now I moved the host in dc2 to another datacenter and removed dc2 from the CRUSH map. Now I have all PGs active+clean+remapped. So now your next statement applies: It replicated to 2 osds in the initial placement but wasn't able to find a suitable third osd. Then by increasing pgp_num it recalculated the placement, again selected two osds and moved the data there. It won't remove the data from the "wrong" osd until it has a new place for it, so you end up with three copies, but remapped pgs. Ok, I think I got this. 3. What's wrong here and what do I have to do to get the cluster back to active+clean, again? I guess you want to have "two copies in dc1, one copy in dc2"? If you stay with only 3 osds that is the only way to distribute 3 objects anyway, so you don't need any crush rule. What your crush rule is currently expressing is "in the default root, select n buckets (where n is the pool size, 3 in this case) of type datacenter, select one leaf (meaning osd) in each datacenter". You only have 2 datacenter buckets, so that will only ever select 2 osds. If your cluster is going to grow to at least 2 osds in each dc, you can go with http://cephnotes.ksperis.com/blog/2017/01/23/crushmap-for-2-dc/ I would translate this crush rule as "in the default root, select 2 buckets of type datacenter, select n-1 (where n is the pool size, so here 3-1 = 2) leafs in each datacenter" You will need at least two osds in each dc for this, because it is random (with respect to the weights) in which dc the 2 copies will be placed and which gets the remaining copy. I don't get it why I need to have at least two osds in each dc. Because I thought when I only have three osds it is implicit clear where to write the two copies. In case I have two osds in each dc I would never know on which side the two copies of my three replicas are. Let's try an example to check if my understanding of the matter is correct or not: I have two dc dcA and dcB with two osds in each dc. Due to the random placement two copies of object A are written in dcA and one in dcB. From the next object B two copies are written in dcB and one in dcA. In case I have two osds in dcA and only one in dcB the two copies of an object are written to dcA every time and only one copy in dcB. Did I get it right? Best regards, Joerg smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Degraded data redundancy: NUM pgs undersized
On 09/04/2018 09:47 AM, Jörg Kastning wrote: > My questions are: > > 1. What does active+undersized actually mean? I did not find anything > about it in the documentation on docs.ceph.com. http://docs.ceph.com/docs/master/rados/operations/pg-states/ active Ceph will process requests to the placement group. undersized The placement group has fewer copies than the configured pool replication level. Your crush map/rules and osds do not allow to have all pgs on three "independent" osds, so pgs have fewer copies than configured. > 2. Why are only 15 PGs were getting remapped after I've corrected the > mistake with the wrong pgp_num value? By pure chance 15 pgs are now actually replicated to all 3 osds, so they have enough copies (clean). But the placement is "wrong", it would like to move the data to different osds (remapped) if possible. It replicated to 2 osds in the initial placement but wasn't able to find a suitable third osd. Then by increasing pgp_num it recalculated the placement, again selected two osds and moved the data there. It won't remove the data from the "wrong" osd until it has a new place for it, so you end up with three copies, but remapped pgs. > 3. What's wrong here and what do I have to do to get the cluster back > to active+clean, again? I guess you want to have "two copies in dc1, one copy in dc2"? If you stay with only 3 osds that is the only way to distribute 3 objects anyway, so you don't need any crush rule. What your crush rule is currently expressing is "in the default root, select n buckets (where n is the pool size, 3 in this case) of type datacenter, select one leaf (meaning osd) in each datacenter". You only have 2 datacenter buckets, so that will only ever select 2 osds. If your cluster is going to grow to at least 2 osds in each dc, you can go with http://cephnotes.ksperis.com/blog/2017/01/23/crushmap-for-2-dc/ I would translate this crush rule as "in the default root, select 2 buckets of type datacenter, select n-1 (where n is the pool size, so here 3-1 = 2) leafs in each datacenter" You will need at least two osds in each dc for this, because it is random (with respect to the weights) in which dc the 2 copies will be placed and which gets the remaining copy. Best regards, Lothar -- Lothar Gesslein Linux Consultant Mail: gessl...@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537 signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Degraded data redundancy: NUM pgs undersized
Good morning folks, As a newbie to Ceph yesterday was the first time I've configured my CRUSH map, added a CRUSH rule and created my first pool using this rule. Since then I get the status HEALTH_WARN with the following output: ~~~ $ sudo ceph status cluster: id: 47c108bd-db66-4197-96df-cadde9e9eb45 health: HEALTH_WARN Degraded data redundancy: 128 pgs undersized 1 pools have pg_num > pgp_num services: mon: 3 daemons, quorum ccp-tcnm01,ccp-tcnm02,ccp-tcnm03 mgr: ccp-tcnm01(active), standbys: ccp-tcnm03, ccp-tcnm02 osd: 3 osds: 3 up, 3 in data: pools: 1 pools, 128 pgs objects: 0 objects, 0 bytes usage: 3088 MB used, 3068 GB / 3071 GB avail pgs: 128 active+undersized ~~~ The pool was created running `sudo ceph osd pool create joergsfirstpool 128 replicated replicate_datacenter`. I've figured out that I forgot to set the value for the key pgp_num accordingly. So I've done that by running `sudo ceph osd pool set joergsfirstpool pgp_num 128`. As you could see in the following output 15 PGs were remapped but 113 still remain in active+undersized. ~~~ $ sudo ceph status cluster: id: 47c108bd-db66-4197-96df-cadde9e9eb45 health: HEALTH_WARN Degraded data redundancy: 113 pgs undersized services: mon: 3 daemons, quorum ccp-tcnm01,ccp-tcnm02,ccp-tcnm03 mgr: ccp-tcnm01(active), standbys: ccp-tcnm03, ccp-tcnm02 osd: 3 osds: 3 up, 3 in; 15 remapped pgs data: pools: 1 pools, 128 pgs objects: 0 objects, 0 bytes usage: 3089 MB used, 3068 GB / 3071 GB avail pgs: 113 active+undersized 15 active+clean+remapped ~~~ My questions are: 1. What does active+undersized actually mean? I did not find anything about it in the documentation on docs.ceph.com. 2. Why are only 15 PGs were getting remapped after I've corrected the mistake with the wrong pgp_num value? 3. What's wrong here and what do I have to do to get the cluster back to active+clean, again? For further information you could find my current CRUSH map below: # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host ccp-tcnm01 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 1.000 alg straw2 hash 0 # rjenkins1 item osd.1 weight 1.000 } host ccp-tcnm03 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 1.000 alg straw2 hash 0 # rjenkins1 item osd.2 weight 1.000 } datacenter dc1 { id -9 # do not change unnecessarily id -12 class hdd# do not change unnecessarily # weight 2.000 alg straw2 hash 0 # rjenkins1 item ccp-tcnm01 weight 1.000 item ccp-tcnm03 weight 1.000 } host ccp-tcnm02 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 1.000 alg straw2 hash 0 # rjenkins1 item osd.0 weight 1.000 } datacenter dc3 { id -10 # do not change unnecessarily id -11 class hdd# do not change unnecessarily # weight 1.000 alg straw2 hash 0 # rjenkins1 item ccp-tcnm02 weight 1.000 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 3.000 alg straw2 hash 0 # rjenkins1 item dc1 weight 2.000 item dc3 weight 1.000 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule replicate_datacenter { id 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type datacenter step emit } # end crush map Best regards, Joerg smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com