[ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-28 Thread Jakub Jaszewski
Hi, I'm trying to understand erasure coded pools and why CRUSH rules seem
to work for only part of PGs in EC pools.

Basically what I'm trying to do is to check erasure coded pool recovering
behaviour after the single OSD or single HOST failure.
I noticed that in case of HOST failure only part of PGs get recovered to
active+remapped when other PGs remain in active+undersized+degraded state.
Why??
EC pool profile I use is k=3 , m=2.

Also I'm not really sure what is the meaning of all steps of below crush
rule (perhaps it is the root cause).
rule ecpool_3_2 {
ruleset 1
type erasure
min_size 3
max_size 5
step set_chooseleaf_tries 5 # should I maybe try to increase this number of
retry ?? Can I apply the changes to existing EC crush rule and pool or need
to create a new one ?
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host # Does it allow to choose more than one
OSD from single HOST but first trying to get only one OSD per HOST if there
are enough HOSTs in the cluster?
step emit
}

ceph version 10.2.9 (jewel)

# INITIAL CLUSTER STATE
root@host01:~# ceph osd tree
ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT
PRIMARY-AFFINITY
 -1 218.18401 root default

 -6 218.18401 region MyRegion

 -5 218.18401 datacenter MyDC

 -4 218.18401 room MyRoom

 -3  43.63699 rack Rack01

 -2  43.63699 host host01

  0   3.63599 osd.0 up  1.0
 1.0
  3   3.63599 osd.3 up  1.0
 1.0
  4   3.63599 osd.4 up  1.0
 1.0
  6   3.63599 osd.6 up  1.0
 1.0
  8   3.63599 osd.8 up  1.0
 1.0
 10   3.63599 osd.10up  1.0
 1.0
 12   3.63599 osd.12up  1.0
 1.0
 14   3.63599 osd.14up  1.0
 1.0
 16   3.63599 osd.16up  1.0
 1.0
 19   3.63599 osd.19up  1.0
 1.0
 22   3.63599 osd.22up  1.0
 1.0
 25   3.63599 osd.25up  1.0
 1.0
 -8  43.63699 rack Rack02

 -7  43.63699 host host02

  1   3.63599 osd.1 up  1.0
 1.0
  2   3.63599 osd.2 up  1.0
 1.0
  5   3.63599 osd.5 up  1.0
 1.0
  7   3.63599 osd.7 up  1.0
 1.0
  9   3.63599 osd.9 up  1.0
 1.0
 11   3.63599 osd.11up  1.0
 1.0
 13   3.63599 osd.13up  1.0
 1.0
 15   3.63599 osd.15up  1.0
 1.0
 17   3.63599 osd.17up  1.0
 1.0
 20   3.63599 osd.20up  1.0
 1.0
 23   3.63599 osd.23up  1.0
 1.0
 26   3.63599 osd.26up  1.0
 1.0
-10 130.91000 rack Rack03

 -9  43.63699 host host03

 18   3.63599 osd.18up  1.0
 1.0
 21   3.63599 osd.21up  1.0
 1.0
 24   3.63599 osd.24up  1.0
 1.0
 27   3.63599 osd.27up  1.0
 1.0
 28   3.63599 osd.28up  1.0
 1.0
 29   3.63599 osd.29up  1.0
 1.0
 30   3.63599 osd.30up  1.0
 1.0
 31   3.63599 osd.31up  1.0
 1.0
 32   3.63599 osd.32up  1.0
 1.0
 33   3.63599 osd.33up  1.0
 1.0
 34   3.63599 osd.34up  1.0
 1.0
 35   3.63599 osd.35up  1.0
 1.0
-11  43.63699 host host04

 36   3.63599 osd.36up  1.0
 1.0
 37   3.63599 osd.37up  1.0
 1.0
 38   3.63599 osd.38up  1.0
 1.0
 39   3.63599 osd.39up  1.0
 1.0
 40   3.63599 osd.40up  1.0
 1.0
 41   3.63599 osd.41up  1.0
 1.0
 42   3.63599 osd.42up  1.0
 1.0
 43   3.63599 osd.43up  1.0
 1.0
 44   3.63599 osd.44up  1.0
 1.0
 45   3.63599 osd.45up  1.0
 1.00

Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-28 Thread David Turner
Your EC profile requires 5 servers to be healthy.  When you remove 1 OSD
from the cluster, it recovers by moving all of the copies on that OSD to
other OSDs in the same host.  However when you remove an entire host, it
cannot store 5 copies of the data on the 4 remaining servers with your
crush rules.  The EC profile you're using does not work with this type of
testing based on your hardware configuration.

On Tue, Nov 28, 2017 at 8:43 AM Jakub Jaszewski 
wrote:

> Hi, I'm trying to understand erasure coded pools and why CRUSH rules seem
> to work for only part of PGs in EC pools.
>
> Basically what I'm trying to do is to check erasure coded pool recovering
> behaviour after the single OSD or single HOST failure.
> I noticed that in case of HOST failure only part of PGs get recovered to
> active+remapped when other PGs remain in active+undersized+degraded state.
> Why??
> EC pool profile I use is k=3 , m=2.
>
> Also I'm not really sure what is the meaning of all steps of below crush
> rule (perhaps it is the root cause).
> rule ecpool_3_2 {
> ruleset 1
> type erasure
> min_size 3
> max_size 5
> step set_chooseleaf_tries 5 # should I maybe try to increase this number
> of retry ?? Can I apply the changes to existing EC crush rule and pool or
> need to create a new one ?
> step set_choose_tries 100
> step take default
> step chooseleaf indep 0 type host # Does it allow to choose more than one
> OSD from single HOST but first trying to get only one OSD per HOST if there
> are enough HOSTs in the cluster?
> step emit
> }
>
> ceph version 10.2.9 (jewel)
>
> # INITIAL CLUSTER STATE
> root@host01:~# ceph osd tree
> ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT
> PRIMARY-AFFINITY
>  -1 218.18401 root default
>
>  -6 218.18401 region MyRegion
>
>  -5 218.18401 datacenter MyDC
>
>  -4 218.18401 room MyRoom
>
>  -3  43.63699 rack Rack01
>
>  -2  43.63699 host host01
>
>   0   3.63599 osd.0 up  1.0
>  1.0
>   3   3.63599 osd.3 up  1.0
>  1.0
>   4   3.63599 osd.4 up  1.0
>  1.0
>   6   3.63599 osd.6 up  1.0
>  1.0
>   8   3.63599 osd.8 up  1.0
>  1.0
>  10   3.63599 osd.10up  1.0
>  1.0
>  12   3.63599 osd.12up  1.0
>  1.0
>  14   3.63599 osd.14up  1.0
>  1.0
>  16   3.63599 osd.16up  1.0
>  1.0
>  19   3.63599 osd.19up  1.0
>  1.0
>  22   3.63599 osd.22up  1.0
>  1.0
>  25   3.63599 osd.25up  1.0
>  1.0
>  -8  43.63699 rack Rack02
>
>  -7  43.63699 host host02
>
>   1   3.63599 osd.1 up  1.0
>  1.0
>   2   3.63599 osd.2 up  1.0
>  1.0
>   5   3.63599 osd.5 up  1.0
>  1.0
>   7   3.63599 osd.7 up  1.0
>  1.0
>   9   3.63599 osd.9 up  1.0
>  1.0
>  11   3.63599 osd.11up  1.0
>  1.0
>  13   3.63599 osd.13up  1.0
>  1.0
>  15   3.63599 osd.15up  1.0
>  1.0
>  17   3.63599 osd.17up  1.0
>  1.0
>  20   3.63599 osd.20up  1.0
>  1.0
>  23   3.63599 osd.23up  1.0
>  1.0
>  26   3.63599 osd.26up  1.0
>  1.0
> -10 130.91000 rack Rack03
>
>  -9  43.63699 host host03
>
>  18   3.63599 osd.18up  1.0
>  1.0
>  21   3.63599 osd.21up  1.0
>  1.0
>  24   3.63599 osd.24up  1.0
>  1.0
>  27   3.63599 osd.27up  1.0
>  1.0
>  28   3.63599 osd.28up  1.0
>  1.0
>  29   3.63599 osd.29up  1.0
>  1.0
>  30   3.63599 osd.30up  1.0
>  1.0
>  31   3.63599 osd.31up  1.0
>  1.0
>  32   3.63599 osd.32up  1.0
>  1.0
>  33   3.63599 osd.33up  1.0
>  1.0
>  34   3.63599 osd.34up  1.0
>  1.0
>  35   3.63599 osd.35up  1.0
>  1.0
> -11  43.63699 host host04
>
>  36   3.63599  

Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-28 Thread Jakub Jaszewski
Hi David, thanks for quick feedback.

Then why some PGs were remapped and some were not?

# LOOKS THAT 338 PGs IN ERASURE CODED POOLS HAVE BEEN REMAPPED
# I DONT GET WHY 540 PGs STILL ENCOUNTER active+undersized+degraded STATE

root at host01 :~#
ceph pg dump pgs_brief  |grep 'active+remapped'
dumped pgs_brief in format plain
16.6f active+remapped [43,2147483647 <(214)%20748-3647>,2,31,12] 43
[43,33,2,31,12] 43
16.6e active+remapped [10,5,35,44,2147483647] 10 [10,5,35,44,41] 10
root at host01
:~# egrep
'16.6f|16.6e' PGs_on_HOST_host05
16.6f active+clean [43,33,2,59,12] 43 [43,33,2,59,12] 43
16.6e active+clean [10,5,49,35,41] 10 [10,5,49,35,41] 10root at host01
:~#

like PG 16.6f, prior to ceph services stop it was on [43,33,2,59,12]
then was remapped to [43,33,2,31,12], so OSD@31 and OSD@33 are on the
same HOST.

But for example PG 16.ee get to active+undersized+degraded state,
prior to services stop it was on

pg_stat state up up_primary acting acting_primary
16.ee active+clean [5,22,33,55,45] 5 [5,22,33,55,45] 5

after the stop of services on the host it was not remapped

16.ee   active+undersized+degraded  [5,22,33,2147483647
<(214)%20748-3647>,45]  5   [5,22,33,2147483647 <(214)%20748-3647>,45]  
5
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread Jakub Jaszewski
I've just did ceph upgrade jewel -> luminous and am facing the same case...

# EC profile
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

5 hosts in the cluster and I run systemctl stop ceph.target on one of them
some PGs from EC pool were remapped (active+clean+remapped state) even when
there was not enough hosts in the cluster but some are still in
active+undersized+degraded state


root@host01:~# ceph status
  cluster:
id: a6f73750-1972-47f6-bcf5-a99753be65ad
health: HEALTH_WARN
Degraded data redundancy: 876/9115 objects degraded (9.611%),
540 pgs unclean, 540 pgs degraded, 540 pgs undersized

  services:
mon: 3 daemons, quorum host01,host02,host03
mgr: host01(active), standbys: host02, host03
osd: 60 osds: 48 up, 48 in; 484 remapped pgs
rgw: 3 daemons active

  data:
pools:   19 pools, 3736 pgs
objects: 1965 objects, 306 MB
usage:   5153 MB used, 174 TB / 174 TB avail
pgs: 876/9115 objects degraded (9.611%)
 2712 active+clean
 540  active+undersized+degraded
 484  active+clean+remapped

  io:
client:   17331 B/s rd, 20 op/s rd, 0 op/s wr

root@host01:~#



Anyone here able to explain this behavior to me ?

Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread David Turner
active+clean+remapped is not a healthy state for a PG. If it actually we're
going to a new osd it would say backfill+wait or backfilling and eventually
would get back to active+clean.

I'm not certain what the active+clean+remapped state means. Perhaps a PG
query, PG dump, etc can give more insight. In any case, this is not a
healthy state and you're still testing removing a node to have less than
you need to be healthy.

On Thu, Nov 30, 2017, 5:38 AM Jakub Jaszewski 
wrote:

> I've just did ceph upgrade jewel -> luminous and am facing the same case...
>
> # EC profile
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=3
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> 5 hosts in the cluster and I run systemctl stop ceph.target on one of them
> some PGs from EC pool were remapped (active+clean+remapped state) even
> when there was not enough hosts in the cluster but some are still in
> active+undersized+degraded state
>
>
> root@host01:~# ceph status
>   cluster:
> id: a6f73750-1972-47f6-bcf5-a99753be65ad
> health: HEALTH_WARN
> Degraded data redundancy: 876/9115 objects degraded (9.611%),
> 540 pgs unclean, 540 pgs degraded, 540 pgs undersized
>
>   services:
> mon: 3 daemons, quorum host01,host02,host03
> mgr: host01(active), standbys: host02, host03
> osd: 60 osds: 48 up, 48 in; 484 remapped pgs
> rgw: 3 daemons active
>
>   data:
> pools:   19 pools, 3736 pgs
> objects: 1965 objects, 306 MB
> usage:   5153 MB used, 174 TB / 174 TB avail
> pgs: 876/9115 objects degraded (9.611%)
>  2712 active+clean
>  540  active+undersized+degraded
>  484  active+clean+remapped
>
>   io:
> client:   17331 B/s rd, 20 op/s rd, 0 op/s wr
>
> root@host01:~#
>
>
>
> Anyone here able to explain this behavior to me ?
>
> Jakub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread Denes Dolhay
As per your ceph status it seems that you have 19 pools, all of them are 
erasure coded as 3+2?


It seems that when you taken the node offline ceph could move some of 
the PGs to other nodes (it seems that that one or more pools does not 
require all 5 osds to be healty. Maybe they are replicated, or not 3+2 
erasure coded?)


Theese pgs are the active+clean+remapped. (Ceph could successfully put 
theese on other osds to maintain the replica count / erasure coding 
profile, and this remapping process completed)


Some other pgs do seem to require all 5 osds to be present, these are 
the "undersized" ones.



One other thing, if your failure domain is osd and not host or a larger 
unit, then Ceph will not try to place all replicas on different servers, 
just different osds, hence it can satisfy the criteria even if one of 
the hosts are down. This setting would be highly inadvisable on a 
production system!



Denes.

On 11/30/2017 02:45 PM, David Turner wrote:


active+clean+remapped is not a healthy state for a PG. If it actually 
we're going to a new osd it would say backfill+wait or backfilling and 
eventually would get back to active+clean.


I'm not certain what the active+clean+remapped state means. Perhaps a 
PG query, PG dump, etc can give more insight. In any case, this is not 
a healthy state and you're still testing removing a node to have less 
than you need to be healthy.



On Thu, Nov 30, 2017, 5:38 AM Jakub Jaszewski 
mailto:jaszewski.ja...@gmail.com>> wrote:


I've just did ceph upgrade jewel -> luminous and am facing the
same case...

# EC profile
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

5 hosts in the cluster and I run systemctl stop ceph.target on one
of them
some PGs from EC pool were remapped (active+clean+remappedstate)
even when there was not enough hosts in the cluster but some are
still in active+undersized+degradedstate


root@host01:~# ceph status
  cluster:
    id: a6f73750-1972-47f6-bcf5-a99753be65ad
    health: HEALTH_WARN
            Degraded data redundancy: 876/9115 objects degraded
(9.611%), 540 pgs unclean, 540 pgs degraded, 540 pgs undersized
  services:
    mon: 3 daemons, quorum host01,host02,host03
    mgr: host01(active), standbys: host02, host03
    osd: 60 osds: 48 up, 48 in; 484 remapped pgs
    rgw: 3 daemons active
  data:
    pools:   19 pools, 3736 pgs
    objects: 1965 objects, 306 MB
    usage:   5153 MB used, 174 TB / 174 TB avail
    pgs:     876/9115 objects degraded (9.611%)
             2712 active+clean
             540  active+undersized+degraded
             484  active+clean+remapped
  io:
    client:   17331 B/s rd, 20 op/s rd, 0 op/s wr
root@host01:~#



Anyone here able to explain this behavior to me ?

Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com