Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

John Morris Mon, 18 Aug 2014 10:51:11 -0700


On 08/18/2014 12:13 PM, John Morris wrote:


On 08/14/2014 02:35 AM, Christian Balzer wrote:


The default (firefly, but previous ones are functionally identical) crush
map has:
---
# rules
rule replicated_ruleset {
         ruleset 0
         type replicated
         min_size 1
         max_size 10
         step take default
         step chooseleaf firstn 0 type host
         step emit
}
---

The type host states that there will be not more that one replica per
host
(node), so with size=3 you will need at least 3 hosts to choose from.
If you were to change this to to type OSD, all 3 replicas could wind
up on
the same host, not really a good idea.


Ah, this is a great clue.  (On my cluster, the default rule contains
'step choose firstn 0 type osd', and thus has the problem you hint at
here.)

So I played with a new rule set with the buckets 'root', 'rack', 'host',
'bank' and 'osd', of which 'rack' and 'host' are unused.  The 'bank'
bucket:  the OSD nodes each contain two 'banks' of disks with a separate
disk controller channel, a separate power supply cable, and a separate
SSD.  Thus, 'bank' actually does represent a real failure domain.  More
importantly, this provides a bucket level below 'osd' that is big enough
for 3-4 replicas.  Here's the rule:

rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step choose firstn 0 type bank
         step choose firstn 0 type osd
         step emit
}

Ah, with the 'legacy' tunables, the 'chooseleaf' step in the above rulegenerates bad mappings. But by injecting tunables into the map(recommended in the below link), the rule can be shortened to the following:


rule by_bank {
        ruleset 3
        type replicated
        min_size 3
        max_size 4
        step take default
        step chooseleaf firstn 0 type bank
        step emit
}

See this link:

http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush-the-hard-way

Below, after re-compiling the new CRUSH map, but before running tests,inject the tunables into the binary map, and then run the tests on/tmp/crush-new-tuned.bin instead:


crushtool --enable-unsafe-tunables \
  --set-choose-local-tries 0 \
  --set-choose-local-fallback-tries 0 \
  --set-choose-total-tries 50 \
  -i /tmp/crush-new.bin -o /tmp/crush-new-tuned.bin


If the OP (sorry, Craig, you do have a name ;) wants to play with CRUSH
map rules, here's the quick and dirty of what I did:

# get the current 'orig' CRUSH map, decompile and edit; see:
#
http://ceph.com/docs/master/rados/operations/crush-map/#editing-a-crush-map

ceph osd getcrushmap -o /tmp/crush-orig.bin
crushtool -d /tmp/crush-orig.bin -o /tmp/crush.txt
$EDITOR /tmp/crush.txt

# edit the crush map with your fave editor; see:
# http://ceph.com/docs/master/rados/operations/crush-map
#
# in my case, I added the bank type:

type 0 osd
type 1 bank
type 2 host
type 3 rack
type 4 root

# the banks (repeat as applicable):

bank bank0 {
         id -6
         alg straw
         hash 0
         item osd.0 weight 1.000
         item osd.1 weight 1.000
}

bank bank1 {
         id -7
         alg straw
         hash 0
         item osd.2 weight 1.000
         item osd.3 weight 1.000
}

# updated the hosts (repeat as applicable):

host host0 {
         id -4           # do not change unnecessarily
         # weight 3.000
         alg straw
         hash 0  # rjenkins1
         item bank0 weight 2.000
         item bank1 weight 2.000
}

# and added the rule:

rule by_bank {
         ruleset 3
         type replicated
         min_size 3
         max_size 4
         step take default
         step choose firstn 0 type bank
         step choose firstn 0 type osd
         step emit
}

# compile the crush map:

crushtool -c /tmp/crush.txt -o /tmp/crush-new.bin

# and run some tests; the replica sizes tested come from
# 'min_size' and 'max_size' in the above rule; see:
# http://ceph.com/docs/master/man/8/crushtool/#running-tests-with-test
#
# show sample PG->OSD maps:

crushtool -i /tmp/crush-new.bin --test --show-statistics

# show bad mappings; if the CRUSH map is correct,
# this should be empty:

crushtool -i /tmp/crush-new.bin --test --show-bad-mappings

# show per-OSD pg utilization:

crushtool -i /tmp/crush-new.bin --test --show-utilization

You might finackle something like that (again the rule splits on
hosts) by
having multiple "hosts" on one physical machine, but therein lies
madness.


Well, the bucket names can be changed, as above, and Sage hints at doing
something like this here:

http://wiki.ceph.com/Planning/Blueprints/Dumpling/extend_crush_rule_language


(And IIUC he also proposes something to implement my original
intentions:  distribute four replicas, two on each of two racks, and
don't put two replicas on the same host within a rack; this is more
easily generalized than the above funky configuration.)

     John

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fixed all active+remapped PGs stuck forever (but I have no clue why)

Reply via email to