Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

Craig Lewis Mon, 18 Aug 2014 10:23:29 -0700

I take it that OSD 8, 13, and 20 are some of the stopped OSDs.

I wasn't able to get ceph to execute ceph pg force_create until the OSDs in
[recovery_state][probing_osds] from ceph pg query were online.  I ended up
reformatting most of them, and re-adding them to the cluster.


What's wrong with those OSDs?  How slow are they?  If the problem is just
that they're really slow, try starting them up, and manually marking them
UP and OUT.  That way Ceph will read from them, but not write to them.  If
they won't stay up, I'd replace them, and get the replacements back in the
cluster.  I'd leave the replacements UP and OUT.  You can rebalance later,
after the cluster is healthy again.



I've never seen the replay state, I'm not sure what to do with that.



On Mon, Aug 18, 2014 at 5:05 AM, Riederer, Michael <michael.riede...@br.de>
wrote:

>  What has changed in the cluster compared to my first mail, the cluster
> was in a position to repair one pg, but now has a different pg in status
> "active+clean+replay"
>
> root@ceph-admin-storage:~# ceph pg dump | grep "^2.92"
> dumped all in format plain
> 2.92    0    0    0    0    0    0    0    active+clean    2014-08-18
> 10:37:20.962858    0'0    36830:577    [8,13]    8    [8,13]    8    0'0
> 2014-08-18 10:37:20.962728    13503'1390419    2014-08-14 10:37:12.497492
> root@ceph-admin-storage:~# ceph pg dump | grep replay
> dumped all in format plain
> 0.49a    0    0    0    0    0    0    0    active+clean+replay
> 2014-08-18 13:09:15.317221    0'0    36830:1704    [12,10]    12
> [12,10]    12    0'0    2014-08-18 13:09:15.317131    0'0    2014-08-18
> 13:09:15.317131
>
> Mike
>
>  ------------------------------
> *Von:* ceph-users [ceph-users-boun...@lists.ceph.com]" im Auftrag von
> "Riederer, Michael [michael.riede...@br.de]
> *Gesendet:* Montag, 18. August 2014 13:40
> *An:* Craig Lewis
> *Cc:* ceph-users@lists.ceph.com; Karan Singh
>
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   Hi Craig,
>
> I brought the cluster in a stable condition. All slow osds are no longer in
> the cluster. All remaining 36 osds are more than 100 MB / sec writeable
> (dd if=/dev/zero of=testfile-2.txt bs=1024 count=4096000). No ceph client
> is connected to the cluster. The ceph nodes are in idle. Now sees the
> state as follows:
>
> root@ceph-admin-storage:~# ceph -s
>     cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>      health HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck
> inactive; 3 pgs stuck unclean
>      monmap e2: 3 mons at {ceph-1-storage=
> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
> election epoch 5018, quorum 0,1,2
> ceph-1-storage,ceph-2-storage,ceph-3-storage
>      osdmap e36830: 36 osds: 36 up, 36 in
>       pgmap v10907190: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
>             22051 GB used, 68206 GB / 90258 GB avail
>                 6140 active+clean
>                    3 down+incomplete
>                    1 active+clean+replay
>
> root@ceph-admin-storage:~# ceph health detail
> HEALTH_WARN 3 pgs down; 3 pgs incomplete; 3 pgs stuck inactive; 3 pgs
> stuck unclean
> pg 2.c1 is stuck inactive since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck inactive since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck inactive since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.c1 is stuck unclean since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck unclean since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck unclean since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.587 is down+incomplete, acting [13,8]
> pg 2.e3 is down+incomplete, acting [20,8]
> pg 2.c1 is down+incomplete, acting [13,8]
>
> I have tried the following:
>
> root@ceph-admin-storage:~# ceph pg scrub 2.587
> instructing pg 2.587 on osd.13 to scrub
> root@ceph-admin-storage:~# ceph pg scrub 2.e3
> ^[[Ainstructing pg 2.e3 on osd.20 to scrub
> root@ceph-admin-storage:~# ceph pg scrub 2.c1
> instructing pg 2.c1 on osd.13 to scrub
>
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.587
> instructing pg 2.587 on osd.13 to deep-scrub
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.e3
> instructing pg 2.e3 on osd.20 to deep-scrub
> root@ceph-admin-storage:~# ceph pg deep-scrub 2.c1
> instructing pg 2.c1 on osd.13 to deep-scrub
>
> root@ceph-admin-storage:~# ceph pg repair 2.587
> instructing pg 2.587 on osd.13 to repair
> root@ceph-admin-storage:~# ceph pg repair 2.e3
> instructing pg 2.e3 on osd.20 to repair
> root@ceph-admin-storage:~# ceph pg repair 2.c1
> instructing pg 2.c1 on osd.13 to repair
>
> In the monitor logfiles (ceph-mon.ceph-1/2/3-storage.log) I see the pg
> scrub, pg deep-scrub and pg repair commands, but I do not see anything in
> ceph.log and nothing in the ceph-osd.13/20/8.log.
> (2014-08-18 13:24:49.337954 7f24ac111700  0 mon.ceph-1-storage@0(leader)
> e2 handle_command mon_command({"prefix": "pg repair", "pgid": "2.587"} v 0)
> v1)
>
> Is it possible to repair the ceph-cluster?
>
> root@ceph-admin-storage:~# ceph pg force_create_pg 2.587
> pg 2.587 now creating, ok
>
> But nothing happens, the pg will not created.
>
> root@ceph-admin-storage:~# ceph -s
>     cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>      health HEALTH_WARN 2 pgs down; 2 pgs incomplete; 3 pgs stuck
> inactive; 3 pgs stuck unclean
>      monmap e2: 3 mons at {ceph-1-storage=
> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
> election epoch 5018, quorum 0,1,2
> ceph-1-storage,ceph-2-storage,ceph-3-storage
>      osdmap e36830: 36 osds: 36 up, 36 in
>       pgmap v10907191: 6144 pgs, 3 pools, 10997 GB data, 2760 kobjects
>             22051 GB used, 68206 GB / 90258 GB avail
>                    1 creating
>                 6140 active+clean
>                    2 down+incomplete
>                    1 active+clean+replay
> root@ceph-admin-storage:~# ceph health detail
> HEALTH_WARN 2 pgs down; 2 pgs incomplete; 3 pgs stuck inactive; 3 pgs
> stuck unclean
> pg 2.c1 is stuck inactive since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck inactive since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck inactive since forever, current state creating, last
> acting []
> pg 2.c1 is stuck unclean since forever, current state down+incomplete,
> last acting [13,8]
> pg 2.e3 is stuck unclean since forever, current state down+incomplete,
> last acting [20,8]
> pg 2.587 is stuck unclean since forever, current state creating, last
> acting []
> pg 2.e3 is down+incomplete, acting [20,8]
> pg 2.c1 is down+incomplete, acting [13,8]
>
> What can I do to get rid of the "incomplete" or "creating" pg?
>
> Regards,
> Mike
>
>
>  ------------------------------
> *Von:* Craig Lewis [cle...@centraldesktop.com]
> *Gesendet:* Donnerstag, 14. August 2014 19:56
> *An:* Riederer, Michael
> *Cc:* Karan Singh; ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
> inactive; 4 pgs stuck unclean
>
>   It sound likes you need to throttle recovery.  I have this in my
> ceph.conf:
> [osd]
>   osd max backfills = 1
>    osd recovery max active = 1
>    osd recovery op priority = 1
>
>
>  Those configs, plus SSD journals, really helped the stability of my
> cluster during recovery.  Before I made those changes, I would see OSDs get
> voted down by other OSDs for not responding to heartbeats quickly.
>  Messages in ceph.log like:
> osd.# IP:PORT 420 : [WRN] map e41738 wrongly marked me down
>
>  are an indication that OSDs are so overloaded that they're getting
> kicked out.
>
>
>
>  I also ran into problems when OSDs were getting kicked repeatedly.  It
> caused those really large sections in pg query's 
> [recovery_state][past_intervals]
> that you also have. I would restart an OSD, it would peer, and then
> suicide timeout 300 seconds after starting the peering process.  When I
> first saw it, it was only affecting a few OSDs.  If you're seeing
> repeated suicide timeouts in the OSD's logs, there's a manual process to
> catch them up.
>
>
>
> On Thu, Aug 14, 2014 at 12:25 AM, Riederer, Michael <
> michael.riede...@br.de> wrote:
>
>>  Hi Craig,
>>
>> Yes we have stability problems. The cluster is definitely not suitable
>> for a production environment. I will not describe the details here. I
>> want to get to know ceph and this is possible with the Test-cluster. Some
>> osds are very slow, less than 15 MB / sec writable. Also increases the
>> load on the ceph nodes to over 30 when a osd is removed and a
>> reorganistation of the data is necessary. If the load is very high (over
>> 30) I have seen exactly what you describe. osds go down and out and come
>> back up and in.
>>
>> OK. I'll try the slow osd to remove and then to scrub, deep-scrub the
>> pgs.
>>
>> Many thanks for your help.
>>
>> Regards,
>> Mike
>>
>>  ------------------------------
>> *Von:* Craig Lewis [cle...@centraldesktop.com]
>> *Gesendet:* Mittwoch, 13. August 2014 19:48
>>
>> *An:* Riederer, Michael
>> *Cc:* Karan Singh; ceph-users@lists.ceph.com
>> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
>> inactive; 4 pgs stuck unclean
>>
>>    Yes, ceph pg <PGID> query, not dump.  Sorry about that.
>>
>>  Are you having problems with OSD stability?  There's a lot of history
>> in the [recovery_state][past_intervals]. That's normal when OSDs go
>> down, and out, and come back up and in. You have a lot of history there.
>> You might even be getting into the point that you have so much failover
>> history, the OSDs can't process it all before they hit the suicide timeout.
>>
>>  [recovery_state][probing_osds] lists a lot of OSDs that have recently
>> owned these PGs. If the OSDs are crashing frequently, you need to get that
>> under control before proceeding.
>>
>>  Once the OSDs are stable, I think Ceph just needs to scrub and
>> deep-scrub those PGs.
>>
>>
>>  Until Ceph clears out the [recovery_state][probing_osds] section in the
>> pg query, it's not going to do anything.  ceph osd lost hears you, but
>> doesn't trust you.  Ceph won't do anything until it's actually checked
>> those OSDs itself.  Scrubbing and Deep scrubbing should convince it.
>>
>>  Once that [recovery_state][probing_osds] section is gone, you should
>> see the [recovery_state][past_intervals] section shrink or disappear. I
>> don't have either section in my pg query. Once that happens, your ceph
>> pg repair or ceph pg force_create_pg should finally have some effect.
>>  You may or may not need to re-issue those commands.
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 9:32 PM, Riederer, Michael <
>> michael.riede...@br.de> wrote:
>>
>>>  Hi Craig,
>>>
>>> # ceph pg 2.587 query
>>> # ceph pg 2.c1 query
>>> # ceph pg 2.92 query
>>> # ceph pg 2.e3 query
>>>
>>> Please download the output form here:
>>> http://server.riederer.org/ceph-user/
>>>
>>> #####################################
>>>
>>>
>>> It is not possible to map a rbd:
>>>
>>> # rbd map testshareone --pool rbd --name client.admin
>>> rbd: add failed: (5) Input/output error
>>>
>>> I found that:
>>> http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/11405
>>>  # ceph osd getcrushmap -o crushmap.bin
>>>  got crush map from osdmap epoch 3741
>>> # crushtool -i crushmap.bin --set-chooseleaf_vary_r 0 -o crushmap-new.bin
>>> # ceph osd setcrushmap -i crushmap-new.bin
>>> set crush map
>>>
>>> The Cluster had to do some. Now it looks a bit different.
>>>
>>> It is still not possible to map a rbd.
>>>
>>>  root@ceph-admin-storage:~# ceph -s
>>>     cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>>>      health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
>>> stuck unclean
>>>      monmap e2: 3 mons at {ceph-1-storage=
>>> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
>>> election epoch 5010, quorum 0,1,2
>>> ceph-1-storage,ceph-2-storage,ceph-3-storage
>>>       osdmap e34206: 55 osds: 55 up, 55 in
>>>       pgmap v10838368: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
>>>             22078 GB used, 79932 GB / 102010 GB avail
>>>                 6140 active+clean
>>>                    4 incomplete
>>>
>>> root@ceph-admin-storage:~# ceph health detail
>>> HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
>>> pg 2.92 is stuck inactive since forever, current state incomplete, last
>>> acting [8,13]
>>>  pg 2.c1 is stuck inactive since forever, current state incomplete, last
>>> acting [13,8]
>>> pg 2.e3 is stuck inactive since forever, current state incomplete, last
>>> acting [20,8]
>>> pg 2.587 is stuck inactive since forever, current state incomplete, last
>>> acting [13,8]
>>>
>>> pg 2.92 is stuck unclean since forever, current state incomplete, last
>>> acting [8,13]
>>>  pg 2.c1 is stuck unclean since forever, current state incomplete, last
>>> acting [13,8]
>>> pg 2.e3 is stuck unclean since forever, current state incomplete, last
>>> acting [20,8]
>>> pg 2.587 is stuck unclean since forever, current state incomplete, last
>>> acting [13,8]
>>> pg 2.587 is incomplete, acting [13,8]
>>> pg 2.e3 is incomplete, acting [20,8]
>>> pg 2.c1 is incomplete, acting [13,8]
>>>
>>> pg 2.92 is incomplete, acting [8,13]
>>>
>>>  #######################################################################
>>>
>>> After updating to firefly, I did the following:
>>>
>>> # ceph health detail
>>> HEALTH_WARN crush map has legacy tunables crush map has legacy tunables;
>>> see http://ceph.com/docs/master/rados/operations/crush-map/#tunables
>>>
>>> # ceph osd crush tunables optimal
>>> adjusted tunables profile to optimal
>>>
>>> Mike
>>>  ------------------------------
>>> *Von:* Craig Lewis [cle...@centraldesktop.com]
>>> *Gesendet:* Dienstag, 12. August 2014 20:02
>>> *An:* Riederer, Michael
>>> *Cc:* Karan Singh; ceph-users@lists.ceph.com
>>>
>>> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
>>> inactive; 4 pgs stuck unclean
>>>
>>>    For the incomplete PGs, can you give me the output of
>>> ceph pg <PGID> dump
>>>
>>>  I'm interested in the recovery_state key of that JSON data.
>>>
>>>
>>>
>>> On Tue, Aug 12, 2014 at 5:29 AM, Riederer, Michael <
>>> michael.riede...@br.de> wrote:
>>>
>>>>  Sorry, but I think that does not help me. I forgot to mention
>>>> something about the operating system:
>>>>
>>>> root@ceph-1-storage:~# dpkg -l | grep libleveldb1
>>>> ii  libleveldb1                       1.12.0-1precise.ceph
>>>> fast key-value storage library
>>>> root@ceph-1-storage:~# lsb_release -a
>>>> No LSB modules are available.
>>>> Distributor ID: Ubuntu
>>>> Description:    Ubuntu 12.04.5 LTS
>>>> Release:        12.04
>>>> Codename:       precise
>>>> root@ceph-1-storage:~# uname -a
>>>> Linux ceph-1-storage 3.5.0-52-generic #79~precise1-Ubuntu SMP Fri Jul 4
>>>> 21:03:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>> libleveldb1 is greater than the mentioned version 1.9.0-1 ~ bpo70 + 1.
>>>>
>>>> All ceph nodes are IBM x3650 with Intel Xeon CPUs 2.00 GHz and 8 GB RAM
>>>> , ok all very old, about eight years,
>>>> but are still running.
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>>  ------------------------------
>>>> *Von:* Karan Singh [karan.si...@csc.fi]
>>>> *Gesendet:* Dienstag, 12. August 2014 13:00
>>>>
>>>> *An:* Riederer, Michael
>>>> *Cc:* ceph-users@lists.ceph.com
>>>> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
>>>> inactive; 4 pgs stuck unclean
>>>>
>>>>   I am not sure if this helps , but have a look
>>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10078.html
>>>>
>>>> - Karan -
>>>>
>>>>  On 12 Aug 2014, at 12:04, Riederer, Michael <michael.riede...@br.de>
>>>> wrote:
>>>>
>>>>  Hi Karan,
>>>>
>>>> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list#
>>>> ceph osd getcrushmap -o crushmap.bin
>>>> got crush map from osdmap epoch 30748
>>>> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list#
>>>> crushtool -d crushmap.bin -o crushmap.txt
>>>> root@ceph-admin-storage:~/ceph-cluster/crush-map-4-ceph-user-list# cat
>>>> crushmap.txt
>>>> # begin crush map
>>>> tunable choose_local_tries 0
>>>> tunable choose_local_fallback_tries 0
>>>> tunable choose_total_tries 50
>>>> tunable chooseleaf_descend_once 1
>>>> tunable chooseleaf_vary_r 1
>>>>
>>>> # devices
>>>> device 0 osd.0
>>>> device 1 osd.1
>>>> device 2 osd.2
>>>> device 3 osd.3
>>>> device 4 osd.4
>>>> device 5 osd.5
>>>> device 6 osd.6
>>>> device 7 osd.7
>>>> device 8 osd.8
>>>> device 9 osd.9
>>>> device 10 osd.10
>>>> device 11 osd.11
>>>> device 12 osd.12
>>>> device 13 osd.13
>>>> device 14 osd.14
>>>> device 15 osd.15
>>>> device 16 osd.16
>>>> device 17 osd.17
>>>> device 18 osd.18
>>>> device 19 osd.19
>>>> device 20 osd.20
>>>> device 21 device21
>>>> device 22 osd.22
>>>> device 23 osd.23
>>>> device 24 osd.24
>>>> device 25 osd.25
>>>> device 26 osd.26
>>>> device 27 device27
>>>> device 28 osd.28
>>>> device 29 osd.29
>>>> device 30 osd.30
>>>> device 31 osd.31
>>>> device 32 osd.32
>>>> device 33 osd.33
>>>> device 34 osd.34
>>>> device 35 osd.35
>>>> device 36 osd.36
>>>> device 37 osd.37
>>>> device 38 osd.38
>>>> device 39 osd.39
>>>> device 40 device40
>>>> device 41 device41
>>>> device 42 osd.42
>>>> device 43 osd.43
>>>> device 44 osd.44
>>>> device 45 osd.45
>>>> device 46 osd.46
>>>> device 47 osd.47
>>>> device 48 osd.48
>>>> device 49 osd.49
>>>> device 50 osd.50
>>>> device 51 osd.51
>>>> device 52 osd.52
>>>> device 53 osd.53
>>>> device 54 osd.54
>>>> device 55 osd.55
>>>> device 56 osd.56
>>>> device 57 osd.57
>>>> device 58 osd.58
>>>>
>>>> # types
>>>> type 0 osd
>>>> type 1 host
>>>> type 2 rack
>>>> type 3 row
>>>> type 4 room
>>>> type 5 datacenter
>>>> type 6 root
>>>>
>>>> # buckets
>>>> host ceph-1-storage {
>>>>     id -2        # do not change unnecessarily
>>>>     # weight 19.330
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.0 weight 0.910
>>>>     item osd.2 weight 0.910
>>>>     item osd.3 weight 0.910
>>>>     item osd.4 weight 1.820
>>>>     item osd.9 weight 1.360
>>>>     item osd.11 weight 0.680
>>>>     item osd.6 weight 3.640
>>>>     item osd.5 weight 1.820
>>>>     item osd.7 weight 3.640
>>>>     item osd.8 weight 3.640
>>>> }
>>>> host ceph-2-storage {
>>>>     id -3        # do not change unnecessarily
>>>>     # weight 20.000
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.14 weight 3.640
>>>>     item osd.18 weight 1.360
>>>>     item osd.19 weight 1.360
>>>>     item osd.15 weight 3.640
>>>>     item osd.1 weight 3.640
>>>>     item osd.12 weight 3.640
>>>>     item osd.22 weight 0.680
>>>>     item osd.23 weight 0.680
>>>>     item osd.26 weight 0.680
>>>>     item osd.36 weight 0.680
>>>> }
>>>> host ceph-5-storage {
>>>>     id -4        # do not change unnecessarily
>>>>     # weight 11.730
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.32 weight 0.270
>>>>     item osd.37 weight 0.270
>>>>     item osd.42 weight 0.270
>>>>     item osd.43 weight 1.820
>>>>     item osd.44 weight 1.820
>>>>     item osd.45 weight 1.820
>>>>     item osd.46 weight 1.820
>>>>     item osd.47 weight 1.820
>>>>     item osd.48 weight 1.820
>>>> }
>>>> room room0 {
>>>>     id -8        # do not change unnecessarily
>>>>     # weight 51.060
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item ceph-1-storage weight 19.330
>>>>     item ceph-2-storage weight 20.000
>>>>     item ceph-5-storage weight 11.730
>>>> }
>>>> host ceph-3-storage {
>>>>     id -5        # do not change unnecessarily
>>>>     # weight 15.920
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.24 weight 1.820
>>>>     item osd.25 weight 1.820
>>>>     item osd.29 weight 1.360
>>>>     item osd.10 weight 3.640
>>>>     item osd.13 weight 3.640
>>>>     item osd.20 weight 3.640
>>>> }
>>>> host ceph-4-storage {
>>>>     id -6        # do not change unnecessarily
>>>>     # weight 20.000
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.34 weight 3.640
>>>>     item osd.38 weight 1.360
>>>>     item osd.39 weight 1.360
>>>>     item osd.16 weight 3.640
>>>>     item osd.30 weight 0.680
>>>>     item osd.35 weight 3.640
>>>>     item osd.17 weight 3.640
>>>>     item osd.28 weight 0.680
>>>>     item osd.31 weight 0.680
>>>>     item osd.33 weight 0.680
>>>> }
>>>> host ceph-6-storage {
>>>>     id -7        # do not change unnecessarily
>>>>     # weight 12.720
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item osd.49 weight 0.450
>>>>     item osd.50 weight 0.450
>>>>     item osd.51 weight 0.450
>>>>     item osd.52 weight 0.450
>>>>     item osd.53 weight 1.820
>>>>     item osd.54 weight 1.820
>>>>     item osd.55 weight 1.820
>>>>     item osd.56 weight 1.820
>>>>     item osd.57 weight 1.820
>>>>     item osd.58 weight 1.820
>>>> }
>>>> room room1 {
>>>>     id -9        # do not change unnecessarily
>>>>     # weight 48.640
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item ceph-3-storage weight 15.920
>>>>     item ceph-4-storage weight 20.000
>>>>     item ceph-6-storage weight 12.720
>>>> }
>>>> root default {
>>>>     id -1        # do not change unnecessarily
>>>>     # weight 99.700
>>>>     alg straw
>>>>     hash 0    # rjenkins1
>>>>     item room0 weight 51.060
>>>>     item room1 weight 48.640
>>>> }
>>>>
>>>> # rules
>>>> rule data {
>>>>     ruleset 0
>>>>     type replicated
>>>>     min_size 1
>>>>     max_size 10
>>>>     step take default
>>>>     step chooseleaf firstn 0 type host
>>>>     step emit
>>>> }
>>>> rule metadata {
>>>>     ruleset 1
>>>>     type replicated
>>>>     min_size 1
>>>>     max_size 10
>>>>     step take default
>>>>     step chooseleaf firstn 0 type host
>>>>     step emit
>>>> }
>>>> rule rbd {
>>>>     ruleset 2
>>>>     type replicated
>>>>     min_size 1
>>>>     max_size 10
>>>>     step take default
>>>>     step chooseleaf firstn 0 type host
>>>>     step emit
>>>> }
>>>>
>>>> # end crush map
>>>>
>>>> root@ceph-admin-storage:~# ceph osd dump | grep -i pool
>>>> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
>>>> rjenkins pg_num 2048 pgp_num 2048 last_change 4623 crash_replay_interval 45
>>>> stripe_width 0
>>>> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1
>>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 4627 stripe_width
>>>> 0
>>>> pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash
>>>> rjenkins pg_num 2048 pgp_num 2048 last_change 4632 stripe_width 0
>>>>
>>>>
>>>> Mike
>>>>  ------------------------------
>>>> *Von:* Karan Singh [karan.si...@csc.fi]
>>>> *Gesendet:* Dienstag, 12. August 2014 10:35
>>>> *An:* Riederer, Michael
>>>> *Cc:* ceph-users@lists.ceph.com
>>>> *Betreff:* Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck
>>>> inactive; 4 pgs stuck unclean
>>>>
>>>>   Can you provide your cluster’s ceph osd dump | grep -i pool    and
>>>> crush map output.
>>>>
>>>>
>>>> - Karan -
>>>>
>>>>  On 12 Aug 2014, at 10:40, Riederer, Michael <michael.riede...@br.de>
>>>> wrote:
>>>>
>>>>  Hi all,
>>>>
>>>> How do I get my Ceph Cluster back to a healthy state?
>>>>
>>>> root@ceph-admin-storage:~# ceph -v
>>>> ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
>>>> root@ceph-admin-storage:~# ceph -s
>>>>     cluster 6b481875-8be5-4508-b075-e1f660fd7b33
>>>>      health HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs
>>>> stuck unclean
>>>>      monmap e2: 3 mons at {ceph-1-storage=
>>>> 10.65.150.101:6789/0,ceph-2-storage=10.65.150.102:6789/0,ceph-3-storage=10.65.150.103:6789/0},
>>>> election epoch 5010, quorum 0,1,2
>>>> ceph-1-storage,ceph-2-storage,ceph-3-storage
>>>>      osdmap e30748: 55 osds: 55 up, 55 in
>>>>       pgmap v10800465: 6144 pgs, 3 pools, 11002 GB data, 2762 kobjects
>>>>             22077 GB used, 79933 GB / 102010 GB avail
>>>>                 6138 active+clean
>>>>                    4 incomplete
>>>>                    2 active+clean+replay
>>>> root@ceph-admin-storage:~# ceph health detail
>>>> HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean
>>>> pg 2.92 is stuck inactive since forever, current state incomplete, last
>>>> acting [8,13]
>>>> pg 2.c1 is stuck inactive since forever, current state incomplete, last
>>>> acting [13,7]
>>>> pg 2.e3 is stuck inactive since forever, current state incomplete, last
>>>> acting [20,7]
>>>> pg 2.587 is stuck inactive since forever, current state incomplete,
>>>> last acting [13,5]
>>>> pg 2.92 is stuck unclean since forever, current state incomplete, last
>>>> acting [8,13]
>>>> pg 2.c1 is stuck unclean since forever, current state incomplete, last
>>>> acting [13,7]
>>>> pg 2.e3 is stuck unclean since forever, current state incomplete, last
>>>> acting [20,7]
>>>> pg 2.587 is stuck unclean since forever, current state incomplete, last
>>>> acting [13,5]
>>>> pg 2.587 is incomplete, acting [13,5]
>>>> pg 2.e3 is incomplete, acting [20,7]
>>>> pg 2.c1 is incomplete, acting [13,7]
>>>> pg 2.92 is incomplete, acting [8,13]
>>>> root@ceph-admin-storage:~# ceph pg dump_stuck inactive
>>>> ok
>>>> pg_stat    objects    mip    degr    unf    bytes    log    disklog
>>>> state    state_stamp    v    reported    up    up_primary    acting
>>>> acting_primary    last_scrub    scrub_stamp    last_deep_scrub
>>>> deep_scrub_stamp
>>>> 2.92    0    0    0    0    0    0    0    incomplete    2014-08-08
>>>> 12:39:20.204592    0'0    30748:7729    [8,13]    8    [8,13]    8
>>>> 13503'1390419    2014-06-26 01:57:48.727625    13503'1390419    2014-06-22
>>>> 01:57:30.114186
>>>> 2.c1    0    0    0    0    0    0    0    incomplete    2014-08-08
>>>> 12:39:18.846542    0'0    30748:7117    [13,7]    13    [13,7]    13
>>>> 13503'1687017    2014-06-26 20:52:51.249864    13503'1687017    2014-06-22
>>>> 14:24:22.633554
>>>> 2.e3    0    0    0    0    0    0    0    incomplete    2014-08-08
>>>> 12:39:29.311552    0'0    30748:8027    [20,7]    20    [20,7]    20
>>>> 13503'1398727    2014-06-26 07:03:25.899254    13503'1398727    2014-06-21
>>>> 07:02:31.393053
>>>> 2.587    0    0    0    0    0    0    0    incomplete    2014-08-08
>>>> 12:39:19.715724    0'0    30748:7060    [13,5]    13    [13,5]    13
>>>> 13646'1542934    2014-06-26 07:48:42.089935    13646'1542934    2014-06-22
>>>> 07:46:20.363695
>>>> root@ceph-admin-storage:~# ceph osd tree
>>>> # id    weight    type name    up/down    reweight
>>>> -1    99.7    root default
>>>> -8    51.06        room room0
>>>> -2    19.33            host ceph-1-storage
>>>> 0    0.91                osd.0    up    1
>>>> 2    0.91                osd.2    up    1
>>>> 3    0.91                osd.3    up    1
>>>> 4    1.82                osd.4    up    1
>>>> 9    1.36                osd.9    up    1
>>>> 11    0.68                osd.11    up    1
>>>> 6    3.64                osd.6    up    1
>>>> 5    1.82                osd.5    up    1
>>>> 7    3.64                osd.7    up    1
>>>> 8    3.64                osd.8    up    1
>>>> -3    20            host ceph-2-storage
>>>> 14    3.64                osd.14    up    1
>>>> 18    1.36                osd.18    up    1
>>>> 19    1.36                osd.19    up    1
>>>> 15    3.64                osd.15    up    1
>>>> 1    3.64                osd.1    up    1
>>>> 12    3.64                osd.12    up    1
>>>> 22    0.68                osd.22    up    1
>>>> 23    0.68                osd.23    up    1
>>>> 26    0.68                osd.26    up    1
>>>> 36    0.68                osd.36    up    1
>>>> -4    11.73            host ceph-5-storage
>>>> 32    0.27                osd.32    up    1
>>>> 37    0.27                osd.37    up    1
>>>> 42    0.27                osd.42    up    1
>>>> 43    1.82                osd.43    up    1
>>>> 44    1.82                osd.44    up    1
>>>> 45    1.82                osd.45    up    1
>>>> 46    1.82                osd.46    up    1
>>>> 47    1.82                osd.47    up    1
>>>> 48    1.82                osd.48    up    1
>>>> -9    48.64        room room1
>>>> -5    15.92            host ceph-3-storage
>>>> 24    1.82                osd.24    up    1
>>>> 25    1.82                osd.25    up    1
>>>> 29    1.36                osd.29    up    1
>>>> 10    3.64                osd.10    up    1
>>>> 13    3.64                osd.13    up    1
>>>> 20    3.64                osd.20    up    1
>>>> -6    20            host ceph-4-storage
>>>> 34    3.64                osd.34    up    1
>>>> 38    1.36                osd.38    up    1
>>>> 39    1.36                osd.39    up    1
>>>> 16    3.64                osd.16    up    1
>>>> 30    0.68                osd.30    up    1
>>>> 35    3.64                osd.35    up    1
>>>> 17    3.64                osd.17    up    1
>>>> 28    0.68                osd.28    up    1
>>>> 31    0.68                osd.31    up    1
>>>> 33    0.68                osd.33    up    1
>>>> -7    12.72            host ceph-6-storage
>>>> 49    0.45                osd.49    up    1
>>>> 50    0.45                osd.50    up    1
>>>> 51    0.45                osd.51    up    1
>>>> 52    0.45                osd.52    up    1
>>>> 53    1.82                osd.53    up    1
>>>> 54    1.82                osd.54    up    1
>>>> 55    1.82                osd.55    up    1
>>>> 56    1.82                osd.56    up    1
>>>> 57    1.82                osd.57    up    1
>>>> 58    1.82                osd.58    up    1
>>>>
>>>> What I have tried so far:
>>>> ceph pg repair 2.587 [2.e3 2.c1 2.92]
>>>> ceph pg force_create_pg 2.587 [2.e3 2.c1 2.92]
>>>> ceph osd lost 5 --yes-i-really-mean-it [7 8 13 20]
>>>>
>>>> The history in brief:
>>>> I installed Cuttlefish and updated to Dumpling and to Emperor. The
>>>> Cluster was healthy. Maybe I made a mistake during repair of 8 broken
>>>>  osds, but from then on I had incompletepgs. At last I have updated
>>>> from Emperor to Firefly.
>>>>
>>>> Regards,
>>>> Mike
>>>>  
>>>> --------------------------------------------------------------------------------------------------
>>>>  Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München  Telefon: +49 89
>>>> 590001; E-Mail: i...@br.de; Website: http://www.BR.de
>>>> <http://www.br.de/> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>   
>>>> --------------------------------------------------------------------------------------------------
>>>>  Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München  Telefon: +49 89
>>>> 590001; E-Mail: i...@br.de; Website: http://www.BR.de
>>>> <http://www.br.de/>
>>>>
>>>>
>>>>     
>>>> --------------------------------------------------------------------------------------------------
>>>>  Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München  Telefon: +49 89
>>>> 590001; E-Mail: i...@br.de; Website: http://www.BR.de
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HEALTH_WARN 4 pgs incomplete; 4 pgs stuck inactive; 4 pgs stuck unclean

Reply via email to