Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Replies Inline : Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 24, 2014 6:10 PM To: Sahana Lokeshappa Cc: Varada Kari; ceph-us...@ceph.com Subject: RE: [Ceph-community] Pgs are in stale+down+peering state On Wed, 24 Sep 2014, Sahana Lokeshappa wrote: 2.a9518 0 0 0 0 2172649472 3001 3001active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28? If so the next step would be to enable logging (debug osd = 20, debug ms = 1) and see wy peering is stuck... Yes 2.a9 directories are present in osd.12, 7 ,28 and 0.49 0.4d and 0.1c directories are not present in respective acting osds. Here are the logs I can see when debugs were raised to 20 2014-09-24 18:38:41.706566 7f92e2dc8700 7 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub 2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map 2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1) 2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply 2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x89cda20 already has epoch 17850 2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 0x89cda20 already has epoch 17850 2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 0xd2eb080 already has epoch 17850 2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 0xd2eb080 already has epoch 17850 2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 0xd5cd5a0 already has epoch 17850 2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 0xd5cd5a0 already has epoch 17850 2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x1161bde0 already has epoch 17850 2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 0x1161bde0 already has epoch 17850 2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x7fc5700 already has epoch 17850 2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 0x7fc5700 already has epoch 17850 2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done. 2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x10377b80 already has epoch 17850 2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 0x10377b80 already has epoch 17850 2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start 2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish 2014-09-24 18:38:42.109373 7f92f15e5700 20 osd.12 17850 _dispatch 0xeb0d900 replica scrub(pg: 2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5) v5 2014-09-24 18:38:42.109378 7f92f15e5700 10 osd.12 17850 queueing MOSDRepScrub replica scrub(pg: 2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5) v5 2014-09-24 18:38:42.109395 7f92f15e5700 10 osd.12 17850 do_waiters -- start 2014-09-24 18:38:42.109396 7f92f15e5700 10 osd.12 17850 do_waiters -- finish 2014-09-24 18:38:42.109456 7f92e2dc8700 7 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub 2014-09-24 18:38:42.109522 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Hi Craig, Sorry for late response. Somehow missed this mail. All osds are up and running. There were no specific logs related to this activity. And, there are no IOs running right now. Few osds were made in and out ,removed fully and recreated before these pgs coming to this stage. I had tried restarting osds. It didn’t work. Thanks Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com From: Craig Lewis [mailto:cle...@centraldesktop.com] Sent: Wednesday, September 24, 2014 5:44 AM To: Sahana Lokeshappa Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state Is osd.12 doing anything strange? Is it consuming lots of CPU or IO? Is it flapping? Writing any interesting logs? Have you tried restarting it? If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23. I doubt that it will help, but it won't hurt. On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari varada.k...@sandisk.commailto:varada.k...@sandisk.com wrote: Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG’s , but the problem happens for these three PG’s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked 32 sec monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0http://10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9518 0 0 0 0 2172649472 30013001 active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.590 0 0 0 0 0 0 0 active+clean2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'42014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.490 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17510 0 0 0 0 2139095040 30013001 active+clean2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 34333433 active+clean2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 28402840 active+clean2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn’t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain any data. Varada From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.commailto:ceph-community-boun...@lists.ceph.com] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Hi All, Here are the steps I followed, to get back all pgs to active+clean state. Still don't know what is the root cause for this pg state. 1. Force create pgs which are in stale+down+peering 2. Stop osd.12 3. Mark osd.12 as lost 4. Start osd.12 5. All pgs were back to active+clean state Thanks Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sahana Lokeshappa Sent: Thursday, September 25, 2014 1:26 PM To: Sage Weil Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state Replies Inline : Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Wednesday, September 24, 2014 6:10 PM To: Sahana Lokeshappa Cc: Varada Kari; ceph-us...@ceph.com Subject: RE: [Ceph-community] Pgs are in stale+down+peering state On Wed, 24 Sep 2014, Sahana Lokeshappa wrote: 2.a9518 0 0 0 0 2172649472 3001 3001active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28? If so the next step would be to enable logging (debug osd = 20, debug ms = 1) and see wy peering is stuck... Yes 2.a9 directories are present in osd.12, 7 ,28 and 0.49 0.4d and 0.1c directories are not present in respective acting osds. Here are the logs I can see when debugs were raised to 20 2014-09-24 18:38:41.706566 7f92e2dc8700 7 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub 2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map 2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1) 2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply 2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x89cda20 already has epoch 17850 2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 0x89cda20 already has epoch 17850 2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 0xd2eb080 already has epoch 17850 2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 0xd2eb080 already has epoch 17850 2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 0xd5cd5a0 already has epoch 17850 2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 0xd5cd5a0 already has epoch 17850 2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x1161bde0 already has epoch 17850 2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 0x1161bde0 already has epoch 17850 2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x7fc5700 already has epoch 17850 2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 0x7fc5700 already has epoch 17850 2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done. 2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 0x10377b80 already has epoch 17850 2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 0x10377b80 already has epoch 17850 2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start 2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish 2014-09-24 18:38:42.109373 7f92f15e5700 20 osd.12 17850 _dispatch 0xeb0d900 replica scrub(pg: 2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5) v5 2014
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
On Wed, 24 Sep 2014, Sahana Lokeshappa wrote: 2.a9 518 0 0 0 0 2172649472 3001 3001 active+clean 2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28? If so the next step would be to enable logging (debug osd = 20, debug ms = 1) and see wy peering is stuck... sage 0.59 0 0 0 0 0 0 0 0 active+clean 2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d 0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'4 2014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.49 0 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c 0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17 510 0 0 0 0 2139095040 3001 3001 active+clean 2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 3433 3433 active+clean 2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 2840 2840 active+clean 2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn?t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn?t contain any data. Varada From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-commun...@lists.ceph.com Subject: Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.com wrote: Hi all, I used command ?ceph osd thrash ? command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ra m-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Is osd.12 doing anything strange? Is it consuming lots of CPU or IO? Is it flapping? Writing any interesting logs? Have you tried restarting it? If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23. I doubt that it will help, but it won't hurt. On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari varada.k...@sandisk.com wrote: Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG’s , but the problem happens for these three PG’s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked 32 sec monmap e1: 3 mons at {rack2-ram-1= 10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9518 0 0 0 0 2172649472 3001 3001active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.590 0 0 0 0 0 0 0 active+clean2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'4 2014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.490 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17510 0 0 0 0 2139095040 3001 3001active+clean2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 3433 3433active+clean2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 2840 2840active+clean2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn’t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain any data. Varada *From:* Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] *On Behalf Of *Sage Weil *Sent:* Monday, September 22, 2014 10:44 PM *To:* Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-commun...@lists.ceph.com *Subject:* Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.com wrote: Hi all, I used command ‘ceph osd thrash ‘ command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1= 10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Hi All, Anyone can help me out here. Sahana Lokeshappa Test Development Engineer I From: Varada Kari Sent: Monday, September 22, 2014 11:52 PM To: Sage Weil; Sahana Lokeshappa; ceph-us...@ceph.com; ceph-commun...@lists.ceph.com Subject: RE: [Ceph-community] Pgs are in stale+down+peering state Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG’s , but the problem happens for these three PG’s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked 32 sec monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9518 0 0 0 0 2172649472 30013001 active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.590 0 0 0 0 0 0 0 active+clean2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'42014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.490 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17510 0 0 0 0 2139095040 30013001 active+clean2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 34333433 active+clean2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 28402840 active+clean2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn’t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain any data. Varada From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com; ceph-us...@ceph.commailto:ceph-us...@ceph.com; ceph-commun...@lists.ceph.commailto:ceph-commun...@lists.ceph.com Subject: Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote: Hi all, I used command ‘ceph osd thrash ‘ command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.com wrote: Hi all, I used command 'ceph osd thrash ' command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, last acting [12,25,23] pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, last acting [12,56,27] pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, last acting [12,6,25] pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, last acting [12,25,23] Please, can anyone explain why pgs are in this state. Sahana Lokeshappa Test Development Engineer I SanDisk Corporation 3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093 T: +918042422283 sahana.lokesha...@sandisk.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ Ceph-community mailing list ceph-commun...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com -- Sent from Kaiten Mail. Please excuse my brevity.___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state
Hi Sage, To give more context on this problem, This cluster has two pools rbd and user-created. Osd.12 is a primary for some other PG’s , but the problem happens for these three PG’s. $ sudo ceph osd lspools 0 rbd,2 pool1, $ sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked 32 sec monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17842: 64 osds: 64 up, 64 in pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12504 GB used, 10971 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering Snippet from pg dump: 2.a9518 0 0 0 0 2172649472 30013001 active+clean2014-09-22 17:49:35.357586 6826'35762 17842:72706 [12,7,28] 12 [12,7,28] 12 6826'35762 2014-09-22 11:33:55.985449 0'0 2014-09-16 20:11:32.693864 0.590 0 0 0 0 0 0 0 active+clean2014-09-22 17:50:00.751218 0'0 17842:4472 [12,41,2] 12 [12,41,2] 12 0'0 2014-09-22 16:47:09.315499 0'0 2014-09-16 12:20:48.618726 0.4d0 0 0 0 0 0 4 4 stale+down+peering 2014-09-18 17:51:10.038247 186'4 11134:498 [12,56,27] 12 [12,56,27] 12 186'42014-09-18 17:30:32.393188 0'0 2014-09-16 12:20:48.615322 0.490 0 0 0 0 0 0 0 stale+down+peering 2014-09-18 17:44:52.681513 0'0 11134:498 [12,6,25] 12 [12,6,25] 12 0'0 2014-09-18 17:16:12.986658 0'0 2014-09-16 12:20:48.614192 0.1c0 0 0 0 0 0 12 12 stale+down+peering 2014-09-18 17:51:16.735549 186'12 11134:522 [12,25,23] 12 [12,25,23] 12 186'12 2014-09-18 17:16:04.457863 186'10 2014-09-16 14:23:58.731465 2.17510 0 0 0 0 2139095040 30013001 active+clean2014-09-22 17:52:20.364754 6784'30742 17842:72033 [12,27,23] 12 [12,27,23] 12 6784'30742 2014-09-22 00:19:39.905291 0'0 2014-09-16 20:11:17.016299 2.7e8 508 0 0 0 0 2130706432 34333433 active+clean2014-09-22 17:52:20.365083 6702'21132 17842:64769 [12,25,23] 12 [12,25,23] 12 6702'21132 2014-09-22 17:01:20.546126 0'0 2014-09-16 14:42:32.079187 2.6a5 528 0 0 0 0 2214592512 28402840 active+clean2014-09-22 22:50:38.092084 6775'34416 17842:83221 [12,58,0] 12 [12,58,0] 12 6775'34416 2014-09-22 22:50:38.091989 0'0 2014-09-16 20:11:32.703368 And we couldn’t observe and peering events happening on the primary osd. $ sudo ceph pg 0.49 query Error ENOENT: i don't have pgid 0.49 $ sudo ceph pg 0.4d query Error ENOENT: i don't have pgid 0.4d $ sudo ceph pg 0.1c query Error ENOENT: i don't have pgid 0.1c Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain any data. Varada From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf Of Sage Weil Sent: Monday, September 22, 2014 10:44 PM To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; ceph-commun...@lists.ceph.com Subject: Re: [Ceph-community] Pgs are in stale+down+peering state Stale means that the primary OSD for the PG went down and the status is stale. They all seem to be from OSD.12... Seems like something is preventing that OSD from reporting to the mon? sage On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote: Hi all, I used command ‘ceph osd thrash ‘ command and after all osds are up and in, 3 pgs are in stale+down+peering state sudo ceph -s cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean monmap e1: 3 mons at {rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3 osdmap e17031: 64 osds: 64 up, 64 in pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects 12501 GB used, 10975 GB / 23476 GB avail 2145 active+clean 3 stale+down+peering sudo ceph health detail HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs