Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Replies Inline :

Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 24, 2014 6:10 PM
To: Sahana Lokeshappa
Cc: Varada Kari; ceph-us...@ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
 2.a9518 0   0   0   0   2172649472  3001
 3001active+clean2014-09-22 17:49:35.357586  6826'35762
 17842:72706 [12,7,28]   12  [12,7,28]   12
 6826'35762
 2014-09-22 11:33:55.985449  0'0 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28?  If 
so the next step would be to enable logging (debug osd = 20, debug ms = 1) and 
see wy peering is stuck...

Yes 2.a9 directories are present in osd.12, 7 ,28

and 0.49 0.4d and 0.1c directories are not present in respective acting osds.


Here are the logs I can see when debugs were raised to 20


2014-09-24 18:38:41.706566 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub
2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map
2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1)
2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply
2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done.
2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42.109373 7f92f15e5700 20 osd.12 17850 _dispatch 0xeb0d900 
replica scrub(pg: 
2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5)
 v5
2014-09-24 18:38:42.109378 7f92f15e5700 10 osd.12 17850 queueing MOSDRepScrub 
replica scrub(pg: 
2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5)
 v5
2014-09-24 18:38:42.109395 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109396 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42.109456 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub
2014-09-24 18:38:42.109522 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Hi Craig,

Sorry for late response. Somehow missed this mail.
All osds are up and running. There were no specific logs related to this 
activity.  And, there are no IOs running right now. Few osds were made in and 
out ,removed fully and recreated before these pgs coming to this stage.
I had tried restarting osds. It didn’t work.

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

From: Craig Lewis [mailto:cle...@centraldesktop.com]
Sent: Wednesday, September 24, 2014 5:44 AM
To: Sahana Lokeshappa
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

Is osd.12  doing anything strange?  Is it consuming lots of CPU or IO?  Is it 
flapping?   Writing any interesting logs?  Have you tried restarting it?

If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23.  I doubt 
that it will help, but it won't hurt.



On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari 
varada.k...@sandisk.commailto:varada.k...@sandisk.com wrote:
Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked  32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0http://10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community 
[mailto:ceph-community-boun...@lists.ceph.commailto:ceph-community-boun...@lists.ceph.com]
 On Behalf Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; 
ceph-users

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Hi All,

Here are the steps I followed, to get back all pgs to active+clean state. Still 
don't know what is the root cause for this pg state.

1. Force create pgs which are in stale+down+peering
2. Stop osd.12
3. Mark osd.12 as lost
4. Start osd.12
5. All pgs were back to active+clean state

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283 
sahana.lokesha...@sandisk.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sahana 
Lokeshappa
Sent: Thursday, September 25, 2014 1:26 PM
To: Sage Weil
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

Replies Inline :

Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 24, 2014 6:10 PM
To: Sahana Lokeshappa
Cc: Varada Kari; ceph-us...@ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
 2.a9518 0   0   0   0   2172649472  3001
 3001active+clean2014-09-22 17:49:35.357586  6826'35762
 17842:72706 [12,7,28]   12  [12,7,28]   12
 6826'35762
 2014-09-22 11:33:55.985449  0'0 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28?  If 
so the next step would be to enable logging (debug osd = 20, debug ms = 1) and 
see wy peering is stuck...

Yes 2.a9 directories are present in osd.12, 7 ,28

and 0.49 0.4d and 0.1c directories are not present in respective acting osds.


Here are the logs I can see when debugs were raised to 20


2014-09-24 18:38:41.706566 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub
2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map
2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1)
2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply
2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done.
2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42.109373 7f92f15e5700 20 osd.12 17850 _dispatch 0xeb0d900 
replica scrub(pg: 
2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5)
 v5
2014

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-24 Thread Sage Weil
On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
 2.a9    518 0   0   0   0   2172649472  3001   
 3001    active+clean    2014-09-22 17:49:35.357586  6826'35762 
 17842:72706 [12,7,28]   12  [12,7,28]   12   6826'35762 
 2014-09-22 11:33:55.985449  0'0 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 
28?  If so the next step would be to enable logging (debug osd = 20, debug 
ms = 1) and see wy peering is stuck...

sage

 
 0.59    0   0   0   0   0   0   0   0  
 active+clean    2014-09-22 17:50:00.751218  0'0 17842:4472 
 [12,41,2]   12  [12,41,2]   12  0'0 2014-09-22
 16:47:09.315499   0'0 2014-09-16 12:20:48.618726
 
 0.4d    0   0   0   0   0   0   4   4  
 stale+down+peering  2014-09-18 17:51:10.038247  186'4  
 11134:498   [12,56,27]  12  [12,56,27]  12  186'4   
 2014-09-18 17:30:32.393188  0'0 2014-09-16 12:20:48.615322
 
 0.49    0   0   0   0   0   0   0   0  
 stale+down+peering  2014-09-18 17:44:52.681513  0'0
 11134:498   [12,6,25]   12  [12,6,25]   12  0'0
  2014-09-18 17:16:12.986658  0'0 2014-09-16 12:20:48.614192
 
 0.1c    0   0   0   0   0   0   12  12 
 stale+down+peering  2014-09-18 17:51:16.735549  186'12 
 11134:522   [12,25,23]  12  [12,25,23]  12  186'12  
 2014-09-18 17:16:04.457863  186'10  2014-09-16 14:23:58.731465
 
 2.17    510 0   0   0   0   2139095040  3001   
 3001    active+clean    2014-09-22 17:52:20.364754  6784'30742 
 17842:72033 [12,27,23]  12  [12,27,23]  12   6784'30742 
 2014-09-22 00:19:39.905291  0'0 2014-09-16 20:11:17.016299
 
 2.7e8   508 0   0   0   0   2130706432  3433   
 3433    active+clean    2014-09-22 17:52:20.365083  6702'21132 
 17842:64769 [12,25,23]  12  [12,25,23]  12   6702'21132 
 2014-09-22 17:01:20.546126  0'0 2014-09-16 14:42:32.079187
 
 2.6a5   528 0   0   0   0   2214592512  2840   
 2840    active+clean    2014-09-22 22:50:38.092084  6775'34416 
 17842:83221 [12,58,0]   12  [12,58,0]   12   6775'34416 
 2014-09-22 22:50:38.091989  0'0 2014-09-16 20:11:32.703368
 
  
 
 And we couldn?t observe and peering events happening on the primary osd.
 
  
 
 $ sudo ceph pg 0.49 query
 
 Error ENOENT: i don't have pgid 0.49
 
 $ sudo ceph pg 0.4d query
 
 Error ENOENT: i don't have pgid 0.4d
 
 $ sudo ceph pg 0.1c query
 
 Error ENOENT: i don't have pgid 0.1c
 
  
 
 Not able to explain why the peering was stuck. BTW, Rbd pool doesn?t contain
 any data.
 
  
 
 Varada
 
  
 
 From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On
 Behalf Of Sage Weil
 Sent: Monday, September 22, 2014 10:44 PM
 To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com;
 ceph-commun...@lists.ceph.com
 Subject: Re: [Ceph-community] Pgs are in stale+down+peering state
 
  
 
 Stale means that the primary OSD for the PG went down and the status is
 stale.  They all seem to be from OSD.12... Seems like something is
 preventing that OSD from reporting to the mon?
 
 sage
 
  
 
 On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa
 sahana.lokesha...@sandisk.com wrote:
 
   Hi all,
 
    
 
   I used command  ?ceph osd thrash ? command and after all osds are up
   and in, 3  pgs are in  stale+down+peering state
 
    
 
   sudo ceph -s
 
   cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 
    health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale;
   3 pgs stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 
    monmap e1: 3 mons 
 at{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ra
   m-3=10.242.42.188:6789/0}, election epoch 2008, quorum 0,1,2
   rack2-ram-1,rack2-ram-2,rack2-ram-3
 
    osdmap e17031: 64 osds: 64 up, 64 in
 
     pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033
   kobjects
 
       12501 GB used, 10975 GB / 23476 GB avail
 
       2145 active+clean
 
      3 stale+down+peering
 
    
 
   sudo ceph health detail
 
   HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
   inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 
   pg 0.4d is stuck inactive for 341048.948643, current state
   stale+down+peering, last acting [12,56,27]
 
   pg 0.49 is stuck inactive for 341048.948667, current state
   stale+down+peering, last acting [12,6,25]
 
   pg 0.1c is stuck inactive for 341048.949362, current state
   stale+down+peering, last acting [12,25,23]
 
   pg 0.4d is stuck 

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-23 Thread Craig Lewis
Is osd.12  doing anything strange?  Is it consuming lots of CPU or IO?  Is
it flapping?   Writing any interesting logs?  Have you tried restarting it?

If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23.  I
doubt that it will help, but it won't hurt.



On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari varada.k...@sandisk.com
wrote:

  Hi Sage,



 To give more context on this problem,



 This cluster has two pools rbd and user-created.



 Osd.12 is a primary for some other PG’s , but the problem happens for
 these three  PG’s.



 $ sudo ceph osd lspools

 0 rbd,2 pool1,



 $ sudo ceph -s

 cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758

  health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs
 stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are
 blocked  32 sec

 monmap e1: 3 mons at {rack2-ram-1=
 10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3

  osdmap e17842: 64 osds: 64 up, 64 in

   pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects

 12504 GB used, 10971 GB / 23476 GB avail

 2145 active+clean

3 stale+down+peering



 Snippet from pg dump:



 2.a9518 0   0   0   0   2172649472  3001
 3001active+clean2014-09-22 17:49:35.357586  6826'35762
 17842:72706 [12,7,28]   12  [12,7,28]   12
 6826'35762  2014-09-22 11:33:55.985449  0'0 2014-09-16
 20:11:32.693864

 0.590   0   0   0   0   0   0   0
 active+clean2014-09-22 17:50:00.751218  0'0 17842:4472
 [12,41,2]   12  [12,41,2]   12  0'0 2014-09-22
 16:47:09.315499   0'0 2014-09-16 12:20:48.618726

 0.4d0   0   0   0   0   0   4   4
 stale+down+peering  2014-09-18 17:51:10.038247  186'4
 11134:498   [12,56,27]  12  [12,56,27]  12  186'4
 2014-09-18 17:30:32.393188  0'0 2014-09-16 12:20:48.615322

 0.490   0   0   0   0   0   0   0
 stale+down+peering  2014-09-18 17:44:52.681513  0'0
 11134:498   [12,6,25]   12  [12,6,25]   12  0'0
  2014-09-18 17:16:12.986658  0'0 2014-09-16 12:20:48.614192

 0.1c0   0   0   0   0   0   12  12
 stale+down+peering  2014-09-18 17:51:16.735549  186'12
 11134:522   [12,25,23]  12  [12,25,23]  12  186'12
 2014-09-18 17:16:04.457863  186'10  2014-09-16 14:23:58.731465

 2.17510 0   0   0   0   2139095040  3001
 3001active+clean2014-09-22 17:52:20.364754  6784'30742
 17842:72033 [12,27,23]  12  [12,27,23]  12
 6784'30742  2014-09-22 00:19:39.905291  0'0 2014-09-16
 20:11:17.016299

 2.7e8   508 0   0   0   0   2130706432  3433
 3433active+clean2014-09-22 17:52:20.365083  6702'21132
 17842:64769 [12,25,23]  12  [12,25,23]  12
 6702'21132  2014-09-22 17:01:20.546126  0'0 2014-09-16
 14:42:32.079187

 2.6a5   528 0   0   0   0   2214592512  2840
 2840active+clean2014-09-22 22:50:38.092084  6775'34416
 17842:83221 [12,58,0]   12  [12,58,0]   12
 6775'34416  2014-09-22 22:50:38.091989  0'0 2014-09-16
 20:11:32.703368



 And we couldn’t observe and peering events happening on the primary osd.



 $ sudo ceph pg 0.49 query

 Error ENOENT: i don't have pgid 0.49

 $ sudo ceph pg 0.4d query

 Error ENOENT: i don't have pgid 0.4d

 $ sudo ceph pg 0.1c query

 Error ENOENT: i don't have pgid 0.1c



 Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t
 contain any data.



 Varada



 *From:* Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] *On
 Behalf Of *Sage Weil
 *Sent:* Monday, September 22, 2014 10:44 PM
 *To:* Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com;
 ceph-commun...@lists.ceph.com
 *Subject:* Re: [Ceph-community] Pgs are in stale+down+peering state



 Stale means that the primary OSD for the PG went down and the status is
 stale.  They all seem to be from OSD.12... Seems like something is
 preventing that OSD from reporting to the mon?

 sage



 On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
 sahana.lokesha...@sandisk.com wrote:

 Hi all,



 I used command  ‘ceph osd thrash ‘ command and after all osds are up and
 in, 3  pgs are in  stale+down+peering state



 sudo ceph -s

 cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758

  health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs
 stuck inactive; 3 pgs stuck stale; 3 pgs stuck unclean

  monmap e1: 3 mons at {rack2-ram-1=
 10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-23 Thread Sahana Lokeshappa
Hi All,

Anyone can help me out here.

Sahana Lokeshappa
Test Development Engineer I


From: Varada Kari
Sent: Monday, September 22, 2014 11:52 PM
To: Sage Weil; Sahana Lokeshappa; ceph-us...@ceph.com; 
ceph-commun...@lists.ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked  32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf 
Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com; 
ceph-us...@ceph.commailto:ceph-us...@ceph.com; 
ceph-commun...@lists.ceph.commailto:ceph-commun...@lists.ceph.com
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state


Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote:
Hi all,


I used command  ‘ceph osd thrash ‘ command and after all osds are up and in, 3  
pgs are in  stale+down+peering state


sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 monmap e1: 3 mons at 

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-22 Thread Sage Weil
Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
sahana.lokesha...@sandisk.com wrote:
Hi all,

I used command  'ceph osd thrash ' command and after all osds are up
and in, 3  pgs are in  stale+down+peering state

sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
monmap e1: 3 mons at
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
pg 0.4d is stuck inactive for 341048.948643, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck inactive for 341048.948667, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck inactive for 341048.949362, current state
stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck unclean for 341048.948665, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck unclean for 341048.948687, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck unclean for 341048.949382, current state
stale+down+peering, last acting [12,25,23]
pg 0.4d is stuck stale for 339823.956929, current state
stale+down+peering, last acting [12,56,27]
pg 0.49 is stuck stale for 339823.956930, current state
stale+down+peering, last acting [12,6,25]
pg 0.1c is stuck stale for 339823.956925, current state
stale+down+peering, last acting [12,25,23]


Please, can anyone explain why pgs are in this state.
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com




PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named
above. If the reader of this message is not the intended recipient, you
are hereby notified that you have received this message in error and
that any review, dissemination, distribution, or copying of this
message is strictly prohibited. If you have received this communication
in error, please notify the sender by telephone or e-mail (as shown
above) immediately and destroy any and all copies of this message in
your possession (whether hard copies or electronically stored copies).





___
Ceph-community mailing list
ceph-commun...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-community-ceph.com

-- 
Sent from Kaiten Mail. Please excuse my brevity.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-22 Thread Varada Kari
Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked  32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf 
Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; ceph-users@lists.ceph.com; ceph-us...@ceph.com; 
ceph-commun...@lists.ceph.com
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state


Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
sahana.lokesha...@sandisk.commailto:sahana.lokesha...@sandisk.com wrote:
Hi all,


I used command  ‘ceph osd thrash ‘ command and after all osds are up and in, 3  
pgs are in  stale+down+peering state


sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering


sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs