Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down

2015-02-05 Thread Sahana Lokeshappa
Hi ,

Even I have seen same logs in osds , but the steps I followed were:

Setup : 4 osds-nodes node1,node2,node3,node4
Each node contains 8 osds.

Node1 got rebooted.

But osd in node2 went down.

Logs from monitor:

2015-02-02 20:28:28.766087 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 
prepare_failure osd.2 10.242.42.114:6871/25606 from osd.13 
10.242.42.110:6835/25881 is reporting failure:1
2015-02-02 20:28:28.766099 7fbaabea4700  0 log_channel(cluster) log [DBG] : 
osd.2 10.242.42.114:6871/25606 reported failed by osd.13 
10.242.42.110:6835/25881
2015-02-02 20:28:32.977235 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 
prepare_failure osd.2 10.242.42.114:6871/25606 from osd.23 
10.242.42.112:6810/23707 is reporting failure:1
2015-02-02 20:28:32.977244 7fbaabea4700  0 log_channel(cluster) log [DBG] : 
osd.2 10.242.42.114:6871/25606 reported failed by osd.23 
10.242.42.112:6810/23707
2015-02-02 20:28:33.012054 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 
prepare_failure osd.2 10.242.42.114:6871/25606 from osd.22 
10.242.42.112:6822/23764 is reporting failure:1
2015-02-02 20:28:33.012075 7fbaabea4700  0 log_channel(cluster) log [DBG] : 
osd.2 10.242.42.114:6871/25606 reported failed by osd.22 
10.242.42.112:6822/23764
2015-02-02 20:28:33.012087 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248  
we have enough reports/reporters to mark osd.2 down
2015-02-02 20:28:33.012098 7fbaabea4700  0 log_channel(cluster) log [INF] : 
osd.2 10.242.42.114:6871/25606 failed (3 reports from 3 peers after 21.000102 
>= grace 20.00)
2015-02-02 20:28:33.071370 7fbaae0ea700  1 mon.rack1-ram-6@0(leader).osd e249 
e249: 32 osds: 23 up, 32 in

Thanks
Sahana


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
van der Ster
Sent: Thursday, February 05, 2015 2:41 PM
To: Sage Weil
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Monitor Restart triggers half of our OSDs marked down

On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil  wrote:
> On Thu, 5 Feb 2015, Dan van der Ster wrote:
>> Hi,
>> We also have seen this once after upgrading to 0.80.8 (from dumpling).
>> Last week we had a network outage which marked out around 1/3rd of
>> our OSDs. The outage lasted less than a minute -- all the OSDs were
>> brought up once the network was restored.
>>
>> Then 30 minutes later I restarted one monitor to roll out a small
>> config change (changing leveldb log path). Surprisingly that resulted
>> in many OSDs (but seemingly fewer than before) being marked out again
>> then quickly marked in again.
>
> Did the 'wrongly marked down' messages appear in ceph.log?
>

Yes. Here's the start of the real outage:

2015-01-29 10:59:52.452367 mon.0 128.142.35.220:6789/0 417 : [INF] pgmap 
v35845354: 24608 pgs: 10 active+clean+scrubbing+deep, 24596
active+clean, 2 active+clean+scrubbing; 125 TB
 data, 377 TB used, 2021 TB / 2399 TB avail; 5137 kB/s rd, 31239 kB/s wr, 641 
op/s
2015-01-29 10:59:52.618591 mon.0 128.142.35.220:6789/0 431 : [INF]
osd.1129 128.142.23.100:6912/90010 failed (3 reports from 3 peers after 
20.000159 >= grace 20.00)
2015-01-29 10:59:52.939018 mon.0 128.142.35.220:6789/0 479 : [INF]
osd.1080 128.142.23.111:6869/124575 failed (3 reports from 3 peers after 
20.000181 >= grace 20.00)
2015-01-29 10:59:53.147616 mon.0 128.142.35.220:6789/0 510 : [INF]
osd.1113 128.142.23.107:6810/25247 failed (3 reports from 3 peers after 
20.525957 >= grace 20.00)
2015-01-29 10:59:53.342428 mon.0 128.142.35.220:6789/0 538 : [INF]
osd.1136 128.142.23.100:6864/86061 failed (3 reports from 3 peers after 
20.403032 >= grace 20.00)
2015-01-29 10:59:53.342557 mon.0 128.142.35.220:6789/0 540 : [INF]
osd.1144 128.142.23.100:6883/91968 failed (3 reports from 3 peers after 
20.403104 >= grace 20.00)

Then many "wrongly marked me down" messages around a minute later when the 
network came back.

But then when I restarted the (peon) monitor:

2015-01-29 11:29:18.250750 mon.0 128.142.35.220:6789/0 10570 : [INF] pgmap 
v35847068: 24608 pgs: 1 active+clean+scrubbing+deep, 24602
active+clean, 5 active+clean+scrubbing; 125 T
B data, 377 TB used, 2021 TB / 2399 TB avail; 193 MB/s rd, 238 MB/s wr, 7410 
op/s
2015-01-29 11:29:28.844678 mon.3 128.142.39.77:6789/0 1 : [INF] mon.2 calling 
new monitor election
2015-01-29 11:29:33.846946 mon.2 128.142.36.229:6789/0 9 : [INF] mon.4 calling 
new monitor election
2015-01-29 11:29:33.847022 mon.4 128.142.39.144:6789/0 7 : [INF] mon.3 calling 
new monitor election
2015-01-29 11:29:33.847085 mon.1 128.142.36.227:6789/0 24 : [INF]
mon.1 calling new monitor election
2015-01-29 11:29:33.853498 mon.3 128.142.39.77:6789/0 2 : [INF] mon.2 calling 
new monitor election
2015-01-29 11:29:33.895660 mon.0 128.142.35.220:6789/0 10860 : [INF]
mon.0 calling new monitor election
2015-01-29 11:29:33.901335 mon.0 128.142.35.220:6789/0 10861 : [INF]
mon.0@0 won leader election with quorum 0,1,2,3,4
2015-01-29 11:29:34.004028 mon.0 128.142.35.220:67

Re: [ceph-users] Question about CRUSH rule set parameter "min_size" "max_size"

2015-02-02 Thread Sahana Lokeshappa
Hi Mika,

The below command will set ruleset to the pool:

ceph osd pool set  crush_ruleset 1

For more info : http://ceph.com/docs/master/rados/operations/crush-map/

Thanks
Sahana

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickie 
ch
Sent: Tuesday, February 03, 2015 11:30 AM
To: ceph-users
Subject: [ceph-users] Question about CRUSH rule set parameter "min_size" 
"max_size"

Hi ,
  CRUSH map have two parameter are "min_size" and "max_size".
Explanation about min_size is "If a pool makes fewer replicas than this number, 
CRUSH will NOT select this rule".
The max_size is "If a pool makes more replicas than this number, CRUSH will NOT 
select this rule"
Default setting of pool replicate size is 1.
Created 2 rules that ruleset0 「min_size = 3」, ruleset1 「min_size = 1」and 
applied.
Then created a new pool named "test" and assume pool "test" will apply ruleset1.
But found pool "test" apply ruleset0.
Which part I missing?

Thanks a lot for any advice!
Best wishes,
​Mika



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to detect degraded objects

2014-11-07 Thread Sahana Lokeshappa
HI tuan,

As per my knowledge, there is no cli as such. By indirect way, when you do pg 
dump, you will get  primary osd assigned for every pg.(check primary header) 
Parse through the directory 
/var/lib/ceph/osd/ceph-/current/_head

Here are all objects resided in that pg.

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

From: Ta Ba Tuan [mailto:tua...@vccloud.vn]
Sent: Friday, November 07, 2014 5:18 PM
To: Sahana Lokeshappa; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to detect degraded objects

Hi Sahana,

Thank for your replying. But, how to list objects of pgs ? :D

Thanks!
Tuan
--
HaNoi-VietNam

On 11/07/2014 04:22 PM, Sahana Lokeshappa wrote:
Hi Tuan,


14918 active+clean

   1 active+clean+scrubbing+deep

  52 active+recovery_wait+degraded

   2 active+recovering+degraded

This says that 2 +52 pgs are degraded. You can run command:

ceph pg dump | grep degraded. You will get list of pgs which are  in degraded 
state. The objects included in that pg are in degraded state

Thanks
Sahana Lokeshappa



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ta Ba 
Tuan
Sent: Friday, November 07, 2014 2:49 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] How to detect degraded objects

Hi everyone,



  111/57706299 objects degraded (0.001%)

   &

 n

bsp;

14918 active+clean

   1 active+clean+scrubbing+deep

  52 active+recovery_wait+degraded

   2 active+recovering+degraded

Ceph'state : 111 /57706299 objects degraded.
Some missing object(s) to have CEPH crash one osd daemon.

How to list degraded objects? Guide me, please. Thanks!

 -2196> 2014-11-07 16:04:23.063584 7fe1aed83700 10 osd.21 pg_epoch: 107789 
pg[6.9f0( v 107789'7058293 lc 107786'7058229 (107617'7055096,107789'7058293] 
local-les=107788 n=4506 ec=164 les/c 107788/107785 107787/107787/105273) 
[101,21,78] r=1 lpr=107787 pi=106418-107786/36 luod=0'0 crt=107786'7058241 lcod 
107786'7058222 active m=1] got missing 
1f7c69f0/rbd_data.885435b2bbeeb.59c2/head//6 v 107786'7058230

 0> 2014-11-07 16:14:57.024605 7f8602e3d700 -1 *** Caught signal (Aborted) 
**
 in thread 7f8602e3d700

 ceph version 0.87-6-gdba7def (dba7defc623474ad17263c9fccfec60fe7a439f0)
 1: /usr/bin/ceph-osd() [0x9b6725]
 2: (()+0xfcb0) [0x7f8626439cb0]
 3: (gsignal()+0x35) [0x7f8624d3e0d5]
 4: (abort()+0x17b) [0x7f8624d4183b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f862569069d]
 6: (()+0xb5846) [0x7f862568e846]
 7: (()+0xb5873) [0x7f862568e873]
 8: (()+0xb596e) [0x7f862568e96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x259) [0xaa0089]
 10: (ReplicatedPG::trim_object(hobject_t const&)+0x222d) [0x8139ed]
 11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim 
const&)+0x43e) [0x82b9be]
 12: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0xc0) [0x870ce0]
 13: (boost::statechart::state_machine, boost::statechart::null_except
ion_translator>::process_queued_events()+0xfb) [0x85618b]
 14: (boost::statechart::state_machine, boost::statechart::null_except
ion_translator>::process_event(boost::statechart::event_base const&)+0x1e) 
[0x85633e]
 15: (ReplicatedPG::snap_trimmer()+0x4f8) [0x7d5ef8]
 16: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x673ab4]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8fade]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xa92870]
 19: (()+0x7e9a) [0x7f8626431e9a]
 20: (clone()+0x6d) [0x7f8624dfc31d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--
Tuan
HaNoi-VietNam






PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to detect degraded objects

2014-11-07 Thread Sahana Lokeshappa
Hi Tuan,


14918 active+clean

   1 active+clean+scrubbing+deep

  52 active+recovery_wait+degraded

   2 active+recovering+degraded

This says that 2 +52 pgs are degraded. You can run command:

ceph pg dump | grep degraded. You will get list of pgs which are  in degraded 
state. The objects included in that pg are in degraded state

Thanks
Sahana Lokeshappa


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ta Ba 
Tuan
Sent: Friday, November 07, 2014 2:49 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] How to detect degraded objects

Hi everyone,


  111/57706299 objects degraded (0.001%)

   &

 n

bsp;

14918 active+clean

   1 active+clean+scrubbing+deep

  52 active+recovery_wait+degraded

   2 active+recovering+degraded

Ceph'state : 111 /57706299 objects degraded.
Some missing object(s) to have CEPH crash one osd daemon.

How to list degraded objects? Guide me, please. Thanks!

 -2196> 2014-11-07 16:04:23.063584 7fe1aed83700 10 osd.21 pg_epoch: 107789 
pg[6.9f0( v 107789'7058293 lc 107786'7058229 (107617'7055096,107789'7058293] 
local-les=107788 n=4506 ec=164 les/c 107788/107785 107787/107787/105273) 
[101,21,78] r=1 lpr=107787 pi=106418-107786/36 luod=0'0 crt=107786'7058241 lcod 
107786'7058222 active m=1] got missing 
1f7c69f0/rbd_data.885435b2bbeeb.59c2/head//6 v 107786'7058230

 0> 2014-11-07 16:14:57.024605 7f8602e3d700 -1 *** Caught signal (Aborted) 
**
 in thread 7f8602e3d700

 ceph version 0.87-6-gdba7def (dba7defc623474ad17263c9fccfec60fe7a439f0)
 1: /usr/bin/ceph-osd() [0x9b6725]
 2: (()+0xfcb0) [0x7f8626439cb0]
 3: (gsignal()+0x35) [0x7f8624d3e0d5]
 4: (abort()+0x17b) [0x7f8624d4183b]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f862569069d]
 6: (()+0xb5846) [0x7f862568e846]
 7: (()+0xb5873) [0x7f862568e873]
 8: (()+0xb596e) [0x7f862568e96e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x259) [0xaa0089]
 10: (ReplicatedPG::trim_object(hobject_t const&)+0x222d) [0x8139ed]
 11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim 
const&)+0x43e) [0x82b9be]
 12: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0xc0) [0x870ce0]
 13: (boost::statechart::state_machine, boost::statechart::null_except
ion_translator>::process_queued_events()+0xfb) [0x85618b]
 14: (boost::statechart::state_machine, boost::statechart::null_except
ion_translator>::process_event(boost::statechart::event_base const&)+0x1e) 
[0x85633e]
 15: (ReplicatedPG::snap_trimmer()+0x4f8) [0x7d5ef8]
 16: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x673ab4]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x48e) [0xa8fade]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xa92870]
 19: (()+0x7e9a) [0x7f8626431e9a]
 20: (clone()+0x6d) [0x7f8624dfc31d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--
Tuan
HaNoi-VietNam





PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Hi All,

Here are the steps I followed, to get back all pgs to active+clean state. Still 
don't know what is the root cause for this pg state.

1. Force create pgs which are in stale+down+peering
2. Stop osd.12
3. Mark osd.12 as lost
4. Start osd.12
5. All pgs were back to active+clean state

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283 
sahana.lokesha...@sandisk.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sahana 
Lokeshappa
Sent: Thursday, September 25, 2014 1:26 PM
To: Sage Weil
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

Replies Inline :

Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 24, 2014 6:10 PM
To: Sahana Lokeshappa
Cc: Varada Kari; ceph-us...@ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
> 2.a9518 0   0   0   0   2172649472  3001
> 3001active+clean2014-09-22 17:49:35.357586  6826'35762
> 17842:72706 [12,7,28]   12  [12,7,28]   12
> 6826'35762
> 2014-09-22 11:33:55.985449  0'0 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28?  If 
so the next step would be to enable logging (debug osd = 20, debug ms = 1) and 
see wy peering is stuck...

Yes 2.a9 directories are present in osd.12, 7 ,28

and 0.49 0.4d and 0.1c directories are not present in respective acting osds.


Here are the logs I can see when debugs were raised to 20


2014-09-24 18:38:41.706566 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub
2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map
2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1)
2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply
2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done.
2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Hi Craig,

Sorry for late response. Somehow missed this mail.
All osds are up and running. There were no specific logs related to this 
activity.  And, there are no IOs running right now. Few osds were made in and 
out ,removed fully and recreated before these pgs coming to this stage.
I had tried restarting osds. It didn’t work.

Thanks
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

From: Craig Lewis [mailto:cle...@centraldesktop.com]
Sent: Wednesday, September 24, 2014 5:44 AM
To: Sahana Lokeshappa
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

Is osd.12  doing anything strange?  Is it consuming lots of CPU or IO?  Is it 
flapping?   Writing any interesting logs?  Have you tried restarting it?

If that doesn't help, try the other involved osds: 56, 27, 6, 25, 23.  I doubt 
that it will help, but it won't hurt.



On Mon, Sep 22, 2014 at 11:21 AM, Varada Kari 
mailto:varada.k...@sandisk.com>> wrote:
Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked > 32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0<http://10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0>},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community 
[mailto:ceph-community-boun...@lists.ceph.com<mailto:ceph-commu

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-25 Thread Sahana Lokeshappa
Replies Inline :

Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, September 24, 2014 6:10 PM
To: Sahana Lokeshappa
Cc: Varada Kari; ceph-us...@ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

On Wed, 24 Sep 2014, Sahana Lokeshappa wrote:
> 2.a9518 0   0   0   0   2172649472  3001
> 3001active+clean2014-09-22 17:49:35.357586  6826'35762
> 17842:72706 [12,7,28]   12  [12,7,28]   12
> 6826'35762
> 2014-09-22 11:33:55.985449  0'0 2014-09-16 20:11:32.693864

Can you verify that 2.a9 exists in teh data directory for 12, 7, and/or 28?  If 
so the next step would be to enable logging (debug osd = 20, debug ms = 1) and 
see wy peering is stuck...

Yes 2.a9 directories are present in osd.12, 7 ,28

and 0.49 0.4d and 0.1c directories are not present in respective acting osds.


Here are the logs I can see when debugs were raised to 20


2014-09-24 18:38:41.706566 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] replica_scrub
2014-09-24 18:38:41.706586 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map
2014-09-24 18:38:41.706592 7f92e2dc8700 20 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] scrub_map_chunk [476de738//0//-1,f38//0//-1)
2014-09-24 18:38:41.711778 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] _scan_list scanning 23 objects deeply
2014-09-24 18:38:41.730881 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.73 7f92eede0700 20 osd.12 17850 share_map_peer 
0x89cda20 already has epoch 17850
2014-09-24 18:38:41.822444 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.822519 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd2eb080 already has epoch 17850
2014-09-24 18:38:41.878894 7f92eede0700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.878921 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0xd5cd5a0 already has epoch 17850
2014-09-24 18:38:41.918307 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.918426 7f92eede0700 20 osd.12 17850 share_map_peer 
0x1161bde0 already has epoch 17850
2014-09-24 18:38:41.951678 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:41.951709 7f92eede0700 20 osd.12 17850 share_map_peer 
0x7fc5700 already has epoch 17850
2014-09-24 18:38:42.064759 7f92e2dc8700 10 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr=17722 pi=17199-17721/6 
luod=0'0 crt=0'0 lcod 0'0 active] build_scrub_map_chunk done.
2014-09-24 18:38:42.107016 7f92ed5dd700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.107032 7f92eede0700 20 osd.12 17850 share_map_peer 
0x10377b80 already has epoch 17850
2014-09-24 18:38:42.109356 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109372 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42.109373 7f92f15e5700 20 osd.12 17850 _dispatch 0xeb0d900 
replica scrub(pg: 
2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5)
 v5
2014-09-24 18:38:42.109378 7f92f15e5700 10 osd.12 17850 queueing MOSDRepScrub 
replica scrub(pg: 
2.738,from:0'0,to:6489'28646,epoch:17850,start:f38//0//-1,end:92371f38//0//-1,chunky:1,deep:1,version:5)
 v5
2014-09-24 18:38:42.109395 7f92f15e5700 10 osd.12 17850 do_waiters -- start
2014-09-24 18:38:42.109396 7f92f15e5700 10 osd.12 17850 do_waiters -- finish
2014-09-24 18:38:42.109456 7f92e2dc8700  7 osd.12 pg_epoch: 17850 pg[2.738( v 
6870'28894 (4076'25093,6870'28894] local-les=17723 n=537 ec=188 les/c 
17723/17725 17722/17722/17709) [57,12,48] r=1 lpr

Re: [ceph-users] [Ceph-community] Pgs are in stale+down+peering state

2014-09-23 Thread Sahana Lokeshappa
Hi All,

Anyone can help me out here.

Sahana Lokeshappa
Test Development Engineer I


From: Varada Kari
Sent: Monday, September 22, 2014 11:52 PM
To: Sage Weil; Sahana Lokeshappa; ceph-us...@ceph.com; 
ceph-commun...@lists.ceph.com
Subject: RE: [Ceph-community] Pgs are in stale+down+peering state

Hi Sage,

To give more context on this problem,

This cluster has two pools rbd and user-created.

Osd.12 is a primary for some other PG’s , but the problem happens for these 
three  PG’s.

$ sudo ceph osd lspools
0 rbd,2 pool1,

$ sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean; 1 requests are blocked > 32 
sec
monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17842: 64 osds: 64 up, 64 in
  pgmap v79729: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12504 GB used, 10971 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

Snippet from pg dump:

2.a9518 0   0   0   0   2172649472  30013001
active+clean2014-09-22 17:49:35.357586  6826'35762  17842:72706 
[12,7,28]   12  [12,7,28]   12   6826'35762  2014-09-22 
11:33:55.985449  0'0 2014-09-16 20:11:32.693864
0.590   0   0   0   0   0   0   0   
active+clean2014-09-22 17:50:00.751218  0'0 17842:4472  
[12,41,2]   12  [12,41,2]   12  0'0 2014-09-22 16:47:09.315499  
 0'0 2014-09-16 12:20:48.618726
0.4d0   0   0   0   0   0   4   4   
stale+down+peering  2014-09-18 17:51:10.038247  186'4   11134:498   
[12,56,27]  12  [12,56,27]  12  186'42014-09-18 17:30:32.393188 
 0'0 2014-09-16 12:20:48.615322
0.490   0   0   0   0   0   0   0   
stale+down+peering  2014-09-18 17:44:52.681513  0'0 11134:498   
[12,6,25]   12  [12,6,25]   12  0'0  2014-09-18 17:16:12.986658 
 0'0 2014-09-16 12:20:48.614192
0.1c0   0   0   0   0   0   12  12  
stale+down+peering  2014-09-18 17:51:16.735549  186'12  11134:522   
[12,25,23]  12  [12,25,23]  12  186'12   2014-09-18 17:16:04.457863 
 186'10  2014-09-16 14:23:58.731465
2.17510 0   0   0   0   2139095040  30013001
active+clean2014-09-22 17:52:20.364754  6784'30742  17842:72033 
[12,27,23]  12  [12,27,23]  12   6784'30742  2014-09-22 
00:19:39.905291  0'0 2014-09-16 20:11:17.016299
2.7e8   508 0   0   0   0   2130706432  34333433
active+clean2014-09-22 17:52:20.365083  6702'21132  17842:64769 
[12,25,23]  12  [12,25,23]  12   6702'21132  2014-09-22 
17:01:20.546126  0'0 2014-09-16 14:42:32.079187
2.6a5   528 0   0   0   0   2214592512  28402840
active+clean2014-09-22 22:50:38.092084  6775'34416  17842:83221 
[12,58,0]   12  [12,58,0]   12   6775'34416  2014-09-22 
22:50:38.091989  0'0 2014-09-16 20:11:32.703368

And we couldn’t observe and peering events happening on the primary osd.

$ sudo ceph pg 0.49 query
Error ENOENT: i don't have pgid 0.49
$ sudo ceph pg 0.4d query
Error ENOENT: i don't have pgid 0.4d
$ sudo ceph pg 0.1c query
Error ENOENT: i don't have pgid 0.1c

Not able to explain why the peering was stuck. BTW, Rbd pool doesn’t contain 
any data.

Varada

From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf 
Of Sage Weil
Sent: Monday, September 22, 2014 10:44 PM
To: Sahana Lokeshappa; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>; 
ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>; 
ceph-commun...@lists.ceph.com<mailto:ceph-commun...@lists.ceph.com>
Subject: Re: [Ceph-community] Pgs are in stale+down+peering state


Stale means that the primary OSD for the PG went down and the status is stale.  
They all seem to be from OSD.12... Seems like something is preventing that OSD 
from reporting to the mon?

sage

On September 22, 2014 7:51:48 AM EDT, Sahana Lokeshappa 
mailto:sahana.lokesha...@sandisk.com>> wrote:
Hi all,


I used command  ‘ceph osd thrash ‘ command and after all osds are up and in, 3  
pgs are in  stale+down+peering state


sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 

[ceph-users] Pgs are in stale+down+peering state

2014-09-22 Thread Sahana Lokeshappa
Hi all,

I used command  'ceph osd thrash ' command and after all osds are up and in, 3  
pgs are in  stale+down+peering state

sudo ceph -s
cluster 99ffc4a5-2811-4547-bd65-34c7d4c58758
 health HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck 
inactive; 3 pgs stuck stale; 3 pgs stuck unclean
 monmap e1: 3 mons at 
{rack2-ram-1=10.242.42.180:6789/0,rack2-ram-2=10.242.42.184:6789/0,rack2-ram-3=10.242.42.188:6789/0},
 election epoch 2008, quorum 0,1,2 rack2-ram-1,rack2-ram-2,rack2-ram-3
 osdmap e17031: 64 osds: 64 up, 64 in
  pgmap v76728: 2148 pgs, 2 pools, 4135 GB data, 1033 kobjects
12501 GB used, 10975 GB / 23476 GB avail
2145 active+clean
   3 stale+down+peering

sudo ceph health detail
HEALTH_WARN 3 pgs down; 3 pgs peering; 3 pgs stale; 3 pgs stuck inactive; 3 pgs 
stuck stale; 3 pgs stuck unclean
pg 0.4d is stuck inactive for 341048.948643, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck inactive for 341048.948667, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck inactive for 341048.949362, current state stale+down+peering, 
last acting [12,25,23]
pg 0.4d is stuck unclean for 341048.948665, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck unclean for 341048.948687, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck unclean for 341048.949382, current state stale+down+peering, 
last acting [12,25,23]
pg 0.4d is stuck stale for 339823.956929, current state stale+down+peering, 
last acting [12,56,27]
pg 0.49 is stuck stale for 339823.956930, current state stale+down+peering, 
last acting [12,6,25]
pg 0.1c is stuck stale for 339823.956925, current state stale+down+peering, 
last acting [12,25,23]


Please, can anyone explain why pgs are in this state.
Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s error

2014-09-04 Thread Sahana Lokeshappa
HI Santhosh,

Copy updated ceph.conf and keyrings from admin node to all cluster nodes 
(present in /etc/ceph/) . If you are using ceph-deploy , use this command from 
admin node.

ceph-deploy –overwrite-conf admin cluster-node1 cluster-node2

Sahana Lokeshappa
Test Development Engineer I
SanDisk Corporation
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Santhosh Fernandes
Sent: Friday, September 05, 2014 10:53 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] ceph -s error

Hi All,

I am trying to configure Ceph with 2 OSD, one MON, One ADMIN, and One ObjectGW 
nodes.

My admin node gives proper output for command  ceph -s and on other ceph nodes 
gives me similar output below.

2014-09-05 10:45:01.946215 7f45d8852700 -1 monclient(hunting): ERROR: missing 
keyring, cannot use cephx for authentication
2014-09-05 10:45:01.946239 7f45d8852700  0 librados: client.admin 
initialization error (2) No such file or directory
Error connecting to cluster: ObjectNotFound
Can anyone help me to resolve this issue?
Regards,
Santhosh




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd crashed with assert at add_log_entry

2014-07-21 Thread Sahana Lokeshappa
Hi All,

I have ceph cluster with 3 monitors, 3 osd nodes (3 osds in each node)

While Io was going on, rebooted a osd node which includes osds osd.6, osd.7, 
osd.8.

osd.0 and osd.2 crashed with assert(e.version > info.last_update): 
PG:add_log_entry

2014-07-17 17:54:14.893962 7f91f3660700 -1 osd/PG.cc: In function 'void 
PG::add_log_entry(pg_log_entry_t&, ceph::bufferlist&)' thread 7f91f3660700 time 
2014-07-17 17:54:13.252064
osd/PG.cc: 2619: FAILED assert(e.version > info.last_update)
ceph version andisk-sprint-2-drop-3-390-g2dbd85c 
(2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (PG::add_log_entry(pg_log_entry_t&, ceph::buffer::list&)+0x481) [0x733a61]
2: (PG::append_log(std::vector<pg_log_entry_t, 
std::allocator<pg_log_entry_t> >&, eversion_t, ObjectStore::Transaction&, 
bool)+0xdf) [0x74483f]
3: 
(ReplicatedBackend::sub_op_modify(std::tr1::shared_ptr<OpRequest>)+0xcfe) 
[0x8193be]
4: 
(ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4a6)
 [0x904586]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, 
ThreadPool::TPHandle&)+0x2db) [0x7aedcb]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x459) [0x635719]
7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) 
[0x635ce6]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
10: (()+0x8182) [0x7f920f579182]
11: (clone()+0x6d) [0x7f920d91a30d]


Raised tracker : http://tracker.ceph.com/issues/8887

Logs are attached to tracker.



Thanks
Sahana Lokeshappa
Test Development Engineer I
[cid:image001.png@01CE9342.6D040E30]
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] logrotate

2014-07-16 Thread Sahana Lokeshappa
Hi Sage,

Even I am facing the problem.
ls -l /var/log/ceph/
total 54280
-rw-r--r-- 1 root root0 Jul 17 06:39 ceph-osd.0.log
-rw-r--r-- 1 root root 19603037 Jul 16 19:01 ceph-osd.0.log.1.gz
-rw-r--r-- 1 root root0 Jul 17 06:39 ceph-osd.1.log
-rw-r--r-- 1 root root 18008247 Jul 16 19:01 ceph-osd.1.log.1.gz
-rw-r--r-- 1 root root0 Jul 17 06:39 ceph-osd.2.log
-rw-r--r-- 1 root root 17969054 Jul 16 19:01 ceph-osd.2.log.1.gz

Due to this , I lost logs, until I restarted the osds.

thanks
Sahana Lokeshappa
Test Development Engineer I

3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Uwe 
Grohnwaldt
Sent: Sunday, July 13, 2014 7:10 AM
To: ceph-us...@ceph.com
Subject: Re: [ceph-users] logrotate

Hi,

we are observing the same problem. After logrotate the new logfile is empty.
The old logfiles are marked as deleted in lsof. At the moment we are restarting 
osds on a regular basis.

Uwe

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of James Eckersall
> Sent: Freitag, 11. Juli 2014 17:06
> To: Sage Weil
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] logrotate
>
> Hi Sage,
>
> Many thanks for the info.
> I have inherited this cluster, but I believe it may have been created
> with mkcephfs rather than ceph-deploy.
>
> I'll touch the done files and see what happens.  Looking at the logic
> in the logrotate script I'm sure this will resolve the problem.
>
> Thanks
>
> J
>
>
> On 11 July 2014 15:04, Sage Weil  <mailto:sw...@redhat.com> > wrote:
>
>
>   On Fri, 11 Jul 2014, James Eckersall wrote:
>   > Upon further investigation, it looks like this part of the ceph
> logrotate
>   > script is causing me the problem:
>   >
>   > if [ -e "/var/lib/ceph/$daemon/$f/done" ] && [ -e
>   > "/var/lib/ceph/$daemon/$f/upstart" ] && [ ! -e
>   > "/var/lib/ceph/$daemon/$f/sysvinit" ]; then
>   >
>   > I don't have a "done" file in the mounted directory for any of my
> osd's.  My
>   > mon's all have the done file and logrotate is working fine for those.
>
>
>   Was this cluster created a while ago with mkcephfs?
>
>
>   > So my question is, what is the purpose of the "done" file and
> should I just
>   > create one for each of my osd's ?
>
>
>   It's used by the newer ceph-disk stuff to indicate whether the OSD
>   directory is propertly 'prepared' and whether the startup stuff
> should pay
>   attention.
>
>   If these are active OSDs, yeah, just touch 'done'.  (Don't touch
> sysvinit,
>   though, if you are enumerating the daemons in ceph.conf with host =
> foo
>   lines.)
>
>   sage
>
>
>
>   >
>   >
>   >
>   > On 10 July 2014 11:10, James Eckersall  <mailto:james.eckers...@gmail.com> > wrote:
>   >   Hi,
>   > I've just upgraded a ceph cluster from Ubuntu 12.04 with 0.73.1 to
>   > Ubuntu 14.04 with 0.80.1.
>   >
>   > I've noticed that the log rotation doesn't appear to work correctly.
>   > The OSD's are just not logging to the current ceph-osd-X.log file.
>   > If I restart the OSD's, they start logging, but then overnight, they
>   > stop logging when the logs are rotated.
>   >
>   > Has anyone else noticed a problem with this?
>   >
>   >
>   >
>   >
>




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD aserts in OSD shutdown

2014-06-20 Thread Sahana Lokeshappa


Sahana Lokeshappa
Test Development Engineer I
[cid:image001.png@01CE9342.6D040E30]
3rd Floor, Bagmane Laurel, Bagmane Tech Park
C V Raman nagar, Bangalore 560093
T: +918042422283
sahana.lokesha...@sandisk.com

From: Ceph-community [mailto:ceph-community-boun...@lists.ceph.com] On Behalf 
Of Sahana Lokeshappa
Sent: Friday, June 20, 2014 7:11 PM
To: ceph-users@lists.ceph.com; ceph-commun...@lists.ceph.com
Subject: [Ceph-community] OSD aserts in OSDService


Hi all,

I have a ceph cluster with 8 nodes with 3 osds in each node, and 3 monitors.

While Client IO and recovery IO was going on, removed one of the osd due to 
some reasons.

Osd crashed with assert :
1: (OSD::shutdown()+0x176f) [0x61e94f]
2: (OSD::handle_signal(int)+0x60) [0x61f210]
3: (SignalHandler::entry()+0x19f) [0x97584f]
4: (()+0x8182) [0x7fa9836b2182]
5: (clone()+0x6d) [0x7fa981a5330d]

pg 4.53 has refcount more than 0, resulted in this assertion failure.

int OSD::shutdown()
{
  if (!service.prepare_to_stop())
return 0; // already shutting down
  osd_lock.Lock();
  if (is_stopping()) {
osd_lock.Unlock();
return 0;
  }
  derr << "shutdown" << dendl;

  set_state(STATE_STOPPING);

// Shutdown PGs
  {
RWLock::RLocker l(pg_map_lock);
for (ceph::unordered_map::iterator p = pg_map.begin();
p != pg_map.end();
++p) {
  dout(20) << " kicking pg " << p->first << dendl;
  p->second->lock();
  p->second->on_shutdown();
  p->second->unlock();
  p->second->osr->flush();
}
  }
// finish ops
  op_shardedwq.drain(); // should already be empty except for lagard PGs
  {
Mutex::Locker l(finished_lock);
finished.clear(); // zap waiters (bleh, this is messy)
  }


  // Remove PGs
#ifdef PG_DEBUG_REFS
  service.dump_live_pgids();
#endif
  {
RWLock::RLocker l(pg_map_lock);
for (ceph::unordered_map::iterator p = pg_map.begin();
p != pg_map.end();
++p) {
  dout(20) << " kicking pg " << p->first << dendl;
  p->second->lock();
  if (p->second->ref.read() != 1) {
derr << "pgid " << p->first << " has ref count of "
<< p->second->ref.read() << dendl;
assert(0);
  }
  p->second->unlock();
  p->second->put("PGMap");
}
pg_map.clear();
  }
#ifdef PG_DEBUG_REFS
  service.dump_live_pgids();
#endif

...
}

I could see this PG been kicked off by the OSD earlier, but this could be due 
to some write ops are still stuck here or someone else is referring this pg. To 
dump any pg references we should have the ceph built with pgrefdebugging.

-3031> 2014-06-19 08:28:27.037414 7fa95b007700 20 osd.23 1551 kicking pg 4.53
-3030> 2014-06-19 08:28:27.037417 7fa95b007700 30 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] lock
-3029> 2014-06-19 08:28:27.037425 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] on_shutdown
-3028> 2014-06-19 08:28:27.037434 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] cancel_copy_ops
-3027> 2014-06-19 08:28:27.037442 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] cancel_flush_ops
-3026> 2014-06-19 08:28:27.037486 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] applying repop tid 3755
-3025> 2014-06-19 08:28:27.037494 7fa95b007700 20 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] remove_repop repgather(0x15548c00 
1551'2202 rep_tid=3755 committed?=0 applied?=0 lock=0 
op=osd_op(client.5117.1:3911087 rbd_data.16862ae8944a.00038629 [write 
2293760~524288] 4.3d302453 ondisk+write e1551) v4)

[ceph-users] OSD aserts in OSDService

2014-06-20 Thread Sahana Lokeshappa
0 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] applying repop tid 3756
-3022> 2014-06-19 08:28:27.037526 7fa95b007700 20 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] remove_repop repgather(0x161d1780 
1551'2203 rep_tid=3756 committed?=0 applied?=0 lock=0 
op=osd_op(client.5117.1:3911088 rbd_data.16862ae8944a.00038629 [write 
2818048~200704] 4.3d302453 ondisk+write e1551) v4)
-3021> 2014-06-19 08:28:27.037538 7fa95b007700 15 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] requeue_ops
-3020> 2014-06-19 08:28:27.037549 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=1551'2201 crt=1498'2116 
lcod 1551'2200 mlcod 1551'2200 active+clean] clear_primary_state
-3019> 2014-06-19 08:28:27.037562 7fa95b007700 20 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=0'0 crt=1498'2116 lcod 
1551'2200 mlcod 0'0 active+clean] agent_stop
-3018> 2014-06-19 08:28:27.037570 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=0'0 crt=1498'2116 lcod 
1551'2200 mlcod 0'0 active+clean] cancel_recovery
-3017> 2014-06-19 08:28:27.037578 7fa95b007700 10 osd.23 pg_epoch: 1551 
pg[4.53( v 1551'2203 (0'0,1551'2203] local-les=1500 n=461 ec=108 les/c 
1500/1510 1499/1499/1499) [23,18,8] r=0 lpr=1499 luod=0'0 crt=1498'2116 lcod 
1551'2200 mlcod 0'0 active+clean] clear_recovery_state


Seems to be a recent issue in the shutdown path when writes are in progress.

Sahana Lokeshappa





PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com