Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
On 8/28/15 4:18 PM, Aaron Ten Clay wrote: How would I go about removing the bad PG with ceph-objectstore-tool? I'm having trouble finding any documentation for said tool. ceph_objectsore_tool —data-path /var/lib/ceph/osd/ceph-0 —journal-path /var/lib/ceph/osd/ceph-0/journal —pgid 2.36s1 —op remove Is it safe to just move /var/lib/ceph/osd/ceph-21/current/2.36s1_head to another place and start the OSD process again? yes Can I safely tar the directory with tar -cvp --xattrs -f /opt/osd-21-pg-2.36s1-removed_2015-08-28.tar /var/lib/ceph/osd/ceph-21/current/2.36s1_*, then rm -rf /var/lib/ceph/osd/ceph-21/current/2.36s1_*? Better to use the --op export feature if you want to save the pg state: ceph_objectsore_tool —data-path /var/lib/ceph/osd/ceph-X —journal-path /var/lib/ceph/osd/ceph-X/journal —pgid 2.36s1 —op export --file save2.36.s1.export Just want to make sure I don't do something silly and shoot myself in the foot. Thanks! -Aaron On Fri, Aug 28, 2015 at 12:16 PM, David Zafman wrote: I don't know about removing the OSD from the CRUSH map. That seems like overkill to me. I just realized a possible better way. It would have been to take OSD down not out. Remove the ECs PG with the bad chunk. Bring it up again and let recovery repair just the single missing PG on the single OSD with no other disruption. David On 8/28/15 11:28 AM, Aaron Ten Clay wrote: Thanks for the tip, David. I've marked osd.21 down and out and will wait for recovery. I've never had success manually manipulating the OSD contents - I assume I can achieve the same result by removing osd.21 from the CRUSH map, "ceph osd rm 21", then recreating it from scratch as though I'd lost a disk? -Aaron On Fri, Aug 28, 2015 at 11:17 AM, David Zafman wrote: Without my latest branch which hasn't merged yet, you can't repair an EC pg in the situation that the shard with a bad checksum is in the first k chunks. A way to fix it would be to take that osd down/out and let recovery regenerate the chunk. Remove the pg from the osd (ceph-objectstore-tool) and then you can bring the osd back up/in. David On 8/28/15 11:06 AM, Samuel Just wrote: David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay wrote: Hi Cephers, I'm trying to resolve an inconsistent pg on an erasure-coded pool, running Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the pg again. Here's the background, with my attempted resolution steps below. Hopefully someone can steer me in the right direction. Thanks in advance! Current state: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 2.36 is active+clean+inconsistent, acting [1,21,12,9,0,10,14,7,18,20,5,4,22,16] 1 scrub errors noout flag(s) set I started by looking at the log file for osd.1, where I found the cause of the inconsistent report: 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : 2.36 deep-scrub starts 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had a read error 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 deep-scrub 0 missing, 1 inconsistent objects 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36 deep-scrub 1 errors I checked osd.21, where this report appears: 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list 576340b6/1005990.0199/head//2 got incorrect hash on read So, based upon the ceph documentation, I thought I could repair the pg by executing "ceph pg repair 2.36". When I run this, while watching the mon log, I see the command dispatch: 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: dispatch But I never see a "finish" in the mon log, like most ceph commands return. (Not sure if I should expect to see a finish, just noting it doesn't occur.) Also, tailing the logs for any OSD in the acting set for pg 2.36, I never see anything about a repair. The same case holds when I try "ceph pg 2.36 deep-scrub" - command dispatched, but none of the OSDs care. In the past on other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD log after executing "ceph pg nn.yy repair". Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, before and after executing "ceph pg 2.36 deep-scrub". I know EC pools are special in several ways, but nothing in the Ceph manual seems to indicate I can't deep-scrub or repair pgs in an EC pool... Thanks for reading and any suggestions. I'm happy to provide complete log files or mo
Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
I don't know about removing the OSD from the CRUSH map. That seems like overkill to me. I just realized a possible better way. It would have been to take OSD down not out. Remove the ECs PG with the bad chunk. Bring it up again and let recovery repair just the single missing PG on the single OSD with no other disruption. David On 8/28/15 11:28 AM, Aaron Ten Clay wrote: Thanks for the tip, David. I've marked osd.21 down and out and will wait for recovery. I've never had success manually manipulating the OSD contents - I assume I can achieve the same result by removing osd.21 from the CRUSH map, "ceph osd rm 21", then recreating it from scratch as though I'd lost a disk? -Aaron On Fri, Aug 28, 2015 at 11:17 AM, David Zafman wrote: Without my latest branch which hasn't merged yet, you can't repair an EC pg in the situation that the shard with a bad checksum is in the first k chunks. A way to fix it would be to take that osd down/out and let recovery regenerate the chunk. Remove the pg from the osd (ceph-objectstore-tool) and then you can bring the osd back up/in. David On 8/28/15 11:06 AM, Samuel Just wrote: David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay wrote: Hi Cephers, I'm trying to resolve an inconsistent pg on an erasure-coded pool, running Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the pg again. Here's the background, with my attempted resolution steps below. Hopefully someone can steer me in the right direction. Thanks in advance! Current state: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 2.36 is active+clean+inconsistent, acting [1,21,12,9,0,10,14,7,18,20,5,4,22,16] 1 scrub errors noout flag(s) set I started by looking at the log file for osd.1, where I found the cause of the inconsistent report: 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : 2.36 deep-scrub starts 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had a read error 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 deep-scrub 0 missing, 1 inconsistent objects 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36 deep-scrub 1 errors I checked osd.21, where this report appears: 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list 576340b6/1005990.0199/head//2 got incorrect hash on read So, based upon the ceph documentation, I thought I could repair the pg by executing "ceph pg repair 2.36". When I run this, while watching the mon log, I see the command dispatch: 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: dispatch But I never see a "finish" in the mon log, like most ceph commands return. (Not sure if I should expect to see a finish, just noting it doesn't occur.) Also, tailing the logs for any OSD in the acting set for pg 2.36, I never see anything about a repair. The same case holds when I try "ceph pg 2.36 deep-scrub" - command dispatched, but none of the OSDs care. In the past on other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD log after executing "ceph pg nn.yy repair". Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, before and after executing "ceph pg 2.36 deep-scrub". I know EC pools are special in several ways, but nothing in the Ceph manual seems to indicate I can't deep-scrub or repair pgs in an EC pool... Thanks for reading and any suggestions. I'm happy to provide complete log files or more details if I've left out any information that could be helpful. ceph -s: http://hastebin.com/xetohugibi ceph pg dump: http://hastebin.com/bijehoheve ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) ceph osd dump: http://hastebin.com/fitajuzeca -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
Thanks for the tip, David. I've marked osd.21 down and out and will wait for recovery. I've never had success manually manipulating the OSD contents - I assume I can achieve the same result by removing osd.21 from the CRUSH map, "ceph osd rm 21", then recreating it from scratch as though I'd lost a disk? -Aaron On Fri, Aug 28, 2015 at 11:17 AM, David Zafman wrote: > > Without my latest branch which hasn't merged yet, you can't repair an EC > pg in the situation that the shard with a bad checksum is in the first k > chunks. > > A way to fix it would be to take that osd down/out and let recovery > regenerate the chunk. Remove the pg from the osd (ceph-objectstore-tool) > and then you can bring the osd back up/in. > > David > > > On 8/28/15 11:06 AM, Samuel Just wrote: > >> David, does this look familiar? >> -Sam >> >> On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay >> wrote: >> >>> Hi Cephers, >>> >>> I'm trying to resolve an inconsistent pg on an erasure-coded pool, >>> running >>> Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub >>> the >>> pg again. Here's the background, with my attempted resolution steps >>> below. >>> Hopefully someone can steer me in the right direction. Thanks in advance! >>> >>> Current state: >>> # ceph health detail >>> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set >>> pg 2.36 is active+clean+inconsistent, acting >>> [1,21,12,9,0,10,14,7,18,20,5,4,22,16] >>> 1 scrub errors >>> noout flag(s) set >>> >>> I started by looking at the log file for osd.1, where I found the cause >>> of >>> the inconsistent report: >>> >>> 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log >>> [INF] : >>> 2.36 deep-scrub starts >>> 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log >>> [ERR] : >>> 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate >>> had >>> a read error >>> 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log >>> [ERR] : >>> 2.36s0 deep-scrub 0 missing, 1 inconsistent objects >>> 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log >>> [ERR] : >>> 2.36 deep-scrub 1 errors >>> >>> I checked osd.21, where this report appears: >>> >>> 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 >>> pg[2.36s1( >>> v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 >>> les/c >>> 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 >>> lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list >>> 576340b6/1005990.0199/head//2 got incorrect hash on read >>> >>> So, based upon the ceph documentation, I thought I could repair the pg by >>> executing "ceph pg repair 2.36". When I run this, while watching the mon >>> log, I see the command dispatch: >>> >>> 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? >>> 10.42.5.61:0/1002181' >>> entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: >>> dispatch >>> >>> But I never see a "finish" in the mon log, like most ceph commands >>> return. >>> (Not sure if I should expect to see a finish, just noting it doesn't >>> occur.) >>> >>> Also, tailing the logs for any OSD in the acting set for pg 2.36, I never >>> see anything about a repair. The same case holds when I try "ceph pg 2.36 >>> deep-scrub" - command dispatched, but none of the OSDs care. In the past >>> on >>> other clusters, I've seen "[INF] : pg.id repair starts" messages in the >>> OSD >>> log after executing "ceph pg nn.yy repair". >>> >>> Further confusing me, I do see osd.1 start and finish other pg >>> deep-scrubs, >>> before and after executing "ceph pg 2.36 deep-scrub". >>> >>> I know EC pools are special in several ways, but nothing in the Ceph >>> manual >>> seems to indicate I can't deep-scrub or repair pgs in an EC pool... >>> >>> Thanks for reading and any suggestions. I'm happy to provide complete log >>> files or more details if I've left out any information that could be >>> helpful. >>> >>> ceph -s: http://hastebin.com/xetohugibi >>> ceph pg dump: http://hastebin.com/bijehoheve >>> ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) >>> ceph osd dump: http://hastebin.com/fitajuzeca >>> >>> -Aaron >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
Without my latest branch which hasn't merged yet, you can't repair an EC pg in the situation that the shard with a bad checksum is in the first k chunks. A way to fix it would be to take that osd down/out and let recovery regenerate the chunk. Remove the pg from the osd (ceph-objectstore-tool) and then you can bring the osd back up/in. David On 8/28/15 11:06 AM, Samuel Just wrote: David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay wrote: Hi Cephers, I'm trying to resolve an inconsistent pg on an erasure-coded pool, running Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the pg again. Here's the background, with my attempted resolution steps below. Hopefully someone can steer me in the right direction. Thanks in advance! Current state: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 2.36 is active+clean+inconsistent, acting [1,21,12,9,0,10,14,7,18,20,5,4,22,16] 1 scrub errors noout flag(s) set I started by looking at the log file for osd.1, where I found the cause of the inconsistent report: 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : 2.36 deep-scrub starts 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had a read error 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 deep-scrub 0 missing, 1 inconsistent objects 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36 deep-scrub 1 errors I checked osd.21, where this report appears: 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list 576340b6/1005990.0199/head//2 got incorrect hash on read So, based upon the ceph documentation, I thought I could repair the pg by executing "ceph pg repair 2.36". When I run this, while watching the mon log, I see the command dispatch: 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: dispatch But I never see a "finish" in the mon log, like most ceph commands return. (Not sure if I should expect to see a finish, just noting it doesn't occur.) Also, tailing the logs for any OSD in the acting set for pg 2.36, I never see anything about a repair. The same case holds when I try "ceph pg 2.36 deep-scrub" - command dispatched, but none of the OSDs care. In the past on other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD log after executing "ceph pg nn.yy repair". Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, before and after executing "ceph pg 2.36 deep-scrub". I know EC pools are special in several ways, but nothing in the Ceph manual seems to indicate I can't deep-scrub or repair pgs in an EC pool... Thanks for reading and any suggestions. I'm happy to provide complete log files or more details if I've left out any information that could be helpful. ceph -s: http://hastebin.com/xetohugibi ceph pg dump: http://hastebin.com/bijehoheve ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) ceph osd dump: http://hastebin.com/fitajuzeca -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help with inconsistent pg on EC pool, v9.0.2
David, does this look familiar? -Sam On Fri, Aug 28, 2015 at 10:43 AM, Aaron Ten Clay wrote: > Hi Cephers, > > I'm trying to resolve an inconsistent pg on an erasure-coded pool, running > Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the > pg again. Here's the background, with my attempted resolution steps below. > Hopefully someone can steer me in the right direction. Thanks in advance! > > Current state: > # ceph health detail > HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set > pg 2.36 is active+clean+inconsistent, acting > [1,21,12,9,0,10,14,7,18,20,5,4,22,16] > 1 scrub errors > noout flag(s) set > > I started by looking at the log file for osd.1, where I found the cause of > the inconsistent report: > > 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : > 2.36 deep-scrub starts > 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had > a read error > 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36s0 deep-scrub 0 missing, 1 inconsistent objects > 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : > 2.36 deep-scrub 1 errors > > I checked osd.21, where this report appears: > > 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( > v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c > 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 > lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list > 576340b6/1005990.0199/head//2 got incorrect hash on read > > So, based upon the ceph documentation, I thought I could repair the pg by > executing "ceph pg repair 2.36". When I run this, while watching the mon > log, I see the command dispatch: > > 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' > entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: > dispatch > > But I never see a "finish" in the mon log, like most ceph commands return. > (Not sure if I should expect to see a finish, just noting it doesn't occur.) > > Also, tailing the logs for any OSD in the acting set for pg 2.36, I never > see anything about a repair. The same case holds when I try "ceph pg 2.36 > deep-scrub" - command dispatched, but none of the OSDs care. In the past on > other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD > log after executing "ceph pg nn.yy repair". > > Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, > before and after executing "ceph pg 2.36 deep-scrub". > > I know EC pools are special in several ways, but nothing in the Ceph manual > seems to indicate I can't deep-scrub or repair pgs in an EC pool... > > Thanks for reading and any suggestions. I'm happy to provide complete log > files or more details if I've left out any information that could be > helpful. > > ceph -s: http://hastebin.com/xetohugibi > ceph pg dump: http://hastebin.com/bijehoheve > ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) > ceph osd dump: http://hastebin.com/fitajuzeca > > -Aaron > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help with inconsistent pg on EC pool, v9.0.2
Hi Cephers, I'm trying to resolve an inconsistent pg on an erasure-coded pool, running Ceph 9.0.2. I can't seem to get Ceph to run a repair or even deep-scrub the pg again. Here's the background, with my attempted resolution steps below. Hopefully someone can steer me in the right direction. Thanks in advance! Current state: # ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 2.36 is active+clean+inconsistent, acting [1,21,12,9,0,10,14,7,18,20,5,4,22,16] 1 scrub errors noout flag(s) set I started by looking at the log file for osd.1, where I found the cause of the inconsistent report: 2015-08-24 00:43:10.391621 7f09fcff9700 0 log_channel(cluster) log [INF] : 2.36 deep-scrub starts 2015-08-24 01:54:59.933532 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 shard 21(1): soid 576340b6/1005990.0199/head//2 candidate had a read error 2015-08-24 02:34:41.380740 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36s0 deep-scrub 0 missing, 1 inconsistent objects 2015-08-24 02:34:41.380757 7f09fcff9700 -1 log_channel(cluster) log [ERR] : 2.36 deep-scrub 1 errors I checked osd.21, where this report appears: 2015-08-24 01:54:56.477020 7f707cbd4700 0 osd.21 pg_epoch: 31958 pg[2.36s1( v 31957'43013 (7132'39997,31957'43013] local-les=31951 n=34556 ec=136 les/c 31951/31954 31945/31945/31924) [1,21,12,9,0,10,14,7,18,20,5,4,22,16] r=1 lpr=31945 pi=1131-31944/7827 luod=0'0 crt=31957'43011 active] _scan_list 576340b6/1005990.0199/head//2 got incorrect hash on read So, based upon the ceph documentation, I thought I could repair the pg by executing "ceph pg repair 2.36". When I run this, while watching the mon log, I see the command dispatch: 2015-08-28 10:14:17.964017 mon.0 [INF] from='client.? 10.42.5.61:0/1002181' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.36"}]: dispatch But I never see a "finish" in the mon log, like most ceph commands return. (Not sure if I should expect to see a finish, just noting it doesn't occur.) Also, tailing the logs for any OSD in the acting set for pg 2.36, I never see anything about a repair. The same case holds when I try "ceph pg 2.36 deep-scrub" - command dispatched, but none of the OSDs care. In the past on other clusters, I've seen "[INF] : pg.id repair starts" messages in the OSD log after executing "ceph pg nn.yy repair". Further confusing me, I do see osd.1 start and finish other pg deep-scrubs, before and after executing "ceph pg 2.36 deep-scrub". I know EC pools are special in several ways, but nothing in the Ceph manual seems to indicate I can't deep-scrub or repair pgs in an EC pool... Thanks for reading and any suggestions. I'm happy to provide complete log files or more details if I've left out any information that could be helpful. ceph -s: http://hastebin.com/xetohugibi ceph pg dump: http://hastebin.com/bijehoheve ceph -v: ceph version 9.0.2 (be422c8f5b494c77ebcf0f7b95e5d728ecacb7f0) ceph osd dump: http://hastebin.com/fitajuzeca -Aaron ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com