Re: [ceph-users] Simulating Disk Failure

2013-06-17 Thread Craig Lewis

Thanks.  I'll have to get more creative.  :-)


On 6/14/13 18:19 , Gregory Farnum wrote:
Yeah. You've picked up on some warty bits of Ceph's error handling 
here for sure, but it's exacerbated by the fact that you're not 
simulating what you think. In a real disk error situation the 
filesystem would be returning EIO or something, but here it's 
returning ENOENT. Since the OSD is authoritative for that key space 
and the filesystem says there is no such object, presto! It doesn't 
exist.
If you restart the OSD it does a scan of the PGs on-disk as well as 
what it should have, and can pick up on the data not being there and 
recover. But "correctly" handling data that has been (from the local 
FS' perspective) properly deleted under a running process would 
require huge and expensive contortions on the part of the daemon (in 
any distributed system that I can think of).

-Greg

On Friday, June 14, 2013, Craig Lewis wrote:

So I'm trying to break my test cluster, and figure out how to put
it back together again.  I'm able to fix this, but the behavior
seems strange to me, so I wanted to run it past more experienced
people.

I'm doing these tests using RadosGW.  I currently have 2 nodes,
with replication=2.  (I haven't gotten to the cluster expansion
testing yet).

I'm going to upload a file, then simulate a disk failure by
deleting some PGs on one of the OSDs.  I have seen this mentioned
as the way to fix OSDs that filled up during recovery/backfill.  I
expected the cluster to detect the error, change the cluster
health to warn, then return the data from another copy.  Instead,
I got a 404 error.



me@client ~ $ s3cmd ls
2013-06-12 00:02  s3://bucket1

me@client ~ $ s3cmd ls s3://bucket1
2013-06-12 00:0213 8ddd8be4b179a529afa5f2ffae4b9858 
s3://bucket1/hello.txt


me@client ~ $ s3cmd put Object1 s3://bucket1
Object1 -> s3://bucket1/Object1  [1 of 1]
 4 of 4   100% in   62s 6.13 MB/s  done

 me@client ~ $ s3cmd ls s3://bucket1
 2013-06-13 01:10   381M 15bdad3e014ca5f5c9e5c706e17d65f3 
s3://bucket1/Object1
 2013-06-12 00:0213 8ddd8be4b179a529afa5f2ffae4b9858 
s3://bucket1/hello.txt






So at this point, the cluster is healthy, and we can download
objects from RGW.


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK
   monmap e2: 2 mons at
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0
},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4055: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB
used, 94406 MB / 102347 MB avail; 17B/s rd, 0op/s
   mdsmap e1: 0/0/1 up

me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download1
s3://bucket1/Object1 -> ./Object.Download1  [1 of 1]
 4 of 4   100% in 13s27.63 MB/s  done






Time to simulate a failure.  Let's delete all the PGs used by
.rgw.buckets on OSD.0.

me@dev-ceph0:~$ ceph osd tree

# idweighttype nameup/downreweight
-10.09998root default
-20.04999host dev-ceph0
00.04999osd.0up1
-30.04999host dev-ceph1
10.04999osd.1up1


me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets
pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 21 owner
18446744073709551615

me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du -sh 9.*
321M9.0_head
289M9.1_head
425M9.2_head
357M9.3_head
358M9.4_head
309M9.5_head
401M9.6_head
397M9.7_head

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ sudo rm -rf 9.*




The cluster is still healthy

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK
   monmap e2: 2 mons at
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0
},
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1
   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB
used, 94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s
   mdsmap e1: 0/0/1 up




It probably hasn't noticed the damage yet, there's no I/O on this
test cluster unless I generate it.  Lets retrieve some data,
that'll make the cluster notice.

me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2
s3://bucket1/Object1 -> ./Object.Download2  [1 of 1]
ERROR: S3 error: 404 (Not Found):

me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):



I wasn't expecting that.  I expected my obj

Re: [ceph-users] Simulating Disk Failure

2013-06-14 Thread Gregory Farnum
Yeah. You've picked up on some warty bits of Ceph's error handling here for
sure, but it's exacerbated by the fact that you're not simulating what you
think. In a real disk error situation the filesystem would be returning EIO
or something, but here it's returning ENOENT. Since the OSD is
authoritative for that key space and the filesystem says there is no such
object, presto! It doesn't exist.
If you restart the OSD it does a scan of the PGs on-disk as well as what it
should have, and can pick up on the data not being there and recover. But
"correctly" handling data that has been (from the local FS' perspective)
properly deleted under a running process would require huge and expensive
contortions on the part of the daemon (in any distributed system that I can
think of).
-Greg

On Friday, June 14, 2013, Craig Lewis wrote:

>  So I'm trying to break my test cluster, and figure out how to put it back
> together again.  I'm able to fix this, but the behavior seems strange to
> me, so I wanted to run it past more experienced people.
>
> I'm doing these tests using RadosGW.  I currently have 2 nodes, with
> replication=2.  (I haven't gotten to the cluster expansion testing yet).
>
> I'm going to upload a file, then simulate a disk failure by deleting some
> PGs on one of the OSDs.  I have seen this mentioned as the way to fix OSDs
> that filled up during recovery/backfill.  I expected the cluster to detect
> the error, change the cluster health to warn, then return the data from
> another copy.  Instead, I got a 404 error.
>
>
>
> me@client ~ $ s3cmd ls
> 2013-06-12 00:02  s3://bucket1
>
> me@client ~ $ s3cmd ls s3://bucket1
> 2013-06-12 00:0213   8ddd8be4b179a529afa5f2ffae4b9858
> s3://bucket1/hello.txt
>
> me@client ~ $ s3cmd put Object1 s3://bucket1
> Object1 -> s3://bucket1/Object1  [1 of 1]
>  4 of 4   100% in   62s 6.13 MB/s  done
>
>  me@client ~ $ s3cmd ls s3://bucket1
>  2013-06-13 01:10   381M  15bdad3e014ca5f5c9e5c706e17d65f3
> s3://bucket1/Object1
>  2013-06-12 00:0213   8ddd8be4b179a529afa5f2ffae4b9858
> s3://bucket1/hello.txt
>
>
>
>
>
> So at this point, the cluster is healthy, and we can download objects from
> RGW.
>
>
>  me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
> health HEALTH_OK
> monmap e2: 2 mons at {dev-ceph0=
> 192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12,
> quorum 0,1 dev-ceph0,dev-ceph1
> osdmap e44: 2 osds: 2 up, 2 in
>  pgmap v4055: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used,
> 94406 MB / 102347 MB avail; 17B/s rd, 0op/s
> mdsmap e1: 0/0/1 up
>
>  me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download1
>  s3://bucket1/Object1 -> ./Object.Download1  [1 of 1]
>   4 of 4   100% in   13s27.63 MB/s  done
>
>
>
>
>
>
> Time to simulate a failure.  Let's delete all the PGs used by .rgw.buckets
> on OSD.0.
>
> me@dev-ceph0:~$ ceph osd tree
>
> # idweighttype nameup/downreweight
> -10.09998root default
> -20.04999host dev-ceph0
> 00.04999osd.0up1
> -30.04999host dev-ceph1
> 10.04999osd.1up1
>
>
> me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets
>  pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615
>
> me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current
> me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du -sh 9.*
>  321M9.0_head
>  289M9.1_head
>  425M9.2_head
>  357M9.3_head
>  358M9.4_head
>  309M9.5_head
>  401M9.6_head
>  397M9.7_head
>
>  me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ sudo rm -rf 9.*
>
>
>
>
> The cluster is still healthy
>
>  me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
> health HEALTH_OK
> monmap e2: 2 mons at {dev-ceph0=
> 192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, election epoch 12,
> quorum 0,1 dev-ceph0,dev-ceph1
> osdmap e44: 2 osds: 2 up, 2 in
>  pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used,
> 94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s
> mdsmap e1: 0/0/1 up
>
>
>
>
> It probably hasn't noticed the damage yet, there's no I/O on this test
> cluster unless I generate it.  Lets retrieve some data, that'll make the
> cluster notice.
>
>  me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2
>  s3://bucket1/Object1 -> ./Object.Download2  [1 of 1]
>  ERROR: S3 error: 404 (Not Found):
>
>  me@client ~ $ s3cmd ls s3://bucket1
>  ERROR: S3 error: 404 (NoSuchKey):
>
>
>
> I wasn't expecting that.  I expected my object to still be accessible.
> Worst case, it should be accessible 50% of the time.  Instead, it's 0%
> accessible.  And the cluster thinks it's still healhty:
>
>  me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
> health HEALTH_OK
> monmap e2: 2 mons at {dev-ceph0=
> 192.168.18.24:6789/0,dev-ceph1=192.168.18.

[ceph-users] Simulating Disk Failure

2013-06-14 Thread Craig Lewis
So I'm trying to break my test cluster, and figure out how to put it 
back together again.  I'm able to fix this, but the behavior seems 
strange to me, so I wanted to run it past more experienced people.


I'm doing these tests using RadosGW.  I currently have 2 nodes, with 
replication=2.  (I haven't gotten to the cluster expansion testing yet).


I'm going to upload a file, then simulate a disk failure by deleting 
some PGs on one of the OSDs.  I have seen this mentioned as the way to 
fix OSDs that filled up during recovery/backfill.  I expected the 
cluster to detect the error, change the cluster health to warn, then 
return the data from another copy.  Instead, I got a 404 error.




me@client ~ $ s3cmd ls
2013-06-12 00:02  s3://bucket1

me@client ~ $ s3cmd ls s3://bucket1
2013-06-12 00:0213 8ddd8be4b179a529afa5f2ffae4b9858  
s3://bucket1/hello.txt


me@client ~ $ s3cmd put Object1 s3://bucket1
Object1 -> s3://bucket1/Object1  [1 of 1]
 4 of 4   100% in   62s 6.13 MB/s  done

 me@client ~ $ s3cmd ls s3://bucket1
 2013-06-13 01:10   381M 15bdad3e014ca5f5c9e5c706e17d65f3  
s3://bucket1/Object1
 2013-06-12 00:0213 8ddd8be4b179a529afa5f2ffae4b9858  
s3://bucket1/hello.txt






So at this point, the cluster is healthy, and we can download objects 
from RGW.



me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK
   monmap e2: 2 mons at 
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, 
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4055: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 
94406 MB / 102347 MB avail; 17B/s rd, 0op/s

   mdsmap e1: 0/0/1 up

me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download1
s3://bucket1/Object1 -> ./Object.Download1 [1 of 1]
 4 of 4   100% in   13s27.63 MB/s  done






Time to simulate a failure.  Let's delete all the PGs used by 
.rgw.buckets on OSD.0.


me@dev-ceph0:~$ ceph osd tree

# idweighttype nameup/downreweight
-10.09998root default
-20.04999host dev-ceph0
00.04999osd.0up1
-30.04999host dev-ceph1
10.04999osd.1up1


me@dev-ceph0:~$ ceph osd dump | grep .rgw.buckets
pool 9 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 8 pgp_num 8 last_change 21 owner 18446744073709551615


me@dev-ceph0:~$ cd /var/lib/ceph/osd/ceph-0/current
me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ du -sh 9.*
321M9.0_head
289M9.1_head
425M9.2_head
357M9.3_head
358M9.4_head
309M9.5_head
401M9.6_head
397M9.7_head

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ sudo rm -rf 9.*




The cluster is still healthy

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK
   monmap e2: 2 mons at 
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, 
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 
94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s

   mdsmap e1: 0/0/1 up




It probably hasn't noticed the damage yet, there's no I/O on this test 
cluster unless I generate it.  Lets retrieve some data, that'll make the 
cluster notice.


me@client ~ $ s3cmd get s3://bucket1/Object1 ./Object.Download2
s3://bucket1/Object1 -> ./Object.Download2 [1 of 1]
ERROR: S3 error: 404 (Not Found):

me@client ~ $ s3cmd ls s3://bucket1
ERROR: S3 error: 404 (NoSuchKey):



I wasn't expecting that.  I expected my object to still be accessible.  
Worst case, it should be accessible 50% of the time. Instead, it's 0% 
accessible.  And the cluster thinks it's still healhty:


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_OK
   monmap e2: 2 mons at 
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, 
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4059: 248 pgs: 248 active+clean; 2852 MB data, 7941 MB used, 
94406 MB / 102347 MB avail; 16071KB/s rd, 3op/s

   mdsmap e1: 0/0/1 up



Scrubbing the PGs corrects the cluster's status, but still doesn't let 
me download


me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ for i in `seq 0 7`
>  do
>   ceph pg scrub 9.$i
> done
instructing pg 9.0 on osd.0 to scrub
instructing pg 9.1 on osd.0 to scrub
instructing pg 9.2 on osd.1 to scrub
instructing pg 9.3 on osd.0 to scrub
instructing pg 9.4 on osd.0 to scrub
instructing pg 9.5 on osd.1 to scrub
instructing pg 9.6 on osd.1 to scrub
instructing pg 9.7 on osd.0 to scrub

me@dev-ceph0:/var/lib/ceph/osd/ceph-0/current$ ceph status
   health HEALTH_ERR 3 pgs inconsistent; 284 scrub errors
   monmap e2: 2 mons at 
{dev-ceph0=192.168.18.24:6789/0,dev-ceph1=192.168.18.25:6789/0}, 
election epoch 12, quorum 0,1 dev-ceph0,dev-ceph1

   osdmap e44: 2 osds: 2 up, 2 in
pgmap v4105: 248 pgs: 2