Re: Is Ceph recovery able to handle massive crash

2013-01-09 Thread Denis Fondras

Hello,

Le 09/01/2013 00:36, Gregory Farnum a écrit :


It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.



I believe it is a BTRFS issue as when I mkfs.btrfs the volume and rejoin 
it to the cluster, it works (OSD is staying up).


Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 11:44 AM, Denis Fondras  wrote:
> Hello,
>
>
>> What error message do you get when you try and turn it on? If the
>> daemon is crashing, what is the backtrace?
>
>
> The daemon is crashing. Here is the full log if you want to take a look :
> http://vps.ledeuns.net/ceph-osd.0.log.gz
>
> The RBD rebuild script helped to get the data back. I will now try to
> rebuild a Ceph cluster and do some more tests.
>
> Denis

It looks like it's taking approximately forever for writes to complete
to disk; it's shutting down because threads are going off to write and
not coming back. If you set "osd op thread timeout = 60" (or 120) it
might manage to churn through, but I'd look into why the writes are
taking so long — bad disk, fragmented btrfs filesystem, or something
else.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Denis Fondras

Hello,


What error message do you get when you try and turn it on? If the
daemon is crashing, what is the backtrace?


The daemon is crashing. Here is the full log if you want to take a look 
: http://vps.ledeuns.net/ceph-osd.0.log.gz


The RBD rebuild script helped to get the data back. I will now try to 
rebuild a Ceph cluster and do some more tests.


Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Gregory Farnum
On Tue, Jan 8, 2013 at 12:44 AM, Denis Fondras  wrote:
>> What's wrong with your primary OSD?
>
>
> I don't know what's really wrong. The disk seems fine.

What error message do you get when you try and turn it on? If the
daemon is crashing, what is the backtrace?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Denis Fondras

Le 08/01/2013 14:51, Moore, Shawn M a écrit :

If you know the prefix (which is seems you do) and the original size of the rbd 
you should be able to use my utility.

https://github.com/smmoore/ceph/blob/master/rbd_restore.sh

You will need all the rados files in the current working directory you execute 
the script from.  We have used it many times so far and works for us.  I have 
not had any outside feedback on it's usage.  But if you are truly missing any 
files, it will seek over them and your rbd might be corrupt.  Likewise if a 
file itself is damaged, it will write what is in that file to the rebuild.



Thank you very much Shawn, that's your script that gave me the idea of 
rebuiding the RBD from files ;-)


I coded my own script which find the RBD files itself.

Denis

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Moore, Shawn M
If you know the prefix (which is seems you do) and the original size of the rbd 
you should be able to use my utility.

https://github.com/smmoore/ceph/blob/master/rbd_restore.sh

You will need all the rados files in the current working directory you execute 
the script from.  We have used it many times so far and works for us.  I have 
not had any outside feedback on it's usage.  But if you are truly missing any 
files, it will seek over them and your rbd might be corrupt.  Likewise if a 
file itself is damaged, it will write what is in that file to the rebuild.

HTH,
Shawn


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Denis Fondras
Sent: Tuesday, January 08, 2013 7:57 AM
To: ceph-devel@vger.kernel.org
Subject: Re: Is Ceph recovery able to handle massive crash

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD 
drive and cat them together and get back a usable raw volume from which 
I could get back my data ?

Everything seems to be there but I don't know the order of the rbd 
objects. Are the last bytes of the file name the offset of the block ?

Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Wido den Hollander

On 01/08/2013 02:10 PM, Wido den Hollander wrote:

On 01/08/2013 01:57 PM, Denis Fondras wrote:

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD
drive and cat them together and get back a usable raw volume from which
I could get back my data ?



Yes, that is doable. The problem only is that RBD is sparse. So you'd
have to fill up the empty spaces with 4MB of zeroes.

But yes, it's doable if you gather all the objects and will the rest up
with zeroes.


Everything seems to be there but I don't know the order of the rbd
objects. Are the last bytes of the file name the offset of the block ?



There was a quick perl command for this to generate all the suffixes,
but I can't seem to find it right now.



You could do something like this to generate all the blocks you should 
need, the non-existing ones you should fill them with nothing, aka 4MB 
of nothing.


perl -e 'while ($s < (SIZE_IN_MB / 4)) { printf "BLOCK_PREFIX.%012x\n", 
$s; $s++}'


Size is the block-device in MB en BLOCK_PREFIX can be something like 
"rb.0.1016.238e1f29"


Wido


Wido


Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Wido den Hollander

On 01/08/2013 01:57 PM, Denis Fondras wrote:

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD
drive and cat them together and get back a usable raw volume from which
I could get back my data ?



Yes, that is doable. The problem only is that RBD is sparse. So you'd 
have to fill up the empty spaces with 4MB of zeroes.


But yes, it's doable if you gather all the objects and will the rest up 
with zeroes.



Everything seems to be there but I don't know the order of the rbd
objects. Are the last bytes of the file name the offset of the block ?



There was a quick perl command for this to generate all the suffixes, 
but I can't seem to find it right now.


Wido


Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Denis Fondras

Hello,

I'm wondering if I can get every "rb.0.8e10.3e2219d7.*" from the OSD 
drive and cat them together and get back a usable raw volume from which 
I could get back my data ?


Everything seems to be there but I don't know the order of the rbd 
objects. Are the last bytes of the file name the offset of the block ?


Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-08 Thread Denis Fondras

Hello,

I tried to upgrade to 0.56.1 this morning as it could help with 
recovery. No luck so far...



What's wrong with your primary OSD?


I don't know what's really wrong. The disk seems fine.


In general they shouldn't really be crashing that frequently and if you've got 
a new bug we'd like to diagnose and fix it.


I don't know if it is hardware related (it seems not as I tested each 
parts). Then it might be an issue with btrfs (linux 3.5) or Ceph or 
another software part.
However, I'm willing to resolve this issue. Just tell me what you need, 
what I can do.



If that can't be done (or it's a hardware failure or something), you can mark 
the OSD lost, but that might lose data and then you will be sad.


Well, if I must have a loss I'd really like to try everything before :)

Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-07 Thread Gregory Farnum
On Monday, January 7, 2013 at 9:25 AM, Denis Fondras wrote:
> Hello all,
> 
> > I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over
> > btrfs) and every once in a while, an OSD process crashes (almost never
> > the same osd crashes).
> > This time I had 2 osd crash in a row and so I only had one replicate. I
> > could bring the 2 crashed osd up and it started to recover.
> > Unfortunately, the "source" osd crashed while recovering and now I have
> > a some lost PGs.
> > 
> > If I happen to bring the primary OSD up again, can I imagine the lost PG
> > will be recovered too ?
> 
> 
> 
> Ok, so it seems I can't bring back to life my primary OSD :-(
> 
> ---8<---
> health HEALTH_WARN 72 pgs incomplete; 72 pgs stuck inactive; 72 pgs 
> stuck unclean
> monmap e1: 1 mons at {a=192.168.0.132:6789/0}, election epoch 1, quorum 0 a
> osdmap e1130: 3 osds: 2 up, 2 in
> pgmap v1567492: 624 pgs: 552 active+clean, 72 incomplete; 1633 GB 
> data, 4766 GB used, 3297 GB / 8383 GB avail
> mdsmap e127: 1/1/1 up {0=a=up:active}
> 
> 2013-01-07 18:11:10.852673 mon.0 [INF] pgmap v1567492: 624 pgs: 552 
> active+clean, 72 incomplete; 1633 GB data, 4766 GB used, 3297 GB / 8383 
> GB avail
> ---8<---
> 
> When I "rbd list", I can see all my images.
> When I do "rbd map", I can map only a few of them and when I mount the 
> devices, none can mount (the mount process hangs and I cannot even ^C 
> the process).
> 
> Is there something I can try ?

What's wrong with your primary OSD? In general they shouldn't really be 
crashing that frequently and if you've got a new bug we'd like to diagnose and 
fix it.

If that can't be done (or it's a hardware failure or something), you can mark 
the OSD lost, but that might lose data and then you will be sad.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-07 Thread Denis Fondras

Hello all,


I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over
btrfs) and every once in a while, an OSD process crashes (almost never
the same osd crashes).
This time I had 2 osd crash in a row and so I only had one replicate. I
could bring the 2 crashed osd up and it started to recover.
Unfortunately, the "source" osd crashed while recovering and now I have
a some lost PGs.

If I happen to bring the primary OSD up again, can I imagine the lost PG
will be recovered too ?



Ok, so it seems I can't bring back to life my primary OSD :-(

---8<---
health HEALTH_WARN 72 pgs incomplete; 72 pgs stuck inactive; 72 pgs 
stuck unclean

monmap e1: 1 mons at {a=192.168.0.132:6789/0}, election epoch 1, quorum 0 a
osdmap e1130: 3 osds: 2 up, 2 in
 pgmap v1567492: 624 pgs: 552 active+clean, 72 incomplete; 1633 GB 
data, 4766 GB used, 3297 GB / 8383 GB avail

 mdsmap e127: 1/1/1 up {0=a=up:active}

2013-01-07 18:11:10.852673 mon.0 [INF] pgmap v1567492: 624 pgs: 552 
active+clean, 72 incomplete; 1633 GB data, 4766 GB used, 3297 GB / 8383 
GB avail

---8<---

When I "rbd list", I can see all my images.
When I do "rbd map", I can map only a few of them and when I mount the 
devices, none can mount (the mount process hangs and I cannot even ^C 
the process).


Is there something I can try ?

Thank you in advance,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is Ceph recovery able to handle massive crash

2013-01-05 Thread Gregory Farnum
On Saturday, January 5, 2013 at 4:19 AM, Denis Fondras wrote:
> Hello all,
> 
> I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over 
> btrfs) and every once in a while, an OSD process crashes (almost never 
> the same osd crashes).
> This time I had 2 osd crash in a row and so I only had one replicate. I 
> could bring the 2 crashed osd up and it started to recover. 
> Unfortunately, the "source" osd crashed while recovering and now I have 
> a some lost PGs.
> 
> If I happen to bring the primary OSD up again, can I imagine the lost PG 
> will be recovered too ?


Yes, it will recover just fine. Ceph is strictly consistent and so you won't 
lose any data unless you lose the disks.
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is Ceph recovery able to handle massive crash

2013-01-05 Thread Denis Fondras

Hello all,

I'm using Ceph 0.55.1 on a Debian Wheezy (1 mon, 1 mds et 3 osd over 
btrfs) and every once in a while, an OSD process crashes (almost never 
the same osd crashes).
This time I had 2 osd crash in a row and so I only had one replicate. I 
could bring the 2 crashed osd up and it started to recover. 
Unfortunately, the "source" osd crashed while recovering and now I have 
a some lost PGs.


If I happen to bring the primary OSD up again, can I imagine the lost PG 
will be recovered too ?


Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html