domino-style OSD crash

2012-06-04 Thread Yann Dupont
Hello, Besides the performance inconsistency (see other thread titled poor OSD performance using kernel 3.4) where I promised some tests (will run this afternoon), we tried this week-end to stress test ceph, making backups with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machin

Re: domino-style OSD crash

2012-06-04 Thread Tommi Virtanen
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont wrote: > Results : Worked like a charm during two days, apart btrfs warn messages > then OSD begin to crash 1 after all 'domino style'. Sorry to hear that. Reading through your message, there seem to be several problems; whether they are because of the

Re: domino-style OSD crash

2012-06-04 Thread Sam Just
Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. The leveldb crash is almost certainly a result of memory corruption. Thanks -Sam On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen wrote: > On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont > wrote: >> Result

Re: domino-style OSD crash

2012-06-04 Thread Greg Farnum
This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is. On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote: > Can you send the osd logs? The merge_log crashes are probably fixable > if I can see the logs. >

Re: domino-style OSD crash

2012-07-03 Thread Yann Dupont
Le 04/06/2012 19:40, Sam Just a écrit : Can you send the osd logs? The merge_log crashes are probably fixable if I can see the logs. Well I'm sorry - As I send in private mail I was away from computer for a long time. I can't send those logs anymore, they are rotated now... Anyway. Now tha

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont wrote: > Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right > now. > > Tried to restart osd with 0.47.3, then next branch, and today with 0.48. > > 4 of 8 nodes fails with the same message : > > ceph version 0.48argonaut (commit:c2b

Re: domino-style OSD crash

2012-07-03 Thread Yann Dupont
Le 03/07/2012 21:42, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont wrote: Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48. 4 of 8 nodes fails with the same message : c

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont wrote: > In the case I could repair, do you think a crashed FS as it is right now is > valuable for you, for future reference , as I saw you can't reproduce the > problem ? I can make an archive (or a btrfs dump ?), but it will be quite > big. At this p

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont
Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont wrote: In the case I could repair, do you think a crashed FS as it is right now is valuable for you, for future reference , as I saw you can't reproduce the problem ? I can make an archive (or a btrfs dump

Re: domino-style OSD crash

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: > Le 03/07/2012 23:38, Tommi Virtanen a écrit : > > On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont > (mailto:yann.dup...@univ-nantes.fr)> wrote: > > > In the case I could repair, do you think a crashed FS as it is right now > > > is > > > val

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont
Le 04/07/2012 18:21, Gregory Farnum a écrit : On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont mailto:yann.dup...@univ-nantes.fr)> wrote: In the case I could repair, do you think a crashed FS as

Re: domino-style OSD crash

2012-07-05 Thread Gregory Farnum
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont wrote: > Le 04/07/2012 18:21, Gregory Farnum a écrit : > >> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote: >>> >>> Le 03/07/2012 23:38, Tommi Virtanen a écrit : On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont >>> (mailto:yann.dup...@uni

Re: domino-style OSD crash

2012-07-06 Thread Yann Dupont
Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1 osd was out, OH , I didn't finish the sentence... When 1 osd was out, missing data was copied on a

Re: domino-style OSD crash

2012-07-06 Thread Gregory Farnum
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont wrote: > Le 05/07/2012 23:32, Gregory Farnum a écrit : > > [...] > >>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like >>> a >>> erroneous out of space ) in more or less the same time. And when 1 osd >>> was >>> out, > > > OH ,

Re: domino-style OSD crash

2012-07-07 Thread Yann Dupont
Le 06/07/2012 19:01, Gregory Farnum a écrit : On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont wrote: Le 05/07/2012 23:32, Gregory Farnum a écrit : [...] ok, so as all nodes were identical, I probably have hit a btrfs bug (like a erroneous out of space ) in more or less the same time. And when 1

Re: domino-style OSD crash

2012-07-09 Thread Samuel Just
Can you restart the node that failed to complete the upgrade with debug filestore = 20 debug osd = 20 and post the log after an hour or so of running? The upgrade process might legitimately take a while. -Sam On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont wrote: > Le 06/07/2012 19:01, Gregory Far

Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont wrote: > Well, I probably wasn't clear enough. I talked about crashed FS, but i was > talking about ceph. The underlying FS (btrfs in that case) of 1 node (and > only one) has PROBABLY crashed in the past, causing corruption in ceph data > on this node,

Re: domino-style OSD crash

2012-07-09 Thread Yann Dupont
Le 09/07/2012 19:43, Tommi Virtanen a écrit : On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont wrote: Well, I probably wasn't clear enough. I talked about crashed FS, but i was talking about ceph. The underlying FS (btrfs in that case) of 1 node (and only one) has PROBABLY crashed in the past, causi

Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont wrote: >> The information here isn't enough to say whether the cause of the >> corruption is btrfs or LevelDB, but the recovery needs to handled by >> LevelDB -- and upstream is working on making it more robust: >> http://code.google.com/p/leveldb/issue

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 09/07/2012 19:14, Samuel Just a écrit : Can you restart the node that failed to complete the upgrade with Well, it's a little big complicated ; I now run those nodes with XFS, and I've long-running jobs on it right now, so I can't stop the ceph cluster at the moment. As I've keeped the o

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont wrote: > As I've keeped the original broken btrfs volumes, I tried this morning to > run the old osd in parrallel, using the $cluster variable. I only have > partial success. The cluster mechanism was never intended for moving existing osds to other cl

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 10/07/2012 17:56, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont wrote: As I've keeped the original broken btrfs volumes, I tried this morning to run the old osd in parrallel, using the $cluster variable. I only have partial success. The cluster mechanism was never in

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont wrote: >> The cluster mechanism was never intended for moving existing osds to >> other clusters. Trying that might not be a good idea. > Ok, good to know. I saw that the remaining maps could lead to problem, but > in 2 words, what are the other associa

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont
Le 10/07/2012 19:11, Tommi Virtanen a écrit : On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont wrote: The cluster mechanism was never intended for moving existing osds to other clusters. Trying that might not be a good idea. Ok, good to know. I saw that the remaining maps could lead to problem, bu

Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont wrote: >> Fundamentally, it comes down to this: the two clusters will still have >> the same fsid, and you won't be isolated from configuration errors or > (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely > redone & reformatted