Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 09/07/2012 19:14, Samuel Just a écrit :

Can you restart the node that failed to complete the upgrade with


Well, it's a little big complicated ; I now run those nodes with XFS, 
and I've long-running jobs on it right now, so I can't stop the ceph 
cluster at the moment.


As I've keeped the original broken btrfs volumes, I tried this morning 
to run the old osd in parrallel, using the $cluster variable. I only 
have partial success.
I tried using different port for the mons, but ceph want to use the old 
mon map. I can edit it (epoch 1) but it seems to use 'latest' instead, 
the format isn't compatible with monmaptool and I don't know how to 
inject the modified on a non running cluster.


Anyway, osd seems to start fine, and I can reproduce the bug :

debug filestore = 20
debug osd = 20



I've put it in [global], is it sufficient ?



and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam
Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a 
strace shows lots of pread.


Here is the log :

[..]
2012-07-10 11:33:29.560052 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by 
glibc
2012-07-10 11:33:29.560062 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC 
ioctl will suffice
2012-07-10 11:33:29.560172 7f3e615ac780 -1 
filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp 
detected: 2. Proceeding, do_update is set, performing disk format upgrade.
2012-07-10 11:33:29.560233 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount found snaps 3744666,3746725
2012-07-10 11:33:29.560263 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  current/ seq was 3746725
2012-07-10 11:33:29.560267 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  most recent snap from 
3744666,3746725 is 3746725
2012-07-10 11:33:29.560280 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap 
3746725
2012-07-10 11:33:29.839281 7f3e615ac780  5 
filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725



... and nothing more.

I'll let him running for 3 hours. If I have another message, I'll let 
you know.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 As I've keeped the original broken btrfs volumes, I tried this morning to
 run the old osd in parrallel, using the $cluster variable. I only have
 partial success.

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 10/07/2012 17:56, Tommi Virtanen a écrit :

On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

As I've keeped the original broken btrfs volumes, I tried this morning to
run the old osd in parrallel, using the $cluster variable. I only have
partial success.

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining maps could lead to problem, 
but in 2 words, what are the other associated risks ? Basically If I use 
2 distincts config files,
with differents  non-overlapping paths, and different ports for OSD, 
MDS  MON, we basically have 2 distincts and independant instances ?


By the way, is using 2 mon instance with different ports supported ?

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 The cluster mechanism was never intended for moving existing osds to
 other clusters. Trying that might not be a good idea.
 Ok, good to know. I saw that the remaining maps could lead to problem, but
 in 2 words, what are the other associated risks ? Basically If I use 2
 distincts config files,
 with differents  non-overlapping paths, and different ports for OSD, MDS 
 MON, we basically have 2 distincts and independant instances ?

Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or
leftover state (such as the monmap) in any way. There's a high chance
that your let's poke around and debug cluster wrecks your healthy
cluster.

 By the way, is using 2 mon instance with different ports supported ?

Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.

Naturally, this practically means giving up on high availability.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 10/07/2012 19:11, Tommi Virtanen a écrit :

On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.

Ok, good to know. I saw that the remaining maps could lead to problem, but
in 2 words, what are the other associated risks ? Basically If I use 2
distincts config files,
with differents  non-overlapping paths, and different ports for OSD, MDS 
MON, we basically have 2 distincts and independant instances ?

Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or


Ah I understand. This is not the case : see :

root@chichibu:~# cat /CEPH/data/osd.0/fsid
f00139fe-478e-4c50-80e2-f7cb359100d4
root@chichibu:~# cat /CEPH-PROD/data/osd.0/fsid
43afd025-330e-4aa8-9324-3e9b0afce794

(CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, 
completely redone  reformatted with mkcephfs. The volumes are totally 
independant :


if you want the gore details :

root@chichibu:~# lvs
  LV  VG Attr   LSize   Origin Snap%  Move Log 
Copy%  Convert

  ceph-osdLocalDisk  -wi-a- 225,00g
  mon-btrfs   LocalDisk  -wi-ao 10,00g
  mon-xfs LocalDisk  -wi-ao 10,00g
  dataceph-chichibu  -wi-ao   5,00t- OLD btrfs, 
mounted on /CEPH-PROD
  dataxceph-chichibu -wi-ao   4,50t   - NEW xfs, mounted 
on /CEPH



leftover state (such as the monmap) in any way. There's a high chance
that your let's poke around and debug cluster wrecks your healthy
cluster.


Yes I understand the risk.


By the way, is using 2 mon instance with different ports supported ?

Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.

Naturally, this practically means giving up on high availability.


The idea is not just having 2 mon. I'll still use 3 differents machines 
for mon, but with 2 mon instance on each. One for the current ceph, the 
other for the old ceph.

2x3 Mon.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont
yann.dup...@univ-nantes.fr wrote:
 Fundamentally, it comes down to this: the two clusters will still have
 the same fsid, and you won't be isolated from configuration errors or
 (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely
 redone  reformatted with mkcephfs. The volumes are totally independant :

Ahh you re-created the monitors too. That changes things, then you
have a new random fsid. I understood you only re-mkfsed the osd.

Doing it like that, your real worry is just the remembered state of
monmaps, osdmaps etc. If the daemons accidentally talk to the wrong
cluster, the fsid *should* protect you from damage; they should get
rejected. Similarly, if you use cephx authentication, the keys won't
match either.

 Naturally, this practically means giving up on high availability.
 The idea is not just having 2 mon. I'll still use 3 differents machines for
 mon, but with 2 mon instance on each. One for the current ceph, the other
 for the old ceph.
 2x3 Mon.

That should be perfectly doable.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Samuel Just
Can you restart the node that failed to complete the upgrade with

debug filestore = 20
debug osd = 20

and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam

On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 06/07/2012 19:01, Gregory Farnum a écrit :

 On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr
 wrote:

 Le 05/07/2012 23:32, Gregory Farnum a écrit :

 [...]

 ok, so as all nodes were identical, I probably have hit a btrfs bug
 (like
 a
 erroneous out of space ) in more or less the same time. And when 1 osd
 was
 out,


 OH , I didn't finish the sentence... When 1 osd was out, missing data was
 copied on another nodes, probably speeding btrfs problem on those nodes
 (I
 suspect erroneous out of space conditions)

 Ah. How full are/were the disks?


 The OSD nodes were below 50 % (all are 5 To volumes):

 osd.0 : 31%
 osd.1 : 31%
 osd.2 : 39%
 osd.3 : 65%
 no osd.4 :)
 osd.5 : 35%
 osd.6 : 60%
 osd.7 : 42%
 osd.8 : 34%

 all the volumes were using btrfs with lzo compress.

 [...]


 Oh, interesting. Are the broken nodes all on the same set of arrays?


 No. There are 4 completely independant raid arrays, in 4 different
 locations. They are similar (same brand  model, but slighltly different
 disks, and 1 different firmware), all arrays are multipathed. I don't
 think
 the raid array is the problem. We use those particular models since 2/3
 years, and in the logs I don't see any problem that can be caused by the
 storage itself (like scsi or multipath errors)

 I must have misunderstood then. What did you mean by 1 Array for 2 OSD
 nodes?


 I have 8 osd nodes, in 4 different locations (several km away). In each
 location I have 2 nodes and 1 raid Array.
 On each location, each raid array has 16 2To disks, 2 controllers with 4x 8
 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one,
 7 disks for the orher). Each raid set is primary attached to 1 controller,
 and each osd node on the location has acces to the controller with 2
 distinct paths.

 There were no correlation between failed nodes  raid array.


 Cheers,

 --
 Yann Dupont - Service IRTS, DSI Université de Nantes
 Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Well, I probably wasn't clear enough. I talked about crashed FS, but i was
 talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
 only one) has PROBABLY crashed in the past, causing corruption in ceph data
 on this node, and then the subsequent crash of other nodes.

 RIGHT now btrfs on this node is OK. I can access the filesystem without
 errors.

But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.

 One node had problem with btrfs, leading first to kernel problem , probably
 corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
 Before that ultimate kernel oops, bad data has been transmitted to other
 (sane) nodes, leading to ceph-osd crash on thoses nodes.

The LevelDB binary contents are not transferred over to other nodes;
this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.

The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Yann Dupont

Le 09/07/2012 19:43, Tommi Virtanen a écrit :

On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Well, I probably wasn't clear enough. I talked about crashed FS, but i was
talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
only one) has PROBABLY crashed in the past, causing corruption in ceph data
on this node, and then the subsequent crash of other nodes.

RIGHT now btrfs on this node is OK. I can access the filesystem without
errors.

But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.

Yes, understood.


One node had problem with btrfs, leading first to kernel problem , probably
corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
Before that ultimate kernel oops, bad data has been transmitted to other
(sane) nodes, leading to ceph-osd crash on thoses nodes.

The LevelDB binary contents are not transferred over to other nodes;

Ok thanks for the clarification ;

this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.
Very likely : since I reformatted my nodes with XFS I don't have 
problems so far.


The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97
Yes, saw this. It's very important. Sometimes, s... happens. In respect 
to the size ceph volumes can reach, having a tool to restart damaged 
nodes (for whatever reason) is a must.


Thanks for the time you took to answer. It's much clearer for me now.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 The information here isn't enough to say whether the cause of the
 corruption is btrfs or LevelDB, but the recovery needs to handled by
 LevelDB -- and upstream is working on making it more robust:
 http://code.google.com/p/leveldb/issues/detail?id=97

 Yes, saw this. It's very important. Sometimes, s... happens. In respect to
 the size ceph volumes can reach, having a tool to restart damaged nodes (for
 whatever reason) is a must.

 Thanks for the time you took to answer. It's much clearer for me now.

If it doesn't recover, you re-format the disk and thereby throw away
the contents. Not really all that different from handling hardware
failure. That's why we have replication.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-07 Thread Yann Dupont

Le 06/07/2012 19:01, Gregory Farnum a écrit :

On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Le 05/07/2012 23:32, Gregory Farnum a écrit :

[...]


ok, so as all nodes were identical, I probably have hit a btrfs bug (like
a
erroneous out of space ) in more or less the same time. And when 1 osd
was
out,


OH , I didn't finish the sentence... When 1 osd was out, missing data was
copied on another nodes, probably speeding btrfs problem on those nodes (I
suspect erroneous out of space conditions)

Ah. How full are/were the disks?


The OSD nodes were below 50 % (all are 5 To volumes):

osd.0 : 31%
osd.1 : 31%
osd.2 : 39%
osd.3 : 65%
no osd.4 :)
osd.5 : 35%
osd.6 : 60%
osd.7 : 42%
osd.8 : 34%

all the volumes were using btrfs with lzo compress.

[...]


Oh, interesting. Are the broken nodes all on the same set of arrays?


No. There are 4 completely independant raid arrays, in 4 different
locations. They are similar (same brand  model, but slighltly different
disks, and 1 different firmware), all arrays are multipathed. I don't think
the raid array is the problem. We use those particular models since 2/3
years, and in the logs I don't see any problem that can be caused by the
storage itself (like scsi or multipath errors)

I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes?


I have 8 osd nodes, in 4 different locations (several km away). In each 
location I have 2 nodes and 1 raid Array.
On each location, each raid array has 16 2To disks, 2 controllers with 
4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks 
for one, 7 disks for the orher). Each raid set is primary attached to 1 
controller, and each osd node on the location has acces to the 
controller with 2 distinct paths.


There were no correlation between failed nodes  raid array.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-06 Thread Yann Dupont

Le 05/07/2012 23:32, Gregory Farnum a écrit :

[...]

ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
erroneous out of space ) in more or less the same time. And when 1 osd was
out,


OH , I didn't finish the sentence... When 1 osd was out, missing data 
was copied on another nodes, probably speeding btrfs problem on those 
nodes (I suspect erroneous out of space conditions)


I've reformatted OSD with xfs. Performance is slightly worse for the 
moment (well, depend on the workload, and maybe lack of syncfs is to 
blame), but at least I hope to have the storage layer rock-solid. BTW, 
I've managed to keep the faulty btrfs volumes .


[...]


I wonder if maybe there's a confounding factor here — are all your nodes
similar to each other,

Yes. I designed the cluster that way. All nodes are identical hardware
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?


No. There are 4 completely independant raid arrays, in 4 different 
locations. They are similar (same brand  model, but slighltly different 
disks, and 1 different firmware), all arrays are multipathed. I don't 
think the raid array is the problem. We use those particular models 
since 2/3 years, and in the logs I don't see any problem that can be 
caused by the storage itself (like scsi or multipath errors)


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-06 Thread Gregory Farnum
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 05/07/2012 23:32, Gregory Farnum a écrit :

 [...]

 ok, so as all nodes were identical, I probably have hit a btrfs bug (like
 a
 erroneous out of space ) in more or less the same time. And when 1 osd
 was
 out,


 OH , I didn't finish the sentence... When 1 osd was out, missing data was
 copied on another nodes, probably speeding btrfs problem on those nodes (I
 suspect erroneous out of space conditions)

Ah. How full are/were the disks?


 I've reformatted OSD with xfs. Performance is slightly worse for the moment
 (well, depend on the workload, and maybe lack of syncfs is to blame), but at
 least I hope to have the storage layer rock-solid. BTW, I've managed to keep
 the faulty btrfs volumes .

 [...]


 I wonder if maybe there's a confounding factor here — are all your nodes
 similar to each other,

 Yes. I designed the cluster that way. All nodes are identical hardware
 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

 Oh, interesting. Are the broken nodes all on the same set of arrays?


 No. There are 4 completely independant raid arrays, in 4 different
 locations. They are similar (same brand  model, but slighltly different
 disks, and 1 different firmware), all arrays are multipathed. I don't think
 the raid array is the problem. We use those particular models since 2/3
 years, and in the logs I don't see any problem that can be caused by the
 storage itself (like scsi or multipath errors)

I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-05 Thread Gregory Farnum
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 04/07/2012 18:21, Gregory Farnum a écrit :

 On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

 Le 03/07/2012 23:38, Tommi Virtanen a écrit :

 On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr
 (mailto:yann.dup...@univ-nantes.fr) wrote:

 In the case I could repair, do you think a crashed FS as it is right
 now is
 valuable for you, for future reference , as I saw you can't reproduce
 the
 problem ? I can make an archive (or a btrfs dump ?), but it will be
 quite
 big.

     At this point, it's more about the upstream developers (of btrfs
 etc)
 than us; we're on good terms with them but not experts on the on-disk
 format(s). You might want to send an email to the relevant mailing
 lists before wiping the disks.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html

     Well, I probably wasn't clear enough. I talked about crashed FS, but
 i
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node
 (and only one) has PROBABLY crashed in the past, causing corruption in
 ceph data on this node, and then the subsequent crash of other nodes.
   RIGHT now btrfs on this node is OK. I can access the filesystem without
 errors.
   For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
 with the underlying fs as far as I can tell.
   So I think the scenario is :
   One node had problem with btrfs, leading first to kernel problem ,
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
 kernel oops. Before that ultimate kernel oops, bad data has been
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
 nodes.

 I don't think that's actually possible — the OSDs all do quite a lot of
 interpretation between what they get off the wire and what goes on disk.
 What you've got here are 4 corrupted LevelDB databases, and we pretty much
 can't do that through the interfaces we have. :/


 ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
 erroneous out of space ) in more or less the same time. And when 1 osd was
 out,



   If you think this scenario is highly improbable in real life (that is,
 btrfs will probably be fixed for good, and then, corruption can't
 happen), it's ok.
   But I wonder if this scenario can be triggered with other problem, and
 bad data can be transmitted to other sane nodes (power outage, out of
 memory condition, disk full... for example)
   That's why I proposed you a crashed ceph volume image (I shouldn't have
 talked about a crashed fs, sorry for the confusion)

 I appreciate the offer, but I don't think this will help much — it's a
 disk state managed by somebody else, not our logical state, which has
 broken. If we could figure out how that state got broken that'd be good, but
 a ceph image won't really help in doing so.

 ok, no problem. I'll restart from scratch, freshly formated.


 I wonder if maybe there's a confounding factor here — are all your nodes
 similar to each other,


 Yes. I designed the cluster that way. All nodes are identical hardware
 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?




   or are they running on different kinds of hardware? How did you do your
 Ceph upgrades? What's ceph -s display when the cluster is running as best it
 can?


 Ceph was running 0.47.2 at that time - (debian package for ceph). After the
 crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
 success.

 Nothing particular for upgrades, because for the moment ceph is broken, so
 just apt-get upgrade with new version.


 ceph -s show that :

 root@label5:~# ceph -s
    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
 1814/1245570 unfound (0.146%)
    monmap e1: 3 mons at
 {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
 election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
    osdmap e2404: 8 osds: 3 up, 3 in
     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
 active+recovering+remapped, 32 active+clean+replay, 11
 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
 20 stale+active+degraded, 6 down+remapped+peering, 8
 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
 used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
 

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Well, I probably wasn't clear enough. I talked about crashed FS, but i 
was talking about ceph. The underlying FS (btrfs in that case) of 1 node 
(and only one) has PROBABLY crashed in the past, causing corruption in 
ceph data on this node, and then the subsequent crash of other nodes.


RIGHT now btrfs on this node is OK. I can access the filesystem without 
errors.


For the moment, on 8 nodes, 4 refuse to restart .
1 of the 4 nodes was the crashed node , the 3 others didn't had broblem 
with the underlying fs as far as I can tell.


So I think the scenario is :

One node had problem with btrfs, leading first to kernel problem , 
probably corruption (in disk/ in memory maybe ?) ,and ultimately to a 
kernel oops. Before that ultimate kernel oops, bad data has been 
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses 
nodes.


If you think this scenario is highly improbable in real life (that is, 
btrfs will probably be fixed for good, and then, corruption can't 
happen), it's ok.


But I wonder if this scenario can be triggered with other problem, and 
bad data can be transmitted to other sane nodes (power outage, out of 
memory condition, disk full... for example)


That's why I proposed you a crashed ceph volume image (I shouldn't have 
talked about a crashed fs, sorry for the confusion)


Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 
3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I 
can , and there is no sign of problem on it. It does'nt mean data is 
safe there, but i think it's a sign that at least, some bugs have been 
corrected in btrfs code.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
 Le 03/07/2012 23:38, Tommi Virtanen a écrit :
  On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   In the case I could repair, do you think a crashed FS as it is right now 
   is
   valuable for you, for future reference , as I saw you can't reproduce the
   problem ? I can make an archive (or a btrfs dump ?), but it will be quite
   big.
   
   
  At this point, it's more about the upstream developers (of btrfs etc)
  than us; we're on good terms with them but not experts on the on-disk
  format(s). You might want to send an email to the relevant mailing
  lists before wiping the disks.
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
 Well, I probably wasn't clear enough. I talked about crashed FS, but i  
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node  
 (and only one) has PROBABLY crashed in the past, causing corruption in  
 ceph data on this node, and then the subsequent crash of other nodes.
  
 RIGHT now btrfs on this node is OK. I can access the filesystem without  
 errors.
  
 For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem  
 with the underlying fs as far as I can tell.
  
 So I think the scenario is :
  
 One node had problem with btrfs, leading first to kernel problem ,  
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a  
 kernel oops. Before that ultimate kernel oops, bad data has been  
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses  
 nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/
  
  
 If you think this scenario is highly improbable in real life (that is,  
 btrfs will probably be fixed for good, and then, corruption can't  
 happen), it's ok.
  
 But I wonder if this scenario can be triggered with other problem, and  
 bad data can be transmitted to other sane nodes (power outage, out of  
 memory condition, disk full... for example)
  
 That's why I proposed you a crashed ceph volume image (I shouldn't have  
 talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk 
state managed by somebody else, not our logical state, which has broken. If we 
could figure out how that state got broken that'd be good, but a ceph image 
won't really help in doing so.

I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other, or are they running on different kinds of hardware? How 
did you do your Ceph upgrades? What's ceph -s display when the cluster is 
running as best it can?
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 04/07/2012 18:21, Gregory Farnum a écrit :

On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
(mailto:yann.dup...@univ-nantes.fr) wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.
  
  
At this point, it's more about the upstream developers (of btrfs etc)

than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
Well, I probably wasn't clear enough. I talked about crashed FS, but i

was talking about ceph. The underlying FS (btrfs in that case) of 1 node
(and only one) has PROBABLY crashed in the past, causing corruption in
ceph data on this node, and then the subsequent crash of other nodes.
  
RIGHT now btrfs on this node is OK. I can access the filesystem without

errors.
  
For the moment, on 8 nodes, 4 refuse to restart .

1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
with the underlying fs as far as I can tell.
  
So I think the scenario is :
  
One node had problem with btrfs, leading first to kernel problem ,

probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
kernel oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/


ok, so as all nodes were identical, I probably have hit a btrfs bug 
(like a erroneous out of space ) in more or less the same time. And when 
1 osd was out,
   
  
If you think this scenario is highly improbable in real life (that is,

btrfs will probably be fixed for good, and then, corruption can't
happen), it's ok.
  
But I wonder if this scenario can be triggered with other problem, and

bad data can be transmitted to other sane nodes (power outage, out of
memory condition, disk full... for example)
  
That's why I proposed you a crashed ceph volume image (I shouldn't have

talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk state managed 
by somebody else, not our logical state, which has broken. If we could figure out how 
that state got broken that'd be good, but a ceph image won't really help in 
doing so.

ok, no problem. I'll restart from scratch, freshly formated.


I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other,


Yes. I designed the cluster that way. All nodes are identical hardware 
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to 
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)



  or are they running on different kinds of hardware? How did you do your Ceph 
upgrades? What's ceph -s display when the cluster is running as best it can?


Ceph was running 0.47.2 at that time - (debian package for ceph). After 
the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 
without success.


Nothing particular for upgrades, because for the moment ceph is broken, 
so just apt-get upgrade with new version.



ceph -s show that :

root@label5:~# ceph -s
   health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 
32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck 
stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded 
(10.729%); 1814/1245570 unfound (0.146%)
   monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa

   osdmap e2404: 8 osds: 3 up, 3 in
pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 
active+recovering+remapped, 32 active+clean+replay, 11 
active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 
active+degraded, 7 stale+active+recovering+degraded, 61 
stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 
stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB 
used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 
1814/1245570 unfound (0.146%)

   mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby



BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of 
the 4 surviving OSD didn't 

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
 now.

 Tried to restart osd with 0.47.3, then next branch, and today with 0.48.

 4 of 8 nodes fails with the same message :

 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x701929]
...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6e811d]

That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is looks like you have a corrupted leveldb
file. Is this reproducible with a freshly mkfs'ed data partition?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-03 Thread Yann Dupont

Le 03/07/2012 21:42, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
now.

Tried to restart osd with 0.47.3, then next branch, and today with 0.48.

4 of 8 nodes fails with the same message :

ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x701929]

...

  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
leveldb::Slice const) const+0x4d) [0x6e811d]

That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is looks like you have a corrupted leveldb
file. Is this reproducible with a freshly mkfs'ed data partition?
Probably not. I have multiple data volumes on each nodes (I was planning 
xfs vs ext4 vs btrfs benchmarks before being ill) and thoses nodes start 
OK with another data partition .


It's very probable that there is corruption somewhere, due to kernel bug 
, probably triggered by btrfs.


Issue 2563 is probably the same.

I'd like to restart those nodes without formatting them, not because the 
data is valuable, but because if the same thing happens in production, a 
method similar to fsck the node could be of great value.


I saw the method to check the leveldb. Will try tomorrow without garantees.

In the case I could repair, do you think a crashed FS as it is right now 
is valuable for you, for future reference , as I saw you can't reproduce 
the problem ? I can make an archive (or a btrfs dump ?), but it will be 
quite big.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 In the case I could repair, do you think a crashed FS as it is right now is
 valuable for you, for future reference , as I saw you can't reproduce the
 problem ? I can make an archive (or a btrfs dump ?), but it will be quite
 big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Tommi Virtanen
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Results : Worked like a charm during two days, apart btrfs warn messages
 then OSD begin to crash 1 after all 'domino style'.

Sorry to hear that. Reading through your message, there seem to be
several problems; whether they are because of the same root cause, I
can't tell.

Quick triage to benefit the other devs:

#1: kernel crash, no details available
 1 of the physical machine was in kernel oops state - Nothing was remote

#2: leveldb corruption? may be memory corruption that started
elsewhere.. Sam, does this look like the leveldb issue you saw?
  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
     0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
 (Aborted) **
...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6ef69d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
 const)+0x9f) [0x6fdd9f]

#3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
     0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)'
 thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
 osd/PG.cc: 402: FAILED assert(log.head = olog.tail  olog.head =
 log.tail)

#4: unknown btrfs warnings, there should an actual message above this
traceback; believed fixed in latest kernel
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
 [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
 [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
 [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
 [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
 [8105a9f0] ? add_wait_queue+0x60/0x60
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
 [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
 [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Sam Just
Can you send the osd logs?  The merge_log crashes are probably fixable
if I can see the logs.

The leveldb crash is almost certainly a result of memory corruption.

Thanks
-Sam

On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com wrote:
 On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr 
 wrote:
 Results : Worked like a charm during two days, apart btrfs warn messages
 then OSD begin to crash 1 after all 'domino style'.

 Sorry to hear that. Reading through your message, there seem to be
 several problems; whether they are because of the same root cause, I
 can't tell.

 Quick triage to benefit the other devs:

 #1: kernel crash, no details available
 1 of the physical machine was in kernel oops state - Nothing was remote

 #2: leveldb corruption? may be memory corruption that started
 elsewhere.. Sam, does this look like the leveldb issue you saw?
  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
     0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
 (Aborted) **
 ...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6ef69d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
 const)+0x9f) [0x6fdd9f]

 #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
     0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)'
 thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
 osd/PG.cc: 402: FAILED assert(log.head = olog.tail  olog.head =
 log.tail)

 #4: unknown btrfs warnings, there should an actual message above this
 traceback; believed fixed in latest kernel
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
 [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
 [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
 [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
 [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
 [8105a9f0] ? add_wait_queue+0x60/0x60
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
 [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
 [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Greg Farnum
This is probably the same/similar to http://tracker.newdream.net/issues/2462, 
no? There's a log there, though I've no idea how helpful it is.


On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:

 Can you send the osd logs? The merge_log crashes are probably fixable
 if I can see the logs.
 
 The leveldb crash is almost certainly a result of memory corruption.
 
 Thanks
 -Sam
 
 On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com 
 (mailto:t...@inktank.com) wrote:
  On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   Results : Worked like a charm during two days, apart btrfs warn messages
   then OSD begin to crash 1 after all 'domino style'.
  
  
  
  Sorry to hear that. Reading through your message, there seem to be
  several problems; whether they are because of the same root cause, I
  can't tell.
  
  Quick triage to benefit the other devs:
  
  #1: kernel crash, no details available
   1 of the physical machine was in kernel oops state - Nothing was remote
  
  
  
  #2: leveldb corruption? may be memory corruption that started
  elsewhere.. Sam, does this look like the leveldb issue you saw?
   [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
   0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
   (Aborted) **
  
  
  ...
   13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
   leveldb::Slice const) const+0x4d) [0x6ef69d]
   14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
   const)+0x9f) [0x6fdd9f]
  
  
  
  #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
   0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): 
   In function
   'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
   int)'
   thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
   osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head = olog.tail  
   olog.head =
   log.tail)
  
  
  
  #4: unknown btrfs warnings, there should an actual message above this
  traceback; believed fixed in latest kernel
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
   [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
   [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
   [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
   [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
   [8105a9f0] ? add_wait_queue+0x60/0x60
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
   [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
   [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html