Re: [ceph-users] Some OSD and MDS crash

2014-07-17 Thread John Spray
Hi Pierre,

Unfortunately it looks like we had a bug in 0.82 that could lead to
journal corruption of the sort you're seeing here.  A new journal
format was added, and on the first start after an update the MDS would
re-write the journal to the new format.  This should only have been
happening on the single active MDS for a given rank, but it was
actually being done by standby-replay MDS daemons too.  As a result,
if there were standby-replay daemons configured, they could try to
rewrite the journal at the same time, resulting in a corrupt journal.

In your case, I think the probability of the condition occurring was
increased by the OSD issues you were having, because at some earlier
stage the rewrite process had been stopped partway through.  Without
standby MDSs this would be recovered from cleanly, but with the
standbys in play the danger of corruption is high while the journal is
in the partly-rewritten state.

The ticket is here: http://tracker.ceph.com/issues/8811
The candidate fix is here: https://github.com/ceph/ceph/pull/2115

If you have recent backups then I would suggest recreating the
filesystem and restoring from backups.  You can also try using the
cephfs-journal-tool journal reset command, which will wipe out the
journal entirely, losing the most recent writes to the filesystem and
potentially leaving some stray objects in the data pool.

Sorry that this has bitten you, even though 0.82 was not a named
release this was a pretty nasty bug to let out there, and I'm going to
improve our automated tests in this area.

Regards,
John


On Wed, Jul 16, 2014 at 11:57 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Le 16/07/2014 22:40, Gregory Farnum a écrit :

 On Wed, Jul 16, 2014 at 6:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 After the repair process, i have :
 1926 active+clean
 2 active+clean+inconsistent

 This two PGs seem to be on the same osd ( #34 ):
 # ceph pg dump | grep inconsistent
 dumped all in format plain
 0.2e4   0   0   0   8388660 4   4
 active+clean+inconsistent   2014-07-16 11:39:43.819631  9463'4
 438411:133968   [34,4]  34  [34,4]  34  9463'4  2014-07-16
 04:52:54.417333  9463'4  2014-07-11 09:29:22.041717
 0.1ed   5   0   0   0   8388623 10  10
 active+clean+inconsistent   2014-07-16 11:39:45.820142  9712'10
 438411:144792   [34,2]  34  [34,2]  34  9712'10 2014-07-16
 09:12:44.742488  9712'10 2014-07-10 21:57:11.345241

 It's can explain why my MDS won't to start ? If i remove ( or shutdown )
 this OSD, it's can solved my problem ?


 You want to figure out why they're inconsistent (if they're still
 going inconsistent, or maybe just need to be repaired), but this
 shouldn't be causing your MDS troubles.
 Can you dump the MDS journal and put it somewhere accessible? (You can
 use ceph-post-file to upload it.) John has been trying to reproduce
 this crash but hasn't succeeded yet.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 Hi,

 I try to do :
 cephfs-journal-tool journal export ceph-journal.bin 2
 cephfs-journal-tool.log

 But the program crash. I upload log file :
 e069c6ac-3cb4-4a52-8950-da7c600e2b01

 There is a mistake in
 http://ceph.com/docs/master/cephfs/cephfs-journal-tool/ in Example: journal
 inspect. The good syntax seems to be :
 # cephfs-journal-tool  journal inspect
 2014-07-17 00:54:14.155382 7ff89d239780 -1 Header is invalid (inconsistent
 offsets)
 Overall journal integrity: DAMAGED
 Header could not be decoded

 Regards


 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-16 Thread Pierre BLONDEAU

Hi,

After the repair process, i have :
1926 active+clean
   2 active+clean+inconsistent

This two PGs seem to be on the same osd ( #34 ):
# ceph pg dump | grep inconsistent
dumped all in format plain
0.2e4   0   0   0   8388660 4   4 
active+clean+inconsistent   2014-07-16 11:39:43.819631  9463'4 
438411:133968   [34,4]  34  [34,4]  34  9463'4  2014-07-16 
04:52:54.417333  9463'4  2014-07-11 09:29:22.041717
0.1ed   5   0   0   0   8388623 10  10 
active+clean+inconsistent   2014-07-16 11:39:45.820142  9712'10 
438411:144792   [34,2]  34  [34,2]  34  9712'10 2014-07-16 
09:12:44.742488  9712'10 2014-07-10 21:57:11.345241


It's can explain why my MDS won't to start ? If i remove ( or shutdown ) 
this OSD, it's can solved my problem ?


Regards.

Le 10/07/2014 11:51, Pierre BLONDEAU a écrit :

Hi,

Great.

All my OSD restart :
osdmap e438044: 36 osds: 36 up, 36 in

All PG page are active and some in recovery :
1604040/49575206 objects degraded (3.236%)
  1780 active+clean
  17 active+degraded+remapped+backfilling
  61 active+degraded+remapped+wait_backfill
  11 active+clean+scrubbing+deep
  34 active+remapped+backfilling
  21 active+remapped+wait_backfill
  4 active+clean+replay

But all mds crash. Logs are here :
https://blondeau.users.greyc.fr/cephlog/legacy/

In any case, thank you very much for your help.

Pierre

Le 09/07/2014 19:34, Joao Eduardo Luis a écrit :

On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote:

Hi,

There is any chance to restore my data ?


Okay, I talked to Sam and here's what you could try before anything else:

- Make sure you have everything running on the same version.
- unset the the chooseleaf_vary_r flag -- this can be accomplished by
setting tunables to legacy.
- have the osds join in the cluster
- you should then either upgrade to firefly (if you haven't done so by
now) or wait for the point-release before you move on to setting
tunables to optimal again.

Let us know how it goes.

   -Joao




Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do
however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for
that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d
/tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow
ended
up divergent?


The only thing that comes to mind that could cause this is
if we
changed
the leader's in-memory map, proposed it, it failed, and only
the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may
have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but 

Re: [ceph-users] Some OSD and MDS crash

2014-07-16 Thread Gregory Farnum
On Wed, Jul 16, 2014 at 6:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Hi,

 After the repair process, i have :
 1926 active+clean
2 active+clean+inconsistent

 This two PGs seem to be on the same osd ( #34 ):
 # ceph pg dump | grep inconsistent
 dumped all in format plain
 0.2e4   0   0   0   8388660 4   4
 active+clean+inconsistent   2014-07-16 11:39:43.819631  9463'4
 438411:133968   [34,4]  34  [34,4]  34  9463'4  2014-07-16
 04:52:54.417333  9463'4  2014-07-11 09:29:22.041717
 0.1ed   5   0   0   0   8388623 10  10
 active+clean+inconsistent   2014-07-16 11:39:45.820142  9712'10
 438411:144792   [34,2]  34  [34,2]  34  9712'10 2014-07-16
 09:12:44.742488  9712'10 2014-07-10 21:57:11.345241

 It's can explain why my MDS won't to start ? If i remove ( or shutdown )
 this OSD, it's can solved my problem ?

You want to figure out why they're inconsistent (if they're still
going inconsistent, or maybe just need to be repaired), but this
shouldn't be causing your MDS troubles.
Can you dump the MDS journal and put it somewhere accessible? (You can
use ceph-post-file to upload it.) John has been trying to reproduce
this crash but hasn't succeeded yet.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-10 Thread Pierre BLONDEAU

Hi,

Great.

All my OSD restart :
osdmap e438044: 36 osds: 36 up, 36 in

All PG page are active and some in recovery :
1604040/49575206 objects degraded (3.236%)
 1780 active+clean
 17 active+degraded+remapped+backfilling
 61 active+degraded+remapped+wait_backfill
 11 active+clean+scrubbing+deep
 34 active+remapped+backfilling
 21 active+remapped+wait_backfill
 4 active+clean+replay

But all mds crash. Logs are here : 
https://blondeau.users.greyc.fr/cephlog/legacy/


In any case, thank you very much for your help.

Pierre

Le 09/07/2014 19:34, Joao Eduardo Luis a écrit :

On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote:

Hi,

There is any chance to restore my data ?


Okay, I talked to Sam and here's what you could try before anything else:

- Make sure you have everything running on the same version.
- unset the the chooseleaf_vary_r flag -- this can be accomplished by
setting tunables to legacy.
- have the osds join in the cluster
- you should then either upgrade to firefly (if you haven't done so by
now) or wait for the point-release before you move on to setting
tunables to optimal again.

Let us know how it goes.

   -Joao




Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for
that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d
/tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow
ended
up divergent?


The only thing that comes to mind that could cause this is
if we
changed
the leader's in-memory map, proposed it, it failed, and only
the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables
optimal
for the
problem of crush map and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
sam.j...@inktank.com mailto:sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db__4f
  osd-20_osdmap.13258__0___4E62BB79__none
6037911f31dc3c18b05499d24dcdbe__5c
  

Re: [ceph-users] Some OSD and MDS crash

2014-07-09 Thread Pierre BLONDEAU

Hi,

There is any chance to restore my data ?

Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended
up divergent?


The only thing that comes to mind that could cause this is if we
changed
the leader's in-memory map, proposed it, it failed, and only the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal
for the
problem of crush map and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
sam.j...@inktank.com mailto:sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db__4f
  osd-20_osdmap.13258__0___4E62BB79__none
6037911f31dc3c18b05499d24dcdbe__5c
  osd-23_osdmap.13258__0___4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
   ceph-deploy install --stable firefly
servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

I upgraded from emperor to firefly. After
repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could
solve my problem. (
It's my mistake ). So, I upgrade from firefly to
0.83 with :
   ceph-deploy install 

Re: [ceph-users] Some OSD and MDS crash

2014-07-09 Thread Joao Eduardo Luis

On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote:

Hi,

There is any chance to restore my data ?


Hello Pierre,

I've been giving this some thought and my guess is that yes, it should 
be possible.  However, it may not be a simple fix.


So, first of all, you got bit by http://tracker.ceph.com/issues/8738, 
which has been resolved and should be available on the next firefly 
point-release.


However, I doubt just upgrading will solve all your problems.  You'll 
have some OSDs with maps containing the chooseleaf_vary_r flag, while 
other OSDs won't.  You'll also have monitors serving such maps, while 
other monitors won't.


This may very well mean having to enable the flag throughout the cluster 
in all those maps that haven't got said flag enabled.  In which case 
this will mean having to put together a tool to do this, while a daemon 
is offline.


There may be however another way, but although simpler it's more intrusive:

First of all we'd have to know which monitor is the one with the 
appropriate maps (this would certainly be the firefly monitor), which 
I'm assuming is still online.


Then we'd have to remove all remaining monitors and add new, firefly 
monitors.  This way they'd sync up with the monitor with the correct maps.


Then we'd have to make sure in which map version this whole thing 
happened, and copy all maps from that point forward from the up OSDs to 
the OSDs that have divergent maps.


It would be nice if Sam could chime in and validate either approach.

  -Joao




Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended
up divergent?


The only thing that comes to mind that could cause this is if we
changed
the leader's in-memory map, proposed it, it failed, and only the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal
for the
problem of crush map and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel 

Re: [ceph-users] Some OSD and MDS crash

2014-07-09 Thread Joao Eduardo Luis

On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote:

Hi,

There is any chance to restore my data ?


Okay, I talked to Sam and here's what you could try before anything else:

- Make sure you have everything running on the same version.
- unset the the chooseleaf_vary_r flag -- this can be accomplished by 
setting tunables to legacy.

- have the osds join in the cluster
- you should then either upgrade to firefly (if you haven't done so by 
now) or wait for the point-release before you move on to setting 
tunables to optimal again.


Let us know how it goes.

  -Joao




Regards
Pierre

Le 07/07/2014 15:42, Pierre BLONDEAU a écrit :

No chance to have those logs and even less in debug mode. I do this
change 3 weeks ago.

I put all my log here if it's can help :
https://blondeau.users.greyc.fr/cephlog/all/

I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to
/tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended
up divergent?


The only thing that comes to mind that could cause this is if we
changed
the leader's in-memory map, proposed it, it failed, and only the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect
osdmap to
whoever asked osdmaps from it, the remaining quorum would
serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal
for the
problem of crush map and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
sam.j...@inktank.com mailto:sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db__4f
  osd-20_osdmap.13258__0___4E62BB79__none
6037911f31dc3c18b05499d24dcdbe__5c
  osd-23_osdmap.13258__0___4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
   ceph-deploy install --stable firefly
servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each 

Re: [ceph-users] Some OSD and MDS crash

2014-07-07 Thread Pierre BLONDEAU
No chance to have those logs and even less in debug mode. I do this 
change 3 weeks ago.


I put all my log here if it's can help : 
https://blondeau.users.greyc.fr/cephlog/all/


I have a chance to recover my +/- 20TB of data ?

Regards

Le 03/07/2014 21:48, Joao Luis a écrit :

Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the
tunables. Say, before the upgrade and a bit after you set the tunable.
If you want to be finer grained, then ideally it would be the moment
where those maps were created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

   -Joao

On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool
--export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d
/tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0___4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended
up divergent?


The only thing that comes to mind that could cause this is if we
changed
the leader's in-memory map, proposed it, it failed, and only the
leader
got to write the map to disk somehow.  This happened once on a
totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect osdmap to
whoever asked osdmaps from it, the remaining quorum would serve the
correct osdmaps to all the others.  This could cause this
divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have
happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should
have informations about the upgrade from firefly to 0.82.
Which mon's log do you want ? Three ?

Regards

-Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after
the update in
firefly, I was in state : HEALTH_WARN crush map has legacy
tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal
for the
problem of crush map and I update my client and server
kernel to
3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just
sam.j...@inktank.com mailto:sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db__4f
  osd-20_osdmap.13258__0___4E62BB79__none
6037911f31dc3c18b05499d24dcdbe__5c
  osd-23_osdmap.13258__0___4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr
mailto:pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

I upgraded from emperor to firefly. After
repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could
solve my problem. (
It's my mistake ). So, I upgrade from firefly to
0.83 with :
   ceph-deploy install --testing servers...
   ..

Now, all programs are in version 0.82.
  

Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Eduardo Luis

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed 
the leader's in-memory map, proposed it, it failed, and only the leader 
got to write the map to disk somehow.  This happened once on a totally 
different issue (although I can't pinpoint right now which).


In such a scenario, the leader would serve the incorrect osdmap to 
whoever asked osdmaps from it, the remaining quorum would serve the 
correct osdmaps to all the others.  This could cause this divergence. 
Or it could be something else.


Are there logs for the monitors for the timeframe this may have happened in?

  -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of crush map and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's normal ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's 

Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Pierre BLONDEAU

Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


The only thing that comes to mind that could cause this is if we changed
the leader's in-memory map, proposed it, it failed, and only the leader
got to write the map to disk somehow.  This happened once on a totally
different issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect osdmap to
whoever asked osdmaps from it, the remaining quorum would serve the
correct osdmaps to all the others.  This could cause this divergence. Or
it could be something else.

Are there logs for the monitors for the timeframe this may have happened
in?


Which exactly timeframe you want ? I have 7 days of logs, I should have 
informations about the upgrade from firefly to 0.82.

Which mon's log do you want ? Three ?

Regards


   -Joao



Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of crush map and I update my client and server kernel to
3.16rc.

It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com
wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's normal ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash
other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because 

Re: [ceph-users] Some OSD and MDS crash

2014-07-03 Thread Joao Luis
Do those logs have a higher debugging level than the default? If not
nevermind as they will not have enough information. If they do however,
we'd be interested in the portion around the moment you set the tunables.
Say, before the upgrade and a bit after you set the tunable. If you want to
be finer grained, then ideally it would be the moment where those maps were
created, but you'd have to grep the logs for that.

Or drop the logs somewhere and I'll take a look.

  -Joao
On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr
wrote:

 Le 03/07/2014 13:49, Joao Eduardo Luis a écrit :

 On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

 Le 03/07/2014 00:55, Samuel Just a écrit :

 Ah,

 ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
 ../ceph/src/osdmaptool: osdmap file
 'osd-20_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
 ../ceph/src/osdmaptool: osdmap file
 'osd-23_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
 6d5
  tunable chooseleaf_vary_r 1

  Looks like the chooseleaf_vary_r tunable somehow ended up divergent?


 The only thing that comes to mind that could cause this is if we changed
 the leader's in-memory map, proposed it, it failed, and only the leader
 got to write the map to disk somehow.  This happened once on a totally
 different issue (although I can't pinpoint right now which).

 In such a scenario, the leader would serve the incorrect osdmap to
 whoever asked osdmaps from it, the remaining quorum would serve the
 correct osdmaps to all the others.  This could cause this divergence. Or
 it could be something else.

 Are there logs for the monitors for the timeframe this may have happened
 in?


 Which exactly timeframe you want ? I have 7 days of logs, I should have
 informations about the upgrade from firefly to 0.82.
 Which mon's log do you want ? Three ?

 Regards

 -Joao


 Pierre: do you recall how and when that got set?


 I am not sure to understand, but if I good remember after the update in
 firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
 I see feature set mismatch in log.

 So if I good remeber, i do : ceph osd crush tunables optimal for the
 problem of crush map and I update my client and server kernel to
 3.16rc.

 It's could be that ?

 Pierre

  -Sam

 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com
 wrote:

 Yeah, divergent osdmaps:
 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_
 4E62BB79__none
 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_
 4E62BB79__none

 Joao: thoughts?
 -Sam

 On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 The files

 When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace,
 etc ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
   ceph-deploy install --testing servers...
   ..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

  Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
 wrote:


 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something
 like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do
 you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31
 osd up to
 16.
 I remark that after this the number of down+peering PG decrease
 from 367
 to
 248. It's normal ? May be it's temporary, the time that the
 cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

  You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd
 like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 

Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Pierre BLONDEAU

Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 
--debug-ms 1'


By modify the /etc/ceph/ceph.conf ? This file is really poor because I 
use udev detection.


When I have made these changes, you want the three log files or only 
osd.20's ?


Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :

Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

Hi,

I join :
  - osd.20 is one of osd that I detect which makes crash other OSD.
  - osd.23 is one of osd which crash when i start osd.20
  - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to upgrade
to unnamed versions like 0.82 (but it's probably too late to go back
now).



I will remember it the next time ;)



-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my problem.

My three MDS crash and some OSD triggers a chain reaction that kills
other
OSD.
I think my MDS will not start because of the metadata are on the OSD.

I have 36 OSD on three servers and I identified 5 OSD which makes crash
others. If i not start their, the cluster passe in reconstructive state
with
31 OSD but i have 378 in down+peering state.

How can I do ? Would you more information ( os, crash log, etc ... ) ?

Regards

--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--



smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 I join :
   - osd.20 is one of osd that I detect which makes crash other OSD.
   - osd.23 is one of osd which crash when i start osd.20
   - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to upgrade
 to unnamed versions like 0.82 (but it's probably too late to go back
 now).



 I will remember it the next time ;)


 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes crash
 others. If i not start their, the cluster passe in reconstructive state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... ) ?

 Regards

 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Pierre BLONDEAU

Hi,

I do it, the log files are available here : 
https://blondeau.users.greyc.fr/cephlog/debug20/


The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31 osd up to 
16. I remark that after this the number of down+peering PG decrease from 
367 to 248. It's normal ? May be it's temporary, the time that the 
cluster verifies all the PG ?


Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :

You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor because I use
udev detection.

When I have made these changes, you want the three log files or only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I join :
   - osd.20 is one of osd that I detect which makes crash other OSD.
   - osd.23 is one of osd which crash when i start osd.20
   - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to upgrade
to unnamed versions like 0.82 (but it's probably too late to go back
now).


I will remember it the next time ;)


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my problem.

My three MDS crash and some OSD triggers a chain reaction that kills
other
OSD.
I think my MDS will not start because of the metadata are on the OSD.

I have 36 OSD on three servers and I identified 5 OSD which makes crash
others. If i not start their, the cluster passe in reconstructive state
with
31 OSD but i have 378 in down+peering state.

How can I do ? Would you more information ( os, crash log, etc ... ) ?

Regards


--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--



smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Ok, in current/meta on osd 20 and osd 23, please attach all files matching

^osdmap.13258.*

There should be one such file on each osd. (should look something like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up to 16.
 I remark that after this the number of down+peering PG decrease from 367 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I
 use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I join :
- osd.20 is one of osd that I detect which makes crash other OSD.
- osd.23 is one of osd which crash when i start osd.20
- mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to upgrade
 to unnamed versions like 0.82 (but it's probably too late to go back
 now).


 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes
 crash
 others. If i not start their, the cluster passe in reconstructive
 state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... )
 ?

 Regards


 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote:
 Ok, in current/meta on osd 20 and osd 23, please attach all files matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:
 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up to 16.
 I remark that after this the number of down+peering PG decrease from 367 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I
 use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I join :
- osd.20 is one of osd that I detect which makes crash other OSD.
- osd.23 is one of osd which crash when i start osd.20
- mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to upgrade
 to unnamed versions like 0.82 (but it's probably too late to go back
 now).


 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes
 crash
 others. If i not start their, the cluster passe in reconstructive
 state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... )
 ?

 Regards


 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Joao: this looks like divergent osdmaps, osd 20 and osd 23 have
differing ideas of the acting set for pg 2.11.  Did we add hashes to
the incremental maps?  What would you want to know from the mons?
-Sam

On Wed, Jul 2, 2014 at 3:10 PM, Samuel Just sam.j...@inktank.com wrote:
 Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote:
 Ok, in current/meta on osd 20 and osd 23, please attach all files matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:
 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up to 16.
 I remark that after this the number of down+peering PG decrease from 367 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I
 use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I join :
- osd.20 is one of osd that I detect which makes crash other OSD.
- osd.23 is one of osd which crash when i start osd.20
- mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to upgrade
 to unnamed versions like 0.82 (but it's probably too late to go back
 now).


 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes
 crash
 others. If i not start their, the cluster passe in reconstructive
 state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... )
 ?

 Regards


 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 The files

 When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

 Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote:

 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up to
 16.
 I remark that after this the number of down+peering PG decrease from 367
 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I
 use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Hi,

 I join :
 - osd.20 is one of osd that I detect which makes crash other
 OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to
 upgrade
 to unnamed versions like 0.82 (but it's probably too late to go
 back
 now).



 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my
 problem.

 My three MDS crash and some OSD triggers a chain reaction that
 kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the
 OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes
 crash
 others. If i not start their, the cluster passe in reconstructive
 state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ...
 )
 ?

 Regards



 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --



 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

Pierre: do you recall how and when that got set?
-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:
 Yeah, divergent osdmaps:
 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

 Joao: thoughts?
 -Sam

 On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:
 The files

 When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

 Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote:

 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up to
 16.
 I remark that after this the number of down+peering PG decrease from 367
 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor because I
 use
 udev detection.

 When I have made these changes, you want the three log files or only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Hi,

 I join :
 - osd.20 is one of osd that I detect which makes crash other
 OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to
 upgrade
 to unnamed versions like 0.82 (but it's probably too late to go
 back
 now).



 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my
 problem.

 My three MDS crash and some OSD triggers a chain reaction that
 kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the
 OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes
 crash
 others. If i not start their, the cluster passe in reconstructive
 state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ...
 )
 ?

 Regards



 --
 --
 Pierre BLONDEAU
 Administrateur 

Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Pierre BLONDEAU

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in 
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and 
I see feature set mismatch in log.


So if I good remeber, i do : ceph osd crush tunables optimal for the 
problem of crush map and I update my client and server kernel to 3.16rc.


It's could be that ?

Pierre


-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace, etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31 osd up to
16.
I remark that after this the number of down+peering PG decrease from 367
to
248. It's normal ? May be it's temporary, the time that the cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor because I
use
udev detection.

When I have made these changes, you want the three log files or only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash other
OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to
upgrade
to unnamed versions like 0.82 (but it's probably too late to go
back
now).




I will remember it the next time ;)


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my
problem.

My three MDS crash and some OSD triggers a chain reaction that
kills
other
OSD.
I think my MDS will not start because of the metadata are on the
OSD.

I have 36 OSD on three servers and I identified 5 OSD which makes

Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Can you confirm from the admin socket that all monitors are running
the same version?
-Sam

On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Le 03/07/2014 00:55, Samuel Just a écrit :

 Ah,

 ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
 ../ceph/src/osdmaptool: osdmap file
 'osd-20_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
 ../ceph/src/osdmaptool: osdmap file
 'osd-23_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
 6d5
  tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

 Pierre: do you recall how and when that got set?


 I am not sure to understand, but if I good remember after the update in
 firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I
 see feature set mismatch in log.

 So if I good remeber, i do : ceph osd crush tunables optimal for the problem
 of crush map and I update my client and server kernel to 3.16rc.

 It's could be that ?

 Pierre


 -Sam

 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:

 Yeah, divergent osdmaps:
 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

 Joao: thoughts?
 -Sam

 On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 The files

 When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace, etc
 ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
   ceph-deploy install --testing servers...
   ..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

 Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
 wrote:


 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you
 have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd up
 to
 16.
 I remark that after this the number of down+peering PG decrease from
 367
 to
 248. It's normal ? May be it's temporary, the time that the cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd
 like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor
 because I
 use
 udev detection.

 When I have made these changes, you want the three log files or
 only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:




 Hi,

 I join :
  - osd.20 is one of osd that I detect which makes crash other
 OSD.
  - osd.23 is one of osd which crash when i start osd.20
  - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to
 upgrade
 to unnamed versions like 0.82 (but it's probably too late to go
 back
 now).




 I will remember it the next time ;)

 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Hi,

 After the upgrade to firefly, I have some PG in 

Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Pierre BLONDEAU

Like that ?

# ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version
{version:0.82}
# ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version
{version:0.82}
# ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version
{version:0.82}

Pierre

Le 03/07/2014 01:17, Samuel Just a écrit :

Can you confirm from the admin socket that all monitors are running
the same version?
-Sam

On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :


Ah,

~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
 tunable chooseleaf_vary_r 1

Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

Pierre: do you recall how and when that got set?



I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I
see feature set mismatch in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the problem
of crush map and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre



-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote:


Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:


The files

When I upgrade :
   ceph-deploy install --stable firefly servers...
   on each servers service ceph restart mon
   on each servers service ceph restart osd
   on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace, etc
... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
   ceph-deploy install --testing servers...
   ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :


Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
wrote:



Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do you
have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:



Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31 osd up
to
16.
I remark that after this the number of down+peering PG decrease from
367
to
248. It's normal ? May be it's temporary, the time that the cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :


You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:




Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :


Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:





Hi,

I join :
  - osd.20 is one of osd that I detect which makes crash other
OSD.
  - osd.23 is one of osd which crash when i start osd.20
  - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :


What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to
upgrade
to unnamed versions like 0.82 (but it's probably too late to go
back
now).





I will remember it the next time ;)


-Greg
Software 

Re: [ceph-users] Some OSD and MDS crash

2014-07-02 Thread Samuel Just
Yes, thanks.
-Sam

On Wed, Jul 2, 2014 at 4:21 PM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Like that ?

 # ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version
 {version:0.82}
 # ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version
 {version:0.82}
 # ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version
 {version:0.82}

 Pierre

 Le 03/07/2014 01:17, Samuel Just a écrit :

 Can you confirm from the admin socket that all monitors are running
 the same version?
 -Sam

 On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Le 03/07/2014 00:55, Samuel Just a écrit :

 Ah,

 ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
 /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i 
 /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
 ../ceph/src/osdmaptool: osdmap file
 'osd-20_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush20
 ../ceph/src/osdmaptool: osdmap file
 'osd-23_osdmap.13258__0_4E62BB79__none'
 ../ceph/src/osdmaptool: exported crush map to /tmp/crush23
 6d5
  tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

 Pierre: do you recall how and when that got set?



 I am not sure to understand, but if I good remember after the update in
 firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I
 see feature set mismatch in log.

 So if I good remeber, i do : ceph osd crush tunables optimal for the
 problem
 of crush map and I update my client and server kernel to 3.16rc.

 It's could be that ?

 Pierre


 -Sam

 On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com
 wrote:


 Yeah, divergent osdmaps:
 555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
 6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

 Joao: thoughts?
 -Sam

 On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:


 The files

 When I upgrade :
ceph-deploy install --stable firefly servers...
on each servers service ceph restart mon
on each servers service ceph restart osd
on each servers service ceph restart mds

 I upgraded from emperor to firefly. After repair, remap, replace, etc
 ... I
 have some PG which pass in peering state.

 I thought why not try the version 0.82, it could solve my problem. (
 It's my mistake ). So, I upgrade from firefly to 0.83 with :
ceph-deploy install --testing servers...
..

 Now, all programs are in version 0.82.
 I have 3 mons, 36 OSD and 3 mds.

 Pierre

 PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta
 directory.

 Le 03/07/2014 00:10, Samuel Just a écrit :

 Also, what version did you upgrade from, and how did you upgrade?
 -Sam

 On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com
 wrote:



 Ok, in current/meta on osd 20 and osd 23, please attach all files
 matching

 ^osdmap.13258.*

 There should be one such file on each osd. (should look something
 like
 osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
 you'll want to use find).

 What version of ceph is running on your mons?  How many mons do you
 have?
 -Sam

 On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:



 Hi,

 I do it, the log files are available here :
 https://blondeau.users.greyc.fr/cephlog/debug20/

 The OSD's files are really big +/- 80M .

 After starting the osd.20 some other osd crash. I pass from 31 osd
 up
 to
 16.
 I remark that after this the number of down+peering PG decrease
 from
 367
 to
 248. It's normal ? May be it's temporary, the time that the
 cluster
 verifies all the PG ?

 Regards
 Pierre

 Le 02/07/2014 19:16, Samuel Just a écrit :

 You should add

 debug osd = 20
 debug filestore = 20
 debug ms = 1

 to the [osd] section of the ceph.conf and restart the osds.  I'd
 like
 all three logs if possible.

 Thanks
 -Sam

 On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:




 Yes, but how i do that ?

 With a command like that ?

 ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
 --debug-ms
 1'

 By modify the /etc/ceph/ceph.conf ? This file is really poor
 because I
 use
 udev detection.

 When I have made these changes, you want the three log files or
 only
 osd.20's ?

 Thank you so much for the help

 Regards
 Pierre

 Le 01/07/2014 23:51, Samuel Just a écrit :

 Can you reproduce with
 debug osd = 20
 debug filestore = 20
 debug ms = 1
 ?
 -Sam

 On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:





 Hi,

 I join :
   - osd.20 is one of osd that I detect which makes crash
 other
 OSD.
   - osd.23 is one of osd which crash when i start osd.20
   - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the 

Re: [ceph-users] Some OSD and MDS crash

2014-07-01 Thread Pierre BLONDEAU

Hi,

I join :
 - osd.20 is one of osd that I detect which makes crash other OSD.
 - osd.23 is one of osd which crash when i start osd.20
 - mds, is one of my MDS

I cut log file because they are to big but. All is here : 
https://blondeau.users.greyc.fr/cephlog/


Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :

What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to upgrade
to unnamed versions like 0.82 (but it's probably too late to go back
now).


I will remember it the next time ;)


-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:

Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my problem.

My three MDS crash and some OSD triggers a chain reaction that kills other
OSD.
I think my MDS will not start because of the metadata are on the OSD.

I have 36 OSD on three servers and I identified 5 OSD which makes crash
others. If i not start their, the cluster passe in reconstructive state with
31 OSD but i have 378 in down+peering state.

How can I do ? Would you more information ( os, crash log, etc ... ) ?

Regards

--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--
 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
 1: (MDLog::_reformat_journal(JournalPointer const, Journaler*, Context*)+0x1356) [0x855826]
 2: (MDLog::_recovery_thread(Context*)+0x7dc) [0x85606c]
 3: (MDLog::RecoveryThread::entry()+0x11) [0x664651]
 4: (()+0x6b50) [0x7f3bb5bc5b50]
 5: (clone()+0x6d) [0x7f3bb49ee0ed]

 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
 1: /usr/bin/ceph-mds() [0x8d81f2]
 2: (()+0xf030) [0x7f3bb5bce030]
 3: (gsignal()+0x35) [0x7f3bb4944475]
 4: (abort()+0x180) [0x7f3bb49476f0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3bb519a89d]
 6: (()+0x63996) [0x7f3bb5198996]
 7: (()+0x639c3) [0x7f3bb51989c3]
 8: (()+0x63bee) [0x7f3bb5198bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da]
 10: (MDLog::_reformat_journal(JournalPointer const, Journaler*, Context*)+0x1356) [0x855826]
 11: (MDLog::_recovery_thread(Context*)+0x7dc) [0x85606c]
 12: (MDLog::RecoveryThread::entry()+0x11) [0x664651]
 13: (()+0x6b50) [0x7f3bb5bc5b50]
 14: (clone()+0x6d) [0x7f3bb49ee0ed]
 ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
 1: (PG::fulfill_info(pg_shard_t, pg_query_t const, std::pairpg_shard_t, pg_info_t)+0x5a) [0x879efa]
 2: (PG::RecoveryState::Stray::react(PG::MQuery const)+0xef) [0x88be5f]
 3: (boost::statechart::detail::reaction_result boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::local_react_impl_non_empty::local_react_implboost::mpl::listboost::statechart::custom_reactionPG::MQuery, boost::statechart::custom_reactionPG::MLogRec, boost::statechart::custom_reactionPG::MInfoRec, boost::statechart::custom_reactionPG::ActMap, boost::statechart::custom_reactionPG::RecoveryDone, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0 (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0, boost::statechart::event_base const, void const*)+0x86) [0x8c8f06]
 4: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, 

Re: [ceph-users] Some OSD and MDS crash

2014-07-01 Thread Samuel Just
Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Hi,

 I join :
  - osd.20 is one of osd that I detect which makes crash other OSD.
  - osd.23 is one of osd which crash when i start osd.20
  - mds, is one of my MDS

 I cut log file because they are to big but. All is here :
 https://blondeau.users.greyc.fr/cephlog/

 Regards

 Le 30/06/2014 17:35, Gregory Farnum a écrit :

 What's the backtrace from the crashing OSDs?

 Keep in mind that as a dev release, it's generally best not to upgrade
 to unnamed versions like 0.82 (but it's probably too late to go back
 now).


 I will remember it the next time ;)


 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
 pierre.blond...@unicaen.fr wrote:

 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills
 other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes crash
 others. If i not start their, the cluster passe in reconstructive state
 with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... ) ?

 Regards

 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Some OSD and MDS crash

2014-06-30 Thread Pierre BLONDEAU

Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my problem.

My three MDS crash and some OSD triggers a chain reaction that kills 
other OSD.

I think my MDS will not start because of the metadata are on the OSD.

I have 36 OSD on three servers and I identified 5 OSD which makes crash 
others. If i not start their, the cluster passe in reconstructive state 
with 31 OSD but i have 378 in down+peering state.


How can I do ? Would you more information ( os, crash log, etc ... ) ?

Regards

--
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--



smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSD and MDS crash

2014-06-30 Thread Gregory Farnum
What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to upgrade
to unnamed versions like 0.82 (but it's probably too late to go back
now).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
pierre.blond...@unicaen.fr wrote:
 Hi,

 After the upgrade to firefly, I have some PG in peering state.
 I seen the output of 0.82 so I try to upgrade for solved my problem.

 My three MDS crash and some OSD triggers a chain reaction that kills other
 OSD.
 I think my MDS will not start because of the metadata are on the OSD.

 I have 36 OSD on three servers and I identified 5 OSD which makes crash
 others. If i not start their, the cluster passe in reconstructive state with
 31 OSD but i have 378 in down+peering state.

 How can I do ? Would you more information ( os, crash log, etc ... ) ?

 Regards

 --
 --
 Pierre BLONDEAU
 Administrateur Systèmes  réseaux
 Université de Caen
 Laboratoire GREYC, Département d'informatique

 tel : 02 31 56 75 42
 bureau  : Campus 2, Science 3, 406
 --


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com