Re: [ceph-users] Some OSD and MDS crash

Joao Eduardo Luis Thu, 03 Jul 2014 04:50:31 -0700

On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote:

Le 03/07/2014 00:55, Samuel Just a écrit :

Ah,


~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush
/tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i >
/tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d
../ceph/src/osdmaptool: osdmap file
'osd-20_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush20
../ceph/src/osdmaptool: osdmap file
'osd-23_osdmap.13258__0_4E62BB79__none'
../ceph/src/osdmaptool: exported crush map to /tmp/crush23
6d5
< tunable chooseleaf_vary_r 1

 Looks like the chooseleaf_vary_r tunable somehow ended up divergent?

The only thing that comes to mind that could cause this is if we changedthe leader's in-memory map, proposed it, it failed, and only the leadergot to write the map to disk somehow. This happened once on a totallydifferent issue (although I can't pinpoint right now which).

In such a scenario, the leader would serve the incorrect osdmap towhoever asked osdmaps from it, the remaining quorum would serve thecorrect osdmaps to all the others. This could cause this divergence.Or it could be something else.


Are there logs for the monitors for the timeframe this may have happened in?

  -Joao


Pierre: do you recall how and when that got set?


I am not sure to understand, but if I good remember after the update in
firefly, I was in state : HEALTH_WARN crush map has legacy tunables and
I see "feature set mismatch" in log.

So if I good remeber, i do : ceph osd crush tunables optimal for the
problem of "crush map" and I update my client and server kernel to 3.16rc.

It's could be that ?

Pierre

-Sam

On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just <sam.j...@inktank.com> wrote:

Yeah, divergent osdmaps:
555ed048e73024687fc8b106a570db4f  osd-20_osdmap.13258__0_4E62BB79__none
6037911f31dc3c18b05499d24dcdbe5c  osd-23_osdmap.13258__0_4E62BB79__none

Joao: thoughts?
-Sam

On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:

The files

When I upgrade :
  ceph-deploy install --stable firefly servers...
  on each servers service ceph restart mon
  on each servers service ceph restart osd
  on each servers service ceph restart mds

I upgraded from emperor to firefly. After repair, remap, replace,
etc ... I
have some PG which pass in peering state.

I thought why not try the version 0.82, it could solve my problem. (
It's my mistake ). So, I upgrade from firefly to 0.83 with :
  ceph-deploy install --testing servers...
  ..

Now, all programs are in version 0.82.
I have 3 mons, 36 OSD and 3 mds.

Pierre

PS : I find also "inc\uosdmap.13258__0_469271DE__none" on each meta
directory.

Le 03/07/2014 00:10, Samuel Just a écrit :

Also, what version did you upgrade from, and how did you upgrade?
-Sam

On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just <sam.j...@inktank.com>
wrote:


Ok, in current/meta on osd 20 and osd 23, please attach all files
matching

^osdmap.13258.*

There should be one such file on each osd. (should look something
like
osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory,
you'll want to use find).

What version of ceph is running on your mons?  How many mons do
you have?
-Sam

On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:


Hi,

I do it, the log files are available here :
https://blondeau.users.greyc.fr/cephlog/debug20/

The OSD's files are really big +/- 80M .

After starting the osd.20 some other osd crash. I pass from 31
osd up to
16.
I remark that after this the number of down+peering PG decrease
from 367
to
248. It's "normal" ? May be it's temporary, the time that the
cluster
verifies all the PG ?

Regards
Pierre

Le 02/07/2014 19:16, Samuel Just a écrit :

You should add

debug osd = 20
debug filestore = 20
debug ms = 1

to the [osd] section of the ceph.conf and restart the osds.  I'd
like
all three logs if possible.

Thanks
-Sam

On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:



Yes, but how i do that ?

With a command like that ?

ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20
--debug-ms
1'

By modify the /etc/ceph/ceph.conf ? This file is really poor
because I
use
udev detection.

When I have made these changes, you want the three log files or
only
osd.20's ?

Thank you so much for the help

Regards
Pierre

Le 01/07/2014 23:51, Samuel Just a écrit :

Can you reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
?
-Sam

On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:




Hi,

I join :
     - osd.20 is one of osd that I detect which makes crash
other
OSD.
     - osd.23 is one of osd which crash when i start osd.20
     - mds, is one of my MDS

I cut log file because they are to big but. All is here :
https://blondeau.users.greyc.fr/cephlog/

Regards

Le 30/06/2014 17:35, Gregory Farnum a écrit :

What's the backtrace from the crashing OSDs?

Keep in mind that as a dev release, it's generally best not to
upgrade
to unnamed versions like 0.82 (but it's probably too late to go
back
now).




I will remember it the next time ;)

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU
<pierre.blond...@unicaen.fr> wrote:



Hi,

After the upgrade to firefly, I have some PG in peering state.
I seen the output of 0.82 so I try to upgrade for solved my
problem.

My three MDS crash and some OSD triggers a chain reaction that
kills
other
OSD.
I think my MDS will not start because of the metadata are
on the
OSD.

I have 36 OSD on three servers and I identified 5 OSD which
makes
crash
others. If i not start their, the cluster passe in
reconstructive
state
with
31 OSD but i have 378 in down+peering state.

How can I do ? Would you more information ( os, crash log,
etc ...
)
?

Regards




--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------



--
----------------------------------------------
Pierre BLONDEAU
Administrateur Systèmes & réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel     : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
----------------------------------------------



--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Some OSD and MDS crash

Reply via email to