Re: [ceph-users] Some OSD and MDS crash
Hi Pierre, Unfortunately it looks like we had a bug in 0.82 that could lead to journal corruption of the sort you're seeing here. A new journal format was added, and on the first start after an update the MDS would re-write the journal to the new format. This should only have been happening on the single active MDS for a given rank, but it was actually being done by standby-replay MDS daemons too. As a result, if there were standby-replay daemons configured, they could try to rewrite the journal at the same time, resulting in a corrupt journal. In your case, I think the probability of the condition occurring was increased by the OSD issues you were having, because at some earlier stage the rewrite process had been stopped partway through. Without standby MDSs this would be recovered from cleanly, but with the standbys in play the danger of corruption is high while the journal is in the partly-rewritten state. The ticket is here: http://tracker.ceph.com/issues/8811 The candidate fix is here: https://github.com/ceph/ceph/pull/2115 If you have recent backups then I would suggest recreating the filesystem and restoring from backups. You can also try using the cephfs-journal-tool journal reset command, which will wipe out the journal entirely, losing the most recent writes to the filesystem and potentially leaving some stray objects in the data pool. Sorry that this has bitten you, even though 0.82 was not a named release this was a pretty nasty bug to let out there, and I'm going to improve our automated tests in this area. Regards, John On Wed, Jul 16, 2014 at 11:57 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 16/07/2014 22:40, Gregory Farnum a écrit : On Wed, Jul 16, 2014 at 6:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the repair process, i have : 1926 active+clean 2 active+clean+inconsistent This two PGs seem to be on the same osd ( #34 ): # ceph pg dump | grep inconsistent dumped all in format plain 0.2e4 0 0 0 8388660 4 4 active+clean+inconsistent 2014-07-16 11:39:43.819631 9463'4 438411:133968 [34,4] 34 [34,4] 34 9463'4 2014-07-16 04:52:54.417333 9463'4 2014-07-11 09:29:22.041717 0.1ed 5 0 0 0 8388623 10 10 active+clean+inconsistent 2014-07-16 11:39:45.820142 9712'10 438411:144792 [34,2] 34 [34,2] 34 9712'10 2014-07-16 09:12:44.742488 9712'10 2014-07-10 21:57:11.345241 It's can explain why my MDS won't to start ? If i remove ( or shutdown ) this OSD, it's can solved my problem ? You want to figure out why they're inconsistent (if they're still going inconsistent, or maybe just need to be repaired), but this shouldn't be causing your MDS troubles. Can you dump the MDS journal and put it somewhere accessible? (You can use ceph-post-file to upload it.) John has been trying to reproduce this crash but hasn't succeeded yet. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Hi, I try to do : cephfs-journal-tool journal export ceph-journal.bin 2 cephfs-journal-tool.log But the program crash. I upload log file : e069c6ac-3cb4-4a52-8950-da7c600e2b01 There is a mistake in http://ceph.com/docs/master/cephfs/cephfs-journal-tool/ in Example: journal inspect. The good syntax seems to be : # cephfs-journal-tool journal inspect 2014-07-17 00:54:14.155382 7ff89d239780 -1 Header is invalid (inconsistent offsets) Overall journal integrity: DAMAGED Header could not be decoded Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Hi, After the repair process, i have : 1926 active+clean 2 active+clean+inconsistent This two PGs seem to be on the same osd ( #34 ): # ceph pg dump | grep inconsistent dumped all in format plain 0.2e4 0 0 0 8388660 4 4 active+clean+inconsistent 2014-07-16 11:39:43.819631 9463'4 438411:133968 [34,4] 34 [34,4] 34 9463'4 2014-07-16 04:52:54.417333 9463'4 2014-07-11 09:29:22.041717 0.1ed 5 0 0 0 8388623 10 10 active+clean+inconsistent 2014-07-16 11:39:45.820142 9712'10 438411:144792 [34,2] 34 [34,2] 34 9712'10 2014-07-16 09:12:44.742488 9712'10 2014-07-10 21:57:11.345241 It's can explain why my MDS won't to start ? If i remove ( or shutdown ) this OSD, it's can solved my problem ? Regards. Le 10/07/2014 11:51, Pierre BLONDEAU a écrit : Hi, Great. All my OSD restart : osdmap e438044: 36 osds: 36 up, 36 in All PG page are active and some in recovery : 1604040/49575206 objects degraded (3.236%) 1780 active+clean 17 active+degraded+remapped+backfilling 61 active+degraded+remapped+wait_backfill 11 active+clean+scrubbing+deep 34 active+remapped+backfilling 21 active+remapped+wait_backfill 4 active+clean+replay But all mds crash. Logs are here : https://blondeau.users.greyc.fr/cephlog/legacy/ In any case, thank you very much for your help. Pierre Le 09/07/2014 19:34, Joao Eduardo Luis a écrit : On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote: Hi, There is any chance to restore my data ? Okay, I talked to Sam and here's what you could try before anything else: - Make sure you have everything running on the same version. - unset the the chooseleaf_vary_r flag -- this can be accomplished by setting tunables to legacy. - have the osds join in the cluster - you should then either upgrade to firefly (if you haven't done so by now) or wait for the point-release before you move on to setting tunables to optimal again. Let us know how it goes. -Joao Regards Pierre Le 07/07/2014 15:42, Pierre BLONDEAU a écrit : No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but
Re: [ceph-users] Some OSD and MDS crash
On Wed, Jul 16, 2014 at 6:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the repair process, i have : 1926 active+clean 2 active+clean+inconsistent This two PGs seem to be on the same osd ( #34 ): # ceph pg dump | grep inconsistent dumped all in format plain 0.2e4 0 0 0 8388660 4 4 active+clean+inconsistent 2014-07-16 11:39:43.819631 9463'4 438411:133968 [34,4] 34 [34,4] 34 9463'4 2014-07-16 04:52:54.417333 9463'4 2014-07-11 09:29:22.041717 0.1ed 5 0 0 0 8388623 10 10 active+clean+inconsistent 2014-07-16 11:39:45.820142 9712'10 438411:144792 [34,2] 34 [34,2] 34 9712'10 2014-07-16 09:12:44.742488 9712'10 2014-07-10 21:57:11.345241 It's can explain why my MDS won't to start ? If i remove ( or shutdown ) this OSD, it's can solved my problem ? You want to figure out why they're inconsistent (if they're still going inconsistent, or maybe just need to be repaired), but this shouldn't be causing your MDS troubles. Can you dump the MDS journal and put it somewhere accessible? (You can use ceph-post-file to upload it.) John has been trying to reproduce this crash but hasn't succeeded yet. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Hi, Great. All my OSD restart : osdmap e438044: 36 osds: 36 up, 36 in All PG page are active and some in recovery : 1604040/49575206 objects degraded (3.236%) 1780 active+clean 17 active+degraded+remapped+backfilling 61 active+degraded+remapped+wait_backfill 11 active+clean+scrubbing+deep 34 active+remapped+backfilling 21 active+remapped+wait_backfill 4 active+clean+replay But all mds crash. Logs are here : https://blondeau.users.greyc.fr/cephlog/legacy/ In any case, thank you very much for your help. Pierre Le 09/07/2014 19:34, Joao Eduardo Luis a écrit : On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote: Hi, There is any chance to restore my data ? Okay, I talked to Sam and here's what you could try before anything else: - Make sure you have everything running on the same version. - unset the the chooseleaf_vary_r flag -- this can be accomplished by setting tunables to legacy. - have the osds join in the cluster - you should then either upgrade to firefly (if you haven't done so by now) or wait for the point-release before you move on to setting tunables to optimal again. Let us know how it goes. -Joao Regards Pierre Le 07/07/2014 15:42, Pierre BLONDEAU a écrit : No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com mailto:sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db__4f osd-20_osdmap.13258__0___4E62BB79__none 6037911f31dc3c18b05499d24dcdbe__5c
Re: [ceph-users] Some OSD and MDS crash
Hi, There is any chance to restore my data ? Regards Pierre Le 07/07/2014 15:42, Pierre BLONDEAU a écrit : No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com mailto:sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db__4f osd-20_osdmap.13258__0___4E62BB79__none 6037911f31dc3c18b05499d24dcdbe__5c osd-23_osdmap.13258__0___4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install
Re: [ceph-users] Some OSD and MDS crash
On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote: Hi, There is any chance to restore my data ? Hello Pierre, I've been giving this some thought and my guess is that yes, it should be possible. However, it may not be a simple fix. So, first of all, you got bit by http://tracker.ceph.com/issues/8738, which has been resolved and should be available on the next firefly point-release. However, I doubt just upgrading will solve all your problems. You'll have some OSDs with maps containing the chooseleaf_vary_r flag, while other OSDs won't. You'll also have monitors serving such maps, while other monitors won't. This may very well mean having to enable the flag throughout the cluster in all those maps that haven't got said flag enabled. In which case this will mean having to put together a tool to do this, while a daemon is offline. There may be however another way, but although simpler it's more intrusive: First of all we'd have to know which monitor is the one with the appropriate maps (this would certainly be the firefly monitor), which I'm assuming is still online. Then we'd have to remove all remaining monitors and add new, firefly monitors. This way they'd sync up with the monitor with the correct maps. Then we'd have to make sure in which map version this whole thing happened, and copy all maps from that point forward from the up OSDs to the OSDs that have divergent maps. It would be nice if Sam could chime in and validate either approach. -Joao Regards Pierre Le 07/07/2014 15:42, Pierre BLONDEAU a écrit : No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel
Re: [ceph-users] Some OSD and MDS crash
On 07/09/2014 02:22 PM, Pierre BLONDEAU wrote: Hi, There is any chance to restore my data ? Okay, I talked to Sam and here's what you could try before anything else: - Make sure you have everything running on the same version. - unset the the chooseleaf_vary_r flag -- this can be accomplished by setting tunables to legacy. - have the osds join in the cluster - you should then either upgrade to firefly (if you haven't done so by now) or wait for the point-release before you move on to setting tunables to optimal again. Let us know how it goes. -Joao Regards Pierre Le 07/07/2014 15:42, Pierre BLONDEAU a écrit : No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com mailto:sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db__4f osd-20_osdmap.13258__0___4E62BB79__none 6037911f31dc3c18b05499d24dcdbe__5c osd-23_osdmap.13258__0___4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each
Re: [ceph-users] Some OSD and MDS crash
No chance to have those logs and even less in debug mode. I do this change 3 weeks ago. I put all my log here if it's can help : https://blondeau.users.greyc.fr/cephlog/all/ I have a chance to recover my +/- 20TB of data ? Regards Le 03/07/2014 21:48, Joao Luis a écrit : Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0___4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com mailto:sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db__4f osd-20_osdmap.13258__0___4E62BB79__none 6037911f31dc3c18b05499d24dcdbe__5c osd-23_osdmap.13258__0___4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr mailto:pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82.
Re: [ceph-users] Some OSD and MDS crash
On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's
Re: [ceph-users] Some OSD and MDS crash
Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because
Re: [ceph-users] Some OSD and MDS crash
Do those logs have a higher debugging level than the default? If not nevermind as they will not have enough information. If they do however, we'd be interested in the portion around the moment you set the tunables. Say, before the upgrade and a bit after you set the tunable. If you want to be finer grained, then ideally it would be the moment where those maps were created, but you'd have to grep the logs for that. Or drop the logs somewhere and I'll take a look. -Joao On Jul 3, 2014 5:48 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 03/07/2014 13:49, Joao Eduardo Luis a écrit : On 07/03/2014 12:15 AM, Pierre BLONDEAU wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? The only thing that comes to mind that could cause this is if we changed the leader's in-memory map, proposed it, it failed, and only the leader got to write the map to disk somehow. This happened once on a totally different issue (although I can't pinpoint right now which). In such a scenario, the leader would serve the incorrect osdmap to whoever asked osdmaps from it, the remaining quorum would serve the correct osdmaps to all the others. This could cause this divergence. Or it could be something else. Are there logs for the monitors for the timeframe this may have happened in? Which exactly timeframe you want ? I have 7 days of logs, I should have informations about the upgrade from firefly to 0.82. Which mon's log do you want ? Three ? Regards -Joao Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_ 4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_ 4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20
Re: [ceph-users] Some OSD and MDS crash
Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Joao: this looks like divergent osdmaps, osd 20 and osd 23 have differing ideas of the acting set for pg 2.11. Did we add hashes to the incremental maps? What would you want to know from the mons? -Sam On Wed, Jul 2, 2014 at 3:10 PM, Samuel Just sam.j...@inktank.com wrote: Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur
Re: [ceph-users] Some OSD and MDS crash
Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes
Re: [ceph-users] Some OSD and MDS crash
Can you confirm from the admin socket that all monitors are running the same version? -Sam On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in
Re: [ceph-users] Some OSD and MDS crash
Like that ? # ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version {version:0.82} # ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version {version:0.82} # ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version {version:0.82} Pierre Le 03/07/2014 01:17, Samuel Just a écrit : Can you confirm from the admin socket that all monitors are running the same version? -Sam On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software
Re: [ceph-users] Some OSD and MDS crash
Yes, thanks. -Sam On Wed, Jul 2, 2014 at 4:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Like that ? # ceph --admin-daemon /var/run/ceph/ceph-mon.william.asok version {version:0.82} # ceph --admin-daemon /var/run/ceph/ceph-mon.jack.asok version {version:0.82} # ceph --admin-daemon /var/run/ceph/ceph-mon.joe.asok version {version:0.82} Pierre Le 03/07/2014 01:17, Samuel Just a écrit : Can you confirm from the admin socket that all monitors are running the same version? -Sam On Wed, Jul 2, 2014 at 4:15 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Le 03/07/2014 00:55, Samuel Just a écrit : Ah, ~/logs » for i in 20 23; do ../ceph/src/osdmaptool --export-crush /tmp/crush$i osd-$i*; ../ceph/src/crushtool -d /tmp/crush$i /tmp/crush$i.d; done; diff /tmp/crush20.d /tmp/crush23.d ../ceph/src/osdmaptool: osdmap file 'osd-20_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush20 ../ceph/src/osdmaptool: osdmap file 'osd-23_osdmap.13258__0_4E62BB79__none' ../ceph/src/osdmaptool: exported crush map to /tmp/crush23 6d5 tunable chooseleaf_vary_r 1 Looks like the chooseleaf_vary_r tunable somehow ended up divergent? Pierre: do you recall how and when that got set? I am not sure to understand, but if I good remember after the update in firefly, I was in state : HEALTH_WARN crush map has legacy tunables and I see feature set mismatch in log. So if I good remeber, i do : ceph osd crush tunables optimal for the problem of crush map and I update my client and server kernel to 3.16rc. It's could be that ? Pierre -Sam On Wed, Jul 2, 2014 at 3:43 PM, Samuel Just sam.j...@inktank.com wrote: Yeah, divergent osdmaps: 555ed048e73024687fc8b106a570db4f osd-20_osdmap.13258__0_4E62BB79__none 6037911f31dc3c18b05499d24dcdbe5c osd-23_osdmap.13258__0_4E62BB79__none Joao: thoughts? -Sam On Wed, Jul 2, 2014 at 3:39 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: The files When I upgrade : ceph-deploy install --stable firefly servers... on each servers service ceph restart mon on each servers service ceph restart osd on each servers service ceph restart mds I upgraded from emperor to firefly. After repair, remap, replace, etc ... I have some PG which pass in peering state. I thought why not try the version 0.82, it could solve my problem. ( It's my mistake ). So, I upgrade from firefly to 0.83 with : ceph-deploy install --testing servers... .. Now, all programs are in version 0.82. I have 3 mons, 36 OSD and 3 mds. Pierre PS : I find also inc\uosdmap.13258__0_469271DE__none on each meta directory. Le 03/07/2014 00:10, Samuel Just a écrit : Also, what version did you upgrade from, and how did you upgrade? -Sam On Wed, Jul 2, 2014 at 3:09 PM, Samuel Just sam.j...@inktank.com wrote: Ok, in current/meta on osd 20 and osd 23, please attach all files matching ^osdmap.13258.* There should be one such file on each osd. (should look something like osdmap.6__0_FD6E4C01__none, probably hashed into a subdirectory, you'll want to use find). What version of ceph is running on your mons? How many mons do you have? -Sam On Wed, Jul 2, 2014 at 2:21 PM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I do it, the log files are available here : https://blondeau.users.greyc.fr/cephlog/debug20/ The OSD's files are really big +/- 80M . After starting the osd.20 some other osd crash. I pass from 31 osd up to 16. I remark that after this the number of down+peering PG decrease from 367 to 248. It's normal ? May be it's temporary, the time that the cluster verifies all the PG ? Regards Pierre Le 02/07/2014 19:16, Samuel Just a écrit : You should add debug osd = 20 debug filestore = 20 debug ms = 1 to the [osd] section of the ceph.conf and restart the osds. I'd like all three logs if possible. Thanks -Sam On Wed, Jul 2, 2014 at 5:03 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Yes, but how i do that ? With a command like that ? ceph tell osd.20 injectargs '--debug-osd 20 --debug-filestore 20 --debug-ms 1' By modify the /etc/ceph/ceph.conf ? This file is really poor because I use udev detection. When I have made these changes, you want the three log files or only osd.20's ? Thank you so much for the help Regards Pierre Le 01/07/2014 23:51, Samuel Just a écrit : Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the
Re: [ceph-users] Some OSD and MDS crash
Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) 1: (MDLog::_reformat_journal(JournalPointer const, Journaler*, Context*)+0x1356) [0x855826] 2: (MDLog::_recovery_thread(Context*)+0x7dc) [0x85606c] 3: (MDLog::RecoveryThread::entry()+0x11) [0x664651] 4: (()+0x6b50) [0x7f3bb5bc5b50] 5: (clone()+0x6d) [0x7f3bb49ee0ed] ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) 1: /usr/bin/ceph-mds() [0x8d81f2] 2: (()+0xf030) [0x7f3bb5bce030] 3: (gsignal()+0x35) [0x7f3bb4944475] 4: (abort()+0x180) [0x7f3bb49476f0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f3bb519a89d] 6: (()+0x63996) [0x7f3bb5198996] 7: (()+0x639c3) [0x7f3bb51989c3] 8: (()+0x63bee) [0x7f3bb5198bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0x9ab5da] 10: (MDLog::_reformat_journal(JournalPointer const, Journaler*, Context*)+0x1356) [0x855826] 11: (MDLog::_recovery_thread(Context*)+0x7dc) [0x85606c] 12: (MDLog::RecoveryThread::entry()+0x11) [0x664651] 13: (()+0x6b50) [0x7f3bb5bc5b50] 14: (clone()+0x6d) [0x7f3bb49ee0ed] ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e) 1: (PG::fulfill_info(pg_shard_t, pg_query_t const, std::pairpg_shard_t, pg_info_t)+0x5a) [0x879efa] 2: (PG::RecoveryState::Stray::react(PG::MQuery const)+0xef) [0x88be5f] 3: (boost::statechart::detail::reaction_result boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0::local_react_impl_non_empty::local_react_implboost::mpl::listboost::statechart::custom_reactionPG::MQuery, boost::statechart::custom_reactionPG::MLogRec, boost::statechart::custom_reactionPG::MInfoRec, boost::statechart::custom_reactionPG::ActMap, boost::statechart::custom_reactionPG::RecoveryDone, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0 (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, (boost::statechart::history_mode)0, boost::statechart::event_base const, void const*)+0x86) [0x8c8f06] 4: (boost::statechart::simple_statePG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na,
Re: [ceph-users] Some OSD and MDS crash
Can you reproduce with debug osd = 20 debug filestore = 20 debug ms = 1 ? -Sam On Tue, Jul 1, 2014 at 1:21 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, I join : - osd.20 is one of osd that I detect which makes crash other OSD. - osd.23 is one of osd which crash when i start osd.20 - mds, is one of my MDS I cut log file because they are to big but. All is here : https://blondeau.users.greyc.fr/cephlog/ Regards Le 30/06/2014 17:35, Gregory Farnum a écrit : What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). I will remember it the next time ;) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Some OSD and MDS crash
Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Some OSD and MDS crash
What's the backtrace from the crashing OSDs? Keep in mind that as a dev release, it's generally best not to upgrade to unnamed versions like 0.82 (but it's probably too late to go back now). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Jun 30, 2014 at 8:06 AM, Pierre BLONDEAU pierre.blond...@unicaen.fr wrote: Hi, After the upgrade to firefly, I have some PG in peering state. I seen the output of 0.82 so I try to upgrade for solved my problem. My three MDS crash and some OSD triggers a chain reaction that kills other OSD. I think my MDS will not start because of the metadata are on the OSD. I have 36 OSD on three servers and I identified 5 OSD which makes crash others. If i not start their, the cluster passe in reconstructive state with 31 OSD but i have 378 in down+peering state. How can I do ? Would you more information ( os, crash log, etc ... ) ? Regards -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com