Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10
An update: It seems that I am arriving at memory shortage. Even with 32 GB for 20 OSDs and 2 GB swap, ceph-osd uses all available memory. I created another swap device with 10 GB, and I managed to get the failed OSD running without crash, but consuming extra 5 GB. Are there known issues regarding memory on ceph osd? But I still get the problem of the incomplete+inactive PG. Regards. Gerd On 12-08-2015 10:11, Gerd Jakobovitsch wrote: I tried it, the error propagates to whichever OSD gets the errorred PG. For the moment, this is my worst problem. I have one PG incomplete+inactive, and the OSD with the highest priority in it gets 100 blocked requests (I guess that is the maximum), and, although running, doesn't get other requests - for example, ceph tell osd.21 injectargs '--osd-max-backfills 1'. After some time, it crashes, and the blocked requests go to the second OSD for the errorred PG. I can't get rid of these slow requests. I guessed a problem with leveldb, I checked, and had the default version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for wheezy-backports (1.17-1~bpo70+1), but the error was the same. I use regular wheezy kernel (3.2+46). On 11-08-2015 23:52, Haomai Wang wrote: it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitschg...@mandic.net.br wrote: Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem known? Is there any configuration that should be checked? Any way to try to recover these OSDs without losing all data? After that, setting the OSD to lost, I got one incomplete, inactive PG. Is there any way to recover it? Data still exists in crashed OSDs. Regards. [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 === osd.7 === 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found mon.spcsmp2 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, global_id 206010466 create-or-move updated item name 'osd.7' weight 3.64 at location {host=spcsnp3,root=default} to crush map Starting Ceph osd.7 on spcsnp3... 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-7/journal [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) dump_stop 2015-08-11 12:58:21.348291 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal /var/lib/ceph/osd/ceph-7/journal 2015-08-11 12:58:21.348326 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is 54c136da-c51c-4799-b2dc-b7988982ee00 2015-08-11 12:58:21.349010 7f200fa8f780 0 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) 2015-08-11 12:58:21.349026 7f200fa8f780 1 filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-08-11 12:58:21.353277 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-11 12:58:21.353302 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-11 12:58:21.362106 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-08-11 12:58:21.362195 7f200fa8f780 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf 2015-08-11 12:58:21.362701 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** in thread 7f200fa8f780 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7f200efcd0a0] 3: (gsignal()+0x35) [0x7f200db3f165] 4: (abort()+0x180) [0x7f200db423e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] 6: (()+0x63996) [0x7f200e393996] 7: (()+0x639c3) [0x7f200e3939c3] 8: (()+0x63bee) [0x7f200e393bee] 9: (tc_new()+0x48e) [0x7f200f213aee] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7f200e3ef999] 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x28) [0x7f200e3f0708] 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) [0x7f200f46ffa2] 15:
Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10
I tried it, the error propagates to whichever OSD gets the errorred PG. For the moment, this is my worst problem. I have one PG incomplete+inactive, and the OSD with the highest priority in it gets 100 blocked requests (I guess that is the maximum), and, although running, doesn't get other requests - for example, ceph tell osd.21 injectargs '--osd-max-backfills 1'. After some time, it crashes, and the blocked requests go to the second OSD for the errorred PG. I can't get rid of these slow requests. I guessed a problem with leveldb, I checked, and had the default version for debian wheezy (0+20120530.gitdd0d562-1). I updated it for wheezy-backports (1.17-1~bpo70+1), but the error was the same. I use regular wheezy kernel (3.2+46). On 11-08-2015 23:52, Haomai Wang wrote: it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote: Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem known? Is there any configuration that should be checked? Any way to try to recover these OSDs without losing all data? After that, setting the OSD to lost, I got one incomplete, inactive PG. Is there any way to recover it? Data still exists in crashed OSDs. Regards. [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 === osd.7 === 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found mon.spcsmp2 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, global_id 206010466 create-or-move updated item name 'osd.7' weight 3.64 at location {host=spcsnp3,root=default} to crush map Starting Ceph osd.7 on spcsnp3... 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-7/journal [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) dump_stop 2015-08-11 12:58:21.348291 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal /var/lib/ceph/osd/ceph-7/journal 2015-08-11 12:58:21.348326 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is 54c136da-c51c-4799-b2dc-b7988982ee00 2015-08-11 12:58:21.349010 7f200fa8f780 0 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) 2015-08-11 12:58:21.349026 7f200fa8f780 1 filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-08-11 12:58:21.353277 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-11 12:58:21.353302 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-11 12:58:21.362106 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-08-11 12:58:21.362195 7f200fa8f780 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf 2015-08-11 12:58:21.362701 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** in thread 7f200fa8f780 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7f200efcd0a0] 3: (gsignal()+0x35) [0x7f200db3f165] 4: (abort()+0x180) [0x7f200db423e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] 6: (()+0x63996) [0x7f200e393996] 7: (()+0x639c3) [0x7f200e3939c3] 8: (()+0x63bee) [0x7f200e393bee] 9: (tc_new()+0x48e) [0x7f200f213aee] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7f200e3ef999] 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x28) [0x7f200e3f0708] 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) [0x7f200f46ffa2] 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*, unsigned long*)+0x180) [0x7f200f468360] 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) [0x7f200f46adf2] 17: (leveldb::DB::Open(leveldb::Options const, std::string const, leveldb::DB**)+0xff) [0x7f200f46b11f] 18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8] 19: (FileStore::mount()+0x18e0) [0x9b7080] 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10
it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch g...@mandic.net.br wrote: Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem known? Is there any configuration that should be checked? Any way to try to recover these OSDs without losing all data? After that, setting the OSD to lost, I got one incomplete, inactive PG. Is there any way to recover it? Data still exists in crashed OSDs. Regards. [(12:58:13) root@spcsnp3 ~]# service ceph start osd.7 === osd.7 === 2015-08-11 12:58:21.003876 7f17ed52b700 1 monclient(hunting): found mon.spcsmp2 2015-08-11 12:58:21.003915 7f17ef493700 5 monclient: authenticate success, global_id 206010466 create-or-move updated item name 'osd.7' weight 3.64 at location {host=spcsnp3,root=default} to crush map Starting Ceph osd.7 on spcsnp3... 2015-08-11 12:58:21.279878 7f200fa8f780 0 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 31918 starting osd.7 at :/0 osd_data /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-7/journal [(12:58:21) root@spcsnp3 ~]# 2015-08-11 12:58:21.348094 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) dump_stop 2015-08-11 12:58:21.348291 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) basedir /var/lib/ceph/osd/ceph-7 journal /var/lib/ceph/osd/ceph-7/journal 2015-08-11 12:58:21.348326 7f200fa8f780 10 filestore(/var/lib/ceph/osd/ceph-7) mount fsid is 54c136da-c51c-4799-b2dc-b7988982ee00 2015-08-11 12:58:21.349010 7f200fa8f780 0 filestore(/var/lib/ceph/osd/ceph-7) mount detected xfs (libxfs) 2015-08-11 12:58:21.349026 7f200fa8f780 1 filestore(/var/lib/ceph/osd/ceph-7) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2015-08-11 12:58:21.353277 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-11 12:58:21.353302 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-11 12:58:21.362106 7f200fa8f780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-08-11 12:58:21.362195 7f200fa8f780 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-7) detect_feature: extsize is disabled by conf 2015-08-11 12:58:21.362701 7f200fa8f780 5 filestore(/var/lib/ceph/osd/ceph-7) mount op_seq is 35490995 2015-08-11 12:58:59.383179 7f200fa8f780 -1 *** Caught signal (Aborted) ** in thread 7f200fa8f780 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7f200efcd0a0] 3: (gsignal()+0x35) [0x7f200db3f165] 4: (abort()+0x180) [0x7f200db423e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f200e39589d] 6: (()+0x63996) [0x7f200e393996] 7: (()+0x639c3) [0x7f200e3939c3] 8: (()+0x63bee) [0x7f200e393bee] 9: (tc_new()+0x48e) [0x7f200f213aee] 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7f200e3ef999] 11: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x28) [0x7f200e3f0708] 12: (std::string::reserve(unsigned long)+0x30) [0x7f200e3f07f0] 13: (std::string::append(char const*, unsigned long)+0xb5) [0x7f200e3f0ab5] 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, std::string*)+0x2a2) [0x7f200f46ffa2] 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, leveldb::VersionEdit*, unsigned long*)+0x180) [0x7f200f468360] 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) [0x7f200f46adf2] 17: (leveldb::DB::Open(leveldb::Options const, std::string const, leveldb::DB**)+0xff) [0x7f200f46b11f] 18: (LevelDBStore::do_open(std::ostream, bool)+0xd8) [0xa123a8] 19: (FileStore::mount()+0x18e0) [0x9b7080] 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a] 21: (main()+0x2234) [0x7331c4] 22: (__libc_start_main()+0xfd) [0x7f200db2bead] 23: /usr/bin/ceph-osd() [0x736e99] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -66 2015-08-11 12:58:21.277524 7f200fa8f780 5 asok(0x2800230) register_command perfcounters_dump hook 0x27f0010 -65 2015-08-11 12:58:21.277552 7f200fa8f780 5 asok(0x2800230) register_command 1 hook 0x27f0010 -64 2015-08-11 12:58:21.277556 7f200fa8f780 5 asok(0x2800230) register_command perf dump hook 0x27f0010 -63 2015-08-11 12:58:21.277561 7f200fa8f780 5 asok(0x2800230) register_command perfcounters_schema hook 0x27f0010 -62 2015-08-11 12:58:21.277564