Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)
Dear list, Quoting Stefan Kooman (ste...@bit.nl): > I wonder if this situation is more likely to be hit on Mimic 13.2.6 than > on any other system. > > Any hints / help to prevent this from happening? We have had this happening another two times now. In both cases the MDS recovers, becomes active (for a few seconds), and crashes again. It won't come out of this loop by itself. When put in deug mode "debug_mds = 10/10) we won't hit the bug and it stays active. After a few minutes we disable debug (live, ceph tell mds.* config set debug_mds 0/0) and it keeps running (Heisenbug)... until hours later when it crashes again and the story repeats itself. So unfortunately no more debug information available, but at least a workaround to get it active again. Gr. Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd / kcephfs - jewel client features question
bit 21 indicates whether upmap is supported which is not set in 0x7fddff8ee8cbffb, so no. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Sat, Oct 19, 2019 at 2:00 PM Lei Liu wrote: > > Hello llya, > > After updated client kernel version to 3.10.0-862 , ceph features shows: > > "client": { > "group": { > "features": "0x7010fb86aa42ada", > "release": "jewel", > "num": 5 > }, > "group": { > "features": "0x7fddff8ee8cbffb", > "release": "jewel", > "num": 1 > }, > "group": { > "features": "0x3ffddff8eea4fffb", > "release": "luminous", > "num": 6 > }, > "group": { > "features": "0x3ffddff8eeacfffb", > "release": "luminous", > "num": 1 > } > } > > both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel > client. > > Is it now possible to force set-require-min-compat-client to be luminous, if > not how to fix it? > > Thanks > > Ilya Dryomov 于2019年10月17日周四 下午9:45写道: >> >> On Thu, Oct 17, 2019 at 3:38 PM Lei Liu wrote: >> > >> > Hi Cephers, >> > >> > We have some ceph clusters in 12.2.x version, now we want to use upmap >> > balancer,but when i set set-require-min-compat-client to luminous, it's >> > failed >> > >> > # ceph osd set-require-min-compat-client luminous >> > Error EPERM: cannot set require_min_compat_client to luminous: 6 connected >> > client(s) look like jewel (missing 0xa20); 1 connected >> > client(s) look like jewel (missing 0x800); 1 connected >> > client(s) look like jewel (missing 0x820); add >> > --yes-i-really-mean-it to do it anyway >> > >> > ceph features >> > >> > "client": { >> > "group": { >> > "features": "0x40106b84a842a52", >> > "release": "jewel", >> > "num": 6 >> > }, >> > "group": { >> > "features": "0x7010fb86aa42ada", >> > "release": "jewel", >> > "num": 1 >> > }, >> > "group": { >> > "features": "0x7fddff8ee84bffb", >> > "release": "jewel", >> > "num": 1 >> > }, >> > "group": { >> > "features": "0x3ffddff8eea4fffb", >> > "release": "luminous", >> > "num": 7 >> > } >> > } >> > >> > and sessions >> > >> > "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *, features >> > 0x40106b84a842a52 (jewel))", >> > "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *, features >> > 0x40106b84a842a52 (jewel))", >> > "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features >> > 0x7fddff8ee84bffb (jewel))", >> > "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features >> > 0x7010fb86aa42ada (jewel))" >> > >> > can i use --yes-i-really-mean-it to force enable it ? >> >> No. 0x40106b84a842a52 and 0x7fddff8ee84bffb are too old. >> >> Thanks, >> >> Ilya > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] krbd / kcephfs - jewel client features question
Hello llya, After updated client kernel version to 3.10.0-862 , ceph features shows: "client": { "group": { "features": "0x7010fb86aa42ada", "release": "jewel", "num": 5 }, "group": { "features": "0x7fddff8ee8cbffb", "release": "jewel", "num": 1 }, "group": { "features": "0x3ffddff8eea4fffb", "release": "luminous", "num": 6 }, "group": { "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 1 } } both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel client. Is it now possible to force set-require-min-compat-client to be luminous, if not how to fix it? Thanks Ilya Dryomov 于2019年10月17日周四 下午9:45写道: > On Thu, Oct 17, 2019 at 3:38 PM Lei Liu wrote: > > > > Hi Cephers, > > > > We have some ceph clusters in 12.2.x version, now we want to use upmap > balancer,but when i set set-require-min-compat-client to luminous, it's > failed > > > > # ceph osd set-require-min-compat-client luminous > > Error EPERM: cannot set require_min_compat_client to luminous: 6 > connected client(s) look like jewel (missing 0xa20); 1 > connected client(s) look like jewel (missing 0x800); 1 > connected client(s) look like jewel (missing 0x820); add > --yes-i-really-mean-it to do it anyway > > > > ceph features > > > > "client": { > > "group": { > > "features": "0x40106b84a842a52", > > "release": "jewel", > > "num": 6 > > }, > > "group": { > > "features": "0x7010fb86aa42ada", > > "release": "jewel", > > "num": 1 > > }, > > "group": { > > "features": "0x7fddff8ee84bffb", > > "release": "jewel", > > "num": 1 > > }, > > "group": { > > "features": "0x3ffddff8eea4fffb", > > "release": "luminous", > > "num": 7 > > } > > } > > > > and sessions > > > > "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *, > features 0x40106b84a842a52 (jewel))", > > "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *, > features 0x40106b84a842a52 (jewel))", > > "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features > 0x7fddff8ee84bffb (jewel))", > > "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features > 0x7010fb86aa42ada (jewel))" > > > > can i use --yes-i-really-mean-it to force enable it ? > > No. 0x40106b84a842a52 and 0x7fddff8ee84bffb are too old. > > Thanks, > > Ilya > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)
Dear list, Today our active MDS crashed with an assert: 2019-10-19 08:14:50.645 7f7906cb7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f7906cb7700 time 2019-10-19 08:14:50.648559 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS) ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f7911b2897e] 2: (()+0x2fab07) [0x7f7911b28b07] 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7] 4: (MDLog::trim(int)+0x5a6) [0x75dcd6] 5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b] 6: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 7: (Context::complete(int)+0x9) [0x4d31d9] 8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b] 9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d] 10: (()+0x76ba) [0x7f79113a76ba] 11: (clone()+0x6d) [0x7f7910bd041d] 2019-10-19 08:14:50.649 7f7906cb7700 -1 *** Caught signal (Aborted) ** in thread 7f7906cb7700 thread_name:safe_timer ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) 1: (()+0x11390) [0x7f79113b1390] 2: (gsignal()+0x38) [0x7f7910afe428] 3: (abort()+0x16a) [0x7f7910b0002a] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f7911b28a86] 5: (()+0x2fab07) [0x7f7911b28b07] 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7] 7: (MDLog::trim(int)+0x5a6) [0x75dcd6] 8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b] 9: (FunctionContext::finish(int)+0x2c) [0x4d52dc] 10: (Context::complete(int)+0x9) [0x4d31d9] 11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b] 12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d] 13: (()+0x76ba) [0x7f79113a76ba] 14: (clone()+0x6d) [0x7f7910bd041d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Apparently this is bug 36094 (https://tracker.ceph.com/issues/36094). Our active MDS had mds_cache_memory_limit=150G and ~ 27 M CAPS handed out to 78 clients. A few of them having many milions of CAPS. This resulted in laggy MDS ... another failover ... until the MDS was finally able to cope with the load. We adjusted mds_cache_memory_limit to 32G right after that and activated the new limit: ceph tell mds.* config set mds_cache_memory_limit 34359738368 Double checked it was set correctly, and monitored mem usage. That all went fine. Around # 6 M CAPS in use (2 clients used 5/6 of those). After ~ 5 yours the same assert was hit. Fortunately the failover was way faster now ... but then the, now active MDS, hit the same assert again triggering another failover ... other MDS took over and failed again ... the other took over and cephfs healthy again ... The bug report does not hint on how to prevent this situation. Recently Zoë'Connell hit the same issue on a Mimic 13.2.6 system: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html I wonder if this situation is more likely to be hit on Mimic 13.2.6 than on any other system. Any hints / help to prevent this from happening? Thanks, Stefan -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com