Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
Dear list,

Quoting Stefan Kooman (ste...@bit.nl):

> I wonder if this situation is more likely to be hit on Mimic 13.2.6 than
> on any other system.
> 
> Any hints / help to prevent this from happening?

We have had this happening another two times now. In both cases the MDS
recovers, becomes active (for a few seconds), and crashes again. It won't
come out of this loop by itself. When put in deug mode "debug_mds =
10/10) we won't hit the bug and it stays active. After a few minutes we
disable debug (live, ceph tell mds.* config set debug_mds 0/0) and it
keeps running (Heisenbug)... until hours later when it crashes again and the 
story
repeats itself.

So unfortunately no more debug information available, but at least a
workaround to get it active again.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-19 Thread Paul Emmerich
bit 21 indicates whether upmap is supported which is not set in
0x7fddff8ee8cbffb, so no.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Oct 19, 2019 at 2:00 PM Lei Liu  wrote:
>
> Hello llya,
>
> After updated client kernel version to 3.10.0-862 , ceph features shows:
>
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 5
> },
> "group": {
> "features": "0x7fddff8ee8cbffb",
> "release": "jewel",
> "num": 1
> },
> "group": {
> "features": "0x3ffddff8eea4fffb",
> "release": "luminous",
> "num": 6
> },
> "group": {
> "features": "0x3ffddff8eeacfffb",
> "release": "luminous",
> "num": 1
> }
> }
>
> both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel 
> client.
>
> Is it now possible to force set-require-min-compat-client to be luminous, if 
> not how to fix it?
>
> Thanks
>
> Ilya Dryomov  于2019年10月17日周四 下午9:45写道:
>>
>> On Thu, Oct 17, 2019 at 3:38 PM Lei Liu  wrote:
>> >
>> > Hi Cephers,
>> >
>> > We have some ceph clusters in 12.2.x version, now we want to use upmap 
>> > balancer,but when i set set-require-min-compat-client to luminous, it's 
>> > failed
>> >
>> > # ceph osd set-require-min-compat-client luminous
>> > Error EPERM: cannot set require_min_compat_client to luminous: 6 connected 
>> > client(s) look like jewel (missing 0xa20); 1 connected 
>> > client(s) look like jewel (missing 0x800); 1 connected 
>> > client(s) look like jewel (missing 0x820); add 
>> > --yes-i-really-mean-it to do it anyway
>> >
>> > ceph features
>> >
>> > "client": {
>> > "group": {
>> > "features": "0x40106b84a842a52",
>> > "release": "jewel",
>> > "num": 6
>> > },
>> > "group": {
>> > "features": "0x7010fb86aa42ada",
>> > "release": "jewel",
>> > "num": 1
>> > },
>> > "group": {
>> > "features": "0x7fddff8ee84bffb",
>> > "release": "jewel",
>> > "num": 1
>> > },
>> > "group": {
>> > "features": "0x3ffddff8eea4fffb",
>> > "release": "luminous",
>> > "num": 7
>> > }
>> > }
>> >
>> > and sessions
>> >
>> > "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *, features 
>> > 0x40106b84a842a52 (jewel))",
>> > "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *, features 
>> > 0x40106b84a842a52 (jewel))",
>> > "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features 
>> > 0x7fddff8ee84bffb (jewel))",
>> > "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features 
>> > 0x7010fb86aa42ada (jewel))"
>> >
>> > can i use --yes-i-really-mean-it to force enable it ?
>>
>> No.  0x40106b84a842a52 and 0x7fddff8ee84bffb are too old.
>>
>> Thanks,
>>
>> Ilya
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-19 Thread Lei Liu
Hello llya,

After updated client kernel version to 3.10.0-862 , ceph features shows:

"client": {
"group": {
"features": "0x7010fb86aa42ada",
"release": "jewel",
"num": 5
},
"group": {
"features": "0x7fddff8ee8cbffb",
"release": "jewel",
"num": 1
},
"group": {
"features": "0x3ffddff8eea4fffb",
"release": "luminous",
"num": 6
},
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 1
}
}

both 0x7fddff8ee8cbffb and 0x7010fb86aa42ada are reported by new kernel
client.

Is it now possible to force set-require-min-compat-client to be luminous,
if not how to fix it?

Thanks

Ilya Dryomov  于2019年10月17日周四 下午9:45写道:

> On Thu, Oct 17, 2019 at 3:38 PM Lei Liu  wrote:
> >
> > Hi Cephers,
> >
> > We have some ceph clusters in 12.2.x version, now we want to use upmap
> balancer,but when i set set-require-min-compat-client to luminous, it's
> failed
> >
> > # ceph osd set-require-min-compat-client luminous
> > Error EPERM: cannot set require_min_compat_client to luminous: 6
> connected client(s) look like jewel (missing 0xa20); 1
> connected client(s) look like jewel (missing 0x800); 1
> connected client(s) look like jewel (missing 0x820); add
> --yes-i-really-mean-it to do it anyway
> >
> > ceph features
> >
> > "client": {
> > "group": {
> > "features": "0x40106b84a842a52",
> > "release": "jewel",
> > "num": 6
> > },
> > "group": {
> > "features": "0x7010fb86aa42ada",
> > "release": "jewel",
> > "num": 1
> > },
> > "group": {
> > "features": "0x7fddff8ee84bffb",
> > "release": "jewel",
> > "num": 1
> > },
> > "group": {
> > "features": "0x3ffddff8eea4fffb",
> > "release": "luminous",
> > "num": 7
> > }
> > }
> >
> > and sessions
> >
> > "MonSession(unknown.0 10.10.100.6:0/1603916368 is open allow *,
> features 0x40106b84a842a52 (jewel))",
> > "MonSession(unknown.0 10.10.100.2:0/2484488531 is open allow *,
> features 0x40106b84a842a52 (jewel))",
> > "MonSession(client.? 10.10.100.6:0/657483412 is open allow *, features
> 0x7fddff8ee84bffb (jewel))",
> > "MonSession(unknown.0 10.10.14.67:0/500706582 is open allow *, features
> 0x7010fb86aa42ada (jewel))"
> >
> > can i use --yes-i-really-mean-it to force enable it ?
>
> No.  0x40106b84a842a52 and 0x7fddff8ee84bffb are too old.
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
Dear list,

Today our active MDS crashed with an assert:

2019-10-19 08:14:50.645 7f7906cb7700 -1 
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void 
OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 
7f7906cb7700 time 2019-10-19 08:14:50.648559
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs 
<= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7f7911b2897e]
 2: (()+0x2fab07) [0x7f7911b28b07]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) 
[0x7703f7]
 4: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 6: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 7: (Context::complete(int)+0x9) [0x4d31d9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 10: (()+0x76ba) [0x7f79113a76ba]
 11: (clone()+0x6d) [0x7f7910bd041d]

2019-10-19 08:14:50.649 7f7906cb7700 -1 *** Caught signal (Aborted) **
 in thread 7f7906cb7700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x11390) [0x7f79113b1390]
 2: (gsignal()+0x38) [0x7f7910afe428]
 3: (abort()+0x16a) [0x7f7910b0002a]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x7f7911b28a86]
 5: (()+0x2fab07) [0x7f7911b28b07]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) 
[0x7703f7]
 7: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 9: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 10: (Context::complete(int)+0x9) [0x4d31d9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 13: (()+0x76ba) [0x7f79113a76ba]
 14: (clone()+0x6d) [0x7f7910bd041d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Apparently this is bug 36094 (https://tracker.ceph.com/issues/36094).

Our active MDS had mds_cache_memory_limit=150G and ~ 27 M CAPS handed
out to 78 clients. A few of them having many milions of CAPS. This
resulted in laggy MDS ... another failover ... until the MDS was finally
able to cope with the load.

We adjusted mds_cache_memory_limit to 32G right after that and activated
the new limit: ceph tell mds.* config set mds_cache_memory_limit
34359738368

Double checked it was set correctly, and monitored mem usage. That all
went fine. Around # 6 M CAPS in use (2 clients used 5/6 of those). After
~ 5 yours the same assert was hit. Fortunately the failover was way
faster now ... but then the, now active MDS, hit the same assert again
triggering another failover ... other MDS took over and failed again ...
the other took over and cephfs healthy again ...

The bug report does not hint on how to prevent this situation. Recently
Zoë'Connell  hit the same issue on a Mimic 13.2.6 system:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html

I wonder if this situation is more likely to be hit on Mimic 13.2.6 than
on any other system.

Any hints / help to prevent this from happening?

Thanks,

Stefan



-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com