[ceph-users] pool nearfull, 300GB rbd image occupies 11TB!

2020-12-12 Thread mk
Hi folks, my cluster shows strange behavior, the only ssd pool on cluster with repsize 3 and pg/pgp size 512 which contains 300GB rbd image and only one snapshot occupies 11TB space! I have tried objectmap check / rebuild, fstrim etc. which couldn’t solve that problem, any help would be

[ceph-users] pool nearfull, 300GB rbd image occupies 11TB!

2020-12-12 Thread mk
Hi folks, my cluster shows strange behavior, the only ssd pool on cluster with repsize 3 and pg/pgp size 512 which contains 300GB rbd image and only one snapshot occupies 11TB space! I have tried objectmap check / rebuild, fstrim etc. which couldn’t solve that problem, any help would be

[ceph-users] Re: PGs down

2020-12-12 Thread Igor Fedotov
Hi Jeremy, wondering what were the OSDs' logs when they crashed for the first time? And does OSD.12 reports the similar problem for now: 3> 2020-12-12 20:23:45.756 7f2d21404700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 3113305400, got 1242690251 in

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Igor Fedotov
Hi Stefan, could you please share OSD startup log from /var/log/ceph? Thanks, Igor On 12/13/2020 5:44 AM, Stefan Wild wrote: Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1

[ceph-users] PGs down

2020-12-12 Thread Jeremy Austin
I could use some input from more experienced folks… First time seeing this behavior. I've been running ceph in production (replicated) since 2016 or earlier. This, however, is a small 3-node cluster for testing EC. Crush map rules should sustain the loss of an entire node. Here's the EC rule:

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Got a trace of the osd process, shortly after ceph status -w announced boot for the osd: strace: Process 784735 attached futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? +++ exited with 1 +++ It was stuck at that one call for several minutes before exiting. From: Stefan Wild Date:

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Just had another look at the logs and this is what I did notice after the affected OSD starts up. Loads of entries of this sort: Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 2020-12-13T02:38:40.851+ 7fafd32c7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fafb721f700'

[ceph-users] OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Hi, We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system

[ceph-users] Re: Third nautilus OSD dead in 11 days - FAILED is_valid_io(off, len)

2020-12-12 Thread Igor Fedotov
Jonas, could you please run "ceph-bluestore-tool --path --allocator block --command free-dump" and share the output... Thanks, Igor On 12/12/2020 10:27 PM, Igor Fedotov wrote: Hi Jonas, didn't you try to switch your OSDs back to bitmap allocator as per my comment #6 in the tracker?

[ceph-users] Re: Third nautilus OSD dead in 11 days - FAILED is_valid_io(off, len)

2020-12-12 Thread Igor Fedotov
Hi Jonas, didn't you try to switch your OSDs back to bitmap allocator as per my comment #6 in the tracker? Also please set debug-bluestore to 20 and collect the startup log for the failing OSD - since it's repeatedly failing on exactly the same assertion this would be very helpful. That's

[ceph-users] Third nautilus OSD dead in 11 days - FAILED is_valid_io(off, len)

2020-12-12 Thread Jonas Jelten
Hi! Yesterday a third OSD died with a failed assertion, and it can no longer boot. It's the third OSD within 11 days. There's already a tracker issue: https://tracker.ceph.com/issues/48276 2020-12-11 20:06:51.839 7fe2b5ffd700 -1 /build/ceph-14.2.13/src/os/bluestore/KernelDevice.cc: In

[ceph-users] Anonymous access to grafana

2020-12-12 Thread Alessandro Piazza
Dear all, It seems that by default the grafana web page embedded inside the ceph dashboard is publicly available in read-only mode. More specifically the grafana configuration inside the docker running the grafana instance has the following configuration file