Fixed. My cluster is HEALTH_OK again now. It went fast in the right
direction after I set cache-mode to forward (from original writeback) and
disabling norecover and nobackfill flags.
Still I'm waiting for 15 million of objects to get flushed from the cache
tier.
It seems that the issue was someh
Nevermind, you helped me a lot by showing this OSD startup procedure
Michael. Big Thanks!
I seem to have made some progress now by setting the cache-mode to forward.
The OSD processes of SATA hosts stopped failing immediately. I'm now
waiting for the cache tier to flush. Then I'll try to enable re
Hello Lukas,
Unfortunately, I'm all out of ideas at the moment. There are some memory
profiling techniques which can help identify what is causing the memory
utilization, but it's a bit beyond what I typically work on. Others on the
list may have experience with this (or otherwise have ideas) a
Thanks Michael, still no luck.
Letting the problematic OSD.10 down has no effect. Within minutes more of
OSDs fail on same issue after consuming ~50GB of memory. Also, I can see
two of those cache-tier OSDs on separate hosts which remain utilized almost
200% CPU all the time
I've performed upgrad
Hello Lukas,
The 'slow request' logs are expected while the cluster is in such a
state.. the OSD processes simply aren't able to respond quickly to client
IO requests.
I would recommend trying to recover without the most problematic disk (
seems to be OSD.10? ).. Simply shut it down and see if t
Hi,
I've noticed the following messages always accumulate in OSD log before it
exhausts all memory:
2014-10-30 08:48:42.994190 7f80a2019700 0 log [WRN] : slow request
38.901192 seconds old, received at 2014-10-30 08:48:04.092889:
osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363
I should have figured that out myself since I did that recently. Thanks.
Unfortunately, I'm still at the step "ceph osd unset noin". After setting
all the OSDs in, the original issue reapears preventing me to proceed with
recovery. It now appears mostly at single OSD - osd.10 which consumes ~200%
Ah, sorry... since they were set out manually, they'll need to be set in
manually..
for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in $i;
done
Michael J. Kidd
Sr. Storage Consultant
Inktank Professional Services
- by Red Hat
On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín
I've ended up at step "ceph osd unset noin". My OSDs are up, but not in,
even after an hour:
[root@q04 ceph-recovery]# ceph osd stat
osdmap e2602: 34 osds: 34 up, 0 in
flags nobackfill,norecover,noscrub,nodeep-scrub
There seems to be no activity generated by OSD processes, occas
Hello Lukas,
Please try the following process for getting all your OSDs up and
operational...
* Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover,
nobackfill
for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd
set $i; done
* Stop all OSDs (I know, this
Hello,
I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down
through night after months of running without change. From Linux logs I
found out the OSD processes were killed because they consumed all available
memory.
Those 5 failed OSDs were from different hosts of my 4-node
Hello,
I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down
through night after months of running without change. From Linux logs I
found out the OSD processes were killed because they consumed all available
memory.
Those 5 failed OSDs were from different hosts of my 4-node
12 matches
Mail list logo