[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node

[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node

[ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Nevermind, you helped me a lot by showing this OSD startup procedure Michael. Big Thanks! I seem to have made some progress now by setting the cache-mode to forward. The OSD processes of SATA hosts stopped failing immediately. I'm now waiting for the cache tier to flush. Then I'll try to enable re

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Michael J. Kidd
Hello Lukas, Please try the following process for getting all your OSDs up and operational... * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover, nobackfill for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd set $i; done * Stop all OSDs (I know, this

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
I've ended up at step "ceph osd unset noin". My OSDs are up, but not in, even after an hour: [root@q04 ceph-recovery]# ceph osd stat osdmap e2602: 34 osds: 34 up, 0 in flags nobackfill,norecover,noscrub,nodeep-scrub There seems to be no activity generated by OSD processes, occas

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Michael J. Kidd
Ah, sorry... since they were set out manually, they'll need to be set in manually.. for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in $i; done Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
I should have figured that out myself since I did that recently. Thanks. Unfortunately, I'm still at the step "ceph osd unset noin". After setting all the OSDs in, the original issue reapears preventing me to proceed with recovery. It now appears mostly at single OSD - osd.10 which consumes ~200%

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Hi, I've noticed the following messages always accumulate in OSD log before it exhausts all memory: 2014-10-30 08:48:42.994190 7f80a2019700 0 log [WRN] : slow request 38.901192 seconds old, received at 2014-10-30 08:48:04.092889: osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Michael J. Kidd
Hello Lukas, The 'slow request' logs are expected while the cluster is in such a state.. the OSD processes simply aren't able to respond quickly to client IO requests. I would recommend trying to recover without the most problematic disk ( seems to be OSD.10? ).. Simply shut it down and see if t

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Thanks Michael, still no luck. Letting the problematic OSD.10 down has no effect. Within minutes more of OSDs fail on same issue after consuming ~50GB of memory. Also, I can see two of those cache-tier OSDs on separate hosts which remain utilized almost 200% CPU all the time I've performed upgrad

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Michael J. Kidd
Hello Lukas, Unfortunately, I'm all out of ideas at the moment. There are some memory profiling techniques which can help identify what is causing the memory utilization, but it's a bit beyond what I typically work on. Others on the list may have experience with this (or otherwise have ideas) a

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
Fixed. My cluster is HEALTH_OK again now. It went fast in the right direction after I set cache-mode to forward (from original writeback) and disabling norecover and nobackfill flags. Still I'm waiting for 15 million of objects to get flushed from the cache tier. It seems that the issue was someh