[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node

[ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node

[ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
Hi, I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - when libvirt mounted RBD disks suspend I/O during snapshot creation until hard reboot. My Ceph cluster (monitors and OSDs) is running v0.94.3, while clients (OpenStack/KVM computes) run v0.94.5. Can I still update the client

Re: [ceph-users] RBD client newer than cluster

2017-02-14 Thread Lukáš Kubín
AM, Lukáš Kubín > wrote: > > Hi, > > I'm most probably hitting bug http://tracker.ceph.com/issues/13755 - > when > > libvirt mounted RBD disks suspend I/O during snapshot creation until hard > > reboot. > > > > My Ceph cluster (monitors and OSDs) is

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
t noscrub > - ceph osd unset nodeep-scrub > > > ## For help identifying why memory usage was so high, please provide: > * ceph osd dump | grep pool > * ceph osd crush rule dump > > Let us know if this helps... I know it looks extreme, but it's worked for > me in

Re: [ceph-users] OSD process exhausting server memory

2014-10-29 Thread Lukáš Kubín
leset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1519 flags hashpspool stripe_width 0 pool 12 'backups' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 862 flags hashpspool stripe_width 0 pool 14 'volumes-cache&#x

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
e the problematic OSD? I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic restart loop with 60 seconds pause before starting again. Thanks and greetings, Lukas On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín wrote: > I should have figured that out myself since I

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
ering and how it could relate to the > load being seen. > > Hope this helps... > > Michael J. Kidd > Sr. Storage Consultant > Inktank Professional Services > - by Red Hat > > On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín > wrote: > >> Hi, >> I've notice

[ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
ltant > Inktank Professional Services > - by Red Hat > > On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín > wrote: > >> Thanks Michael, still no luck. >> >> Letting the problematic OSD.10 down has no effect. Within minutes more of >> OSDs fail on same issue aft

Re: [ceph-users] OSD process exhausting server memory

2014-10-30 Thread Lukáš Kubín
e was somehow related to the caching tier. Does anybody have an idea how to prevent this? Did anybody experienced similar issue with writeback cache tier? Big thanks to Michael J. Kidd for all his support! Best greetings, Lukas On Thu, Oct 30, 2014 at 8:18 PM, Lukáš Kubín wrote: > Nevermind, yo

[ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
Hi, I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 pools, each of size=2. Today, one of our OSDs got full, another 2 near full. Cluster turned into ERR state. I have noticed uneven space distribution among OSD drives between 70 and 100 perce. I have realized there's a low a

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
size in TB. > > Beware that reweighting will (afaik) only shuffle the data to other local > drives, so you should reweight both the full drives at the same time and > only by little bit at a time (0.95 is a good starting point). > > Jan > > > > On 17 Feb 2016, at 21:43, L

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
ly really full. > OSDs don't usually go down when "full" (95%) .. or do they? I don't think > so... so the reason they stopped is likely a completely full filfeystem. > You have to move something out of the way, restart those OSDs with lower > reweight and hopefully

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-18 Thread Lukáš Kubín
1.0 9 0.53999 osd.9 up 1.0 1.0 10 0.53999 osd.10 up 1.0 1.0 11 0.26999 osd.11 up 1.0 1.0 On Wed, Feb 17, 2016 at 9:43 PM Lukáš Kubín wrote: > Hi, > I'm running a very small setup of 2 node

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-19 Thread Lukáš Kubín
t have to risk data loss. > > It usually doesn't take much before you can restart the OSDs and let ceph > take care of the rest. > > Bryan > > From: ceph-users on behalf of Lukáš > Kubín > Date: Thursday, February 18, 2016 at 2:39 PM > To: "ceph-users

[ceph-users] Tunables client support

2019-08-22 Thread Lukáš Kubín
Hello, I am considering enabling optimal crush tunables in our Jewel cluster (4 nodes, 52 OSD, used as OpenStack Cinder+Nova backend = RBD images). I've got two questions: 1. Do I understand right that having the optimal tunables on can be considered best practice and should be applied in most sce

[ceph-users] Increase pg_num while backfilling

2019-08-22 Thread Lukáš Kubín
Hello, yesterday I've added 4th OSD node (increase from 39 to 52 OSDs) into our Jewel cluster. Backfilling of remapped pgs is still running and seems it will run for another day until complete. I know the pg_num of largest is undersized and I should increase it from 512 to 2048. The question is -