Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-25 Thread Adrien Gillard
The issue is finally resolved. Upgrading to Luminous was the way to go. Unfortunately, we did not set 'ceph osd require-osd-release luminous' immediately so we did not activate the luminous functionnalities that saved us. I think the new mechanisms to manage and prune past intervals[1] allowed

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Adrien Gillard
Sending back, forgot the plain text for ceph-devel. Sorry about that. On Thu, Aug 23, 2018 at 9:57 PM Adrien Gillard wrote: > > We are running CentOS 7.5 with upstream Ceph packages, no remote syslog, just > default local logging. > > After looking a bit deeper into pprof, --alloc_space seems

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Adrien Gillard
We are running CentOS 7.5 with upstream Ceph packages, no remote syslog, just default local logging. After looking a bit deeper into pprof, --alloc_space seems to represent allocations that happened since the program started which goes along with the quick deallocation of the memory.

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Gregory Farnum
On Thu, Aug 23, 2018 at 8:42 AM Adrien Gillard wrote: > With a bit of profiling, it seems all the memory is allocated to > ceph::logging::Log::create_entry (see below) > > Shoould this be normal ? Is it because some OSDs are down and it logs the > results of its osd_ping ? > Hmm, is that where

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Adrien Gillard
With a bit of profiling, it seems all the memory is allocated to ceph::logging::Log::create_entry (see below) Shoould this be normal ? Is it because some OSDs are down and it logs the results of its osd_ping ? The debug level of the OSD is below also. Thanks, Adrien $ pprof

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Adrien Gillard
After upgrading to luminous, we see the exact same behaviour, with OSDs eating as much as 80/90 GB of memory. We'll try some memory profiling but at this point we're a bit lost. Is there any specific logs that could help us ? On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard wrote: > Well after a

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-23 Thread Adrien Gillard
Well after a few hours, still nothing new in the behaviour. With half of the OSDs (so 6 per host) up and peering and the nodown flag set to limit the creation of new maps, all the memory is consumed and OSDs get killed by OOM killer. We observe a lot of threads being created for each OSDs

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard wrote: > We 'paused' the cluster early in our investigation to avoid unnecessary > IO. > We also set the nodown flag but the OOM rate was really sustained and we > got servers that stopped responding from time to time, so we decided to > lower the

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Adrien Gillard
We 'paused' the cluster early in our investigation to avoid unnecessary IO. We also set the nodown flag but the OOM rate was really sustained and we got servers that stopped responding from time to time, so we decided to lower the number of OSDs up and let them peer. I don't know if it is the best

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard wrote: > Hi everyone, > > We have a hard time figuring out a behaviour encountered after upgrading > the monitors of one of our cluster from Jewel to Luminous yesterday. > > The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of RAM), >

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread David Turner
I swear I remember a thread on the ML that talked about someone having increased memory usage on their OSDs after upgrading their MONs to Luminous as well, but I can't seem to find it. Iirc the problem for them was resolved when they finished the upgrade to Luminous. It might not be a bad idea

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Adrien Gillard
Thank you so much for your feedback. It always helps with this kind of situation. We fixed the network issue, added as much RAM and as much swap as possible but were still far from anything with OOM killer decimating the OSDs which at times used more than 35GB of memory. We decided to shut half

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Lothar Gesslein
Hi Adrien, I don't expect I can fully explain what happened to your cluster, but since you got no other feedback so far I'll try my best. So you have 517 million RADOS objects. Assuming at least 3 copies each for normal replication or 5 "shards" for EC pools, there are somewhere between 1.5 to

Re: [ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Adrien Gillard
Some follow-up. After doubling the RAM (so 128GB for 12x4TB OSD), all the RAM is still consumed a bit after the OSD restart. We are considering going through with the update of the OSDs to luminous (or maybe go back to jewel on the mons...) but the cluster is in bad shape... health:

[ceph-users] Unexpected behaviour after monitors upgrade from Jewel to Luminous

2018-08-22 Thread Adrien Gillard
Hi everyone, We have a hard time figuring out a behaviour encountered after upgrading the monitors of one of our cluster from Jewel to Luminous yesterday. The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of RAM), each containing 12x4TB OSD with journals on DC grade SSDs). The