The issue is finally resolved.
Upgrading to Luminous was the way to go. Unfortunately, we did not set
'ceph osd require-osd-release luminous' immediately so we did not
activate the luminous functionnalities that saved us.
I think the new mechanisms to manage and prune past intervals[1]
allowed
Sending back, forgot the plain text for ceph-devel.
Sorry about that.
On Thu, Aug 23, 2018 at 9:57 PM Adrien Gillard wrote:
>
> We are running CentOS 7.5 with upstream Ceph packages, no remote syslog, just
> default local logging.
>
> After looking a bit deeper into pprof, --alloc_space seems
We are running CentOS 7.5 with upstream Ceph packages, no remote syslog,
just default local logging.
After looking a bit deeper into pprof, --alloc_space seems to represent
allocations that happened since the program started which goes along with
the quick deallocation of the memory.
On Thu, Aug 23, 2018 at 8:42 AM Adrien Gillard
wrote:
> With a bit of profiling, it seems all the memory is allocated to
> ceph::logging::Log::create_entry (see below)
>
> Shoould this be normal ? Is it because some OSDs are down and it logs the
> results of its osd_ping ?
>
Hmm, is that where
With a bit of profiling, it seems all the memory is allocated to
ceph::logging::Log::create_entry (see below)
Shoould this be normal ? Is it because some OSDs are down and it logs the
results of its osd_ping ?
The debug level of the OSD is below also.
Thanks,
Adrien
$ pprof
After upgrading to luminous, we see the exact same behaviour, with OSDs
eating as much as 80/90 GB of memory.
We'll try some memory profiling but at this point we're a bit lost. Is
there any specific logs that could help us ?
On Thu, Aug 23, 2018 at 2:34 PM Adrien Gillard
wrote:
> Well after a
Well after a few hours, still nothing new in the behaviour. With half of
the OSDs (so 6 per host) up and peering and the nodown flag set to limit
the creation of new maps, all the memory is consumed and OSDs get killed by
OOM killer.
We observe a lot of threads being created for each OSDs
On Wed, Aug 22, 2018 at 6:02 PM Adrien Gillard
wrote:
> We 'paused' the cluster early in our investigation to avoid unnecessary
> IO.
> We also set the nodown flag but the OOM rate was really sustained and we
> got servers that stopped responding from time to time, so we decided to
> lower the
We 'paused' the cluster early in our investigation to avoid unnecessary IO.
We also set the nodown flag but the OOM rate was really sustained and we
got servers that stopped responding from time to time, so we decided to
lower the number of OSDs up and let them peer.
I don't know if it is the best
On Wed, Aug 22, 2018 at 6:35 AM Adrien Gillard
wrote:
> Hi everyone,
>
> We have a hard time figuring out a behaviour encountered after upgrading
> the monitors of one of our cluster from Jewel to Luminous yesterday.
>
> The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of RAM),
>
I swear I remember a thread on the ML that talked about someone having
increased memory usage on their OSDs after upgrading their MONs to Luminous
as well, but I can't seem to find it. Iirc the problem for them was
resolved when they finished the upgrade to Luminous. It might not be a bad
idea
Thank you so much for your feedback. It always helps with this kind of
situation.
We fixed the network issue, added as much RAM and as much swap as possible
but
were still far from anything with OOM killer decimating the OSDs which at
times
used more than 35GB of memory.
We decided to shut half
Hi Adrien,
I don't expect I can fully explain what happened to your cluster, but
since you got no other feedback so far I'll try my best.
So you have 517 million RADOS objects. Assuming at least 3 copies each
for normal replication or 5 "shards" for EC pools, there are somewhere
between 1.5 to
Some follow-up.
After doubling the RAM (so 128GB for 12x4TB OSD), all the RAM is still
consumed a bit after the OSD restart.
We are considering going through with the update of the OSDs to luminous
(or maybe go back to jewel on the mons...) but the
cluster is in bad shape...
health:
Hi everyone,
We have a hard time figuring out a behaviour encountered after upgrading
the monitors of one of our cluster from Jewel to Luminous yesterday.
The cluster is composed of 14 OSDs hosts (2xE5-2640 v3 and 64 GB of RAM),
each containing 12x4TB OSD with journals on DC grade SSDs). The
15 matches
Mail list logo