[ceph-users] Re: Seriously degraded performance after update to Octopus

Vladimir Prokofev Mon, 02 Nov 2020 07:07:41 -0800

Just shooting in the dark here, but you may be affected by similar issue I
had a while back, it was discussed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZOPBOY6XQOYOV6CQMY27XM37OC6DKWZ7/


In short - they've changed setting bluefs_buffered_io to false in the
recent Nautilus release. I guess the same was applied to newer releases.
That lead to severe performance issues and similar symptoms, i.e. lower
memory usage on OSD nodes. Worth checking out.

Of course, it may be something completely different. You should look into
monitoring all your OSDs separately, checking their utilization, await, and
other parameters, at the same time comparing them to pre-upgrade values, to
find the root cause.

пн, 2 нояб. 2020 г. в 11:55, Marc Roos <m.r...@f1-outsourcing.eu>:

>
> I am advocating already a long time for publishing testing data of some
> basic test cluster against different ceph releases. Just a basic ceph
> cluster that covers most configs and run the same tests, so you can
> compare just ceph performance. That would mean a lot for smaller
> companies that do not have access to a good test environment. I have
> asked also about this at some ceph seminar.
>
>
>
> -----Original Message-----
> From: Martin Rasmus Lundquist Hansen [mailto:han...@imada.sdu.dk]
> Sent: Monday, November 02, 2020 7:53 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Seriously degraded performance after update to
> Octopus
>
> Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to
> Octopus (15.2.5), an update that was long overdue. We used the Ansible
> playbooks to perform a rolling update and except from a few minor
> problems with the Ansible code, the update went well. The Ansible
> playbooks were also used for setting up the cluster in the first place.
> Before updating the Ceph software we also performed a full update of
> CentOS and the Linux kernel (this part of the update had already been
> tested on one of the OSD nodes the week before and we didn't notice any
> problems).
>
> However, after the update we are seeing a serious decrease in
> performance, more than a factor of 10x in some cases. I spend a week
> trying to come up with an explantion or solution, but I am completely
> blank. Independently of Ceph I tested the network performance and the
> performance of the OSD disks, and I am not really seeing any problems
> here.
>
> The specifications of the cluster is:
> - 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU
> @ 1.80GHz, 16 cores, 196 GB RAM)
> - 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold
> 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
> - CentOS 7.8 and Kernel 5.4.51
> - 100 Gbps Infiniband
>
> We are collecting various metrics using Prometheus, and on the OSD nodes
> we are seeing some clear differences when it comes to CPU and Memory
> usage. I collected some graphs here: http://mitsted.dk/ceph . After the
> update the system load is highly reduced, there is almost no longer any
> iowait for the CPU, and the free memory is no longer used for Buffers (I
> can confirm that the changes in these metrics are not due to the update
> of CentOS or the Linux kernel). All in all, now the OSD nodes are almost
> completely idle all the time (and so are the monitors). On the linked
> page I also attached two RADOS benchmarks. The first benchmark was
> performed when the cluster was initially configured, and the second is
> the same benchmark after the update to Octopus. When comparing these
> two, it is clear that the performance has changed dramatically. For
> example, in the write test the bandwidth is reduced from 320 MB/s to 21
> MB/s and the number of IOPS has also dropped significantly.
>
> I temporarily tried to disable the firewall and SELinux on all nodes to
> see if it made any difference, but it didnt look like it (I did not
> restart any services during this test, I am not sure if that could be
> necessary).
>
> Any suggestions for finding the root cause of this performance decrease
> would be greatly appreciated.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Seriously degraded performance after update to Octopus

Reply via email to