subject:"\[ceph\-users\] High Load and High Apply Latency"

Re: [ceph-users] High Load and High Apply Latency

2018-02-18 Thread Steven Vacaroaia

Hi John,

I am trying to squize extra performance from my test cluster too
Dell R 620 with PERC 710 , RAID0, 10 GB network

Would you be willing to share your controller and kernel configuration ?

For example, I am using BIOS profile 'Performance" with the following
added to /etc/default/kernel

intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0
idle=poll

and tuned profile  throughput-performance

All disks are configured with nr-request=1024 and read-ahead-kb=4096
SSD uses scheduled= noop while HDD uses deadline

cache policy for SSD



megacli -LDSetProp  -WT -Immediate -L0 -a0

megacli -LDSetProp  -NORA -Immediate -L0 -a0

 megacli -LDSetProp  -Direct -Immediate -L0 -a0

HDD cache policy has all caches enabled , WB and ADRA

Many thanks

Steven



On 16 February 2018 at 19:06, John Petrini  wrote:

> I thought I'd follow up on this just in case anyone else experiences
> similar issues. We ended up increasing the tcmalloc thread cache size and
> saw a huge improvement in latency. This got us out of the woods because we
> were finally in a state where performance was good enough that it was no
> longer impacting services.
>
> The tcmalloc issues are pretty well documented on this mailing list and I
> don't believe they impact newer versions of Ceph but I thought I'd at least
> give a data point. After making this change our average apply latency
> dropped to 3.46ms during peak business hours. To give you an idea of how
> significant that is here's a graph of the apply latency prior to the
> change: https://imgur.com/KYUETvD
>
> This however did not resolve all of our issues. We were still seeing high
> iowait (repeated spikes up to 400ms) on three of our OSD nodes on all
> disks. We tried replacing the RAID controller (PERC H730) on these nodes
> and while this resolved the issue on one server the two others remained
> problematic. These two nodes were configured differently than the rest.
> They'd been configured in non-raid mode while the others were configured as
> individual raid-0. This turned out to be the problem. We ended up removing
> the two nodes one at a time and rebuilding them with their disks configured
> in independent raid-0 instead of non-raid. After this change iowait rarely
> spikes above 15ms and averages <1ms.
>
> I was really surprised at the performance impact when using non-raid mode.
> While I realize non-raid bypasses the controller cache I still would have
> never expected such high latency. Dell has a whitepaper that recommends
> using individual raid-0 but their own tests show only a small performance
> advantage over non-raid. Note that we are running SAS disks, they actually
> recommend non-raid mode for SATA but I have not tested this. You can view
> the whtiepaper here: http://en.community.dell.com/
> techcenter/cloud/m/dell_cloud_resources/20442913/download
>
> I hope this helps someone.
>
> John Petrini
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2018-02-17 Thread Marc Roos


But that is already the default not? (on CentOS7 rpms)

[@c03 ~]# cat /etc/sysconfig/ceph
# /etc/sysconfig/ceph
#
# Environment file for ceph daemon systemd unit files.
#

# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728
 



-Original Message-
From: John Petrini [mailto:jpetr...@coredial.com] 
Sent: zaterdag 17 februari 2018 1:06
To: David Turner
Cc: ceph-users
Subject: Re: [ceph-users] High Load and High Apply Latency

I thought I'd follow up on this just in case anyone else experiences 
similar issues. We ended up increasing the tcmalloc thread cache size 
and saw a huge improvement in latency. This got us out of the woods 
because we were finally in a state where performance was good enough 
that it was no longer impacting services. 

The tcmalloc issues are pretty well documented on this mailing list and 
I don't believe they impact newer versions of Ceph but I thought I'd at 
least give a data point. After making this change our average apply 
latency dropped to 3.46ms during peak business hours. To give you an 
idea of how significant that is here's a graph of the apply latency 
prior to the change: https://imgur.com/KYUETvD


This however did not resolve all of our issues. We were still seeing 
high iowait (repeated spikes up to 400ms) on three of our OSD nodes on 
all disks. We tried replacing the RAID controller (PERC H730) on these 
nodes and while this resolved the issue on one server the two others 
remained problematic. These two nodes were configured differently than 
the rest. They'd been configured in non-raid mode while the others were 
configured as individual raid-0. This turned out to be the problem. We 
ended up removing the two nodes one at a time and rebuilding them with 
their disks configured in independent raid-0 instead of non-raid. After 
this change iowait rarely spikes above 15ms and averages <1ms.


I was really surprised at the performance impact when using non-raid 
mode. While I realize non-raid bypasses the controller cache I still 
would have never expected such high latency. Dell has a whitepaper that 
recommends using individual raid-0 but their own tests show only a small 
performance advantage over non-raid. Note that we are running SAS disks, 
they actually recommend non-raid mode for SATA but I have not tested 
this. You can view the whtiepaper here: 
http://en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download


I hope this helps someone.


John Petrini



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2018-02-16 Thread John Petrini

I thought I'd follow up on this just in case anyone else experiences
similar issues. We ended up increasing the tcmalloc thread cache size and
saw a huge improvement in latency. This got us out of the woods because we
were finally in a state where performance was good enough that it was no
longer impacting services.

The tcmalloc issues are pretty well documented on this mailing list and I
don't believe they impact newer versions of Ceph but I thought I'd at least
give a data point. After making this change our average apply latency
dropped to 3.46ms during peak business hours. To give you an idea of how
significant that is here's a graph of the apply latency prior to the
change: https://imgur.com/KYUETvD

This however did not resolve all of our issues. We were still seeing high
iowait (repeated spikes up to 400ms) on three of our OSD nodes on all
disks. We tried replacing the RAID controller (PERC H730) on these nodes
and while this resolved the issue on one server the two others remained
problematic. These two nodes were configured differently than the rest.
They'd been configured in non-raid mode while the others were configured as
individual raid-0. This turned out to be the problem. We ended up removing
the two nodes one at a time and rebuilding them with their disks configured
in independent raid-0 instead of non-raid. After this change iowait rarely
spikes above 15ms and averages <1ms.

I was really surprised at the performance impact when using non-raid mode.
While I realize non-raid bypasses the controller cache I still would have
never expected such high latency. Dell has a whitepaper that recommends
using individual raid-0 but their own tests show only a small performance
advantage over non-raid. Note that we are running SAS disks, they actually
recommend non-raid mode for SATA but I have not tested this. You can view
the whtiepaper here:
http://en.community.dell.com/techcenter/cloud/m/dell_cloud_resources/20442913/download

I hope this helps someone.

John Petrini
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2017-12-20 Thread John Petrini

Hello,

Looking at perf top it looks as though Ceph is spending most of it's CPU
cycles on tcmalloc. Looking around online i found that this is a known
issue and in fact I found this guide on how to increase the tcmalloc thread
cache size:
https://swamireddy.wordpress.com/2017/01/27/increase-tcmalloc-thread-cache-bytes/.
Is this the right step to take toward fixing this issue?

Here's the output of perf report that shows this behavior.
http://paste.openstack.org/show/629490/

Thanks,

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2017-12-18 Thread John Petrini

Another strange thing I'm seeing is that two of the nodes in the cluster
have some OSD's with almost no activity. If I watch top long enough I'll
eventually see cpu utilization on these osds but for the most part they sit
a 0% cpu utilization. I'm not sure if this is expected behavior or not
though. I have another cluster running the same version of ceph that has
the same symptom but the osds in our jewel cluster always show activity.

John Petrini
Platforms Engineer

[image: Call CoreDial] 215.297.4400 x 232 <215-297-4400>
[image: Call CoreDial] www.coredial.com 
[image: CoreDial] 751 Arbor Way, Hillcrest I, Suite 150 Blue Bell, PA 19422

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

On Mon, Dec 18, 2017 at 11:51 AM, John Petrini 
wrote:

> Hi David,
>
> Thanks for the info. The controller in the server (perc h730) was just
> replaced and the battery is at full health. Prior to replacing the
> controller I was seeing very high iowait when running iostat but I no
> longer see that behavior - just apply latency when running ceph osd perf.
> Since there's no iowait it makes me believe that the latency is not being
> introduced by the hardware; though I'm not ruling it out completely. I'd
> like to know what I can do to get a better understanding of what the OSD
> processes are so busy doing because they are working much harder on this
> server than the others.
>
>
>
>
>
> On Thu, Dec 14, 2017 at 11:33 AM, David Turner 
> wrote:
>
>> We show high disk latencies on a node when the controller's cache battery
>> dies.  This is assuming that you're using a controller with cache enabled
>> for your disks.  In any case, I would look at the hardware on the server.
>>
>> On Thu, Dec 14, 2017 at 10:15 AM John Petrini 
>> wrote:
>>
>>> Anyone have any ideas on this?
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2017-12-18 Thread John Petrini

Hi David,

Thanks for the info. The controller in the server (perc h730) was just
replaced and the battery is at full health. Prior to replacing the
controller I was seeing very high iowait when running iostat but I no
longer see that behavior - just apply latency when running ceph osd perf.
Since there's no iowait it makes me believe that the latency is not being
introduced by the hardware; though I'm not ruling it out completely. I'd
like to know what I can do to get a better understanding of what the OSD
processes are so busy doing because they are working much harder on this
server than the others.

On Thu, Dec 14, 2017 at 11:33 AM, David Turner 
wrote:

> We show high disk latencies on a node when the controller's cache battery
> dies.  This is assuming that you're using a controller with cache enabled
> for your disks.  In any case, I would look at the hardware on the server.
>
> On Thu, Dec 14, 2017 at 10:15 AM John Petrini 
> wrote:
>
>> Anyone have any ideas on this?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2017-12-14 Thread David Turner

We show high disk latencies on a node when the controller's cache battery
dies.  This is assuming that you're using a controller with cache enabled
for your disks.  In any case, I would look at the hardware on the server.

On Thu, Dec 14, 2017 at 10:15 AM John Petrini  wrote:

> Anyone have any ideas on this?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

2017-12-14 Thread John Petrini

Anyone have any ideas on this?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] High Load and High Apply Latency

2017-12-11 Thread John Petrini

Hi List,

I've got a 5 OSD node cluster running hammer. All of the OSD servers are
identical but one has about 3-4x higher load than the others and the OSD's
in this node are reporting high apply latency.

The cause of the load appears to be the OSD processes. About half of the
OSD processes are using between 100-185% CPU putting keeping the proc
pegged around 85% utilization overall. In comparison others servers in the
cluster are sitting around 30% CPU utilization and are report ~1.5ms of
apply latency.

A few days ago I restarted the OSD processes and the problem went away but
now three days later it has returned. I don't see anything in the logs and
there's no iowait on the disks.

Anyone have any ideas on how I can troubleshoot this further?

Thank You,

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

Re: [ceph-users] High Load and High Apply Latency

[ceph-users] High Load and High Apply Latency

9 matches

Site Navigation

Mail list logo

Footer information