Re: cause of IO wait

2011-06-27 Thread Mike Ballon
you've mentioned iostat and vmstat so lets skip those.

I would start sar and saving running process at the same interval, normal 5mins.

I would also take a look at lsof, tracing the pids

Then there is iotop if you have it.

-Mike

On Monday, June 27, 2011, der.hans  wrote:
> moin moin,
>
> I've got a machine experiencing a lot of IO wait.
>
> We had power at a datacenter go down last week. Since then IO wait has
> been over 35%. At first we thought it was due to 3ware RAID verify taking
> place due to the crash. That took a few days, then the weekly verify
> started. We stopped that and IO wait stayed high. 8 disks in a RAID 10.
>
> Load avg is also very high, presumably due to the IO wait.
>
> smartctl short tests didn't turn up any issues.
>
> We're not swapping at all.
>
> Disk read and write are fairly low.
>
> Network traffic is down as is the total number of process and the number
> of running processes. No evidence of network errors on the box or at the
> switch.
>
> Not much going on in the logs. We've stopped several reporting processes
> in order to reduce disk access.
>
> On the positive side, entropy has been staying high :).
>
> IO wait is not explicitly disk? It could be network, serial, USB, etc.?
>
> How do I determine what resource is causing the IO wait? Is there a way to
> track to a specific process?
>
> vmstat, iostat, top and lots of other tools have been great at showing
> that there's overall IO wait ( I've been able to show that almost all
> processors have high wait, one was only at 5% ), but I haven't yet
> determined what and how.
>
> The server is running CentOS in case that matters.
>
> ciao,
>
> der.hans
> --
> #  http://www.LuftHans.com/        http://www.LuftHans.com/Classes/
> #  Hope has two beautiful daughters: Anger and Courage. Anger at the way
> #  things are, and Courage to struggle to create things as they should be.
> #  -- St. Augustine
> ---
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>
---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss


Re: cause of IO wait

2011-06-27 Thread James Mcphee
If it's consistently consuming a lot of CPU, doing a "ps auxwww" and
checking for blocked state should do it.

On Mon, Jun 27, 2011 at 5:42 PM, Mike Ballon  wrote:

> you've mentioned iostat and vmstat so lets skip those.
>
> I would start sar and saving running process at the same interval, normal
> 5mins.
>
> I would also take a look at lsof, tracing the pids
>
> Then there is iotop if you have it.
>
> -Mike
>
> On Monday, June 27, 2011, der.hans  wrote:
> > moin moin,
> >
> > I've got a machine experiencing a lot of IO wait.
> >
> > We had power at a datacenter go down last week. Since then IO wait has
> > been over 35%. At first we thought it was due to 3ware RAID verify taking
> > place due to the crash. That took a few days, then the weekly verify
> > started. We stopped that and IO wait stayed high. 8 disks in a RAID 10.
> >
> > Load avg is also very high, presumably due to the IO wait.
> >
> > smartctl short tests didn't turn up any issues.
> >
> > We're not swapping at all.
> >
> > Disk read and write are fairly low.
> >
> > Network traffic is down as is the total number of process and the number
> > of running processes. No evidence of network errors on the box or at the
> > switch.
> >
> > Not much going on in the logs. We've stopped several reporting processes
> > in order to reduce disk access.
> >
> > On the positive side, entropy has been staying high :).
> >
> > IO wait is not explicitly disk? It could be network, serial, USB, etc.?
> >
> > How do I determine what resource is causing the IO wait? Is there a way
> to
> > track to a specific process?
> >
> > vmstat, iostat, top and lots of other tools have been great at showing
> > that there's overall IO wait ( I've been able to show that almost all
> > processors have high wait, one was only at 5% ), but I haven't yet
> > determined what and how.
> >
> > The server is running CentOS in case that matters.
> >
> > ciao,
> >
> > der.hans
> > --
> > #  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
> > #  Hope has two beautiful daughters: Anger and Courage. Anger at the way
> > #  things are, and Courage to struggle to create things as they should
> be.
> > #  -- St. Augustine
> > ---
> > PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> > To subscribe, unsubscribe, or to change your mail settings:
> > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> >
> ---
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>



-- 
James McPhee
jmc...@gmail.com
---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-27 Thread Lisa Kachold
Hi Hans:

On Mon, Jun 27, 2011 at 5:07 PM, der.hans  wrote:
> moin moin,
>
> I've got a machine experiencing a lot of IO wait.
>
> We had power at a datacenter go down last week. Since then IO wait has
> been over 35%. At first we thought it was due to 3ware RAID verify taking
> place due to the crash. That took a few days, then the weekly verify
> started. We stopped that and IO wait stayed high. 8 disks in a RAID 10.
>
> Load avg is also very high, presumably due to the IO wait.
>
> smartctl short tests didn't turn up any issues.
>
> We're not swapping at all.
>
> Disk read and write are fairly low.
>
> Network traffic is down as is the total number of process and the number
> of running processes. No evidence of network errors on the box or at the
> switch.
>
> Not much going on in the logs. We've stopped several reporting processes
> in order to reduce disk access.
>
> On the positive side, entropy has been staying high :).
>
> IO wait is not explicitly disk? It could be network, serial, USB, etc.?
>
> How do I determine what resource is causing the IO wait? Is there a way to
> track to a specific process?
>
> vmstat, iostat, top and lots of other tools have been great at showing
> that there's overall IO wait ( I've been able to show that almost all
> processors have high wait, one was only at 5% ), but I haven't yet
> determined what and how.

What version is your 3ware firmware?  That's fairly important, you realize?

> The server is running CentOS in case that matters.

Please see this link related to known kernel bug in rhel kernel for
3ware products:
https://bugzilla.redhat.com/show_bug.cgi?id=121434
It also discusses troubleshooting commands to verify, some kernel proc
tuning and resolutions that worked for some.

I don't see where your kernel or distro version is listed?  CentOs in
a 2.4 kernel?  CentOs 5.6?

There are many suggestions that will give you a place to start:

For instance, try reducing the queue depth of the 3Ware driver:

can_queue from 254 to 30
command_per_lun from 254 to 4

There is a good deal of material in this post that will give you some
ideas on how to do high performance kernel tuning and troubleshooting.

But first, I would search using your firmware version and kernel
version/distro to get all the known issues in preparation for
UPGRADING.  You certainly can't expect CURRENT performance without
kernel sources?
> ciao,
>
> der.hans
> --
> #  http://www.LuftHans.com/        http://www.LuftHans.com/Classes/
> #  Hope has two beautiful daughters: Anger and Courage. Anger at the way
> #  things are, and Courage to struggle to create things as they should be.
> #  -- St. Augustine
> ---
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>



-- 
(602) 791-8002  Android
(623) 239-3392 Skype
(623) 688-3392 Google Voice

HomeSmartInternational.com
---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-27 Thread Stephen
Not sure if this applies to your hardware. But in the past I have had
supermicro boards autodetect some 3ware card at 133 MHz pci-x vs their real
100 MHz. This can lead to ever increasing latency and performance issues
before the card just fails.
On Jun 27, 2011 9:27 PM, "Lisa Kachold"  wrote:
> Hi Hans:
>
> On Mon, Jun 27, 2011 at 5:07 PM, der.hans  wrote:
>> moin moin,
>>
>> I've got a machine experiencing a lot of IO wait.
>>
>> We had power at a datacenter go down last week. Since then IO wait has
>> been over 35%. At first we thought it was due to 3ware RAID verify taking
>> place due to the crash. That took a few days, then the weekly verify
>> started. We stopped that and IO wait stayed high. 8 disks in a RAID 10.
>>
>> Load avg is also very high, presumably due to the IO wait.
>>
>> smartctl short tests didn't turn up any issues.
>>
>> We're not swapping at all.
>>
>> Disk read and write are fairly low.
>>
>> Network traffic is down as is the total number of process and the number
>> of running processes. No evidence of network errors on the box or at the
>> switch.
>>
>> Not much going on in the logs. We've stopped several reporting processes
>> in order to reduce disk access.
>>
>> On the positive side, entropy has been staying high :).
>>
>> IO wait is not explicitly disk? It could be network, serial, USB, etc.?
>>
>> How do I determine what resource is causing the IO wait? Is there a way
to
>> track to a specific process?
>>
>> vmstat, iostat, top and lots of other tools have been great at showing
>> that there's overall IO wait ( I've been able to show that almost all
>> processors have high wait, one was only at 5% ), but I haven't yet
>> determined what and how.
>
> What version is your 3ware firmware? That's fairly important, you realize?
>
>> The server is running CentOS in case that matters.
>
> Please see this link related to known kernel bug in rhel kernel for
> 3ware products:
> https://bugzilla.redhat.com/show_bug.cgi?id=121434
> It also discusses troubleshooting commands to verify, some kernel proc
> tuning and resolutions that worked for some.
>
> I don't see where your kernel or distro version is listed? CentOs in
> a 2.4 kernel? CentOs 5.6?
>
> There are many suggestions that will give you a place to start:
>
> For instance, try reducing the queue depth of the 3Ware driver:
>
> can_queue from 254 to 30
> command_per_lun from 254 to 4
>
> There is a good deal of material in this post that will give you some
> ideas on how to do high performance kernel tuning and troubleshooting.
>
> But first, I would search using your firmware version and kernel
> version/distro to get all the known issues in preparation for
> UPGRADING. You certainly can't expect CURRENT performance without
> kernel sources?
>> ciao,
>>
>> der.hans
>> --
>> #  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
>> #  Hope has two beautiful daughters: Anger and Courage. Anger at the way
>> #  things are, and Courage to struggle to create things as they should
be.
>> #  -- St. Augustine
>> ---
>> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
>> To subscribe, unsubscribe, or to change your mail settings:
>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>>
>
>
>
> --
> (602) 791-8002  Android
> (623) 239-3392 Skype
> (623) 688-3392 Google Voice
>
> HomeSmartInternational.com
> ---
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-28 Thread der.hans

Am 27. Jun, 2011 schwätzte James Mcphee so:


If it's consistently consuming a lot of CPU, doing a "ps auxwww" and
checking for blocked state should do it.


I hadn't done a good enough job of reviewing procs in blocked state.

I was able to restart sendmail and after-the-fact noted that sendmail had
been blocked.

Load avg is down to 3, but iowait is still 25%.

ciao,

der.hans
--
#  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
#  "I have seen the enemy, and it is shiny." -- Benjy Feen, 22Jun2001---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-28 Thread Bryan O'Neal
I too would like some answers on how to track down the source of Io
wait but I can ask some other questions. Did you check the raid
controllers health? BBU in good shape? Still have all your cache? did
you end up in write through? Did you tweak things before and lose your
tweaks becuse they were not in the appropriate confs? by tweaks I mean
things like you fs levelers or disabling atime etc.

On 6/27/11, der.hans  wrote:
> moin moin,
>
> I've got a machine experiencing a lot of IO wait.
>
> We had power at a datacenter go down last week. Since then IO wait has
> been over 35%. At first we thought it was due to 3ware RAID verify taking
> place due to the crash. That took a few days, then the weekly verify
> started. We stopped that and IO wait stayed high. 8 disks in a RAID 10.
>
> Load avg is also very high, presumably due to the IO wait.
>
> smartctl short tests didn't turn up any issues.
>
> We're not swapping at all.
>
> Disk read and write are fairly low.
>
> Network traffic is down as is the total number of process and the number
> of running processes. No evidence of network errors on the box or at the
> switch.
>
> Not much going on in the logs. We've stopped several reporting processes
> in order to reduce disk access.
>
> On the positive side, entropy has been staying high :).
>
> IO wait is not explicitly disk? It could be network, serial, USB, etc.?
>
> How do I determine what resource is causing the IO wait? Is there a way to
> track to a specific process?
>
> vmstat, iostat, top and lots of other tools have been great at showing
> that there's overall IO wait ( I've been able to show that almost all
> processors have high wait, one was only at 5% ), but I haven't yet
> determined what and how.
>
> The server is running CentOS in case that matters.
>
> ciao,
>
> der.hans
> --
> #  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
> #  Hope has two beautiful daughters: Anger and Courage. Anger at the way
> #  things are, and Courage to struggle to create things as they should be.
> #  -- St. Augustine
> ---
> PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
> To subscribe, unsubscribe, or to change your mail settings:
> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
>

-- 
Sent from my mobile device
---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss


Re: cause of IO wait

2011-06-30 Thread der.hans

Am 27. Jun, 2011 schwätzte Mike Ballon so:


you've mentioned iostat and vmstat so lets skip those.

I would start sar and saving running process at the same interval, normal 5mins.


Also looked at sar output :).


I would also take a look at lsof, tracing the pids

Then there is iotop if you have it.


Hmm, I thought I'd tried that, but since it's still not installed on that
machine I must not have. Going to investigate that with an eye on what I
was trying to do.

ciao,

der.hans
--
#  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
#  As we enjoy great Advantages from the
#  Inventions of others we should be glad of an
#  Opportunity to serve others by any Invention of ours,
#  and this we should do freely and generously.
#  -- Benjamin Franklin (1706-1790), on his refusal to patent his inventions.---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-30 Thread der.hans

Am 27. Jun, 2011 schwätzte Lisa Kachold so:

moin moin,


What version is your 3ware firmware?  That's fairly important, you realize?


There is an update we'll apply when I can schedule some downtime for the
server.


For instance, try reducing the queue depth of the 3Ware driver:

can_queue from 254 to 30
command_per_lun from 254 to 4


I'll look into these. We only have a couple of 3ware cards in place, but
making them more efficient is a good idea.

ciao,

der.hans
--
#  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
#  Molotov Bible - religion thrown at other people in order to cause an
#  explosive situation - der.hans---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-30 Thread der.hans

Am 27. Jun, 2011 schwätzte Stephen so:

moin moin,


Not sure if this applies to your hardware. But in the past I have had
supermicro boards autodetect some 3ware card at 133 MHz pci-x vs their real
100 MHz. This can lead to ever increasing latency and performance issues
before the card just fails.


I'll check on that. I'll even see if I can add a monitor for it.

ciao,

der.hans
--
#  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
#  I've got a photographic memory,
#  but I'm lousy photographer. - der.hans---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: cause of IO wait

2011-06-30 Thread der.hans

Am 28. Jun, 2011 schwätzte Bryan O'Neal so:


I too would like some answers on how to track down the source of Io
wait but I can ask some other questions. Did you check the raid


First off, don't believe sendmail when it claims you have an empty mail
queue. Apparently it's lazy and if the mail queue gets too large sendmail
stops trying to count and just says the queue is empty :(. The machine
only gets a few emails a minute, so an empty queue made sense.

James' suggestion of looking for processes in a blocked state was what
finally got me. I had done that, but was apparently been too bleary-eyed
to notice the capital D that accompanied each sendmail process.

Mike's suggestion of iotop looks good, but it turns out I can't use it on
that machine right now anyway.

I was starting to use oprofile when I finally figured out the problem.

Lisa's suggestion of updating ( or better yet avoiding ) proprietary
firmware is also good. A few more things in our datacenters to fix before
I can add firmware updates to the rotation, but that's definitely now
something on my radar.


controllers health? BBU in good shape? Still have all your cache? did


3ware tool claimed the hardware is in good shape.


you end up in write through? Did you tweak things before and lose your
tweaks becuse they were not in the appropriate confs? by tweaks I mean
things like you fs levelers or disabling atime etc.


Not that I know of and if we did they're gone as that machine had been on
the air long since the guy who set it up left the company...

I'm documenting and/or moving to puppet all such things as I find them.

ciao,

der.hans
--
#  http://www.LuftHans.com/http://www.LuftHans.com/Classes/
#  Dissent is patriotic.---
PLUG-discuss mailing list - PLUG-discuss@lists.plug.phoenix.az.us
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss