Re: any advice to find root cause of Falling back to HPET ?

2010-05-23 Thread Bond Masuda
On Sun, 2010-05-23 at 21:11 -0300, Andrew Reid wrote:
 A few  random thoughts:
 
 Any chance the RAID is still rebuilding.
 
 Is slow unit truly slower, that is  does it take longer as YOU measure 
 it to do a significant task. I ask because a bad clock will would make 
 it report longer times due to time measurement errors.
 
 A comparison of the actual number of interrupts (cat /proc/interrupts) 
 on both machines after they have been up for similar periods, doing 
 similar things may give a hint as to who has 'stolen' your CPU.

Yes, tasks do take longer in real time. (a lot longer)

The RAID was not rebuilding.

The 2nd chassis started behaving strangely, and upon reboot would not
finish booting. The guys at the hosting company said the server was
still booting up after 15 minutes and going VERY VERY slowly. They
finally gave up and decided to swap chassis again (on 3rd chassis now).

This time around, on 3rd chassis, the lost ticks are no longer, the
server is fast/normal again, and all is well. I just can't believe we
ran into 2 servers with the same issue back to back??? (the guys at the
hosting company are usually great, but it almost makes me doubt that
they did the 1st chassis swap??)

Anyway, all is well again  bizarre weekend indeed.
-Bond

___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


Re: any advice to find root cause of Falling back to HPET ?

2010-05-23 Thread Jefferson Ogata
On 2010-05-24 01:01, Bond Masuda wrote:
 This time around, on 3rd chassis, the lost ticks are no longer, the
 server is fast/normal again, and all is well. I just can't believe we
 ran into 2 servers with the same issue back to back??? (the guys at the
 hosting company are usually great, but it almost makes me doubt that
 they did the 1st chassis swap??)

For future reference, you might want to save a dump of dmidecode output
before requesting chassis swaps so you can check if the service tag
actually changed.

___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


RE: any advice to find root cause of Falling back to HPET ?

2010-05-23 Thread Bond Masuda
Thanks. Good point.

As an aside, this weekend has been a bizarre weekend. After finally
resolving this lost ticks issue, another server kernel panic and crashed
mysteriously, and again kernel panic upon boot up.. some messages about bad
memory in DIMM8 and DIMM7. then my friend's Drobo went all red and failed
this afternoon. We must be getting showered by intense cosmic rays this
weekend

thanks for the responses from everyone.
-Bond

 -Original Message-
 From: linux-poweredge-boun...@dell.com [mailto:linux-poweredge-
 boun...@dell.com] On Behalf Of Jefferson Ogata
 Sent: Sunday, May 23, 2010 9:01 PM
 To: linux-powere...@lists.us.dell.com
 Subject: Re: any advice to find root cause of Falling back to HPET ?
 
 On 2010-05-24 01:01, Bond Masuda wrote:
  This time around, on 3rd chassis, the lost ticks are no longer, the
  server is fast/normal again, and all is well. I just can't believe we
  ran into 2 servers with the same issue back to back??? (the guys at
 the
  hosting company are usually great, but it almost makes me doubt that
  they did the 1st chassis swap??)
 
 For future reference, you might want to save a dump of dmidecode output
 before requesting chassis swaps so you can check if the service tag
 actually changed.
 
 ___
 Linux-PowerEdge mailing list
 Linux-PowerEdge@dell.com
 https://lists.us.dell.com/mailman/listinfo/linux-poweredge
 Please read the FAQ at http://lists.us.dell.com/faq

___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


Re: any advice to find root cause of Falling back to HPET ?

2010-05-23 Thread Jefferson Ogata
On 2010-05-24 04:45, Bond Masuda wrote:
 As an aside, this weekend has been a bizarre weekend. After finally
 resolving this lost ticks issue, another server kernel panic and crashed
 mysteriously, and again kernel panic upon boot up.. some messages about bad
 memory in DIMM8 and DIMM7. then my friend's Drobo went all red and failed
 this afternoon. We must be getting showered by intense cosmic rays this
 weekend

Maybe the lost ticks was happening because of the Lost series
finale. rim-shot /

___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


any advice to find root cause of Falling back to HPET ?

2010-05-22 Thread Bond Masuda
Hello,

I'd appreciate any help/advice anyone can provide regarding our issue. I've
run out of ideas on this one...

We have two identical PowerEdge 2950, one is called s7 and the other is s8.
Both are web servers running Apache and PHP. We first noticed the problem
because our benchmarking showed drastically different results between the
two servers. With s7, we were able to get 180 requests/sec while on s8 we
only get 35 request/sec (and now only 15 requests/sec - more on that below).
After this, we became aware that almost all tasks on s8 were slower than s7,
whether it is CPU bound or I/O bound, everything we tried was slower on s8
than on s7 (untar'ing archives, running md5 hashes, etc).

I started digging around. Both servers are identical in terms of software
and configuration (other than things like hostname and IP addresses). Both
servers are RHEL4U8, kernel-2.6.9-89.0.25.ELsmp, x86_64, exact same packages
and exact same versions. I even ran 'rpm --verify' on all packages and
didn't find anything unusual on both s7 and s8.

The ONLY error message I'm seeing that is unique to s8 are the following
messages in dmesg:

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0
Falling back to HPET

Some google searching found:

https://bugzilla.redhat.com/show_bug.cgi?id=429010

which refers to:

https://bugzilla.redhat.com/show_bug.cgi?id=248488

But that seems to refer to problems with virtualization. This is on real
hardware.

What we don't understand is that s7 does *not* exhibit any slowness nor the
messags above, only s8. Again, both are identical.

So, thinking this might be a hardware issue, we asked our hosting company to
pull the drives out of s8 and replace the entire chassis. After replacing
the entire chassis of s8, we are still getting the above messages in dmesg.
Not only that, things have gotten worse... our benchmarking (using 'ab') now
shows the server can only do 15 requests/sec (all these test were run
locally on loopback to avoid any network related issue).

Since the chassis was swapped, we feel that it probably isn't a hardware
issue. But we have s7 which is configured identically to s8 that doesn't
have this issue, so it is hard to say that it is a software issue.

Any advice? What can I do to find the root cause?

TIA,
-Bond





___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


RE: any advice to find root cause of Falling back to HPET ?

2010-05-22 Thread Bond Masuda
thanks for the reply. I'm pretty sure it is not #2, we have verified the
software on both s7 and s8 are identical.

it is true, the only hardware component that we kept using are the hard
drives when the server chassis was swapped out. I just don't understand how
a faulty drive can cause many lost ticks and those other messages.
Nonetheless, I'm not going to exclude the possibility. 

I wish RHEL4 had smartctl that worked with megasas; i'll have to compile the
latest smartctl to see if SMART data will tell me anything about the drives.
One thing to note is that during the build of these  servers 2 weeks ago,
one of the drives on s8 did fail and had to be replaced.

-Bond

 -Original Message-
 From: guy [mailto:guy.choo.ke...@gmail.com]
 Sent: Saturday, May 22, 2010 12:42 PM
 To: Bond Masuda
 Cc: linux-powere...@lists.us.dell.com
 Subject: Re: any advice to find root cause of Falling back to HPET ?
 
 
 you can try putting s8's drives inside s7 and see what you get.
 
 if you get errors, this can be one of:
 
 1. the hard disk themselves are faulty.
 
 or
 
 2. there is some different driver on s8 that is not found on s7 (or the
 other way around) and it was installed not via the RPM system (so in
 RPM
 you won't see a difference) - which is causing the problems.
 
 --guy
 
 Bond Masuda wrote:
  Hello,
 
  I'd appreciate any help/advice anyone can provide regarding our
 issue. I've
  run out of ideas on this one...
 
  We have two identical PowerEdge 2950, one is called s7 and the other
 is s8.
  Both are web servers running Apache and PHP. We first noticed the
 problem
  because our benchmarking showed drastically different results between
 the
  two servers. With s7, we were able to get 180 requests/sec while on
 s8 we
  only get 35 request/sec (and now only 15 requests/sec - more on that
 below).
  After this, we became aware that almost all tasks on s8 were slower
 than s7,
  whether it is CPU bound or I/O bound, everything we tried was slower
 on s8
  than on s7 (untar'ing archives, running md5 hashes, etc).
 
  I started digging around. Both servers are identical in terms of
 software
  and configuration (other than things like hostname and IP addresses).
 Both
  servers are RHEL4U8, kernel-2.6.9-89.0.25.ELsmp, x86_64, exact same
 packages
  and exact same versions. I even ran 'rpm --verify' on all packages
 and
  didn't find anything unusual on both s7 and s8.
 
  The ONLY error message I'm seeing that is unique to s8 are the
 following
  messages in dmesg:
 
  Losing some ticks... checking if CPU frequency changed.
  warning: many lost ticks.
  Your time source seems to be instable or some driver is hogging
 interupts
  rip __do_softirq+0x4d/0xd0
  Falling back to HPET
 
  Some google searching found:
 
  https://bugzilla.redhat.com/show_bug.cgi?id=429010
 
  which refers to:
 
  https://bugzilla.redhat.com/show_bug.cgi?id=248488
 
  But that seems to refer to problems with virtualization. This is on
 real
  hardware.
 
  What we don't understand is that s7 does *not* exhibit any slowness
 nor the
  messags above, only s8. Again, both are identical.
 
  So, thinking this might be a hardware issue, we asked our hosting
 company to
  pull the drives out of s8 and replace the entire chassis. After
 replacing
  the entire chassis of s8, we are still getting the above messages in
 dmesg.
  Not only that, things have gotten worse... our benchmarking (using
 'ab') now
  shows the server can only do 15 requests/sec (all these test were run
  locally on loopback to avoid any network related issue).
 
  Since the chassis was swapped, we feel that it probably isn't a
 hardware
  issue. But we have s7 which is configured identically to s8 that
 doesn't
  have this issue, so it is hard to say that it is a software issue.
 
  Any advice? What can I do to find the root cause?
 
  TIA,
  -Bond
 
 
 
 
 
  ___
  Linux-PowerEdge mailing list
  Linux-PowerEdge@dell.com
  https://lists.us.dell.com/mailman/listinfo/linux-poweredge
  Please read the FAQ at http://lists.us.dell.com/faq
 


___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq


Re: any advice to find root cause of Falling back to HPET ?

2010-05-22 Thread Jefferson Ogata
On 2010-05-22 19:52, Bond Masuda wrote:
 I wish RHEL4 had smartctl that worked with megasas; i'll have to compile the
 latest smartctl to see if SMART data will tell me anything about the drives.
 One thing to note is that during the build of these  servers 2 weeks ago,
 one of the drives on s8 did fail and had to be replaced.

Try megactl/megasasctl.

___
Linux-PowerEdge mailing list
Linux-PowerEdge@dell.com
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq