Re: Interpreting MCA error output

2011-10-02 Thread Thomas Zander
Hello Jeremy,

first, thank you for the extensive explanation. It cleared some things
up for me. I do have some rambling to add, though :-)

On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick free...@jdc.parodius.com wrote:

 So what should you do?  Replace the RAM.  Which DIMM?  Sadly I don't
 know how to determine that.  Some system BIOSes (particularly on AMD
 systems I've used) let you do memory tests (similar to memtest86) within
 the BIOS which can then tell you which DIMM slot experienced a problem.
 If yours doesn't have that, I would have to say purchase all new RAM
 (yes, all of it) and test the individual DIMMs later so you can
 determine which is bad.

Well, I wasn't too surprised by the panic. I have read somewhere that
in these situations the kernel might simply panic since the system
might be in a compromised state. So far so ... well ... acceptable.

My question here is how can I be certain right now if any of the DIMMs
has gone bad.
You mentioned problems you have all the time with DIMMs due to bad
cooling in data centers. My machine in question is not located in a
data center, that was my home server that tends to have very little
load. But being located in my apartment, there are lots of _potential_
problems, including stability of power. In fact this was the first MCA
event with these DIMMs ever, in more than a year.
But of course you could be right. A DIMM could be rotten. Absolutely.
Regarding your suggestion to do memory tests: My BIOS does not support
testing, so I booted up memtest86+ after reading your e-mail and let
it run for almost a whole day now. It did not encounter a single
problem.
So, even if I bought new DIMMs at once, it might take weeks to figure
out which DIMM is rotten, if at all. Assuming that MCA events stay
this infrequent, that is.
Of course I'll observe the machine closely, but if the rate stays at
one MCA event per year, it'll take some time to figure out the broken
DIMM :-)

 I should really work with John to make mcelog a FreeBSD port and just
 regularly update it with patches, etc. to work on FreeBSD.  DMI support
 and so on I don't think can be added (at least not by me), but simple
 ASCII decoding?  Very possible.

That would be absolutely helpful! After all, FreeBSD is primarily a
server OS, and where would one have ECC if not on servers. Being able
to determine what's wrong with memory would be certainly very valuable
for many admins.

 An alternative would be for me to make a CGI version where you could
 just go my site and paste in the FreeBSD MCE and it would siphon it
 through mcelog and give you the output.

I could live with that, too :-)

Thanks again for your extensive explanation, I appreciate it very much!
Now I am going to watch that machine closely...

Best regards,
Riggs
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.3 + kqueue + apache/php + DNS lookup problem

2011-10-02 Thread Damien Fleuriot
On 1 October 2011 03:18, Doug Barton do...@freebsd.org wrote:
 It's a php module doing a lookup for the hostname of the back-end mysql
 server.

 Are the delays always 3 seconds?

 Pretty much.

 If so, that almost sounds like a timeout of some kind.

 That was my first thought, but the answer always comes eventually.

 To answer Chuck's questions, no threading is involved, and it's not
 apache doing the lookups.


 Doug


Check your bind/unbound logs to ensure the queries are actually
successful on their first try.

Is your DNS using forwarders ? views ?
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.3 + kqueue + apache/php + DNS lookup problem

2011-10-02 Thread Jeremy Chadwick
On Sun, Oct 02, 2011 at 10:18:30AM +0200, Damien Fleuriot wrote:
 On 1 October 2011 03:18, Doug Barton do...@freebsd.org wrote:
  It's a php module doing a lookup for the hostname of the back-end mysql
  server.
 
  Are the delays always 3 seconds?
 
  Pretty much.
 
  If so, that almost sounds like a timeout of some kind.
 
  That was my first thought, but the answer always comes eventually.
 
  To answer Chuck's questions, no threading is involved, and it's not
  apache doing the lookups.
 
 
  Doug
 
 
 Check your bind/unbound logs to ensure the queries are actually
 successful on their first try.
 
 Is your DNS using forwarders ? views ?

How would this explain 100% quick/reliable lookups when done from tools
like nslookup and host?  Same box and same resolver (127.0.0.1:53), yet
different behaviour (nslookup/host vs. PHP).

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, US |
| Making life hard for others since 1977.   PGP 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.3 + kqueue + apache/php + DNS lookup problem

2011-10-02 Thread Adrian Chadd
Something stupid and ridiculous, like the socket watermark points are
set incorrectly?
It'd also be helpful to see exactly what the knotes were.


Adrian
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7.3 + kqueue + apache/php + DNS lookup problem

2011-10-02 Thread Matthew Seaman
On 01/10/2011 02:18, Doug Barton wrote:
 Does this happen when httpd tries to do DNS resolution for, say, an
  incoming connection to the web server (e.g. trying to resolve the
  incoming IP address of the client to an FQDN), or is it happening within
  some PHP code (assuming PHP is installed/used as an Apache module)
  that's trying to do DNS resolution of some kind?

 It's a php module doing a lookup for the hostname of the back-end mysql
 server.

Hmmm... Is this a function of DNS traffic being via UDP?  Presumably
you're not seeing the same sort of delays when eg. apache connects to
mysql via TCP.

Hard to think of another UDP protocol you could use to test -- SNMP
perhaps?  Or somehow forcing the DNS traffic to go via TCP?  Tricky to
make that happen when the resolver is on localhost.  Of course, since
DNS will only fall back to TCP after trying UDP, that's going to be even
slower overall than your current situation, but the point here is to
examine the truss output for timing details specifically around where
the TCP query is issued.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: matt...@infracaninophile.co.uk   Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature


Re: 7.3 + kqueue + apache/php + DNS lookup problem

2011-10-02 Thread Michiel Boland

On 10/02/2011 12:10, Matthew Seaman wrote:

On 01/10/2011 02:18, Doug Barton wrote:

Does this happen when httpd tries to do DNS resolution for, say, an

incoming connection to the web server (e.g. trying to resolve the
incoming IP address of the client to an FQDN), or is it happening within
some PHP code (assuming PHP is installed/used as an Apache module)
that's trying to do DNS resolution of some kind?



It's a php module doing a lookup for the hostname of the back-end mysql
server.


Hmmm... Is this a function of DNS traffic being via UDP?  Presumably
you're not seeing the same sort of delays when eg. apache connects to
mysql via TCP.

Hard to think of another UDP protocol you could use to test -- SNMP
perhaps?  Or somehow forcing the DNS traffic to go via TCP?  Tricky to
make that happen when the resolver is on localhost.  Of course, since
DNS will only fall back to TCP after trying UDP, that's going to be even
slower overall than your current situation, but the point here is to
examine the truss output for timing details specifically around where
the TCP query is issued.

Cheers,

Matthew



What is the exact query issued and what was the response?

I see recvfrom returned 30 bytes in Doug's original mail which seems awfully 
short for a meaningful DNS response.


Cheers
Michiel
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Interpreting MCA error output

2011-10-02 Thread Jeremy Chadwick
On Sun, Oct 02, 2011 at 09:37:43AM +0200, Thomas Zander wrote:
 On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick free...@jdc.parodius.com 
 wrote:
 
  So what should you do? ?Replace the RAM. ?Which DIMM? ?Sadly I don't
  know how to determine that. ?Some system BIOSes (particularly on AMD
  systems I've used) let you do memory tests (similar to memtest86) within
  the BIOS which can then tell you which DIMM slot experienced a problem.
  If yours doesn't have that, I would have to say purchase all new RAM
  (yes, all of it) and test the individual DIMMs later so you can
  determine which is bad.
 
 Well, I wasn't too surprised by the panic. I have read somewhere that
 in these situations the kernel might simply panic since the system
 might be in a compromised state. So far so ... well ... acceptable.

IMO, you absolutely want the system to panic where an MCE arrives which
the kernel does not know how to handle gracefully.

There are some MCEs which can be treated as informational.  A common
example is an MCE that indicates the CPU itself (not system RAM!)
experienced a single-bit ECC L1/L2/L3 parity error (and thus was
correctable).  In the case of Solaris 10, such is reported as
informational.  The kernel in this case also keeps count of how many
times it encounters this MCE for the particular CPU (either core or
physical CPU, depending on if L1/L2/L3 is shared across cores or
dedicated), and if a threshold is reached, it actually takes the CPU
offline.

In the case of FreeBSD (which I do not think has this type of
framework), the administrator has to keep an eye on this type of MCE
over time.  L1/L2/L3 ECC errors are actually normal (think about how
often these caches get used!), but excessive amounts in short periods of
time means it's time to replace the CPU.

Of course, this means that for certain MCEs which are informational
(e.g. recoverable), the kernel might panic until code in the kernel
gets written to handle said MCE gracefully.  This applies to all OSes,
naturally, and gets into a cat-and-mouse game when CPU manufacturers
release a new CPU on the market.

Again, the above example is not your situation, but I wanted to provide
an example of something that can be auto-corrected (meaning harmless)
but requires the SA to keep an eye on the system.  Solaris is quite nice
in this regard; fmd (Fault Manager Daemon) and its related framework is
really great for this stuff (look up fmd, fmadm, or fmdump online).
More on a confusing MCE momentarily (with an example on Solaris 10).

 My question here is how can I be certain right now if any of the DIMMs
 has gone bad.

You can be absolutely 100% certain.  The MCE is not a guess at what's
going on -- literally the hardware reported to the system (either via
NMI or SMI (probably the latter)) the situation.  The MCE really did
happen; it's not fake.

What you *can't* be certain of is that if you were to run, say,
memtest86 or memtest86+, that after an hour or two you'd see some
errors.

So what I'm trying to say is: you definitely have a DIMM that is either
downright bad, or at bare minimum, flaky to the point where it's
suffering from uncorrectable multi-bit errors.  When you will see that
happen is unknown to me, but it's more likely you'll see the situation
happen if you let memtest86/memtest86+ run for a while.

Be aware that in memtest86 (not sure about memtest86+ but probably the
same) you may have to adjust the Error Report Mode to show you things
like ECC corrections when they happen.  I *think* by default they're
disabled, I'm not sure.  Search for ECC here:

http://www.memtest86.com/tech.html

 You mentioned problems you have all the time with DIMMs due to bad
 cooling in data centers. My machine in question is not located in a
 data center, that was my home server that tends to have very little
 load. But being located in my apartment, there are lots of _potential_
 problems, including stability of power. In fact this was the first MCA
 event with these DIMMs ever, in more than a year.

Understood.  Let me try to explain what I was getting at:

In actual production datacenters at my workplace we see MCEs which are
indicative of thermal problems with our DCs, and I'd say ~90% of the
time engineers decode these MCEs incorrectly (meaning their reaction is
incorrect for the situation).  Here's an example of one (again, taken
from Solaris 10, with some data XXX'd out given its sensitive nature):


# fmadm faulty
---   -- -
TIMEEVENT-ID  MSG-ID SEVERITY
---   -- -
Sep 21 21:02:33 e1975284-e77c-6c00-d1be-a2e640b12f4a  INTEL-8001-3S  Major

Host: 
Platform: S5000PSL  Chassis_id  : 
Product_sn  :

Fault class : fault.memory.intel.fbd.otf
Problem in  : MB