Re: Interpreting MCA error output
Hello Jeremy, first, thank you for the extensive explanation. It cleared some things up for me. I do have some rambling to add, though :-) On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick free...@jdc.parodius.com wrote: So what should you do? Replace the RAM. Which DIMM? Sadly I don't know how to determine that. Some system BIOSes (particularly on AMD systems I've used) let you do memory tests (similar to memtest86) within the BIOS which can then tell you which DIMM slot experienced a problem. If yours doesn't have that, I would have to say purchase all new RAM (yes, all of it) and test the individual DIMMs later so you can determine which is bad. Well, I wasn't too surprised by the panic. I have read somewhere that in these situations the kernel might simply panic since the system might be in a compromised state. So far so ... well ... acceptable. My question here is how can I be certain right now if any of the DIMMs has gone bad. You mentioned problems you have all the time with DIMMs due to bad cooling in data centers. My machine in question is not located in a data center, that was my home server that tends to have very little load. But being located in my apartment, there are lots of _potential_ problems, including stability of power. In fact this was the first MCA event with these DIMMs ever, in more than a year. But of course you could be right. A DIMM could be rotten. Absolutely. Regarding your suggestion to do memory tests: My BIOS does not support testing, so I booted up memtest86+ after reading your e-mail and let it run for almost a whole day now. It did not encounter a single problem. So, even if I bought new DIMMs at once, it might take weeks to figure out which DIMM is rotten, if at all. Assuming that MCA events stay this infrequent, that is. Of course I'll observe the machine closely, but if the rate stays at one MCA event per year, it'll take some time to figure out the broken DIMM :-) I should really work with John to make mcelog a FreeBSD port and just regularly update it with patches, etc. to work on FreeBSD. DMI support and so on I don't think can be added (at least not by me), but simple ASCII decoding? Very possible. That would be absolutely helpful! After all, FreeBSD is primarily a server OS, and where would one have ECC if not on servers. Being able to determine what's wrong with memory would be certainly very valuable for many admins. An alternative would be for me to make a CGI version where you could just go my site and paste in the FreeBSD MCE and it would siphon it through mcelog and give you the output. I could live with that, too :-) Thanks again for your extensive explanation, I appreciate it very much! Now I am going to watch that machine closely... Best regards, Riggs ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 7.3 + kqueue + apache/php + DNS lookup problem
On 1 October 2011 03:18, Doug Barton do...@freebsd.org wrote: It's a php module doing a lookup for the hostname of the back-end mysql server. Are the delays always 3 seconds? Pretty much. If so, that almost sounds like a timeout of some kind. That was my first thought, but the answer always comes eventually. To answer Chuck's questions, no threading is involved, and it's not apache doing the lookups. Doug Check your bind/unbound logs to ensure the queries are actually successful on their first try. Is your DNS using forwarders ? views ? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 7.3 + kqueue + apache/php + DNS lookup problem
On Sun, Oct 02, 2011 at 10:18:30AM +0200, Damien Fleuriot wrote: On 1 October 2011 03:18, Doug Barton do...@freebsd.org wrote: It's a php module doing a lookup for the hostname of the back-end mysql server. Are the delays always 3 seconds? Pretty much. If so, that almost sounds like a timeout of some kind. That was my first thought, but the answer always comes eventually. To answer Chuck's questions, no threading is involved, and it's not apache doing the lookups. Doug Check your bind/unbound logs to ensure the queries are actually successful on their first try. Is your DNS using forwarders ? views ? How would this explain 100% quick/reliable lookups when done from tools like nslookup and host? Same box and same resolver (127.0.0.1:53), yet different behaviour (nslookup/host vs. PHP). -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 7.3 + kqueue + apache/php + DNS lookup problem
Something stupid and ridiculous, like the socket watermark points are set incorrectly? It'd also be helpful to see exactly what the knotes were. Adrian ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 7.3 + kqueue + apache/php + DNS lookup problem
On 01/10/2011 02:18, Doug Barton wrote: Does this happen when httpd tries to do DNS resolution for, say, an incoming connection to the web server (e.g. trying to resolve the incoming IP address of the client to an FQDN), or is it happening within some PHP code (assuming PHP is installed/used as an Apache module) that's trying to do DNS resolution of some kind? It's a php module doing a lookup for the hostname of the back-end mysql server. Hmmm... Is this a function of DNS traffic being via UDP? Presumably you're not seeing the same sort of delays when eg. apache connects to mysql via TCP. Hard to think of another UDP protocol you could use to test -- SNMP perhaps? Or somehow forcing the DNS traffic to go via TCP? Tricky to make that happen when the resolver is on localhost. Of course, since DNS will only fall back to TCP after trying UDP, that's going to be even slower overall than your current situation, but the point here is to examine the truss output for timing details specifically around where the TCP query is issued. Cheers, Matthew -- Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate JID: matt...@infracaninophile.co.uk Kent, CT11 9PW signature.asc Description: OpenPGP digital signature
Re: 7.3 + kqueue + apache/php + DNS lookup problem
On 10/02/2011 12:10, Matthew Seaman wrote: On 01/10/2011 02:18, Doug Barton wrote: Does this happen when httpd tries to do DNS resolution for, say, an incoming connection to the web server (e.g. trying to resolve the incoming IP address of the client to an FQDN), or is it happening within some PHP code (assuming PHP is installed/used as an Apache module) that's trying to do DNS resolution of some kind? It's a php module doing a lookup for the hostname of the back-end mysql server. Hmmm... Is this a function of DNS traffic being via UDP? Presumably you're not seeing the same sort of delays when eg. apache connects to mysql via TCP. Hard to think of another UDP protocol you could use to test -- SNMP perhaps? Or somehow forcing the DNS traffic to go via TCP? Tricky to make that happen when the resolver is on localhost. Of course, since DNS will only fall back to TCP after trying UDP, that's going to be even slower overall than your current situation, but the point here is to examine the truss output for timing details specifically around where the TCP query is issued. Cheers, Matthew What is the exact query issued and what was the response? I see recvfrom returned 30 bytes in Doug's original mail which seems awfully short for a meaningful DNS response. Cheers Michiel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Interpreting MCA error output
On Sun, Oct 02, 2011 at 09:37:43AM +0200, Thomas Zander wrote: On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick free...@jdc.parodius.com wrote: So what should you do? ?Replace the RAM. ?Which DIMM? ?Sadly I don't know how to determine that. ?Some system BIOSes (particularly on AMD systems I've used) let you do memory tests (similar to memtest86) within the BIOS which can then tell you which DIMM slot experienced a problem. If yours doesn't have that, I would have to say purchase all new RAM (yes, all of it) and test the individual DIMMs later so you can determine which is bad. Well, I wasn't too surprised by the panic. I have read somewhere that in these situations the kernel might simply panic since the system might be in a compromised state. So far so ... well ... acceptable. IMO, you absolutely want the system to panic where an MCE arrives which the kernel does not know how to handle gracefully. There are some MCEs which can be treated as informational. A common example is an MCE that indicates the CPU itself (not system RAM!) experienced a single-bit ECC L1/L2/L3 parity error (and thus was correctable). In the case of Solaris 10, such is reported as informational. The kernel in this case also keeps count of how many times it encounters this MCE for the particular CPU (either core or physical CPU, depending on if L1/L2/L3 is shared across cores or dedicated), and if a threshold is reached, it actually takes the CPU offline. In the case of FreeBSD (which I do not think has this type of framework), the administrator has to keep an eye on this type of MCE over time. L1/L2/L3 ECC errors are actually normal (think about how often these caches get used!), but excessive amounts in short periods of time means it's time to replace the CPU. Of course, this means that for certain MCEs which are informational (e.g. recoverable), the kernel might panic until code in the kernel gets written to handle said MCE gracefully. This applies to all OSes, naturally, and gets into a cat-and-mouse game when CPU manufacturers release a new CPU on the market. Again, the above example is not your situation, but I wanted to provide an example of something that can be auto-corrected (meaning harmless) but requires the SA to keep an eye on the system. Solaris is quite nice in this regard; fmd (Fault Manager Daemon) and its related framework is really great for this stuff (look up fmd, fmadm, or fmdump online). More on a confusing MCE momentarily (with an example on Solaris 10). My question here is how can I be certain right now if any of the DIMMs has gone bad. You can be absolutely 100% certain. The MCE is not a guess at what's going on -- literally the hardware reported to the system (either via NMI or SMI (probably the latter)) the situation. The MCE really did happen; it's not fake. What you *can't* be certain of is that if you were to run, say, memtest86 or memtest86+, that after an hour or two you'd see some errors. So what I'm trying to say is: you definitely have a DIMM that is either downright bad, or at bare minimum, flaky to the point where it's suffering from uncorrectable multi-bit errors. When you will see that happen is unknown to me, but it's more likely you'll see the situation happen if you let memtest86/memtest86+ run for a while. Be aware that in memtest86 (not sure about memtest86+ but probably the same) you may have to adjust the Error Report Mode to show you things like ECC corrections when they happen. I *think* by default they're disabled, I'm not sure. Search for ECC here: http://www.memtest86.com/tech.html You mentioned problems you have all the time with DIMMs due to bad cooling in data centers. My machine in question is not located in a data center, that was my home server that tends to have very little load. But being located in my apartment, there are lots of _potential_ problems, including stability of power. In fact this was the first MCA event with these DIMMs ever, in more than a year. Understood. Let me try to explain what I was getting at: In actual production datacenters at my workplace we see MCEs which are indicative of thermal problems with our DCs, and I'd say ~90% of the time engineers decode these MCEs incorrectly (meaning their reaction is incorrect for the situation). Here's an example of one (again, taken from Solaris 10, with some data XXX'd out given its sensitive nature): # fmadm faulty --- -- - TIMEEVENT-ID MSG-ID SEVERITY --- -- - Sep 21 21:02:33 e1975284-e77c-6c00-d1be-a2e640b12f4a INTEL-8001-3S Major Host: Platform: S5000PSL Chassis_id : Product_sn : Fault class : fault.memory.intel.fbd.otf Problem in : MB