how to get more logging from GEOM?
About 10 days ago one of my personal machines started hanging at random. This is the first bit of instability I've ever experienced on this machine (2+ years running) FreeBSD triceratops.netconsonance.com 6.2-RELEASE-p11 FreeBSD 6.2- RELEASE-p11 #0: Wed Feb 13 06:44:57 UTC 2008 [EMAIL PROTECTED] :/usr/obj/usr/src/sys/GENERIC i386 After about 2 weeks of watching it carefully I've learned almost nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now running healthd without complaints) it's not based on any given network traffic... however it does appear to accompany heavy cpu/disk activity. It usually dies when indexing my websites at night (but not always) and it sometimes dies when compiling programs. Just heavy disk isn't enough to do the job, as backups proceed without problems. Heavy cpu by itself isn't enough to do it either. But if I start compiling things and keep going a while, it will eventually hang. My best guess is that geom is having a problem and locking up. There's no log entry before failure to back this idea up, but I think this because during boot I see the following: ad0: 286168MB at ata0-master UDMA100 GEOM_MIRROR: Device gm0 created (id=575427344). GEOM_MIRROR: Device gm0: provider ad0 detected. ad1: 286168MB at ata0-slave UDMA100 GEOM_MIRROR: Device gm0: provider ad1 detected. GEOM_MIRROR: Device gm0: provider ad1 activated. GEOM_MIRROR: Device gm0: provider mirror/gm0 launched. GEOM_MIRROR: Device gm0: rebuilding provider ad0. Every time it is rebuilding ad0. Every single boot in the last two weeks. Is this any way to get more logging from geom, to confirm or deny this theory? Is there anything else I should be looking at? FWIW, this never happened before the p11 patch to 6.2. I don't know if that is related or not. Obviously, I can't upgrade to 6.3 if heavy cpu/disk activity kills the system. No, I don't have any other insights. I'm not prone to posting "duh help me please!" posts, so I'm quite a bit frustrated by this one. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
Jo Rhett wrote: > About 10 days ago one of my personal machines started hanging at > random. This is the first bit of instability I've ever experienced on > this machine (2+ years running) > > FreeBSD triceratops.netconsonance.com 6.2-RELEASE-p11 FreeBSD > 6.2-RELEASE-p11 #0: Wed Feb 13 06:44:57 UTC 2008 > [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 > > After about 2 weeks of watching it carefully I've learned almost > nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now > running healthd without complaints) it's not based on any given network > traffic... however it does appear to accompany heavy cpu/disk > activity. It usually dies when indexing my websites at night (but not > always) and it sometimes dies when compiling programs. Just heavy disk > isn't enough to do the job, as backups proceed without problems. Heavy > cpu by itself isn't enough to do it either. But if I start compiling > things and keep going a while, it will eventually hang. > > My best guess is that geom is having a problem and locking up. There's > no log entry before failure to back this idea up, but I think this > because during boot I see the following: > > ad0: 286168MB at ata0-master UDMA100 > GEOM_MIRROR: Device gm0 created (id=575427344). > GEOM_MIRROR: Device gm0: provider ad0 detected. > ad1: 286168MB at ata0-slave UDMA100 > GEOM_MIRROR: Device gm0: provider ad1 detected. > GEOM_MIRROR: Device gm0: provider ad1 activated. > GEOM_MIRROR: Device gm0: provider mirror/gm0 launched. > GEOM_MIRROR: Device gm0: rebuilding provider ad0. > > Every time it is rebuilding ad0. Every single boot in the last two weeks. > > Is this any way to get more logging from geom, to confirm or deny this > theory? Just a guess but try kern.geom.debugflags > 0 This certainly spews out far more geom info, as to how helpful this will be... Vince > > Is there anything else I should be looking at? > > FWIW, this never happened before the p11 patch to 6.2. I don't know if > that is related or not. > > Obviously, I can't upgrade to 6.3 if heavy cpu/disk activity kills the > system. > > No, I don't have any other insights. I'm not prone to posting "duh help > me please!" posts, so I'm quite a bit frustrated by this one. > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Fri, 11 Jul 2008 09:59:33 +0200, Jo Rhett <[EMAIL PROTECTED]> wrote: About 10 days ago one of my personal machines started hanging at random. This is the first bit of instability I've ever experienced on this machine (2+ years running) FreeBSD triceratops.netconsonance.com 6.2-RELEASE-p11 FreeBSD 6.2- RELEASE-p11 #0: Wed Feb 13 06:44:57 UTC 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 After about 2 weeks of watching it carefully I've learned almost nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now running healthd without complaints) it's not based on any given network traffic... however it does appear to accompany heavy cpu/disk activity. It usually dies when indexing my websites at night (but not always) and it sometimes dies when compiling programs. Just heavy disk isn't enough to do the job, as backups proceed without problems. Heavy cpu by itself isn't enough to do it either. But if I start compiling things and keep going a while, it will eventually hang. My best guess is that geom is having a problem and locking up. There's no log entry before failure to back this idea up, but I think this because during boot I see the following: ad0: 286168MB at ata0-master UDMA100 GEOM_MIRROR: Device gm0 created (id=575427344). GEOM_MIRROR: Device gm0: provider ad0 detected. ad1: 286168MB at ata0-slave UDMA100 GEOM_MIRROR: Device gm0: provider ad1 detected. GEOM_MIRROR: Device gm0: provider ad1 activated. GEOM_MIRROR: Device gm0: provider mirror/gm0 launched. GEOM_MIRROR: Device gm0: rebuilding provider ad0. Every time it is rebuilding ad0. Every single boot in the last two weeks. Is this any way to get more logging from geom, to confirm or deny this theory? Is there anything else I should be looking at? FWIW, this never happened before the p11 patch to 6.2. I don't know if that is related or not. Obviously, I can't upgrade to 6.3 if heavy cpu/disk activity kills the system. No, I don't have any other insights. I'm not prone to posting "duh help me please!" posts, so I'm quite a bit frustrated by this one. You can try going into the kernel debugger to see where it is hanging. Debugging via a serial cable is also very easy. I don't know the details, but there is a lot of info in the Freebsd handbook. Put this in google 'freebsd handbook kernel debug'. Ronald. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Fri, Jul 11, 2008 at 12:59:33AM -0700, Jo Rhett wrote: > About 10 days ago one of my personal machines started hanging at > random. This is the first bit of instability I've ever experienced on > this machine (2+ years running) > > FreeBSD triceratops.netconsonance.com 6.2-RELEASE-p11 FreeBSD 6.2- > RELEASE-p11 #0: Wed Feb 13 06:44:57 UTC 2008 [EMAIL PROTECTED] > :/usr/obj/usr/src/sys/GENERIC i386 > > After about 2 weeks of watching it carefully I've learned almost > nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now > running healthd without complaints) it's not based on any given > network traffic... however it does appear to accompany heavy cpu/disk > activity. It usually dies when indexing my websites at night (but not > always) and it sometimes dies when compiling programs. Just heavy > disk isn't enough to do the job, as backups proceed without > problems. Heavy cpu by itself isn't enough to do it either. But if > I start compiling things and keep going a while, it will eventually > hang. > Is there anything else I should be looking at? Power supply or motherboard would be my first guess. Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpYH4pn00ZAc.pgp Description: PGP signature
Re: how to get more logging from GEOM?
On Fri, Jul 11, 2008 at 12:59:33AM -0700, Jo Rhett wrote: > My best guess is that geom is having a problem and locking up. > There's no log entry before failure to back this idea up, but I think > this because during boot I see the following: > > ad0: 286168MB at ata0-master UDMA100 > GEOM_MIRROR: Device gm0 created (id=575427344). > GEOM_MIRROR: Device gm0: provider ad0 detected. > ad1: 286168MB at ata0-slave UDMA100 > GEOM_MIRROR: Device gm0: provider ad1 detected. > GEOM_MIRROR: Device gm0: provider ad1 activated. > GEOM_MIRROR: Device gm0: provider mirror/gm0 launched. > GEOM_MIRROR: Device gm0: rebuilding provider ad0. > > Every time it is rebuilding ad0. Every single boot in the last two > weeks. That just means that it halted without a proper shutdown. If it crashes, the mirror isn't stopped properly, so it's marked dirty, so it must rebuild it. It is the precise analogy of finding all the file systems dirty on boot and fscking them, following a crash. -- Clifton -- Clifton Royston -- [EMAIL PROTECTED] / [EMAIL PROTECTED] President - I and I Computing * http://www.iandicomputing.com/ Custom programming, network design, systems and network consulting services ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
Jo Rhett <[EMAIL PROTECTED]> wrote: > About 10 days ago one of my personal machines started hanging at > random. This is the first bit of instability I've ever experienced on > this machine (2+ years running) > > FreeBSD triceratops.netconsonance.com 6.2-RELEASE-p11 FreeBSD 6.2- > RELEASE-p11 #0: Wed Feb 13 06:44:57 UTC 2008 [EMAIL PROTECTED] > :/usr/obj/usr/src/sys/GENERIC i386 > > After about 2 weeks of watching it carefully I've learned almost > nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now > running healthd without complaints) it's not based on any given > network traffic... however it does appear to accompany heavy cpu/disk > activity. It usually dies when indexing my websites at night (but not > always) and it sometimes dies when compiling programs. Just heavy > disk isn't enough to do the job, as backups proceed without > problems. Heavy cpu by itself isn't enough to do it either. But if > I start compiling things and keep going a while, it will eventually > hang. I had exactly the same problems on a machine a few months ago. It had also been running for about two years, then started freezing when there was high CPU + disk activity. It turned out that the power supply went weak (either the power supply itself or the voltage regulators on the main- board). Replacing PS + mainboard solved the problem. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "C++ is the only current language making COBOL look good." -- Bertrand Meyer ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Jul 11, 2008, at 4:48 AM, Ronald Klop wrote: You can try going into the kernel debugger to see where it is hanging. Debugging via a serial cable is also very easy. I don't know the details, but there is a lot of info in the Freebsd handbook. Put this in google 'freebsd handbook kernel debug'. Thanks for the reply. I'm familiar with these options, but as the system is currently running GENERIC and trying to compile a kernel would guarantee to cause the problem to occur... I could probably keep hacking at it until I finally get everything compiled, but... Ugh. I guess this option doesn't appeal very much. Are there any other options available? -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Jul 11, 2008, at 8:58 AM, Roland Smith wrote: After about 2 weeks of watching it carefully I've learned almost nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now running healthd without complaints) it's not based on any given network traffic... however it does appear to accompany heavy cpu/ disk activity. It usually dies when indexing my websites at night (but not always) and it sometimes dies when compiling programs. Just heavy disk isn't enough to do the job, as backups proceed without problems. Heavy cpu by itself isn't enough to do it either. But if I start compiling things and keep going a while, it will eventually hang. Is there anything else I should be looking at? Power supply or motherboard would be my first guess. If the system went offline, I agree. But it's clearly a kernel deadlock, since the system remains pingable, answers TCP connections, etc etcc but doesn't respond. No TCP negotiation, no response on the console, etc. It's higher level activity which isn't working... -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Fri, Jul 11, 2008 at 12:59:33AM -0700, Jo Rhett wrote: Every time it is rebuilding ad0. Every single boot in the last two weeks. On Jul 11, 2008, at 9:49 AM, Clifton Royston wrote: That just means that it halted without a proper shutdown. If it crashes, the mirror isn't stopped properly, so it's marked dirty, so it must rebuild it. It is the precise analogy of finding all the file systems dirty on boot and fscking them, following a crash. Thanks for the clarification. Dang, I hoped I was on to something. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Wed, Jul 16, 2008 at 02:41:28PM -0700, Jo Rhett wrote: > On Jul 11, 2008, at 8:58 AM, Roland Smith wrote: > >> After about 2 weeks of watching it carefully I've learned almost > >> nothing. It's not a disk failure (AFAIK) it's not cpu overheat (now > >> running healthd without complaints) it's not based on any given > >> network traffic... however it does appear to accompany heavy cpu/ > >> disk > >> activity. It usually dies when indexing my websites at night (but > >> not > >> always) and it sometimes dies when compiling programs. Just heavy > >> disk isn't enough to do the job, as backups proceed without > >> problems. Heavy cpu by itself isn't enough to do it either. But if > >> I start compiling things and keep going a while, it will eventually > >> hang. > > > >> Is there anything else I should be looking at? > > > > Power supply or motherboard would be my first guess. > > > If the system went offline, I agree. But it's clearly a kernel > deadlock, since the system remains pingable, answers TCP connections, > etc etcc but doesn't respond. Ah. Well, you did said the system 'dies', not 'becomes unresponsive'. > No TCP negotiation, no response on > the console, etc. It's higher level activity which isn't working... Try compiling a kernel with debugging options e.g. WITNESS(4), MUTEX_DEBUG, LOCK_PROFILING, DIAGNOSTIC and INVARIANTS. See /usr/src/sys/conf/NOTES This will create a lot of messages in the dmesg output. If you can hook the system up to another machine via serial console, you might be able to debug the kernel. Read the kernel debugging chapter in the Developers' Handbook. Another tip is to create a cron job that makes log entries every couple of minutes with logger. This might help you pinpoint the exact time of the mishap, to correlate it to other system activity. Be _really_ sure that it isn't hardware though. Otherwise you'll be led on a merry goose chase looking for software errors that aren't there. If you can restore a backup of this machine's software to a similar one, do so and see if the hangs persist. If they don't, it's hardware. Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpOV7PD8PdJ6.pgp Description: PGP signature
Re: how to get more logging from GEOM?
On Wed, Jul 16, 2008 at 5:40 PM, Jo Rhett <[EMAIL PROTECTED]> wrote: > On Jul 11, 2008, at 4:48 AM, Ronald Klop wrote: >> >> You can try going into the kernel debugger to see where it is hanging. >> Debugging via a serial cable is also very easy. >> I don't know the details, but there is a lot of info in the Freebsd >> handbook. Put this in google 'freebsd handbook kernel debug'. > > > Thanks for the reply. I'm familiar with these options, but as the system is > currently running GENERIC and trying to compile a kernel would guarantee to > cause the problem to occur... I could probably keep hacking at it until I > finally get everything compiled, but... > > Ugh. I guess this option doesn't appeal very much. Are there any other > options available? > You don't need to compile the kernel on the same machine that you use it on -- you can copy the compiled kernel into /boot/kernel.new -Ben Kaduk ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
-- Original message -- From: "Ben Kaduk" <[EMAIL PROTECTED]> > On Wed, Jul 16, 2008 at 5:40 PM, Jo Rhett <[EMAIL PROTECTED]> wrote: > > On Jul 11, 2008, at 4:48 AM, Ronald Klop wrote: > >> > >> You can try going into the kernel debugger to see where it is hanging. > >> Debugging via a serial cable is also very easy. > >> I don't know the details, but there is a lot of info in the Freebsd > >> handbook. Put this in google 'freebsd handbook kernel debug'. > > > > > > Thanks for the reply. I'm familiar with these options, but as the system is > > currently running GENERIC and trying to compile a kernel would guarantee to > > cause the problem to occur... I could probably keep hacking at it until I > > finally get everything compiled, but... > > > > Ugh. I guess this option doesn't appeal very much. Are there any other > > options available? > > > > You don't need to compile the kernel on the same machine that you use it > on -- you can copy the compiled kernel into /boot/kernel.new > But how do you handle the issue of differences in contents on the board where you don't have exact identical hardwares? SJK www.sulima.com <> > -Ben Kaduk > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Thu, Jul 17, 2008 at 7:11 AM, <[EMAIL PROTECTED]> wrote: > > -- Original message -- > From: "Ben Kaduk" <[EMAIL PROTECTED]> >> >> You don't need to compile the kernel on the same machine that you use it >> on -- you can copy the compiled kernel into /boot/kernel.new >> > But how do you handle the issue of differences in contents on the board where > you don't have exact identical hardwares? > The kernel configuration file specifies which device drivers will be included in the compiled kernel; if those devices aren't present in the system, the relevant code is present but doesn't get used. For example, the GENERIC kernel has the majority of device drivers included, so that most devices will be recognized out-of-the-box. A more difficult problem to solve is when you want to compile a kernel for a different architecture; say, to compile a kernel for x86 on an amd64 build machine. This can still be done, but it requires a fair amount more work. -Ben Kaduk ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Wed, Jul 16, 2008 at 2:42 PM, Jo Rhett <[EMAIL PROTECTED]> wrote: >> On Fri, Jul 11, 2008 at 12:59:33AM -0700, Jo Rhett wrote: >>> >>> Every time it is rebuilding ad0. Every single boot in the last two >>> weeks. > > On Jul 11, 2008, at 9:49 AM, Clifton Royston wrote: >> >> That just means that it halted without a proper shutdown. If it >> crashes, the mirror isn't stopped properly, so it's marked dirty, so it >> must rebuild it. It is the precise analogy of finding all the file >> systems dirty on boot and fscking them, following a crash. > > > Thanks for the clarification. Dang, I hoped I was on to something. This is really off on a tangent, but I thought I'd mention it on the off-chance that it fit your problem. Recently there have been grumblings about heat problems with certain nvidia chipsets on consumer boards. Apparently, there is some process issue, if you believe trade rags like theinquirer.net etc. Apparently there is some issue with heat damage over time. Consumer motherboards with passive cooled (no fan) heat pipes etc seem to be particularly vulnerable. I use the word "apparently" because it is far from a verified fact. However, I've got two motherboards, one running freebsd, one running windows, with nvidia chipsets. Both used to be fine with onboard IDE activity. Both now use raid controllers so the IDE interfaces have been idle for a good year or so. Something came up and I had to use the IDE interfaces for a lot of data transfer. Suddenly, both machines are flakey. The windows machine blue screens under load. My freebsd box just "turns off" (motherboard appears to power off, but the power supply is on still). The same happens when I use a linux boot disk, so I know its not FreeBSD's fault. The common factor seems to be that the motherboards are now about a year and a half old. They both have the same nvidia south bridge that theinquirer.net was trashing. Both used to work fine, now have problems with IDE. and now I recalled the article and started wondering... Do you, by any wildly remote chance, have an nvidia based motherboard? I believe the fault I'm seeing is the system asserting a fatal error by doing a HT ECC flood to halt everything. -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: how to get more logging from GEOM?
On Jul 15, 2008, at 8:35 AM, Oliver Fromme wrote: I had exactly the same problems on a machine a few months ago. It had also been running for about two years, then started freezing when there was high CPU + disk activity. It turned out that the power supply went weak (either the power supply itself or the voltage regulators on the main- board). Replacing PS + mainboard solved the problem. I have removed these drives and installed them in another machine and had exactly the same symptoms. I built a new machine fresh with 7.2 (in case it was due to the upgrade process) and installed these ports and experienced the exact same problem. That's why I am here. Physical or localized issues have already been ruled out. -- Jo Rhett Net Consonance : consonant endings by net philanthropy, open source and other randomness ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"