Re: My problems with stability on -current
On 05/11/2011 04:33, Alexander Motin wrote: On 11.05.2011 08:17, Doug Barton wrote: I had an interesting result doing nothing but switching from HPET to LAPIC ... no crash. Still on the same version of -current (r221566) the only thing I've done is to add kern.eventtimer.timer=LAPIC to /boot/loader.conf, and so far I haven't been able to get it to crash no matter how much I compile, or how much other stuff I do in the background. I _can_ get the system heavily loaded enough so that the mouse can drag across the screen, windows take visible time to repaint, etc. That happens with a load average of 4+ on this core 2 duo. But other than that (which is not altogether unreasonable) the system has been very stable for a couple of days now. Does that suggest a next step in terms of what to test? The fact that LAPIC is working fine can mean that problem is either HPET specific or non-per-CPU timers specific. To check that you could try to use i8254 timer in one-shot mode: hint.attimer.0.timecounter=0 kern.eventtimer.timer=i8254 , or use HPET in per-CPU mode: hint.atrtc.0.clock=0 hint.attimer.0.clock=0 hint.hpet.X.legacy_route=1 But the most informative would be to see what's going on with HPET interrupts during the freezes. With HPET hardware it is very easy to loose interrupt. And the lost interrupt means problem for many things. There are some workarounds made for that, but I can't be sure. For that case you could experiment with this patch: --- acpi_hpet.c.prev 2010-12-25 11:28:45.0 +0200 +++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300 @@ -190,7 +190,7 @@ restart: bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num), t-next); } - if (fdiv 5000) { + if (1 || fdiv 5000) { bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num)); now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER); FYI, I have been running this patch since you sent it, and haven't crashed under high load since. -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
On 05/11/2011 04:33, Alexander Motin wrote: On 11.05.2011 08:17, Doug Barton wrote: I had an interesting result doing nothing but switching from HPET to LAPIC ... no crash. Still on the same version of -current (r221566) the only thing I've done is to add kern.eventtimer.timer=LAPIC to /boot/loader.conf, and so far I haven't been able to get it to crash no matter how much I compile, or how much other stuff I do in the background. I _can_ get the system heavily loaded enough so that the mouse can drag across the screen, windows take visible time to repaint, etc. That happens with a load average of 4+ on this core 2 duo. But other than that (which is not altogether unreasonable) the system has been very stable for a couple of days now. Does that suggest a next step in terms of what to test? The fact that LAPIC is working fine can mean that problem is either HPET specific or non-per-CPU timers specific. To check that you could try to use i8254 timer in one-shot mode: hint.attimer.0.timecounter=0 kern.eventtimer.timer=i8254 , or use HPET in per-CPU mode: hint.atrtc.0.clock=0 hint.attimer.0.clock=0 hint.hpet.X.legacy_route=1 But the most informative would be to see what's going on with HPET interrupts during the freezes. With HPET hardware it is very easy to loose interrupt. And the lost interrupt means problem for many things. There are some workarounds made for that, but I can't be sure. For that case you could experiment with this patch: --- acpi_hpet.c.prev 2010-12-25 11:28:45.0 +0200 +++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300 @@ -190,7 +190,7 @@ restart: bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num), t-next); } - if (fdiv 5000) { + if (1 || fdiv 5000) { bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num)); now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER); Ok, I'll try the patch sometime soon, lots going on right now. FYI, I had something odd happen tonight, the laptop had been up for about 36 hours, and it was idle for a while when I was afk for about an hour. When I came back, the system was off. Nothing in the logs, no core dump, but it definitely crashed because when I turned it back on the file systems were all dirty. This is still r221566 running LAPIC. Interestingly I had pidgin running while it was idle, and a friend sent me an e-mail saying that he tried to IM me and as soon as he sent the message my status went from away to off line. The time he sent the e-mail corresponds roughly to the last entry in the log before I rebooted it. I realize that this is not a lot to go on, but I thought I'd mention it. Doug -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
On 11.05.2011 08:17, Doug Barton wrote: I had an interesting result doing nothing but switching from HPET to LAPIC ... no crash. Still on the same version of -current (r221566) the only thing I've done is to add kern.eventtimer.timer=LAPIC to /boot/loader.conf, and so far I haven't been able to get it to crash no matter how much I compile, or how much other stuff I do in the background. I _can_ get the system heavily loaded enough so that the mouse can drag across the screen, windows take visible time to repaint, etc. That happens with a load average of 4+ on this core 2 duo. But other than that (which is not altogether unreasonable) the system has been very stable for a couple of days now. Does that suggest a next step in terms of what to test? The fact that LAPIC is working fine can mean that problem is either HPET specific or non-per-CPU timers specific. To check that you could try to use i8254 timer in one-shot mode: hint.attimer.0.timecounter=0 kern.eventtimer.timer=i8254 , or use HPET in per-CPU mode: hint.atrtc.0.clock=0 hint.attimer.0.clock=0 hint.hpet.X.legacy_route=1 But the most informative would be to see what's going on with HPET interrupts during the freezes. With HPET hardware it is very easy to loose interrupt. And the lost interrupt means problem for many things. There are some workarounds made for that, but I can't be sure. For that case you could experiment with this patch: --- acpi_hpet.c.prev2010-12-25 11:28:45.0 +0200 +++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300 @@ -190,7 +190,7 @@ restart: bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num), t-next); } - if (fdiv 5000) { + if (1 || fdiv 5000) { bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num)); now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER); -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
Hi. On 10.05.2011 05:05, Jason Hellenthal wrote: On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote: On 10.05.2011 02:48, Doug Barton wrote: Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? Yes. You can do it in run-time also. Not quite absolutely sure here but IIRC the last time I tried setting that via loader.conf in 8-STABLE it was not being set so I eventually added it to sysctl.conf. Just for reference I never looked into it further. There is no kern.eventtimer sysctls on 8-STABLE yet, so not sure what you were setting. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
on 10/05/2011 05:05 Jason Hellenthal said the following: Alexander, On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote: On 10.05.2011 02:48, Doug Barton wrote: Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? Yes. You can do it in run-time also. Not quite absolutely sure here but IIRC the last time I tried setting that via loader.conf in 8-STABLE it was not being set so I eventually added it to sysctl.conf. Just for reference I never looked into it further. Perhaps you are confusing selection of eventtimer with choice of timecounter? For the latter indeed there is no tunable, which is a small annoyance. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
Alexander, On Tue, May 10, 2011 at 11:05:04AM +0300, Alexander Motin wrote: Hi. On 10.05.2011 05:05, Jason Hellenthal wrote: On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote: On 10.05.2011 02:48, Doug Barton wrote: Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? Yes. You can do it in run-time also. Not quite absolutely sure here but IIRC the last time I tried setting that via loader.conf in 8-STABLE it was not being set so I eventually added it to sysctl.conf. Just for reference I never looked into it further. There is no kern.eventtimer sysctls on 8-STABLE yet, so not sure what you were setting. Ugh! yeah I had that mixed up with kern.timecounter. Somehow transcribed the two. -- Regards, (jhell) Jason Hellenthal pgpidR443gME7.pgp Description: PGP signature
Re: My problems with stability on -current
I had an interesting result doing nothing but switching from HPET to LAPIC ... no crash. Still on the same version of -current (r221566) the only thing I've done is to add kern.eventtimer.timer=LAPIC to /boot/loader.conf, and so far I haven't been able to get it to crash no matter how much I compile, or how much other stuff I do in the background. I _can_ get the system heavily loaded enough so that the mouse can drag across the screen, windows take visible time to repaint, etc. That happens with a load average of 4+ on this core 2 duo. But other than that (which is not altogether unreasonable) the system has been very stable for a couple of days now. Does that suggest a next step in terms of what to test? -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
On 10.05.2011 02:48, Doug Barton wrote: I would start from most obvious problems. I need to know more about crashes. As usual: how to trigger, stack backtraces, etc. Triggering is easy, I can start a buildworld with -j2, and a build of ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system will reboot. I posted a panic message relative to r220282, (-current archives, 4/4) but kib said it didn't make any sense. Usually I don't get a panic at all. Could you hint me the thread? Go to http://www.FreeBSD.org/ Click 'mailing lists' Click 'listed in the FreeBSD Handbook.' Click freebsd-current Click freebsd-current Archives Click April 2011 search for r220282 Voila! :) OK, but URL would be fine also. :) I am agree with kib@ -- the message doesn't match the backtrace. What's about time problems, I would try to collect more data: - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose dmesg outputs; http://people.freebsd.org/~dougb/dougb-current-r221566.txt - what eventtimer is used now and does it helps to switch to another one with kern.eventtimer.timer sysctl? When I was trying to track down the problems last summer I vaguely remember trying RTC, but eventually we realized that the real problem was throttling, so I stopped specifying RTC and let it go back to the default. What do you suggest I try? As I see, now you are using HPET (chosen automatically). I would try switch to the LAPIC. Just make sure to disable C-states if you are enabled them to be sure that LAPIC timer won't stop. Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? Yes. You can do it in run-time also. I don't use C-states (in part as a result of previous investigation) but I do use powerd as such: powerd_flags=-a adaptive -b adaptive -n adaptive - does the timer runs in periodic or one-shot mode and does it helps to switch to another one? How could I tell, and how would I switch? `sysctl kern.eventtimer.periodic`. kern.eventtimer.periodic: 0 And read eventtimers(4) please. I did that, but I don't see anything in there as to which choice is one-shot, and how to change to periodic. I assume 0 is the default, which I also assume is one-shot. Does setting that to 1 change to periodic? Also, can I safely do this while the system is running, or should it be in /boot/loader.conf as well? Yes, nonzero value means periodic. And yes, changing in run-time is safe. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
Alexander, On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote: On 10.05.2011 02:48, Doug Barton wrote: Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? Yes. You can do it in run-time also. Not quite absolutely sure here but IIRC the last time I tried setting that via loader.conf in 8-STABLE it was not being set so I eventually added it to sysctl.conf. Just for reference I never looked into it further. -- Regards, (jhell) Jason Hellenthal pgpkLYmqIZwBa.pgp Description: PGP signature
Re: My problems with stability on -current
New symptom, today (still running r221566) I compiled a small port, that worked without any freezes or interactivity problems. Then I tried compiling a larger port (java/openjdk6 if anyone cares) and still no interactivity problems, but I got the system wedge requiring power cycle problem I was seeing previously that I tracked to the one-shot timer update. More below. On 05/07/2011 02:43, Alexander Motin wrote: Doug Barton wrote: On 05/05/2011 13:55, Alexander Motin wrote: I see several possibly unrelated problems there: - crashes are always crashes. They should be debugged. - calcru going backwards could have the same roots as lost wall clock time. I think you're right about that. What usually happens when the load maxes out is that the system visibly freezes for a minute or 2, and when it comes back to life the log is flooded with calcru messages. If it stays up long enough after that the wall clock drift becomes noticeable. This is in spite of running ntpd. These system freezes are very suspicious. Most time counters need only few seconds to overflow, some even less. So freeze for few minutes will easily overflow most of them. So the freezes are probably the cause of time problems, but the question now is what the cause of freezes. You should try to investigate what is going on during freezes. Does the system do anything, are there any interrupts working (`vmstat -i` just before and after), are there any interrupt storms, etc? Here is the output on a mostly-idle system, shortly after reboot: vmstat -i interrupt total rate irq1: atkbd01784 0 irq9: acpi01 0 irq14: ata0 213355 89 irq15: ata1 58 0 irq17: wpi074331 31 irq20: hpet0 uhci0+ 787767331 irq22: uhci2 21453 9 irq256: hdac0 11 0 Total1098760462 At a more opportune time I'll try crashing it again and get another result. If there are some problems with timer interrupts, timecounters could wrap unnoticed that will cause random time jumps. - interactivity problems. I can't prove it is unrelated, but have no real ideas now. I would start from most obvious problems. I need to know more about crashes. As usual: how to trigger, stack backtraces, etc. Triggering is easy, I can start a buildworld with -j2, and a build of ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system will reboot. I posted a panic message relative to r220282, (-current archives, 4/4) but kib said it didn't make any sense. Usually I don't get a panic at all. Could you hint me the thread? Go to http://www.FreeBSD.org/ Click 'mailing lists' Click 'listed in the FreeBSD Handbook.' Click freebsd-current Click freebsd-current Archives Click April 2011 search for r220282 Voila! :) What's about time problems, I would try to collect more data: - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose dmesg outputs; http://people.freebsd.org/~dougb/dougb-current-r221566.txt - what eventtimer is used now and does it helps to switch to another one with kern.eventtimer.timer sysctl? When I was trying to track down the problems last summer I vaguely remember trying RTC, but eventually we realized that the real problem was throttling, so I stopped specifying RTC and let it go back to the default. What do you suggest I try? As I see, now you are using HPET (chosen automatically). I would try switch to the LAPIC. Just make sure to disable C-states if you are enabled them to be sure that LAPIC timer won't stop. Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do that, right? I don't use C-states (in part as a result of previous investigation) but I do use powerd as such: powerd_flags=-a adaptive -b adaptive -n adaptive - does the timer runs in periodic or one-shot mode and does it helps to switch to another one? How could I tell, and how would I switch? `sysctl kern.eventtimer.periodic`. kern.eventtimer.periodic: 0 And read eventtimers(4) please. I did that, but I don't see anything in there as to which choice is one-shot, and how to change to periodic. I assume 0 is the default, which I also assume is one-shot. Does setting that to 1 change to periodic? Also, can I safely do this while the system is running, or should it be in /boot/loader.conf as well? - if full CPU load makes time to stop, try to track what is going on with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full CPU load in one-shot mode you should have stable timer interrupt rate about hz+stathz. Ok, I'll do that tomorrow, tired now. - if timer interrupts are not working well, you can build kernel with optionsKTR optionsALQ optionsKTR_ALQ
Re: My problems with stability on -current
On 05/05/2011 13:55, Alexander Motin wrote: Doug Barton wrote: Alexander suggested some knobs to twist for the timers, and I'll be glad to do that once he gets back to me with more concrete suggestions now that he knows more about my specific problems. OK, I am all here. While this post is indeed larger then previous, it is not much more informative. Sorry. :( I understand. I see several possibly unrelated problems there: - crashes are always crashes. They should be debugged. - calcru going backwards could have the same roots as lost wall clock time. I think you're right about that. What usually happens when the load maxes out is that the system visibly freezes for a minute or 2, and when it comes back to life the log is flooded with calcru messages. If it stays up long enough after that the wall clock drift becomes noticeable. This is in spite of running ntpd. If there are some problems with timer interrupts, timecounters could wrap unnoticed that will cause random time jumps. - interactivity problems. I can't prove it is unrelated, but have no real ideas now. I would start from most obvious problems. I need to know more about crashes. As usual: how to trigger, stack backtraces, etc. Triggering is easy, I can start a buildworld with -j2, and a build of ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system will reboot. I posted a panic message relative to r220282, (-current archives, 4/4) but kib said it didn't make any sense. Usually I don't get a panic at all. What's about time problems, I would try to collect more data: - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose dmesg outputs; http://people.freebsd.org/~dougb/dougb-current-r221566.txt - what eventtimer is used now and does it helps to switch to another one with kern.eventtimer.timer sysctl? When I was trying to track down the problems last summer I vaguely remember trying RTC, but eventually we realized that the real problem was throttling, so I stopped specifying RTC and let it go back to the default. What do you suggest I try? - does the timer runs in periodic or one-shot mode and does it helps to switch to another one? How could I tell, and how would I switch? - if full CPU load makes time to stop, try to track what is going on with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full CPU load in one-shot mode you should have stable timer interrupt rate about hz+stathz. Ok, I'll do that tomorrow, tired now. - if timer interrupts are not working well, you can build kernel with optionsKTR optionsALQ optionsKTR_ALQ optionsKTR_COMPILE=(KTR_SPARE2) optionsKTR_ENTRIES=131072 optionsKTR_MASK=(KTR_SPARE2) to track event timers operation and use ktrdump to save the trace when problem exist (preferably when it begins). And let's experiment with fresh CURRENT. Done and done. I'm up to r221566, and I added those options to my kernel config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: http://people.freebsd.org/~dougb/ktrdumpfile.txt This was shortly after boot, with no load. Not sure if it helps, but there you go. Thanks again for your help, Doug -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
Doug Barton wrote: On 05/05/2011 13:55, Alexander Motin wrote: I see several possibly unrelated problems there: - crashes are always crashes. They should be debugged. - calcru going backwards could have the same roots as lost wall clock time. I think you're right about that. What usually happens when the load maxes out is that the system visibly freezes for a minute or 2, and when it comes back to life the log is flooded with calcru messages. If it stays up long enough after that the wall clock drift becomes noticeable. This is in spite of running ntpd. These system freezes are very suspicious. Most time counters need only few seconds to overflow, some even less. So freeze for few minutes will easily overflow most of them. So the freezes are probably the cause of time problems, but the question now is what the cause of freezes. You should try to investigate what is going on during freezes. Does the system do anything, are there any interrupts working (`vmstat -i` just before and after), are there any interrupt storms, etc? If there are some problems with timer interrupts, timecounters could wrap unnoticed that will cause random time jumps. - interactivity problems. I can't prove it is unrelated, but have no real ideas now. I would start from most obvious problems. I need to know more about crashes. As usual: how to trigger, stack backtraces, etc. Triggering is easy, I can start a buildworld with -j2, and a build of ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system will reboot. I posted a panic message relative to r220282, (-current archives, 4/4) but kib said it didn't make any sense. Usually I don't get a panic at all. Could you hint me the thread? What's about time problems, I would try to collect more data: - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose dmesg outputs; http://people.freebsd.org/~dougb/dougb-current-r221566.txt - what eventtimer is used now and does it helps to switch to another one with kern.eventtimer.timer sysctl? When I was trying to track down the problems last summer I vaguely remember trying RTC, but eventually we realized that the real problem was throttling, so I stopped specifying RTC and let it go back to the default. What do you suggest I try? As I see, now you are using HPET (chosen automatically). I would try switch to the LAPIC. Just make sure to disable C-states if you are enabled them to be sure that LAPIC timer won't stop. - does the timer runs in periodic or one-shot mode and does it helps to switch to another one? How could I tell, and how would I switch? `sysctl kern.eventtimer.periodic`. And read eventtimers(4) please. - if full CPU load makes time to stop, try to track what is going on with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full CPU load in one-shot mode you should have stable timer interrupt rate about hz+stathz. Ok, I'll do that tomorrow, tired now. - if timer interrupts are not working well, you can build kernel with optionsKTR optionsALQ optionsKTR_ALQ optionsKTR_COMPILE=(KTR_SPARE2) optionsKTR_ENTRIES=131072 optionsKTR_MASK=(KTR_SPARE2) to track event timers operation and use ktrdump to save the trace when problem exist (preferably when it begins). And let's experiment with fresh CURRENT. Done and done. I'm up to r221566, and I added those options to my kernel config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: http://people.freebsd.org/~dougb/ktrdumpfile.txt This was shortly after boot, with no load. Not sure if it helps, but there you go. Dump looks fine, but I need dump specifically for the time of the problem. As soon as time probably can't be trusted here, it would be nice to make dump as localized as possible: clear buffer with `sysctl debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with `sysctl debug.ktr.mask=0` and do the dump. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
My problems with stability on -current
This is long, sorry. I wish I could condense things down to just the answer, or even just the question, but here goes. I've used HEAD on my main workstation(s) for many years. It's common for there to be ups and downs, and that's fine. Lately however the problems have been debilitating. First a timeline. Since sometime before January 2008 I've been using a Dell Latitude D620 laptop as my primary system. It has a core 2 duo running at 2.33 G, and 2 G RAM. I 4xboot it with windows xp, freebsd current (amd64), another freebsd (usually 8.N-RELEASE i386) and Ubuntu. On the first and last I don't do a lot of compiling obviously, but even under heavy load on 8.2-RELEASE I'm not seeing problems, so the problems I _am_ seeing are not hardware related. I keep my system very close to stock. My kernel config is GENERIC minus devices I don't have, and plus the following: options EXT2FS options IEEE80211_DEBUG # enable debug msgs options VESA device atapicam device sound device snd_hda device snp I was building with clang for a while, but when the problems started I went back to gcc. I still have INVARIANTS on but I disabled WITNESS because with all the known+unfixed LORs it's kind of pointless. Nothing interesting in make/src.conf either, the latter is just a list of stuff not to build, KERNCONF, and MODULES_OVERRIDE. Starting around December 2009 I started having problems under load with -current. Often I reported them, sometimes problems were found, sometimes not. In the course of trying to debug those problems I disabled throttling, which helped. Switching to SCHED_4BSD also helped quite a bit with interactivity under load, although it was still worse than on 8.x. In October of 2010 I was lucky enough to receive a donation of a Dell Optiplex desktop that I started using as my primary workstation. Around that same time there was some work being done in the scheduler(s) and various related systems, and my desktop (which had a slightly faster core 2 duo and 4 G RAM) was running great. I assumed that the problems were solved. Then 2 months ago I packed up the desktop system and pulled out the laptop again. I updated to the latest -current on the laptop, and all heck broke loose. I couldn't do anything on my laptop that created even a mediocre load without it crashing. Trying to do something like a buildworld (even without -j) would cause the system to absolutely crawl. I'd get tons of the dreaded calcru messages about time going backwards, and the system clock would lose literally minutes of wall clock time. At one point when I could keep it up long enough to build the world without crashing it had lost 40 minutes of wall clock time when it finished. I think that specific problem happened sometime between March 15 and r220282. In trying to find that problem, I uncovered another, deeper problem with the one-shot timers from r212541. In order to make my binary search easier for the problem described above I was using a -current snapshot CD from August 2010 that I had laying around. I could easily build world with -j2, run X, do normal desktop stuff (firefox, thunderbird, pidgin, etc.) all at the same time. When I got closer to the more recent -current, it would crash as soon as I put a load on it. I eventually bifurcated down to that exact commit. I've been running on 212540 for over a week now without any problems, including lots of port builds with FORCE_MAKE_JOBS, etc. Alexander suggested some knobs to twist for the timers, and I'll be glad to do that once he gets back to me with more concrete suggestions now that he knows more about my specific problems. Doug -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: My problems with stability on -current
Doug Barton wrote: Alexander suggested some knobs to twist for the timers, and I'll be glad to do that once he gets back to me with more concrete suggestions now that he knows more about my specific problems. OK, I am all here. While this post is indeed larger then previous, it is not much more informative. Sorry. :( I see several possibly unrelated problems there: - crashes are always crashes. They should be debugged. - calcru going backwards could have the same roots as lost wall clock time. If there are some problems with timer interrupts, timecounters could wrap unnoticed that will cause random time jumps. - interactivity problems. I can't prove it is unrelated, but have no real ideas now. I would start from most obvious problems. I need to know more about crashes. As usual: how to trigger, stack backtraces, etc. What's about time problems, I would try to collect more data: - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose dmesg outputs; - what eventtimer is used now and does it helps to switch to another one with kern.eventtimer.timer sysctl? - does the timer runs in periodic or one-shot mode and does it helps to switch to another one? - if full CPU load makes time to stop, try to track what is going on with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full CPU load in one-shot mode you should have stable timer interrupt rate about hz+stathz. - if timer interrupts are not working well, you can build kernel with optionsKTR optionsALQ optionsKTR_ALQ optionsKTR_COMPILE=(KTR_SPARE2) optionsKTR_ENTRIES=131072 optionsKTR_MASK=(KTR_SPARE2) to track event timers operation and use ktrdump to save the trace when problem exist (preferably when it begins). And let's experiment with fresh CURRENT. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org