Re: FreeBSD unstable on Dell 1750 using SMP?

2006-04-05 Thread Rutger Bevaart
It doesn't look like a power problem. We have it with several systems  
in different datacenters. I've tried the "giantlock" setting, let's  
hope it works! Am I safe to assume that it can (negatively) impact  
performance of the system? What can be the cause of "fine grained  
locking" causing the crashes? I'm willing to let a developer play  
around with one of the affected machines...


Thanks again for the suggestion Ulrich.

Met vriendelijke groet / Kind Regards,
Rutger Bevaart

On Apr 5, 2006, at 1:53 AM, Ulrich Keil wrote:


We solved the problem by running the network stack with Giant lock
(set "debug.mpsafenet=0" in loader.conf).
Since then the machine runs rock stable.

Ulrich



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2006-04-04 Thread Charles Swiger

On Apr 4, 2006, at 4:37 AM, Rutger Bevaart wrote:
This because we have 2850's that experience exactly the same  
problems, just less frequently (about once every 4 months).


I'm completely at a loss, and inclined to remove FreeBSD and  
install "another OS" as it is an important management machine for  
us, that reboots about monthly.


By all means, feel free to see whether the problem reoccurs using  
another OS, but it sounds like an intermittent hardware failure or  
power drop to me.  I've got a dozen or so Dell 2800 or 2850 machines  
which have no problems reaching 6+ months of uptime.


--
-Chuck

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2006-04-04 Thread Vivek Khera


On Apr 4, 2006, at 4:37 AM, Rutger Bevaart wrote:

I'm completely at a loss, and inclined to remove FreeBSD and  
install "another OS" as it is an important management machine for  
us, that reboots about monthly.


Any clues, tips, help, know bugs?



Either bad hardware or pilot error.  Here's some stats for you:

[morebiz]% grep DELL /var/run/dmesg.boot
ACPI APIC Table: 
acpi0:  on motherboard
[morebiz]% sysctl kern.boottime
kern.boottime: { sec = 1130521993, usec = 140021 } Fri Oct 28  
13:53:13 2005

[morebiz]% date
Tue Apr  4 09:58:18 EDT 2006
[morebiz]% uptime
9:58AM  up 157 days, 20:05, 1 user, load averages: 0.00, 0.00, 0.00
[morebiz]% uname -r
5.4-RELEASE-p8

This machine runs two instances of apache on two IPs, a postgres  
server and a mysql server to run a few different web sites.  It gets  
a fair number of hits, many of which hit the dbs.  I run with  
hyperthreading enabled, but when I next upgrade this box to 6.1, I  
will turn it off.


I don't have any 2850's but the one 1850 I have has been 100% stable  
since it went into production last october running FreeBSD 6.0.  I'd  
buy it again in a heartbeat.


Are you sure your electrical power is stable?


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2006-04-04 Thread Rutger Bevaart
Argh. After all the fixes done on the 5.4-STABLE and 6.0 codebases my  
Dell PE1750 still reboots randomly. Again last night at 03.03 :- 
( essages still shows nothing, nothing special was going on at the  
time (loadavg ~ 0.00).


It's running:
FreeBSD xyz 6.0-RELEASE-p4 FreeBSD 6.0-RELEASE-p4 #0: Sun Feb 19  
21:15:01 CET 2006 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/SMP  i386


What I've tried to fix the problems:
- kern_proc.c patch submitted to freebsd-stable by Don Lewis.
- disable HTT
- upgrade to 5-STABLE
- upgrade to 6.0-RELEASE-p1,2,3,4

What we've _not_ tried:
- Swap memory

This because we have 2850's that experience exactly the same  
problems, just less frequently (about once every 4 months).


I'm completely at a loss, and inclined to remove FreeBSD and install  
"another OS" as it is an important management machine for us, that  
reboots about monthly.


Any clues, tips, help, know bugs?

Regards
Rutger Bevaart

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2006-01-02 Thread Don Lewis
On 30 Nov, Dan Charrois wrote:
> This is encouraging - it's the first I've heard of someone who has  
> found a way to trigger the problem "on demand".  The problems I was  
> experiencing were on a dual Xeon with HTT enabled as well.Perhaps  
> someone out there who knows much more about the inner workings of  
> FreeBSD may have an idea of why running top in "aggressive mode" like  
> this might trigger the random rebooting.  In particular, it would be  
> nice to *know* that someone out there specifically fixed whatever is  
> wrong in 5.4 when bringing it to 6.0.  It's encouraging that you  
> haven't had any problems since upgrading to 6.0, but I have to wonder  
> if the bug's actually fixed, or the specific trigger of running top  
> doesn't trigger the problem but the problem is still lurking in the  
> background waiting to strike with the right combination of events.
> 
> In any case, I'm anxious to try it out myself on our server to see if  
> "top -s0" brings it down "on command" with HTT enabled, and not with  
> HTT disabled.  But I'm going to have to wait until some time over the  
> Christmas holidays to do that sort of experimentation at a time when  
> it isn't affecting the end users of the machine.  I may also upgrade  
> to 6.0 at that time, since by then it will have been out for a couple  
> of months, so most of the worst quirks should be worked out by then.
> 
> In the meantime, disabling HTT as I've done seems like a reasonable  
> precaution to improve the stability..
> 
> Thanks for your help!
> 
> Dan

Try this patch, which I posted to stable@ on October 15.  I had hoped to
commit it to RELENG_5 in November, but my day job intervened.

-- Forwarded message --
From: Don Lewis <[EMAIL PROTECTED]>
 Subject: testers wanted for 5.4-STABLE sysctl kern.proc patch
Date: Sat, 15 Oct 2005 14:51:37 -0700 (PDT)
  To: [EMAIL PROTECTED]
  Cc: 

The patch below is the 5.4-STABLE version of a patch that was recently
committed to HEAD and 6.0-BETA5 to fix locking problems in the kern.proc
sysctl handler that could cause panics or deadlocks.  It has already
been tested by myself and one other person in 5.4-STABLE, but I think it
deserves wider testing before I commit it.  Testing on SMP systems,
while running threaded applications, and on systems that have
experienced panics in the existing code is of the most interest.  Also
be on the lookout for any regressions, such as incorrect data being
returned.

Index: sys/kern/kern_proc.c
===
RCS file: /home/ncvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.215.2.6
diff -u -r1.215.2.6 kern_proc.c
--- sys/kern/kern_proc.c22 Mar 2005 13:40:23 -  1.215.2.6
+++ sys/kern/kern_proc.c12 Oct 2005 19:13:14 -
@@ -72,6 +72,8 @@
 
 static void doenterpgrp(struct proc *, struct pgrp *);
 static void orphanpg(struct pgrp *pg);
+static void fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp);
+static void fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 static void pgadjustjobc(struct pgrp *pgrp, int entering);
 static void pgdelete(struct pgrp *);
 static int proc_ctor(void *mem, int size, void *arg, int flags);
@@ -601,33 +603,22 @@
}
 }
 #endif /* DDB */
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp);
 
 /*
- * Fill in a kinfo_proc structure for the specified process.
+ * Clear kinfo_proc and fill in any information that is common
+ * to all threads in the process.
  * Must be called with the target process locked.
  */
-void
-fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp)
-{
-   fill_kinfo_thread(FIRST_THREAD_IN_PROC(p), kp);
-}
-
-void
-fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp)
+static void
+fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp)
 {
-   struct proc *p;
struct thread *td0;
-   struct ksegrp *kg;
struct tty *tp;
struct session *sp;
struct timeval tv;
struct ucred *cred;
struct sigacts *ps;
 
-   p = td->td_proc;
-
bzero(kp, sizeof(*kp));
 
kp->ki_structsize = sizeof(*kp);
@@ -685,7 +676,8 @@
kp->ki_tsize = vm->vm_tsize;
kp->ki_dsize = vm->vm_dsize;
kp->ki_ssize = vm->vm_ssize;
-   }
+   } else if (p->p_state == PRS_ZOMBIE)
+   kp->ki_stat = SZOMB;
if ((p->p_sflag & PS_INMEM) && p->p_stats) {
kp->ki_start = p->p_stats->p_start;
timevaladd(&kp->ki_start, &boottime);
@@ -704,71 +696,6 @@
kp->ki_nice = p->p_nice;
bintime2timeval(&p->p_runtime, &tv);
kp->ki_runtime = tv.tv_sec * (u_int64_t)100 + tv.tv_usec;
-   if (p->p_state != PRS_ZOMBIE) {
-#if 0
-   if (td == NULL) {
-   /* XXXKSE: This should never happen. */
-   printf("fill_kinfo_proc(): pid %d has no threads!\n",
-   

Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-30 Thread Dan Charrois
This is encouraging - it's the first I've heard of someone who has  
found a way to trigger the problem "on demand".  The problems I was  
experiencing were on a dual Xeon with HTT enabled as well.Perhaps  
someone out there who knows much more about the inner workings of  
FreeBSD may have an idea of why running top in "aggressive mode" like  
this might trigger the random rebooting.  In particular, it would be  
nice to *know* that someone out there specifically fixed whatever is  
wrong in 5.4 when bringing it to 6.0.  It's encouraging that you  
haven't had any problems since upgrading to 6.0, but I have to wonder  
if the bug's actually fixed, or the specific trigger of running top  
doesn't trigger the problem but the problem is still lurking in the  
background waiting to strike with the right combination of events.


In any case, I'm anxious to try it out myself on our server to see if  
"top -s0" brings it down "on command" with HTT enabled, and not with  
HTT disabled.  But I'm going to have to wait until some time over the  
Christmas holidays to do that sort of experimentation at a time when  
it isn't affecting the end users of the machine.  I may also upgrade  
to 6.0 at that time, since by then it will have been out for a couple  
of months, so most of the worst quirks should be worked out by then.


In the meantime, disabling HTT as I've done seems like a reasonable  
precaution to improve the stability..


Thanks for your help!

Dan

On Nov 29, 2005, at 10:50 PM, Stephen Montgomery-Smith wrote:


Dan Charrois wrote:

It actually may be a comfort, since perhaps HTT is related to the   
culprit.  Since the last crash, about a month ago, I disabled  
HTT,  both in the kernel as well in the BIOS.  So as far as I  
know, it's  completely been disabled (and the boot messages and  
top only show 2  CPUs).  And I haven't had the system go down for  
nearly a month now.


I don't know if it is related, but I used to have random reboots on  
a dual Xeon system with HTT enabled.  It happened when I ran a CPU  
intensive threaded program at the same time as "top" - running "top  
-s0" (which you have to do as root) could usually kill the machine  
in seconds if not minutes.


All I can tell you is that with FreeBSD 6.0 the problem disappeared.

Well not totally - I still get a bunch of harmless calcru negative  
messages, although I don't know if it is actually related to the  
boot problems I used to have with FreeBSD 5.4, because I get the  
calcru backwards messages even with HTT disabled.


Anyway, if you are in the mood to try it out, you might like to try  
re-enabling HTT, starting up whatever process you usually use (I'm  
guessing it is MySQL), and then run "top -s0".  If you get a crash  
soon after that, you have the same problem I had.


Let me also add that these crashes usually did not trigger a crash  
dump (I had dumpon set), and when it did the resulting dump looked  
rather corrupted.


Stephen



--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Stephen Montgomery-Smith

Dan Charrois wrote:

It actually may be a comfort, since perhaps HTT is related to the  
culprit.  Since the last crash, about a month ago, I disabled HTT,  both 
in the kernel as well in the BIOS.  So as far as I know, it's  
completely been disabled (and the boot messages and top only show 2  
CPUs).  And I haven't had the system go down for nearly a month now.


I don't know if it is related, but I used to have random reboots on a 
dual Xeon system with HTT enabled.  It happened when I ran a CPU 
intensive threaded program at the same time as "top" - running "top -s0" 
(which you have to do as root) could usually kill the machine in seconds 
if not minutes.


All I can tell you is that with FreeBSD 6.0 the problem disappeared.

Well not totally - I still get a bunch of harmless calcru negative 
messages, although I don't know if it is actually related to the boot 
problems I used to have with FreeBSD 5.4, because I get the calcru 
backwards messages even with HTT disabled.


Anyway, if you are in the mood to try it out, you might like to try 
re-enabling HTT, starting up whatever process you usually use (I'm 
guessing it is MySQL), and then run "top -s0".  If you get a crash soon 
after that, you have the same problem I had.


Let me also add that these crashes usually did not trigger a crash dump 
(I had dumpon set), and when it did the resulting dump looked rather 
corrupted.


Stephen
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Dan Charrois

Rutger Bevaart wrote:
Same here on several 1750's, 1850's and 2850's. Tomorrow I'll  
disable USB
in the BIOS on one of the 1750's and see if it makes a difference.  
It's
the only one of the set that I could get downtime for because it  
rebooted

yesterday ;-)


I've disabled USB in the BIOS on my 2850 much earlier on when I was  
getting interrupt storms, since I didn't need USB anyway.  It solved  
the problem of the interrupt storms, but it didn't seem to have any  
impact on the mysterious unsolicited rebooting problem.


Claus Guttesen wrote:

It's not any comfort to you but I have two Dell PE 1750's running very
reliable using FreeBSD 5.4 stable as of Wed. the 28'th of Sep. 2005.
It has two Xeon at 3 GHz, 2 GB RAM, a LSILogic 1030 Ultra4 Adapter.
HTT is *off*.

HTT does not yield any higher performance for most purposes. I can
send you my kernel if you want.


It actually may be a comfort, since perhaps HTT is related to the  
culprit.  Since the last crash, about a month ago, I disabled HTT,  
both in the kernel as well in the BIOS.  So as far as I know, it's  
completely been disabled (and the boot messages and top only show 2  
CPUs).  And I haven't had the system go down for nearly a month now.


Of course, I also did some other things at the same time, so it's  
unclear as to which specifically may have helped.  I had noticed that  
in the past it had rebooted itself twice right while running  
mysqlhotcopy as root during a period where the server may have been  
rather heavily loaded.  So in addition to turning off hyperthreading,  
I also changed the time when mysqlhotcopy was running to a period  
likely under a lighter load, and modified things so it isn't running  
as root any longer.


Not that I think mysqlhotcopy was the culprit itself, but it does  
cause a fairly large burst of disk activity when it is running, and  
it does seem to be related to triggering the event, at least in my  
situation.


In any case, since I've done those three things, I haven't had a  
crash yet.  Of course, the lack of a result doesn't prove anything,  
but the more time that passes, the better I feel.  That is until one  
day I wake up to find that it died again.  In any case, if that  
happens, I'll know more things that the problem isn't related to..


Vivek Khera wrote:

I'd recommend running the Dell diags.  They're pretty good at picking
out hardware trouble, which it sounds like the OP is having.


In my case anyway, I have run the Dell diagnostics, and they showed  
everything to be just fine..


Kevin Oberman wrote:
As far as I can tell, hyperthreading is not much of a win for  
anyone. See hte
article at: http://news.zdnet.co.uk/ 
0,39020330,39237341,00.htmhttp://news.zdnet

.co.uk/0,39020330,39237341,00.htm

It reports that HTT slows performance even on threaded and,  
theoretically HTT

ideal apps. (And this was with Windows.)


So I've heard.  I was hoping that hyperthreading might be able to  
help a dedicated MySQL server handle a bit higher load, but I never  
had the chance to benchmark it with and with hyperthreading before I  
had to put the machine into production.  So it's disabled now - it  
can't hurt the stability of the system and can only potentially help  
it.  Time will tell.


Thanks for your replies, everyone!

Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Kevin Oberman
As far as I can tell, hyperthreading is not much of a win for anyone. See hte 
article at: http://news.zdnet.co.uk/0,39020330,39237341,00.htmhttp://news.zdnet
.co.uk/0,39020330,39237341,00.htm

It reports that HTT slows performance even on threaded and, theoretically HTT 
ideal apps. (And this was with Windows.)
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: [EMAIL PROTECTED]   Phone: +1 510 486-8634


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Vivek Khera


On Nov 29, 2005, at 10:46 AM, Claus Guttesen wrote:


It's not any comfort to you but I have two Dell PE 1750's running very
reliable using FreeBSD 5.4 stable as of Wed. the 28'th of Sep. 2005.


I'd recommend running the Dell diags.  They're pretty good at picking  
out hardware trouble, which it sounds like the OP is having.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Claus Guttesen
> Thanks everyone for replies made over the past few days about the
> "unsolicited" rebooting problem.  At first, I thought there was a
> memory allocation bug as judged by the output of "netstat -m", but
> apparently it's just a cosmetic statistics reporting bug and nothing
> related to the instability itself.

It's not any comfort to you but I have two Dell PE 1750's running very
reliable using FreeBSD 5.4 stable as of Wed. the 28'th of Sep. 2005.
It has two Xeon at 3 GHz, 2 GB RAM, a LSILogic 1030 Ultra4 Adapter.
HTT is *off*.

HTT does not yield any higher performance for most purposes. I can
send you my kernel if you want.

regards
Claus
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-29 Thread Dan Charrois
Thanks everyone for replies made over the past few days about the  
"unsolicited" rebooting problem.  At first, I thought there was a  
memory allocation bug as judged by the output of "netstat -m", but  
apparently it's just a cosmetic statistics reporting bug and nothing  
related to the instability itself.


Unfortunately, it means that I still haven't been able to find a  
solution to the problem (and apparently, I'm not the only one to  
experience it).  Considering that we only have the one machine, which  
happens to be a production machine, that experiences the problem  
(infrequently at that), it's difficult to test and resolve.  It's  
been suggested that FreeBSD 6.0 may fix the problem, but considering  
some of the inevitable bugs that creep into new releases, I'm  
reluctant to go there until things settle down in 6.0 (plus, I  
haven't seen any documentation that implies that a fix for the  
problem will result from using 6.0 in any case).  If it weren't a  
production machine that needs to be reliable, stable, and available,  
I'd have a better chance at being able to test it under 6.0.


Some speculation has been made about it being triggered by possibly  
buggy ethernet drivers, etc.  In my case, though possible, I doubt it  
- since my machine has rebooted itself right when mysqlhotcopy was  
about to run on the machine (and it runs locally without causing any  
network activity that I'm aware of).  The first thought I had was  
that it may be caused by faulty memory or something, but Dell's  
hardware diagnostics all tested everything to be perfectly okay.


What I find strange is that it's not that the kernel locks up or  
anything - the machine just suddenly restarts (caches aren't flushed  
to disk or anything - it's just like someone literally pulls the  
power plug midstream, and then plugs it back in.  The only indication  
that something weird goes on is that in the server logs everything  
seems to be crunching away happily and then suddenly I see the boot  
messages when it restarts all by itself..


In any case, if anyone else with a dual processor machine (I have a  
PowerEdge 2850 myself) has experienced the rebooting problem  
discussed a few days ago and resolved it, I'd very much like to hear  
from you.


Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-26 Thread Gino Ruopolo


Both the servers where I changed the nic to an old intel 10/100 reached a 
week of uptime!

I'm seriously thinking about a problem with the "em" ethernet card driver.

later,
gino



From: Kris Kennaway <[EMAIL PROTECTED]>
To: Rutger Bevaart <[EMAIL PROTECTED]>
CC: Kris Kennaway <[EMAIL PROTECTED]>, freebsd-stable@freebsd.org,Gino 
Ruopolo <[EMAIL PROTECTED]>

Subject: Re: FreeBSD unstable on Dell 1750 using SMP?
Date: Fri, 25 Nov 2005 14:09:19 -0500

On Fri, Nov 25, 2005 at 01:22:01PM +0100, Rutger Bevaart wrote:
> Hello Kris (& list),
>
> Thanks for helping the 1750 and 2850 owners on this list. Unfortunately 
I
> cannot find any references to the leak or the fix you are referring to 
in

> the Release errata (http://www.freebsd.org/releases/5.4R/errata.html).

It was in the 5.3 errata, sorry.  I don't think it was fixed until
after 5.4 though.

> We are trying really hard to resolve the stability issues with our Dell
> servers and would be very happy to know when the fix for what was
> committed.

As I said twice already, the stats leak is ***HARMLESS***.  It only
gives the wrong value to counters that are unused for anything except
reporting to the user.

> No way we'll be upgrading to 6.0 without knowing exactly what
> is going on (remembering broken 4.10 -> 5.3 systems) ...

5.4 -> 6.0 is really a very minor jump.  But if you're not willing to
even test it out on one machine to see whether it resolves your
problems, you'll likely just have to get used to the instability until
someone can identify your problem and then fix it.

Kris




<< attach3 >>


_
Parla con i tuoi amici che hanno MSN Hotmail in tempo reale! E' gratis. 
http://www.imagine-msn.com/messenger/default.aspx?locale=it-IT


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-25 Thread Rutger Bevaart


On Nov 25, 2005, at 8:09 PM, Kris Kennaway wrote:


As I said twice already, the stats leak is ***HARMLESS***.  It only
gives the wrong value to counters that are unused for anything except
reporting to the user.



Aha, that's what I was trying to clear up when the whole counters  
issue came
along in the first place. Must have missed your previous remark about  
it.



No way we'll be upgrading to 6.0 without knowing exactly what
is going on (remembering broken 4.10 -> 5.3 systems) ...


5.4 -> 6.0 is really a very minor jump.  But if you're not willing to
even test it out on one machine to see whether it resolves your
problems, you'll likely just have to get used to the instability until
someone can identify your problem and then fix it.



Of course I'm trying it out, it's just hard to get a spare Dell 2850  
and put

it through the same day-to-day use as the rest of them. That's a matter
of cost.

On other posts I've basically offered anything short of root access to
resolve this. There are actually quite a group of people with issues
with this. Some think it's an ACPI issue. An irq conflict with USB  
has been
suggested. the 'em' driver was suspect, the 'bge' driver was suspect,  
the

'amr' driver was suspect.

Funny thing is we have this 1750 (2x 2.4 Xeon) that takes a major  
hitting each

day, running 5.3-BETA6. Hasn't crashed ever.

If you have any clues on where I can look further (previous posts at:
http://lists.freebsd.org/pipermail/freebsd-smp/2005-July/000930.html)
greatly appreciated.

Regards & thanks for all the help,
Rutger

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-25 Thread Kris Kennaway
On Fri, Nov 25, 2005 at 01:22:01PM +0100, Rutger Bevaart wrote:
> Hello Kris (& list),
> 
> Thanks for helping the 1750 and 2850 owners on this list. Unfortunately I
> cannot find any references to the leak or the fix you are referring to in
> the Release errata (http://www.freebsd.org/releases/5.4R/errata.html).

It was in the 5.3 errata, sorry.  I don't think it was fixed until
after 5.4 though.
 
> We are trying really hard to resolve the stability issues with our Dell
> servers and would be very happy to know when the fix for what was
> committed.

As I said twice already, the stats leak is ***HARMLESS***.  It only
gives the wrong value to counters that are unused for anything except
reporting to the user.

> No way we'll be upgrading to 6.0 without knowing exactly what
> is going on (remembering broken 4.10 -> 5.3 systems) ...

5.4 -> 6.0 is really a very minor jump.  But if you're not willing to
even test it out on one machine to see whether it resolves your
problems, you'll likely just have to get used to the instability until
someone can identify your problem and then fix it.

Kris


pgpcAipeb4rbx.pgp
Description: PGP signature


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-25 Thread Rutger Bevaart
Hello Kris (& list),

Thanks for helping the 1750 and 2850 owners on this list. Unfortunately I
cannot find any references to the leak or the fix you are referring to in
the Release errata (http://www.freebsd.org/releases/5.4R/errata.html).

We are trying really hard to resolve the stability issues with our Dell
servers and would be very happy to know when the fix for what was
committed. No way we'll be upgrading to 6.0 without knowing exactly what
is going on (remembering broken 4.10 -> 5.3 systems) ...

Regards
Rutger Bevaart

On Thu, November 24, 2005 21:22, Kris Kennaway wrote:
> On Thu, Nov 24, 2005 at 09:45:08AM +0100, Rutger Bevaart wrote:
>> Hi Kris,
>>
>> I cannot find anything about that in the /usr/src/UPDATING for the 5.4
>> branch.
>
> I didn't say anything about UPDATING, I said the release errata.
>
>> We're running "FreeBSD xyz 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5"
>> and p6 and later only fix some IPSEC and SSL stuff.
>>
>> Is it in 6.0 and if so, will somebody backport that fix?
>
> Yes and as I said, it already was.
>
>> > This is documented in the 5.4 errata, it's a leak in the stats
>> > counting on SMP machines.  It was fixed after 5.4.
>
> Kris
>


Rutger Bevaart :: illian.networks

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Kris Kennaway
On Thu, Nov 24, 2005 at 02:49:10PM -0700, Dan Charrois wrote:
> I just thought of one other bit of info that may be relevant to the  
> auto-rebooting problem I've experienced with our PowerEdge 2850.   
> Since the problem may be related to memory allocation, I thought I  
> should mention that we have more memory in that machine that is  
> typical for some users.  We have 5 Gigs installed.  From "top":
> 
> Mem: 175M Active, 4121M Inact, 244M Wired, 244M Cache, 214M Buf, 23M  
> Free
> Swap: 10G Total, 12K Used, 10G Free
> 
> If this turns out to be an AMD64 vs. 386 issue and we were to revert  
> to the 386 branch, would we still be able to access this memory, or  
> would the 386 be limited to 4Gb (or maybe 2Gb) due to 32 bit  
> addressing?  We don't need anywhere near this much memory for user  
> space programs, but the kernel does make good use of it to cache  
> commonly accessed regions of the file system in memory.

There are no issues with using 5GB of RAM on AMD64, unless of course
you have bad memory (I assume you already ruled this out by swapping
out the RAM, making sure you don't have mismatched RAM with different
characteristics, etc).

On i386 this would be limited to 4GB unless you enable PAE, which has
performance implications (how much depends on your CPU) and which may
not be supported by the drivers you need (see the PAE kernel config
file).

Kris


pgpmmkzOPLvFE.pgp
Description: PGP signature


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Kris Kennaway
On Thu, Nov 24, 2005 at 02:36:01PM -0700, Dan Charrois wrote:

> But here's about where any troubleshooting on my own reaches its  
> limit.  I noticed that Kris mentioned it was a known problem in the  
> stats counting for SMP machines and had been fixed, but haven't been  
> able to find a reference to that, or any indication of how to do so.   
> Is this fix supposed to have been an accounting bug in the report for  
> netstat, or is it something which would have taken down the machine  
> as has been happening?

It's a leak in the stats counting that has no implications other than
cosmetic ones.  If you update to 5.4-STABLE that should be fixed.

Anyway, if 5.4 is giving you stability problems then you should try
6.0 to see if the bug is already fixed.

Kris


pgpk2WYuBzx3T.pgp
Description: PGP signature


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Dan Charrois
I just thought of one other bit of info that may be relevant to the  
auto-rebooting problem I've experienced with our PowerEdge 2850.   
Since the problem may be related to memory allocation, I thought I  
should mention that we have more memory in that machine that is  
typical for some users.  We have 5 Gigs installed.  From "top":


Mem: 175M Active, 4121M Inact, 244M Wired, 244M Cache, 214M Buf, 23M  
Free

Swap: 10G Total, 12K Used, 10G Free

If this turns out to be an AMD64 vs. 386 issue and we were to revert  
to the 386 branch, would we still be able to access this memory, or  
would the 386 be limited to 4Gb (or maybe 2Gb) due to 32 bit  
addressing?  We don't need anywhere near this much memory for user  
space programs, but the kernel does make good use of it to cache  
commonly accessed regions of the file system in memory.


Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Dan Charrois

Hi Kris, Rutger, and others that have commented on this thread.

I'm happy to hear that I'm not the only one experiencing problems  
like this.  I posted a similar question a month or so ago about a  
PowerEdge 2850 using SMP (dual Xeons) and never received any  
responses that helped solve the problem, or even any indication that  
others had the same problem.  As you know, troubleshooting this is  
quite difficult, since it can take weeks to go down, and then the  
"auto-reboot" doesn't result in any clues as to why in the log file -  
it's just suddenly started again as if someone had pulled the plug on  
it.  I've been pulling my hair out.


My machine crashed twice in the last month or so, within two weeks of  
each other.  Both times, it was just as a cron task was about to  
schedule the mysqlhotcopy script to back up some SQL databases that  
are being hosted on that machine, so I thought it may have something  
to do with that (I had it running as a root crontask so figured that  
maybe some bug in that caused things to go weird - it was running as  
root, after all).  I changed it to run under a less privileged user  
and the machine hasn't died for about 2 1/2 weeks.  But that's hardly  
a conclusive case of having solved the situation - it's probably  
planning on surviving just long enough to last until the point I need  
it the most to work.   It sounds as though memory buffer allocations  
are going wacky or something, in which anything could take it down  
given the wrong combination of events.


In any case, We're running the amd64 version of FreeBSD 5.4-RELEASE- 
p6 FreeBSD 5.4-RELEASE-p6 #3: Fri Aug  5 18:18:10 MDT 2005


A netstat -m (which I'd never tried before) yields:

18446744073709551402 mbufs in use
49/25600 mbuf clusters in use (current/max)
0/0/0 sfbufs in use (current/peak/max)
44 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
884 calls to protocol drain routines

Obviously, the mbufs in use currently on that machine is way out to  
lunch.  And interestingly, it looks as though my max mbuf clusters in  
use of 25600 is identical to the other netstat -m reports from people  
having this problem.


Another machine (an older single CPU Dell) on which I'm running the  
386 version of FreeBSD 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5 #1: Thu  
Jul 21 22:30:46 MDT 2005 has a more sane netstat -m:


130 mbufs in use
128/8896 mbuf clusters in use (current/max)
0/177/2480 sfbufs in use (current/peak/max)
288 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
208493 requests for I/O initiated by sendfile
26697 calls to protocol drain routines

But here's about where any troubleshooting on my own reaches its  
limit.  I noticed that Kris mentioned it was a known problem in the  
stats counting for SMP machines and had been fixed, but haven't been  
able to find a reference to that, or any indication of how to do so.   
Is this fix supposed to have been an accounting bug in the report for  
netstat, or is it something which would have taken down the machine  
as has been happening?


If switching to single CPU mode works, it's good to hear that I have  
an option if things continue to act up.  But I'd really rather not  
have to "dumb down" the machine to one CPU when there is the  
potential of two.  Most of the time it's not under a huge load, but  
periodically there are massive spikes, and that's where having two  
CPUs really help.


If anyone can shed further light on a fix for this problem, it would  
be greatly appreciated!


Dan
--
Syzygy Research & Technology
Box 83, Legal, AB  T0G 1L0 Canada
Phone: 780-961-2213

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Kris Kennaway
On Thu, Nov 24, 2005 at 09:45:08AM +0100, Rutger Bevaart wrote:
> Hi Kris,
> 
> I cannot find anything about that in the /usr/src/UPDATING for the 5.4
> branch.

I didn't say anything about UPDATING, I said the release errata.

> We're running "FreeBSD xyz 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5"
> and p6 and later only fix some IPSEC and SSL stuff.
> 
> Is it in 6.0 and if so, will somebody backport that fix?

Yes and as I said, it already was.

> > This is documented in the 5.4 errata, it's a leak in the stats
> > counting on SMP machines.  It was fixed after 5.4.

Kris


pgppJezzRFjO2.pgp
Description: PGP signature


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-24 Thread Rutger Bevaart
Hi Kris,

I cannot find anything about that in the /usr/src/UPDATING for the 5.4
branch. We're running "FreeBSD xyz 5.4-RELEASE-p5 FreeBSD 5.4-RELEASE-p5"
and p6 and later only fix some IPSEC and SSL stuff.

Is it in 6.0 and if so, will somebody backport that fix?

Regards
Rutger

On Wed, November 23, 2005 22:39, Kris Kennaway wrote:
> On Sun, Nov 20, 2005 at 07:24:25PM +0100, Rutger Bevaart wrote:
>> Strange indeed.
>>
>> On a 1750 with bge's:
>> 475 mbufs in use
>> 501/25600 mbuf clusters in use (current/max)
>> 0/3/6656 sfbufs in use (current/peak/max)
>> 1120 KBytes allocated to network
>> 0 requests for sfbufs denied
>> 0 requests for sfbufs delayed
>> 0 requests for I/O initiated by sendfile
>> 100 calls to protocol drain routines
>>
>> On a 2850 (hardware identical to an 1850):
>> $ netstat -m
>> 4294966848 mbufs in use
>> 565/25600 mbuf clusters in use (current/max)
>> 0/67/6656 sfbufs in use (current/peak/max)
>> 1018 KBytes allocated to network
>> 0 requests for sfbufs denied
>> 0 requests for sfbufs delayed
>> 16449 requests for I/O initiated by sendfile
>> 589 calls to protocol drain routines
>>
>> Both experience the "auto reboot" feature. The mbufs on the 2850 look
>> like a counter (signed/unsigned) bug, maybe even just in the
>> printing. Other than that I'm having a hard time interpreting these
>> results.
>
> This is documented in the 5.4 errata, it's a leak in the stats
> counting on SMP machines.  It was fixed after 5.4.
>
> Kris
>


Rutger Bevaart :: illian.networks

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-23 Thread Kris Kennaway
On Sun, Nov 20, 2005 at 07:24:25PM +0100, Rutger Bevaart wrote:
> Strange indeed.
> 
> On a 1750 with bge's:
> 475 mbufs in use
> 501/25600 mbuf clusters in use (current/max)
> 0/3/6656 sfbufs in use (current/peak/max)
> 1120 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 100 calls to protocol drain routines
> 
> On a 2850 (hardware identical to an 1850):
> $ netstat -m
> 4294966848 mbufs in use
> 565/25600 mbuf clusters in use (current/max)
> 0/67/6656 sfbufs in use (current/peak/max)
> 1018 KBytes allocated to network
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 16449 requests for I/O initiated by sendfile
> 589 calls to protocol drain routines
> 
> Both experience the "auto reboot" feature. The mbufs on the 2850 look  
> like a counter (signed/unsigned) bug, maybe even just in the  
> printing. Other than that I'm having a hard time interpreting these  
> results.

This is documented in the 5.4 errata, it's a leak in the stats
counting on SMP machines.  It was fixed after 5.4.

Kris


pgpLzZcOoJBbO.pgp
Description: PGP signature


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-21 Thread Vivek Khera


On Nov 20, 2005, at 1:24 PM, Rutger Bevaart wrote:

Both experience the "auto reboot" feature. The mbufs on the 2850  
look like a counter (signed/unsigned) bug, maybe even just in the  
printing. Other than that I'm having a hard time interpreting these  
results.


FreeBSD 4.x, 5.x, and 6.x have been stable for me on all Dell hardware.

4.x (currently 4.11) has been running on 1550's, 1650's, 2650 and  
1750's for > 3 years

5.4 on 2450  for ~6 months
6.0 on 1750, 1850, and 2650 since 6.0-RC2, currently running 6.0-REL.

Never a flake-out not due to a hardware failure, and that only on two  
of the 1550s over 4 years' time.  I did have the 5.4 box running 5.4- 
REL-p7 lockup once, but was unable to determine the cause.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: FreeBSD unstable on Dell 1750 using SMP?

2005-11-20 Thread Rutger Bevaart

Strange indeed.

On a 1750 with bge's:
475 mbufs in use
501/25600 mbuf clusters in use (current/max)
0/3/6656 sfbufs in use (current/peak/max)
1120 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
100 calls to protocol drain routines

On a 2850 (hardware identical to an 1850):
$ netstat -m
4294966848 mbufs in use
565/25600 mbuf clusters in use (current/max)
0/67/6656 sfbufs in use (current/peak/max)
1018 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
16449 requests for I/O initiated by sendfile
589 calls to protocol drain routines

Both experience the "auto reboot" feature. The mbufs on the 2850 look  
like a counter (signed/unsigned) bug, maybe even just in the  
printing. Other than that I'm having a hard time interpreting these  
results.


Regards
Rutger Bevaart

On Nov 20, 2005, at 5:07 PM, Gino Ruopolo wrote:



Hello Rutger,

I red your post but I'm unable to reply on the list 'cause of some  
firewall settings.


I'm having the same problems  with various Dell1850 and Fbsd 5.4

Last week I noticed the following:

#netstat -m
4294899289 mbufs in use!?!?!??!!?
4294940375/25600 mbuf clusters in use (current/max) !?!?!?!??!
0/9/6656 sfbufs in use (current/peak/max)
4123460 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
34 requests for I/O initiated by sendfile
2533 calls to protocol drain routines

Here is the output of the same command on a different server with  
fxp0 ethernet driver, also FBSD 5.4 and doing the same work:


#netstat -m
194 mbufs in use
171/25600 mbuf clusters in use (current/max)
0/4/6656 sfbufs in use (current/peak/max)
390 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

So I've tried putting an old pci ethernet 10/100 using fxp driver  
on a Dell1850 suffering the "self-reboot" problem.  I'm getting 5  
days of uptime without a single reboot ...


What about a problem with the em driver?

Regards,
gino

_
Parla con i tuoi amici che hanno MSN Hotmail in tempo reale! E'  
gratis. http://www.imagine-msn.com/messenger/default.aspx?locale=it-IT




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"