re: Had to revert from 5.3 to 4.11

2005-03-27 Thread Bruce Campbell

Ted Mittelstaedt said...
Bruce,

  Please do us a favor, these kinds of reports basically go into the
bit bucket when posted to the freebsd-questions mailing list.

  If you would be so kind, please run send-pr on your 4.11 systems
and send what your seeing in as a bug.  Granted, since it's not
specific nobody is going to be able to send you a patch or some
such - but there is still value in these reports being in there as
if others report the same trouble a coorelation can be drawn.

  Also please list the model number of your SuperMicro motherboards.

Thanks!
Ted

Supermicro motherboard X5DPR-8G2+

It has been running 4.11 solidly for almost 2 months now.

Some more info on the system is here (about a problem experienced
on the same system):

  http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/75855

Kris Kennaway said...
Probably this:

ftp://ftp.FreeBSD.org/pub/FreeBSD/ERRATA/notices/FreeBSD-EN-05:03.ipi.asc

Kris

I applied that towards the end of January.  Pretty sure anyway,
memory is fading.

We were running:

  FreeBSD 5.3-RELEASE-p5 #3

when we abandoned ship.  If memory serves, I rebuilt the kernel
only (not the world), when I applied those patches.

I have left the department that owns the server now,  but
I've asked them to followup to this mailling list when they
continue with the diagnosis.

 -Original Message-
 From: owner-freebsd-questions at freebsd.org
 [mailto:owner-freebsd-questions at freebsd.org]On Behalf Of Bruce Campbell
 Sent: Tuesday, March 01, 2005 6:01 PM
 To: freebsd-questions at freebsd.org
 Subject: Had to revert from 5.3 to 4.11



 Upgraded a large e-mail server from 4.7 to 5.3 late December/2004

 The 5.3 system never stayed up for more than 3 days (kernel panics
 - often while running vacation).

 A fair bit of fiddling trying to keep it running for about a month,
 then gave up.  Kept the kernel tree updated, no difference.

 Reverted to 4.11 about 3 weeks ago, no problems since.

 Also upgraded a web server to 5.3 during that time, and
 had to retreat also, same reasons.

 We do have a heavily loaded 5.2.1 system running well.

 Main difference between the crashy and reliable system
 is nfs home dirs on the mail and web servers.  nfs server
 is 4.7

 Same hardware in all cases, dual xeon supermicro.

 At a later time we will invest further diagnostic effort.
 Sorry for the lack of specifics.




-- 
Bruce Campbell
Manager, Science Computing
C2-260
University of Waterloo
(519)888-4567 ext 6991


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Had to revert from 5.3 to 4.11

2005-03-01 Thread Bruce Campbell

Upgraded a large e-mail server from 4.7 to 5.3 late December/2004

The 5.3 system never stayed up for more than 3 days (kernel panics
- often while running vacation).

A fair bit of fiddling trying to keep it running for about a month,
then gave up.  Kept the kernel tree updated, no difference.

Reverted to 4.11 about 3 weeks ago, no problems since.

Also upgraded a web server to 5.3 during that time, and
had to retreat also, same reasons.

We do have a heavily loaded 5.2.1 system running well.

Main difference between the crashy and reliable system
is nfs home dirs on the mail and web servers.  nfs server
is 4.7

Same hardware in all cases, dual xeon supermicro.

At a later time we will invest further diagnostic effort.
Sorry for the lack of specifics.

-- 
Bruce Campbell
Manager, Science Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 6991


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


flock failure on NFS from 5.3 client to 4.7 server

2005-01-13 Thread Bruce Campbell

NFS server:  FreeBSD 4.7
Old Mail server: FreeBSD 4.7, home directories mounted to NFS server
New Mail server: FreeBSD 5.3, home directories mounted to NFS server

After the mail server upgrade to 5.3, flock gives error operation not 
supported 
on nfs mounted home directories.  Example:

Jan 13 00:06:32 mail vacation[92816]: vacation: .vacation: Operation not 
supported

output from truss

open(.vacation.db,0x2,0640)= 3 (0x3)
fstat(3,0xbfbfd350)  = 0 (0x0)
flock(0x3,0x2)   ERR#45 'Operation not 
supported'
close(3) = 0 (0x0)

It appears someone else has done substantially more debugging than I:

  
http://lists.freebsd.org/pipermail/freebsd-questions/2004-September/059777.html

but is seemingly no further ahead.

On our NFS server, rpc.statd is running, but rpc.lockd wasn't.  Started
it, still no worky.  Killed it, other 4.7 clients still flock fine.

Any suggestions for a fix or workaround so vacation works (which depends
on flock) ?

Thanks,

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: flock failure on NFS from 5.3 client to 4.7 server

2005-01-13 Thread Bruce Campbell
Quoting Kris Kennaway [EMAIL PROTECTED]:
  ...
  After the mail server upgrade to 5.3, flock gives error operation not 
  supported 
  on nfs mounted home directories.
  ...
  On our NFS server, rpc.statd is running, but rpc.lockd wasn't.  Started
  it, still no worky.  Killed it, other 4.7 clients still flock fine.
 
 rpc.lockd needs to be running on *both* client *and* server.
 
 4.x gets away with it because the rpc.lockd implementation does not in
 fact implement locking on the client.
 
 Kris


Thanks, that has fixed it, and I've added the appropriate rc.conf
settings on the client:

rpc_lockd_enable=YES   # Run NFS rpc.lockd needed for client/serv
rpc_statd_enable=YES   # Run NFS rpc.statd needed for client/serv
rpcbind_enable=YES # Run the portmapper service
 
and on the server:

rpc_lockd_enable=YES  # Run NFS rpc.lockd (*broken!*) if nfs_server.
rpc_statd_enable=YES  # Run NFS rpc.statd if nfs_server (or NO).


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?

2005-01-05 Thread Bruce Campbell
Quoting Kris Kennaway [EMAIL PROTECTED]:

 On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote:
 
  I wrote a small program:
  
#include sys/types.h
#include pwd.h
  
main( int argc, char *argv[] )
{
getpwuid( 13076 );
}
  
  and ran it under truss on 5.x and it generated 178,711 lines of output.
  (the bulk of which is those lseek/read calls as above)
  ...
 
 Try tuning the pwd_mkdb parameters (see hash(3)) in
 /usr/src/usr.sbin/pwd_mkdb/pwd_mkdb.c and recompile:
 
 HASHINFO openinfo = {
 4096,   /* bsize */
 32, /* ffactor */
 256,/* nelem */
 2048 * 1024,/* cachesize */
 NULL,   /* hash() */
 0   /* lorder */
 };
 
 e.g. adjust nelem to 12000 to accomodate your
 significantly-larger-than-average password database.  If this helps,
 please submit a PR requesting that someone make an option to pwd_mkdb
 to tune this at runtime (or better yet, submit the patch to do this
 yourself - it's straightforward to modify the source to do this).

Thanks.  That had no effect on the large number of seeks/reads
to do a getpwuid of a specific uid.  I tried boosting that
number further, still no change.  I suspect the problem is related
to some change to the hash functions between 4.7 and 5.2.1 and I
hope to get to the bottom of it today.

I tried two getpwnam (as opposed to getpwuid) calls on 2 different userids, one
took 1000 seek/reads, the other 16,000, so it's all
pretty random, no doubt related to how stuff gets hashed.  On
4.7 it takes just one or two reads/seeks.

As each login via ipop, imap, and each sendmail, and just about everything
will be doing getpwnam's I think this is our problem.

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?

2005-01-05 Thread Bruce Campbell
Quoting Bruce Campbell [EMAIL PROTECTED]:
  On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote:
  
   I wrote a small program:
   
 #include sys/types.h
 #include pwd.h
   
 main( int argc, char *argv[] )
 {
 getpwuid( 13076 );
 }
   
   and ran it under truss on 5.x and it generated 178,711 lines of output.
   (the bulk of which is those lseek/read calls as above)

It looks like the overhaul of getpwent Apr/2003 to make it thread safe:

  http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/gen/getpwent.c

may be the problem.

I've tested the dbm_fetch function independently on a large
file, and it is fine.

I've opened a bug report, and plan to build a replacement 4.x
mail server, as the most deterministic path to restoring
adequate e-mail service to our users.

Can anyone suggest a workaround ?

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?

2005-01-05 Thread Bruce Campbell
Quoting Bruce Campbell [EMAIL PROTECTED]:
 Quoting Bruce Campbell [EMAIL PROTECTED]:
   On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote:
   
I wrote a small program:

  #include sys/types.h
  #include pwd.h

  main( int argc, char *argv[] )
  {
  getpwuid( 13076 );
  }

and ran it under truss on 5.x and it generated 178,711 lines of output.
(the bulk of which is those lseek/read calls as above)
 
 It looks like the overhaul of getpwent Apr/2003 to make it thread safe:
 
   http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/gen/getpwent.c
 
 may be the problem.
 
 I've tested the dbm_fetch function independently on a large
 file, and it is fine.
 
 I've opened a bug report, and plan to build a replacement 4.x
 mail server, as the most deterministic path to restoring
 adequate e-mail service to our users.
 
 Can anyone suggest a workaround ?

Well, somewhat unbelievably, copying a getpwent.c from 4.7
and remaking libc on 5.3 with it worked.  Load average
has gone from 70 to 2.

And, so that this qualifies as a question...

Am I crazy to pull an old getpwnam from 4.7 and blindly
build it on 5.3 ?

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?

2005-01-05 Thread Bruce Campbell
Quoting Bruce Campbell [EMAIL PROTECTED]:
 ...
 Well, somewhat unbelievably, copying a getpwent.c from 4.7
 and remaking libc on 5.3 with it worked.  Load average
 has gone from 70 to 2.
 

One of my co-workers has found a less kludgey workaround
for the high load problem we were seeing on 5.3 with
large /etc/master.passwd, as follows:

--- /etc/nsswitch.conf.old  Wed Jan  5 19:23:24 2005
+++ /etc/nsswitch.conf  Wed Jan  5 19:23:43 2005
@@ -1,7 +1,7 @@
-group: compat
+group: files
 group_compat: nis
 hosts: files dns
 networks: files
-passwd: compat
+passwd: files
 passwd_compat: nis
 shells: files

System is purring with load average under 1 now,
200,000 pop/imap sessions per day and 200,000 e-mails
per day, all spamassassinated.

For more details and ongoing followup, see:

  http://www.freebsd.org/cgi/query-pr.cgi?pr=75855

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


New FreeBSD 5.3 e-mail server extremely slow...

2005-01-04 Thread Bruce Campbell

We upgraded from a dual 1.66GHz AMD running FreeBSD 4.7
and a dual 3GHz Xeon running FreeBSD 5.3 and the new server
is painfully slow, even after turning spamassassin 
and yavr (yet another virus recipe) off.  Load
appears to be imapd/ipop3d (uw-imapd) related.
New server is Adaptec SCSI RAID, old one was 3ware ATA RAID,
but disk load is relatively low anyway.

It is a fairly high volume server, maybe 150,000 messages
per day and 150,000 pop/imap sessions per day.  But the old
box was doing relatively fine.

Turning off hyperthreading helped alot, but not enough.

load average is around 48 now, I've set the 2 sendmail
conf load av settings to 48 so at least e-mail gets in.

A quick truss of an ipop3d process shows piles
of this streaming by...

setitimer(0,{0 0, 0 0},{0 0, 599 92})= 0 (0x0)
write(1,0x805a000,21)= 21 (0x15)
gettimeofday({1104857422 906783},0x0)= 0 (0x0)
setitimer(0,{0 0, 600 0},{0 0, 0 0}) = 0 (0x0)
read(0x0,0x8063000,0x832c)   = 10 (0xa)
setitimer(0,{0 0, 0 0},{0 0, 600 0}) = 0 (0x0)
write(1,0x805a000,14)= 14 (0xe)
gettimeofday({1104857422 908916},0x0)= 0 (0x0)
setitimer(0,{0 0, 600 0},{0 0, 0 0}) = 0 (0x0)

top shows 80-90% system activity.

About to revert to our old box and maybe nfs mount
/var/mail to make it less painless.  Any suggestions ?

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: New FreeBSD 5.3 e-mail server extremely slow...

2005-01-04 Thread Bruce Campbell
Quoting Kris Kennaway [EMAIL PROTECTED]:

 On Tue, Jan 04, 2005 at 12:38:48PM -0500, Bruce Campbell wrote:
  
  We upgraded from a dual 1.66GHz AMD running FreeBSD 4.7
  and a dual 3GHz Xeon running FreeBSD 5.3 and the new server
  is painfully slow, even after turning spamassassin 
  and yavr (yet another virus recipe) off.  Load
  appears to be imapd/ipop3d (uw-imapd) related.
 
 Same version as you were running before?  Same configuration files?

Well, no, not quite.

old: imap-uw-2002_1,1
new: imap-uw-2004a,1

Just about all packages have undergone some updates on our
new server.  The only processes for which we have hundreds
running would be sendmail, procmail, ipop3d and imapd.
But, when I had the sendmail conf'ed to shutdown mail
when load av went over 12, load av would still shoot
up to 40 or 50 and stay there, and only major processes were imapd, ipop3d.
And I noticed them calling setitimer alot, and 80% system usage.

I'm about to pull the zero channel adaptec scsi raid card, for no other reason
than I'm out of bright ideas.

 
 Can you show us your kernel configuration and dmesg?
 
 Kris

old: (difference from 4.7 GENERIC)

- cpu   I386_CPU
- cpu   I486_CPU
+ optionsQUOTA   #enable disk quotas
+ options   SMP # Symmetric MultiProcessor Kernel
+ options   APIC_IO # Symmetric (APIC) I/O

new: (difference from 5.3 GENERIC)

Reverted to non SMP for now, only difference from GENERIC is...

 options   QUOTA

I did have 

 options   SMP

going for a while.  Removing SMP has made no difference in load
or responsiveness.  Actually seems slightly better on one CPU.

dmesg.boot from new system is as follows:

Copyright (c) 1992-2004 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 5.3-RELEASE #0: Thu Nov 25 15:48:15 EST 2004
[EMAIL PROTECTED]:/usr/src/sys/i386/compile/MAIL_SERVER
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3065.80-MHz 686-class CPU)
  Origin = GenuineIntel  Id = 0xf27  Stepping = 7
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMO
V,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Hyperthreading: 2 logical CPUs
real memory  = 2146959360 (2047 MB)
avail memory = 2095419392 (1998 MB)
ACPI APIC Table: PTLTD  APIC  
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 24-47 on motherboard
ioapic2 Version 2.0 irqs 48-71 on motherboard
npx0: [FAST]
npx0: math processor on motherboard
npx0: INT 16 interface
acpi0: PTLTD   RSDT on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x1008-0x100b on acpi0
cpu0: ACPI CPU (2 Cx states) on acpi0
pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pci0: unknown at device 0.1 (no driver attached)
pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0
pcib1: could not get PCI interrupt routing table for \\_SB_.PCI0.HLB_ - 
AE_NOT_FOU
ND
pci1: ACPI PCI bus on pcib1
pci1: base peripheral, interrupt controller at device 28.0 (no driver 
attached)
pcib2: ACPI PCI-PCI bridge at device 29.0 on pci1
pci2: ACPI PCI bus on pcib2
em0: Intel(R) PRO/1000 Network Connection, Version - 1.7.35 port 
0x3000-0x303f m
em 0xf820-0xf821 irq 54 at device 3.0 on pci2
em0: Ethernet address: 00:30:48:29:c5:a8
em0:  Speed:N/A  Duplex:N/A
em1: Intel(R) PRO/1000 Network Connection, Version - 1.7.35 port 
0x3040-0x307f m
em 0xf822-0xf823 irq 55 at device 3.1 on pci2
em1: Ethernet address: 00:30:48:29:c5:a9
em1:  Speed:N/A  Duplex:N/A
pci1: base peripheral, interrupt controller at device 30.0 (no driver 
attached)
pcib3: ACPI PCI-PCI bridge at device 31.0 on pci1
pci3: ACPI PCI bus on pcib3
asr0: Adaptec Caching SCSI RAID mem 
0xfc00-0xfdff,0xfb00-0xfbff,
0xf830-0xf83f irq 30 at device 3.0 on pci3
asr0: [GIANT-LOCKED]
asr0: ADAPTEC 2015S FW Rev. 3B05, 2 channel, 256 CCBs, Protocol I2O
uhci0: Intel 82801CA/CAM (ICH3) USB controller USB-A port 0x2000-0x201f irq 
16 a
t device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: Intel 82801CA/CAM (ICH3) USB controller USB-A on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: Intel 82801CA/CAM (ICH3) USB controller USB-B port 0x2020-0x203f irq 
19 a
t device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: Intel 82801CA/CAM (ICH3) USB controller USB-B on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: Intel 82801CA/CAM (ICH3) USB controller USB-C port 0x2040-0x205f irq 
18 a
t device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: Intel 82801CA/CAM (ICH3) USB controller USB-C on uhci2
usb2: USB

Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?

2005-01-04 Thread Bruce Campbell
Quoting Kris Kennaway [EMAIL PROTECTED]:
  Well, no, not quite.
  
  old: imap-uw-2002_1,1
  new: imap-uw-2004a,1
 
 OK, that's where you should start, then.  Go back to the software
 configuration that you know is working and see if it still misbehaves.
 
 Kris

Thanks.  I shutdown imapd/ipop3d completely so I just had sendmail running, and 
still
load av. was 20-30.

Anyways, I have just found something very odd with both 5.2.1 and 5.3
on multiple different systems here, including a brand new GENERIC install.

On 5.x, ls -l or ps waux is very slow with our
/etc/master.passwd which has 11320 entries.  I truss'ed
those commands, and gave up after watching :

  lseek(4,0x17d000,SEEK_SET)   = 1560576 (0x17d000)
  read(0x4,0x8074000,0x1000)   = 4096 (0x1000)
  lseek(4,0x17e000,SEEK_SET)   = 1564672 (0x17e000)
  read(0x4,0x8062000,0x1000)   = 4096 (0x1000)
  lseek(4,0x17f000,SEEK_SET)   = 1568768 (0x17f000)
  read(0x4,0x8066000,0x1000)   = 4096 (0x1000)
  lseek(4,0x18,SEEK_SET)   = 1572864 (0x18)

scroll by for 10 minutes.  (handle 4 = /etc/spwd.db)

I wrote a small program:

  #include sys/types.h
  #include pwd.h

  main( int argc, char *argv[] )
  {
  getpwuid( 13076 );
  }

and ran it under truss on 5.x and it generated 178,711 lines of output.
(the bulk of which is those lseek/read calls as above)

4.7 (with same master.passwd file) gave 59 lines of output, which seems
normal.

I'm speculating that imap and sendmail and just about everything use
getpwuid and getpwuid is misbehaving on 5.x especially with a large
master.passwd file.

I will report this through the proper mechanism once I do
just a bit more testing.  And perhaps it is a known issue
already and I'll look into that also.  Or perhaps I have messed
something up unwittingly, which I have been known to do.

We do have an extremely busy 5.2.1 system running here fine on
the same hardware, just it has a small /etc/master.passwd which may explain that
systems success to date.

Thank you to everyone who sent suggestions.

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


apparent change in php4 port build procedure...

2004-12-21 Thread Bruce Campbell

I'm upgrading to mod_php4-4.3.10

In the past, the make procedure presented me with a detailed
menu of options.  Now, it appears to just ask me these questions 3:

 - apache 1 vs 2
 - debug
 - ipv6

and not all the other stuff like mysql, imap, and so forth.

I can easily add the configure args I want to /usr/ports/lang/php4/Makefile, 
like
this:

--with-mysql=/usr/local \
--with-layout=GNU \
--with-config-file-scan-dir=${PREFIX}/etc/php \
--with-zlib-dir=/usr \
--with-regex=php \
--enable-ftp \

But I liked the old menu system, as it saved me figuring out
the configure args.  Was there a reason to move away from that,
or is there a new mechanism I am not aware of ?

Thanks,

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


getting an ls -l from a dump type file system backup...

2003-11-22 Thread Bruce Campbell

I use dump/restore for file system backups, and I'd like
to be able to get a detailed ls -l type listing from the backup.
(ie something with dates/times/sizes, unlike what restore -t
or ls in restore -i does)

Does anyone know of any utilities to do this ?

After each backup, I'd like to be able to put the
details of all files backed up into a database, so
I can see what versions of each file I've got available
before restoring them.

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: ipfw and divert and trying to do something clever (never mind)

2003-10-08 Thread Bruce Campbell

never mind.  ipfw fwd does exactly what I am after,
I misunderstood the command line.


Quoting Bruce Campbell [EMAIL PROTECTED]:
 
 I have some machines behind a freebsd firewall, and I'm using ipfw.
 
 Presently, I reset attempts to smtp past the firewall:
 
   reset tcp from [subnet] to any 25
 
 but I'd like to divert them to my own smtp server, so it doesn't
 matter what the clients try to use.
 
 I thought this would be easy.  Maybe it is.
 
 The fwd feature doesn't seem to do it, as it just forwards a
 specific ipaddr[,port] (no subnet/mask)
 
 divert looks like the way to do it, and after a few hours of
 fiddling with a program that opens a divert socket, I can watch
 all manner of traffic going back and forth, but each time
 I attempt to send it elsewhere, I get nowhere.  I am duly
 setting both the ip and tcp checksum, before re-injection.
 
 Somebody else must have done this, and/or I must be doing it
 the wrong way.
 
 Any suggestions ?  Please e-mail me directly also as I am
 not on this list.  A code snippet using divert would
 be excellent.
 
 -- 
 Bruce Campbell
 Engineering Computing
 CPH-2374B
 University of Waterloo
 (519)888-4567 ext 5889
 
 
 This mail sent through www.mywaterloo.ca
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


ipfw and divert and trying to do something clever

2003-10-06 Thread Bruce Campbell

I have some machines behind a freebsd firewall, and I'm using ipfw.

Presently, I reset attempts to smtp past the firewall:

  reset tcp from [subnet] to any 25

but I'd like to divert them to my own smtp server, so it doesn't
matter what the clients try to use.

I thought this would be easy.  Maybe it is.

The fwd feature doesn't seem to do it, as it just forwards a
specific ipaddr[,port] (no subnet/mask)

divert looks like the way to do it, and after a few hours of
fiddling with a program that opens a divert socket, I can watch
all manner of traffic going back and forth, but each time
I attempt to send it elsewhere, I get nowhere.  I am duly
setting both the ip and tcp checksum, before re-injection.

Somebody else must have done this, and/or I must be doing it
the wrong way.

Any suggestions ?  Please e-mail me directly also as I am
not on this list.  A code snippet using divert would
be excellent.

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


ipfw2 loss of feature ?

2003-09-14 Thread Bruce Campbell


With ipfw1 on 4.8 I use this:

ipfw add 10 check-state
ipfw add 20 allow tcp from xxx.xxx.xxx.0/24 to any keep-state limit src-addr 10

to provide stateful firewalling, and limit the number of simultaneous
tcp sessions to 10 per client.  Seems to work great.

On 4.8 I tried ipfw2

(kernel with options IPFW2 and rebuilt ipfw and libalias with -DIPFW2
as instructed in man ipfw)

When I tried ipfw2, as I wanted keepalives, I get an error
when I run ipfw

  only one of keep-state and limit is allowed

How can I do both the stateful firewalling and limit
the simultaneous sessions, with ipfw2 ?

Thanks



ps. As an aside,  I also patch /usr/src/sys/netinet/ip_fw.c to
be more verbose when it drops a session...

--- ip_fw.c Sun Sep 14 15:33:16 2003
+++ ip_fw.old   Sun Sep 14 15:31:10 2003
@@ -999,9 +999,7 @@
if (fw_verbose  last_log != time_second) {
last_log = time_second;
log(LOG_SECURITY | LOG_DEBUG,
-   drop session 0x%08x %u - 0x%08x %u, TOO many entries
\n,
-  (args-f_id.src_ip), (args-f_id.src_port),
-   (args-f_id.dst_ip), (args-f_id.dst_port));
+   drop session, too many entries\n);
}
return 1;
}


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: problem on 1TB filesystem RAID 5 3ware

2003-03-22 Thread Bruce Campbell

Some more test results:

22 Mar 2003 - test with RAID 5 with 4 * WD 200GB with Write Cache disabled 
*succeeded*. Write Cache can be disabled through the 3ware BIOS, or the 3ware 
web management tool. Raw write performance to the array dropped from 
3KBytes/Second to 4500KBytes/Second, however this did not impact the test 
significantly, as the test involved copying data via an NFS mount on a 100 
MBit/second network. The effective speed of the NFS copy dropped from around 
5000 KBytes/Second to about 4500 KBytes/Second with Write Cache disabled. 

Thread here:

http://oss.sgi.com/projects/xfs/mail_archive/200211/msg00056.html

suggests firmware/driver mismatches can cause trouble, and someone
else who had trouble found turning off write cache fixed it.


All my info on this problem being kept here:

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem

Quoting MikeM [EMAIL PROTECTED]:
 Jim King [EMAIL PROTECTED] wrote:
 
 Which tells me that all of the work in the last year has been 
 maintenance related to changes within FreeBSD itself, and not any 
 updates for 3Ware functionality, e.g no support for firmware 7.5.x 
 on the 7000 series controllers, and no support for the 8000 series 
 controllers.
 
 
 If the above is true, perhaps the hardware guide should be modified.  It
 currently says that the 3Ware 7000 series is supported.
 
 I, for one, purchased a 3Ware controller for my FreeBSD server based upon
 the misleading hardware guide.  
 
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-hardware in the body of the message
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: problem on 1TB filesystem RAID 5 3ware

2003-03-20 Thread Bruce Campbell

I openned a case with 3ware tech support and they responded:

We do not support FreeBSD plus the
current driver for FreeBSD has not been
updated for some time to keep up with
firmware changes. 

Please try linux instead. 

So I guess I will try that and see what happens.


Quoting Simon [EMAIL PROTECTED]:
 
 I have a hard time believing that hardware implementation of
 RAID5 would corrupt files over RAID10, perhaps your 3ware
 card/its firmware is malfunctioning, but anything is possible.
 Well, I'll have to see for myself, I'm about to build RAID5 NAS
 using maxtor drives and 7500-8 3ware card.
 
 -Simon
 
 On Tue, 18 Mar 2003 08:05:36 -0500, Bruce Campbell wrote:
 
 
 
 Tested with RAID 10 instead of RAID 5, success !
 
 RAID 10 arrays tested:  6xWD200GB and 8xWD200GB both worked
 RAID 5 arrays tested: 6xWD200GB and 4xWD200GB both failed
 
 Note: 3ware lists the WD 200GB disk as Under Test. (ie they have not yet 
 given it a Compatible rating) 
 
 details of tests and the procedure to detect the failure etc at
 
 http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem
 
 I still have to try an officially approved drive.
 
 
 
 
 This mail sent through www.mywaterloo.ca
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-hardware in the body of the message
 
 
 
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: problem on 1TB filesystem RAID 5 3ware

2003-03-18 Thread Bruce Campbell


Tested with RAID 10 instead of RAID 5, success !

RAID 10 arrays tested:  6xWD200GB and 8xWD200GB both worked
RAID 5 arrays tested: 6xWD200GB and 4xWD200GB both failed

Note: 3ware lists the WD 200GB disk as Under Test. (ie they have not yet 
given it a Compatible rating) 

details of tests and the procedure to detect the failure etc at

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem

I still have to try an officially approved drive.




This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: problem on 1TB filesystem RAID 5 3ware

2003-03-14 Thread Bruce Campbell

Not solved this yet, but I have determined a few things that
the problem isn't.  Info at:

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem

Tested with soft updates off and on, fails in either case, so that
isn't it.

Seems like the problem is either:

  - 3ware card or driver
  - something to do with the large filesystem

Quoting Bruce Campbell [EMAIL PROTECTED]:
 
 File corruption on 2 identical systems, designed to be backup
 servers to contain dumps of other systems:
 
 FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct  9
 
 15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC 
 
 i386
 
 with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5:
 
 Filesystem  1K-blocks  Used Avail Capacity  Mounted on
 /dev/twed0s1a20644846906552  18086708 5%/
 procfs  4 4 0   100%/proc
 /dev/twed0s1e   938819776 279031856 58468233832%/backup
 
 disks are 6 x Western Digital 2000JB  (200GB)
 
 I ran tests on /backup for 10 days on each system (fill disk with
 50GB files of pseudo random data, then reading them all back and
 verify contents, then erase, then start over).  Tests ran perfectly.
 
 details on hardware config at:
 
 http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware
 
 Then, I was ready to put the systems into production, so I copied
 data from my 2 older backup servers (which have 360GB vinum partitions)
 and after copying the data (approx 250GB in 325 files) about a dozen
 files were corrupt after the copy.  I copied via an NFS mount.
 
 All corruption started on a 64K boundary, except one which was on a 16K
 boundary.  Recopied the dozen corrupt files, and then only 6 were corrupt.
 Same problem on both systems, each which copied from a different source
 server.
 
 File seems corrupt to the end after first corruption starts, I have
 not looked for a pattern to see if it is another files contents,
 or misplaced contents from the same file.
 
 fsck shows no problems
 
 Restarted my test filling with 50GB files again, has run perfectly.
 
 I plan to try:
 
   - turn off soft updates
   - RAID 10 instead of 5
   - different file system parameters, for example I don't need
 100 million inodes.
   - rcp'ing the files
   - staring at computer screen
 
 By the way, 3ware has not officially approved the WD 200GB drive last
 time I checked.  
 
 Lots of good experience with the motherboard (ASUS P4S533) and
 network card (Intel Pro/100).  Lots of good experience with
 vinum striped partitions of smaller size (360GB)
 
 Does anyone have any suggestions ?
 
 -- 
 Bruce Campbell
 Engineering Computing
 CPH-2374B
 University of Waterloo
 (519)888-4567 ext 5889
 
 
 This mail sent through www.mywaterloo.ca
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


problem on 1TB filesystem RAID 5 3ware

2003-03-12 Thread Bruce Campbell

File corruption on 2 identical systems, designed to be backup
servers to contain dumps of other systems:

FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct  9 
15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  
i386

with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5:

Filesystem  1K-blocks  Used Avail Capacity  Mounted on
/dev/twed0s1a20644846906552  18086708 5%/
procfs  4 4 0   100%/proc
/dev/twed0s1e   938819776 279031856 58468233832%/backup

disks are 6 x Western Digital 2000JB  (200GB)

I ran tests on /backup for 10 days on each system (fill disk with
50GB files of pseudo random data, then reading them all back and
verify contents, then erase, then start over).  Tests ran perfectly.

details on hardware config at:

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware

Then, I was ready to put the systems into production, so I copied
data from my 2 older backup servers (which have 360GB vinum partitions)
and after copying the data (approx 250GB in 325 files) about a dozen
files were corrupt after the copy.  I copied via an NFS mount.

All corruption started on a 64K boundary, except one which was on a 16K
boundary.  Recopied the dozen corrupt files, and then only 6 were corrupt.
Same problem on both systems, each which copied from a different source
server.

File seems corrupt to the end after first corruption starts, I have
not looked for a pattern to see if it is another files contents,
or misplaced contents from the same file.

fsck shows no problems

Restarted my test filling with 50GB files again, has run perfectly.

I plan to try:

  - turn off soft updates
  - RAID 10 instead of 5
  - different file system parameters, for example I don't need
100 million inodes.
  - rcp'ing the files
  - staring at computer screen

By the way, 3ware has not officially approved the WD 200GB drive last
time I checked.  

Lots of good experience with the motherboard (ASUS P4S533) and
network card (Intel Pro/100).  Lots of good experience with
vinum striped partitions of smaller size (360GB)

Does anyone have any suggestions ?

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


Re: problem on 1TB filesystem RAID 5 3ware

2003-03-12 Thread Bruce Campbell
Quoting Simon [EMAIL PROTECTED]:
 
 I can only hope I don't have the same issue. I'm currently building a 1.75TB
 NAS to do daily backups using 3ware 7500-8 and maxtor drives.

Tiny bit more info:

 - NFS was starting to be implicated, but on one of my backup servers
   I had let it run 2 dumps of our Network Appliance, basically:

  rsh netapp dump ... | gzip  file

   and I tried gunzip -t to test the file, and both were corrupt.

   My backup system I've been running with vinum for a long time
   does a weekly gunzip -t on all files, and I've not seen a problem
   before.

This also removes the network card from suspicion, as if it was the
problem, the .gz file would still be valid (it would just be
compressed garbage, but it would not be corrupt itself)

Here is the program I wrote to test the partitions:

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BurnInProcedure

(obviously not an outstanding test, since it passed my system)

 
 -Simon
 
 On Wed, 12 Mar 2003 20:38:13 -0500, Bruce Campbell wrote:
 
 
 File corruption on 2 identical systems, designed to be backup
 servers to contain dumps of other systems:
 
 FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct  9
 
 15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC 
 
 i386
 
 with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5:
 
 Filesystem  1K-blocks  Used Avail Capacity  Mounted on
 /dev/twed0s1a20644846906552  18086708 5%/
 procfs  4 4 0   100%/proc
 /dev/twed0s1e   938819776 279031856 58468233832%/backup
 
 disks are 6 x Western Digital 2000JB  (200GB)
 
 I ran tests on /backup for 10 days on each system (fill disk with
 50GB files of pseudo random data, then reading them all back and
 verify contents, then erase, then start over).  Tests ran perfectly.
 
 details on hardware config at:
 
 http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware
 
 Then, I was ready to put the systems into production, so I copied
 data from my 2 older backup servers (which have 360GB vinum partitions)
 and after copying the data (approx 250GB in 325 files) about a dozen
 files were corrupt after the copy.  I copied via an NFS mount.
 
 All corruption started on a 64K boundary, except one which was on a 16K
 boundary.  Recopied the dozen corrupt files, and then only 6 were corrupt.
 Same problem on both systems, each which copied from a different source
 server.
 
 File seems corrupt to the end after first corruption starts, I have
 not looked for a pattern to see if it is another files contents,
 or misplaced contents from the same file.
 
 fsck shows no problems
 
 Restarted my test filling with 50GB files again, has run perfectly.
 
 I plan to try:
 
   - turn off soft updates
   - RAID 10 instead of 5
   - different file system parameters, for example I don't need
 100 million inodes.
   - rcp'ing the files
   - staring at computer screen
 
 By the way, 3ware has not officially approved the WD 200GB drive last
 time I checked.  
 
 Lots of good experience with the motherboard (ASUS P4S533) and
 network card (Intel Pro/100).  Lots of good experience with
 vinum striped partitions of smaller size (360GB)
 
 Does anyone have any suggestions ?
 
 -- 
 Bruce Campbell
 Engineering Computing
 CPH-2374B
 University of Waterloo
 (519)888-4567 ext 5889
 
 
 This mail sent through www.mywaterloo.ca
 
 To Unsubscribe: send mail to [EMAIL PROTECTED]
 with unsubscribe freebsd-hardware in the body of the message
 
 
 
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message


swap_pager: indefinite wait buffer but no disk errors

2003-01-07 Thread Bruce Campbell
 /kernel: pid 68914 (file1), uid 0 on /test: file 
system full
Jan  6 17:50:20 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: 
#ad/0x20001, blkno: 504, size: 4096
Jan  6 17:50:50 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: 
#ad/0x20001, blkno: 504, size: 4096
Jan  6 22:16:43 ecserv15 /kernel: pid 69461 (file1), uid 0 on /test: file 
system full
Jan  7 00:00:45 ecserv15 newsyslog[69749]: logfile turned over
Jan  7 00:00:45 ecserv15 newsyslog[69749]: logfile turned over
Jan  7 03:50:20 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: 
#ad/0x20001, blkno: 504, size: 4096
Jan  7 03:50:57 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: 
#ad/0x20001, blkno: 504, size: 4096
Jan  7 06:56:44 ecserv15 /kernel: pid 70291 (file1), uid 0 on /test: file 
system full


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: ata fallback to PIO mode on dual processor AMD systems

2003-01-05 Thread Bruce Campbell
Quoting Bruce Campbell [EMAIL PROTECTED]:

 Quoting Matthew Emmerton [EMAIL PROTECTED]:
 
  [ cc'ing Soren since he's the ATA guru ]
  
   Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode
   Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
  
   The test continues to run with the ata controller in PIO mode, with
   slower performance, and higher load average.
  
   Once the master drops to PIO, attempts to access the slave then cause
   it to drop to PIO.
 
  Are you using 80-conductor cables on all your drives?  These are required
 to
  get consistent high throughput, and running without them may cause the
  problems you're seeing.
 
 Thanks for the information about the design of IDE etc, and the suggestion
 about the cables.  I was about to shuffle things to get the disks
 onto separate channels, but I now see that would be a mistake as my
 CD drive would share a cable with a disk.

ps.  As an aside, I have since determined that putting a PIO device and
 a UDMA device on the same channel does not affect the performance
 of the UDMA device, unless the PIO device is in use.  So, sharing
 a low use CD rom drive with a disk wouldn't be so bad.

 I am puzzled about the fallback to PIO concept.  If a disk has
 gives some sort of timeout error or whatever, why would trying
 PIO correct the problem ?  That seems equivalent to asking the
 disk to do the same thing, just more slowly.

 In my case, some sort of timeout error occurs on ad0, so
 it falls back to PIO, and works.  A later access to ad1
 also yields a timeout error, and then it drops to PIO,
 and works too.  I'm fairly confident both disks did not 
 experience media errors at the same time, which suggests 
 a problem with the onboard IDE controller, or a driver bug.

 Tests continue...

 






This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Followup to fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Bruce Campbell

By the way, I've determined our removable IDE disk trays are manufactured
by SNT (http://www.snt.com.tw/metal.htm) and are part number
SNT-129.  It looks like these are the same ones startech sells.
I've placed my hardware configuration here:

http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/DualAmd2000

Out of my 4 AMD systems, my test results are now:

 - 1 refuses to die
 - 1 panic'ed and died, after not being able to drop to PIO.  Many
   fsck errors upon reboot.  The console error was ata0: resetting devices
   .. ad0: DMA limited to UDMA33, non-ATA66 cable or device
 - 2 dropped to PIO after about 15 hours of tests, and ran fine 
   (but slowly) with PIO

As for the the 2 that dropped to PIO and worked, I rebooted and manually ran

  atacontrol mode 0 UDMA33 UDMA33

and restarted the tests.  No problems in 36 hours so far.  My 4 Intel
systems (which only have a UDMA33 controller on the motherboard)
have also been running 48 hours no problems.

The test I run is...

  dbench 1
  sleep 300
  dbench 2
  sleep 300
  dbench 3
  ... up to about dbench 80 and then I kill and restart.

With UDMA100, dbench 10 gave 43 MB/Sec
With UDMA33, dbench 10 gives 37 MB/Sec

I still plan to:

 - try UDMA100 with the drives directly attached (ie. no removable tray)
 - maybe try a non onboard IDE controller
 - shuffle the disks to see if the problems follow the disks or not

At present, I don't suspect bad media because the error message is
WRITE command timeout tag=0 serv=0 which doesn't suggest a specific
sector/track etc, and running with UDMA33 instead of UDMA100 makes the problem 
appear to vanish.




This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: ata fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Bruce Campbell
Quoting Francesco Casadei [EMAIL PROTECTED]:
 On Tue, Dec 31, 2002 at 03:57:16PM -0500, Bruce Campbell wrote:
  
  I am seeing a problem with ata disks on 4 new systems, which
  I believe is either a bug in the ata driver, or a problem with
  the onboard IDE controller, or something else.  Systems are as follows:
  ...
  Motherboard: ASUS A7M266-D
  CPUs   : 2 x 2000+ AMD MP
  Memory : 2 x 512MB Crucial part: CT6472Y265
  Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0
 -
  resetting
  Dec 30 23:26:59 ecserv13 /kernel: ata0: resetting devices .. done
  Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
  resetting
  Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
  Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
  resetting
  Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
  Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
  resetting
  Dec 30 23:27:00 ecserv13 /kernel: ad0: timeout waiting for cmd=ef s=d0
 e=00
  Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode

 Same problem here, but slightly different configuration:
 
 # atacontrol list
 ATA channel 0:
 Master:  ad0 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5
 Slave:   no device present
 ATA channel 1:
 Master: acd0 LG CD-ROM CRD-8521B/1.03 ATA/ATAPI rev 0
 Slave:   no device present
 ATA channel 2:
 Master:  ad4 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5
 Slave:   no device present
 ATA channel 3:
 Master:  ad6 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5
 Slave:   no device present
 
 ad4 and ad6 are attached to a Promise FastTrak 100 TX2 ATA RAID controller.
 
 # atacontrol mode 0
 Master = UDMA100 
 Slave  = ???
 
 # atacontrol mode 1
 Master = PIO4 
 Slave  = ???
 
 # atacontrol mode 2
 Master = UDMA100 
 Slave  = ???
 
 # atacontrol mode 3
 Master = PIO4 
 Slave  = ???
 
 ad6 falls back to PIO mode on heavy I/O activity, i.e. when the system does
 a
 level 0 file systems dump from the RAID 1 array (ad4,ad6) to the backup disk
 ad0.
 Rebooting and rebuilding the array with the Promise BIOS utility temporarily
 solve the problem. The system may be up and running for 1-4 weeks doing a
 level 0 dump every morning at 5:30am and then one day the drive ad6 falls
 back
 to PIO mode again (little before the completion of fs dump).
 
 Do the hard drives you are using support the ATA tagged queuing? And if so,
 do
 you have TQ enbled?

I don't have it enabled:

  hw.ata.tags: 0

I've manually set:

  atacontrol mode 0 UDMA33 UDMA33

and the problem has not recurred.

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: Followup to fallback to PIO mode on dual processor AMD systems

2003-01-02 Thread Bruce Campbell
Quoting Bruce Evans [EMAIL PROTECTED]:
 On Thu, 2 Jan 2003, Bruce Campbell wrote:
 
  At present, I don't suspect bad media because the error message is
  WRITE command timeout tag=0 serv=0 which doesn't suggest a specific
  sector/track etc, and running with UDMA33 instead of UDMA100 makes the
 problem
  appear to vanish.
 
 The fallback is clearly wrong because it turns isolated media errors
 into pessimized i/o for the whole disk at best, system hangs during
 resets next best, and system crashes at worst.  I keep a disk with bad
 media on line for testing some of this, and zap the fallback using the
 following patch (hope this is complete; it was edited from a larger
 patch).

Thanks for the patch.  Under moderate load, I am seeing occasional
instances of:

/kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
/kernel: ata0: resetting devices .. done

and everything keeps on working normally via DMA. ie it does not drop to PIO.

The more manacing case is this:

Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:26:59 /kernel: ata0: resetting devices .. done
Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting
Dec 30 23:27:00 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00
Dec 30 23:27:00 /kernel: ad0: trying fallback to PIO mode
Dec 30 23:27:00 /kernel: ata0: resetting devices .. done

So it appears it would no longer with DMA, but it would work with PIO.
If it is manually set back to UDMA with the atacontrol command, it times
out again, and falls back to PIO.

However, a soft reboot, and all is well again.

 
 %%%
 Index: ata-disk.c
 ===
 RCS file: /home/ncvs/src/sys/dev/ata/ata-disk.c,v
 retrieving revision 1.139
 diff -u -2 -r1.139 ata-disk.c
 --- ata-disk.c17 Dec 2002 16:26:22 -  1.139
 +++ ata-disk.c18 Dec 2002 01:03:37 -
 @@ -597,5 +606,5 @@
   else {
   ata_dmainit(adp-device, ata_pmode(adp-device-param), -1, -1);
 - printf( falling back to PIO mode\n);
 + printf( NOT falling back to PIO mode\n);
   }
   TAILQ_INSERT_HEAD(adp-device-channel-ata_queue, request, chain);
 @@ -603,4 +612,5 @@
   }
 
 +#if 0
   /* if using DMA, try once again in PIO mode */
   if (request-flags  ADR_F_DMA_USED) {
 @@ -613,4 +623,5 @@
   return ATA_OP_FINISHED;
   }
 +#endif
 
   request-flags |= ADR_F_ERROR;
 %%%
 
 Bruce
 


-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



ata fallback to PIO mode on dual processor AMD systems

2002-12-31 Thread Bruce Campbell

I am seeing a problem with ata disks on 4 new systems, which
I believe is either a bug in the ata driver, or a problem with
the onboard IDE controller, or something else.  Systems are as follows:

Motherboard: ASUS A7M266-D
CPUs   : 2 x 2000+ AMD MP
Memory : 2 x 512MB Crucial part: CT6472Y265

Disks (all UDMA100):

Master   Slave
System 1:  WDC WD400BB WDC WD1000BB
System 2:  WDC WD400BB WDC WD1000BB
System 3:  WDC WD400BB WDC WD800BB
System 4:  WDC WD400BB Maxtor 98196H8

Kernel : 4.7-RELEASE, custom kernel (compared to GENERIC):

commented out:

 cpu   I386_CPU
 cpu   I486_CPU

enabled 

 options   SMP # Symmetric MultiProcessor Kernel
 options   APIC_IO # Symmetric (APIC) I/O


I am running a test with dbench (/usr/ports/benchmarks/dbench)
with a script which runs:

  dbench 1
  sleep for 5 minutes
  dbench 2
  sleep for 5 minutes
  dbench 3
  ...

to simulate 1,2,3... clients.

The following has happened on systems 2,3 and 4, after about 15 hours
of running the test:

Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 -
resetting
Dec 30 23:26:59 ecserv13 /kernel: ata0: resetting devices .. done
Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
resetting
Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
resetting
Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 
resetting
Dec 30 23:27:00 ecserv13 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00
Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode
Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done

The test continues to run with the ata controller in PIO mode, with
slower performance, and higher load average.

Once the master drops to PIO, attempts to access the slave then cause
it to drop to PIO.

If I run:

  atacontrol mode 0 UDMA100 UDMA100

attempts to access either drive result in a delay until the controller
drops to PIO, and then operations resume.  A soft reboot and things
work in UDMA mode again.  Also tried UDMA33 and UDMA66 with no change.
I also tried atacontrol reinit 0 with no help.

Theories when I search the web for fallback to PIO mode include:

 - bad disks
 - something to do with thermal recalibration

I don't believe the problems are bad disks, as the slave drops to PIO
after the master does, and I can't get in back to UDMA, other than by
soft reboot.  Plus I see the problem on 6 of 8 disks.

The problem is very repeatable.

Can anyone offer any ideas, or suggest investigative steps ?  I have a system
in PIO mode right now.

Thanks,

-- 
Bruce Campbell
Engineering Computing
CPH-2374B
University of Waterloo
(519)888-4567 ext 5889


This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message



Re: ata fallback to PIO mode on dual processor AMD systems

2002-12-31 Thread Bruce Campbell
Quoting Matthew Emmerton [EMAIL PROTECTED]:

 [ cc'ing Soren since he's the ATA guru ]
 
  Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode
  Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done
 
  The test continues to run with the ata controller in PIO mode, with
  slower performance, and higher load average.
 
  Once the master drops to PIO, attempts to access the slave then cause
  it to drop to PIO.

 Are you using 80-conductor cables on all your drives?  These are required to
 get consistent high throughput, and running without them may cause the
 problems you're seeing.

Thanks for the information about the design of IDE etc, and the suggestion
about the cables.  I was about to shuffle things to get the disks
onto separate channels, but I now see that would be a mistake as my
CD drive would share a cable with a disk.

Anyway, they all have the 80 conductor cable.  I forgot to add some 
environmental and other information.

 The 4 AMD systems are in Aopen hx08 towers, with 400 watt power supplies,
 and 5 auxilliary fans (in addition to the power supply fan, and fan on
 each cpu).  They are in an air conditioned machine room.  The CPU and
 motherboard temperatures are within spec.  I mention this as I note
 many reported AMD system problems traced to overheating.

 All drives are installed in removeable drive bays.  I don't have the make/model
 on hand right now.  They were $19 CAD.  ($13USD).  The low cost makes
 me suspicious now, but...

 I'm running the same tests on 4 single processor 2.4GHz Intel systems.
 They have not failed in this manner so far.

 Initially, I had 1GB memory modules in the AMD systems (I can't remember
 the make) and the systems froze and rebooted randomly.  I moved to
 Crucial 512MB modules to cure that problem.




This mail sent through www.mywaterloo.ca

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-questions in the body of the message