re: Had to revert from 5.3 to 4.11
Ted Mittelstaedt said... Bruce, Please do us a favor, these kinds of reports basically go into the bit bucket when posted to the freebsd-questions mailing list. If you would be so kind, please run send-pr on your 4.11 systems and send what your seeing in as a bug. Granted, since it's not specific nobody is going to be able to send you a patch or some such - but there is still value in these reports being in there as if others report the same trouble a coorelation can be drawn. Also please list the model number of your SuperMicro motherboards. Thanks! Ted Supermicro motherboard X5DPR-8G2+ It has been running 4.11 solidly for almost 2 months now. Some more info on the system is here (about a problem experienced on the same system): http://www.freebsd.org/cgi/query-pr.cgi?pr=bin/75855 Kris Kennaway said... Probably this: ftp://ftp.FreeBSD.org/pub/FreeBSD/ERRATA/notices/FreeBSD-EN-05:03.ipi.asc Kris I applied that towards the end of January. Pretty sure anyway, memory is fading. We were running: FreeBSD 5.3-RELEASE-p5 #3 when we abandoned ship. If memory serves, I rebuilt the kernel only (not the world), when I applied those patches. I have left the department that owns the server now, but I've asked them to followup to this mailling list when they continue with the diagnosis. -Original Message- From: owner-freebsd-questions at freebsd.org [mailto:owner-freebsd-questions at freebsd.org]On Behalf Of Bruce Campbell Sent: Tuesday, March 01, 2005 6:01 PM To: freebsd-questions at freebsd.org Subject: Had to revert from 5.3 to 4.11 Upgraded a large e-mail server from 4.7 to 5.3 late December/2004 The 5.3 system never stayed up for more than 3 days (kernel panics - often while running vacation). A fair bit of fiddling trying to keep it running for about a month, then gave up. Kept the kernel tree updated, no difference. Reverted to 4.11 about 3 weeks ago, no problems since. Also upgraded a web server to 5.3 during that time, and had to retreat also, same reasons. We do have a heavily loaded 5.2.1 system running well. Main difference between the crashy and reliable system is nfs home dirs on the mail and web servers. nfs server is 4.7 Same hardware in all cases, dual xeon supermicro. At a later time we will invest further diagnostic effort. Sorry for the lack of specifics. -- Bruce Campbell Manager, Science Computing C2-260 University of Waterloo (519)888-4567 ext 6991 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Had to revert from 5.3 to 4.11
Upgraded a large e-mail server from 4.7 to 5.3 late December/2004 The 5.3 system never stayed up for more than 3 days (kernel panics - often while running vacation). A fair bit of fiddling trying to keep it running for about a month, then gave up. Kept the kernel tree updated, no difference. Reverted to 4.11 about 3 weeks ago, no problems since. Also upgraded a web server to 5.3 during that time, and had to retreat also, same reasons. We do have a heavily loaded 5.2.1 system running well. Main difference between the crashy and reliable system is nfs home dirs on the mail and web servers. nfs server is 4.7 Same hardware in all cases, dual xeon supermicro. At a later time we will invest further diagnostic effort. Sorry for the lack of specifics. -- Bruce Campbell Manager, Science Computing CPH-2374B University of Waterloo (519)888-4567 ext 6991 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
flock failure on NFS from 5.3 client to 4.7 server
NFS server: FreeBSD 4.7 Old Mail server: FreeBSD 4.7, home directories mounted to NFS server New Mail server: FreeBSD 5.3, home directories mounted to NFS server After the mail server upgrade to 5.3, flock gives error operation not supported on nfs mounted home directories. Example: Jan 13 00:06:32 mail vacation[92816]: vacation: .vacation: Operation not supported output from truss open(.vacation.db,0x2,0640)= 3 (0x3) fstat(3,0xbfbfd350) = 0 (0x0) flock(0x3,0x2) ERR#45 'Operation not supported' close(3) = 0 (0x0) It appears someone else has done substantially more debugging than I: http://lists.freebsd.org/pipermail/freebsd-questions/2004-September/059777.html but is seemingly no further ahead. On our NFS server, rpc.statd is running, but rpc.lockd wasn't. Started it, still no worky. Killed it, other 4.7 clients still flock fine. Any suggestions for a fix or workaround so vacation works (which depends on flock) ? Thanks, -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: flock failure on NFS from 5.3 client to 4.7 server
Quoting Kris Kennaway [EMAIL PROTECTED]: ... After the mail server upgrade to 5.3, flock gives error operation not supported on nfs mounted home directories. ... On our NFS server, rpc.statd is running, but rpc.lockd wasn't. Started it, still no worky. Killed it, other 4.7 clients still flock fine. rpc.lockd needs to be running on *both* client *and* server. 4.x gets away with it because the rpc.lockd implementation does not in fact implement locking on the client. Kris Thanks, that has fixed it, and I've added the appropriate rc.conf settings on the client: rpc_lockd_enable=YES # Run NFS rpc.lockd needed for client/serv rpc_statd_enable=YES # Run NFS rpc.statd needed for client/serv rpcbind_enable=YES # Run the portmapper service and on the server: rpc_lockd_enable=YES # Run NFS rpc.lockd (*broken!*) if nfs_server. rpc_statd_enable=YES # Run NFS rpc.statd if nfs_server (or NO). -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?
Quoting Kris Kennaway [EMAIL PROTECTED]: On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote: I wrote a small program: #include sys/types.h #include pwd.h main( int argc, char *argv[] ) { getpwuid( 13076 ); } and ran it under truss on 5.x and it generated 178,711 lines of output. (the bulk of which is those lseek/read calls as above) ... Try tuning the pwd_mkdb parameters (see hash(3)) in /usr/src/usr.sbin/pwd_mkdb/pwd_mkdb.c and recompile: HASHINFO openinfo = { 4096, /* bsize */ 32, /* ffactor */ 256,/* nelem */ 2048 * 1024,/* cachesize */ NULL, /* hash() */ 0 /* lorder */ }; e.g. adjust nelem to 12000 to accomodate your significantly-larger-than-average password database. If this helps, please submit a PR requesting that someone make an option to pwd_mkdb to tune this at runtime (or better yet, submit the patch to do this yourself - it's straightforward to modify the source to do this). Thanks. That had no effect on the large number of seeks/reads to do a getpwuid of a specific uid. I tried boosting that number further, still no change. I suspect the problem is related to some change to the hash functions between 4.7 and 5.2.1 and I hope to get to the bottom of it today. I tried two getpwnam (as opposed to getpwuid) calls on 2 different userids, one took 1000 seek/reads, the other 16,000, so it's all pretty random, no doubt related to how stuff gets hashed. On 4.7 it takes just one or two reads/seeks. As each login via ipop, imap, and each sendmail, and just about everything will be doing getpwnam's I think this is our problem. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?
Quoting Bruce Campbell [EMAIL PROTECTED]: On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote: I wrote a small program: #include sys/types.h #include pwd.h main( int argc, char *argv[] ) { getpwuid( 13076 ); } and ran it under truss on 5.x and it generated 178,711 lines of output. (the bulk of which is those lseek/read calls as above) It looks like the overhaul of getpwent Apr/2003 to make it thread safe: http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/gen/getpwent.c may be the problem. I've tested the dbm_fetch function independently on a large file, and it is fine. I've opened a bug report, and plan to build a replacement 4.x mail server, as the most deterministic path to restoring adequate e-mail service to our users. Can anyone suggest a workaround ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?
Quoting Bruce Campbell [EMAIL PROTECTED]: Quoting Bruce Campbell [EMAIL PROTECTED]: On Tue, Jan 04, 2005 at 09:27:27PM -0500, Bruce Campbell wrote: I wrote a small program: #include sys/types.h #include pwd.h main( int argc, char *argv[] ) { getpwuid( 13076 ); } and ran it under truss on 5.x and it generated 178,711 lines of output. (the bulk of which is those lseek/read calls as above) It looks like the overhaul of getpwent Apr/2003 to make it thread safe: http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/gen/getpwent.c may be the problem. I've tested the dbm_fetch function independently on a large file, and it is fine. I've opened a bug report, and plan to build a replacement 4.x mail server, as the most deterministic path to restoring adequate e-mail service to our users. Can anyone suggest a workaround ? Well, somewhat unbelievably, copying a getpwent.c from 4.7 and remaking libc on 5.3 with it worked. Load average has gone from 70 to 2. And, so that this qualifies as a question... Am I crazy to pull an old getpwnam from 4.7 and blindly build it on 5.3 ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?
Quoting Bruce Campbell [EMAIL PROTECTED]: ... Well, somewhat unbelievably, copying a getpwent.c from 4.7 and remaking libc on 5.3 with it worked. Load average has gone from 70 to 2. One of my co-workers has found a less kludgey workaround for the high load problem we were seeing on 5.3 with large /etc/master.passwd, as follows: --- /etc/nsswitch.conf.old Wed Jan 5 19:23:24 2005 +++ /etc/nsswitch.conf Wed Jan 5 19:23:43 2005 @@ -1,7 +1,7 @@ -group: compat +group: files group_compat: nis hosts: files dns networks: files -passwd: compat +passwd: files passwd_compat: nis shells: files System is purring with load average under 1 now, 200,000 pop/imap sessions per day and 200,000 e-mails per day, all spamassassinated. For more details and ongoing followup, see: http://www.freebsd.org/cgi/query-pr.cgi?pr=75855 -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
New FreeBSD 5.3 e-mail server extremely slow...
We upgraded from a dual 1.66GHz AMD running FreeBSD 4.7 and a dual 3GHz Xeon running FreeBSD 5.3 and the new server is painfully slow, even after turning spamassassin and yavr (yet another virus recipe) off. Load appears to be imapd/ipop3d (uw-imapd) related. New server is Adaptec SCSI RAID, old one was 3ware ATA RAID, but disk load is relatively low anyway. It is a fairly high volume server, maybe 150,000 messages per day and 150,000 pop/imap sessions per day. But the old box was doing relatively fine. Turning off hyperthreading helped alot, but not enough. load average is around 48 now, I've set the 2 sendmail conf load av settings to 48 so at least e-mail gets in. A quick truss of an ipop3d process shows piles of this streaming by... setitimer(0,{0 0, 0 0},{0 0, 599 92})= 0 (0x0) write(1,0x805a000,21)= 21 (0x15) gettimeofday({1104857422 906783},0x0)= 0 (0x0) setitimer(0,{0 0, 600 0},{0 0, 0 0}) = 0 (0x0) read(0x0,0x8063000,0x832c) = 10 (0xa) setitimer(0,{0 0, 0 0},{0 0, 600 0}) = 0 (0x0) write(1,0x805a000,14)= 14 (0xe) gettimeofday({1104857422 908916},0x0)= 0 (0x0) setitimer(0,{0 0, 600 0},{0 0, 0 0}) = 0 (0x0) top shows 80-90% system activity. About to revert to our old box and maybe nfs mount /var/mail to make it less painless. Any suggestions ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: New FreeBSD 5.3 e-mail server extremely slow...
Quoting Kris Kennaway [EMAIL PROTECTED]: On Tue, Jan 04, 2005 at 12:38:48PM -0500, Bruce Campbell wrote: We upgraded from a dual 1.66GHz AMD running FreeBSD 4.7 and a dual 3GHz Xeon running FreeBSD 5.3 and the new server is painfully slow, even after turning spamassassin and yavr (yet another virus recipe) off. Load appears to be imapd/ipop3d (uw-imapd) related. Same version as you were running before? Same configuration files? Well, no, not quite. old: imap-uw-2002_1,1 new: imap-uw-2004a,1 Just about all packages have undergone some updates on our new server. The only processes for which we have hundreds running would be sendmail, procmail, ipop3d and imapd. But, when I had the sendmail conf'ed to shutdown mail when load av went over 12, load av would still shoot up to 40 or 50 and stay there, and only major processes were imapd, ipop3d. And I noticed them calling setitimer alot, and 80% system usage. I'm about to pull the zero channel adaptec scsi raid card, for no other reason than I'm out of bright ideas. Can you show us your kernel configuration and dmesg? Kris old: (difference from 4.7 GENERIC) - cpu I386_CPU - cpu I486_CPU + optionsQUOTA #enable disk quotas + options SMP # Symmetric MultiProcessor Kernel + options APIC_IO # Symmetric (APIC) I/O new: (difference from 5.3 GENERIC) Reverted to non SMP for now, only difference from GENERIC is... options QUOTA I did have options SMP going for a while. Removing SMP has made no difference in load or responsiveness. Actually seems slightly better on one CPU. dmesg.boot from new system is as follows: Copyright (c) 1992-2004 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 5.3-RELEASE #0: Thu Nov 25 15:48:15 EST 2004 [EMAIL PROTECTED]:/usr/src/sys/i386/compile/MAIL_SERVER Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 3.06GHz (3065.80-MHz 686-class CPU) Origin = GenuineIntel Id = 0xf27 Stepping = 7 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMO V,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Hyperthreading: 2 logical CPUs real memory = 2146959360 (2047 MB) avail memory = 2095419392 (1998 MB) ACPI APIC Table: PTLTD APIC ioapic0 Version 2.0 irqs 0-23 on motherboard ioapic1 Version 2.0 irqs 24-47 on motherboard ioapic2 Version 2.0 irqs 48-71 on motherboard npx0: [FAST] npx0: math processor on motherboard npx0: INT 16 interface acpi0: PTLTD RSDT on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x1008-0x100b on acpi0 cpu0: ACPI CPU (2 Cx states) on acpi0 pcib0: ACPI Host-PCI bridge port 0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 pci0: unknown at device 0.1 (no driver attached) pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0 pcib1: could not get PCI interrupt routing table for \\_SB_.PCI0.HLB_ - AE_NOT_FOU ND pci1: ACPI PCI bus on pcib1 pci1: base peripheral, interrupt controller at device 28.0 (no driver attached) pcib2: ACPI PCI-PCI bridge at device 29.0 on pci1 pci2: ACPI PCI bus on pcib2 em0: Intel(R) PRO/1000 Network Connection, Version - 1.7.35 port 0x3000-0x303f m em 0xf820-0xf821 irq 54 at device 3.0 on pci2 em0: Ethernet address: 00:30:48:29:c5:a8 em0: Speed:N/A Duplex:N/A em1: Intel(R) PRO/1000 Network Connection, Version - 1.7.35 port 0x3040-0x307f m em 0xf822-0xf823 irq 55 at device 3.1 on pci2 em1: Ethernet address: 00:30:48:29:c5:a9 em1: Speed:N/A Duplex:N/A pci1: base peripheral, interrupt controller at device 30.0 (no driver attached) pcib3: ACPI PCI-PCI bridge at device 31.0 on pci1 pci3: ACPI PCI bus on pcib3 asr0: Adaptec Caching SCSI RAID mem 0xfc00-0xfdff,0xfb00-0xfbff, 0xf830-0xf83f irq 30 at device 3.0 on pci3 asr0: [GIANT-LOCKED] asr0: ADAPTEC 2015S FW Rev. 3B05, 2 channel, 256 CCBs, Protocol I2O uhci0: Intel 82801CA/CAM (ICH3) USB controller USB-A port 0x2000-0x201f irq 16 a t device 29.0 on pci0 uhci0: [GIANT-LOCKED] usb0: Intel 82801CA/CAM (ICH3) USB controller USB-A on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: Intel 82801CA/CAM (ICH3) USB controller USB-B port 0x2020-0x203f irq 19 a t device 29.1 on pci0 uhci1: [GIANT-LOCKED] usb1: Intel 82801CA/CAM (ICH3) USB controller USB-B on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered uhci2: Intel 82801CA/CAM (ICH3) USB controller USB-C port 0x2040-0x205f irq 18 a t device 29.2 on pci0 uhci2: [GIANT-LOCKED] usb2: Intel 82801CA/CAM (ICH3) USB controller USB-C on uhci2 usb2: USB
Re: New FreeBSD 5.3 e-mail server extremely slow - traced to getpwnam maybe ?
Quoting Kris Kennaway [EMAIL PROTECTED]: Well, no, not quite. old: imap-uw-2002_1,1 new: imap-uw-2004a,1 OK, that's where you should start, then. Go back to the software configuration that you know is working and see if it still misbehaves. Kris Thanks. I shutdown imapd/ipop3d completely so I just had sendmail running, and still load av. was 20-30. Anyways, I have just found something very odd with both 5.2.1 and 5.3 on multiple different systems here, including a brand new GENERIC install. On 5.x, ls -l or ps waux is very slow with our /etc/master.passwd which has 11320 entries. I truss'ed those commands, and gave up after watching : lseek(4,0x17d000,SEEK_SET) = 1560576 (0x17d000) read(0x4,0x8074000,0x1000) = 4096 (0x1000) lseek(4,0x17e000,SEEK_SET) = 1564672 (0x17e000) read(0x4,0x8062000,0x1000) = 4096 (0x1000) lseek(4,0x17f000,SEEK_SET) = 1568768 (0x17f000) read(0x4,0x8066000,0x1000) = 4096 (0x1000) lseek(4,0x18,SEEK_SET) = 1572864 (0x18) scroll by for 10 minutes. (handle 4 = /etc/spwd.db) I wrote a small program: #include sys/types.h #include pwd.h main( int argc, char *argv[] ) { getpwuid( 13076 ); } and ran it under truss on 5.x and it generated 178,711 lines of output. (the bulk of which is those lseek/read calls as above) 4.7 (with same master.passwd file) gave 59 lines of output, which seems normal. I'm speculating that imap and sendmail and just about everything use getpwuid and getpwuid is misbehaving on 5.x especially with a large master.passwd file. I will report this through the proper mechanism once I do just a bit more testing. And perhaps it is a known issue already and I'll look into that also. Or perhaps I have messed something up unwittingly, which I have been known to do. We do have an extremely busy 5.2.1 system running here fine on the same hardware, just it has a small /etc/master.passwd which may explain that systems success to date. Thank you to everyone who sent suggestions. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
apparent change in php4 port build procedure...
I'm upgrading to mod_php4-4.3.10 In the past, the make procedure presented me with a detailed menu of options. Now, it appears to just ask me these questions 3: - apache 1 vs 2 - debug - ipv6 and not all the other stuff like mysql, imap, and so forth. I can easily add the configure args I want to /usr/ports/lang/php4/Makefile, like this: --with-mysql=/usr/local \ --with-layout=GNU \ --with-config-file-scan-dir=${PREFIX}/etc/php \ --with-zlib-dir=/usr \ --with-regex=php \ --enable-ftp \ But I liked the old menu system, as it saved me figuring out the configure args. Was there a reason to move away from that, or is there a new mechanism I am not aware of ? Thanks, -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
getting an ls -l from a dump type file system backup...
I use dump/restore for file system backups, and I'd like to be able to get a detailed ls -l type listing from the backup. (ie something with dates/times/sizes, unlike what restore -t or ls in restore -i does) Does anyone know of any utilities to do this ? After each backup, I'd like to be able to put the details of all files backed up into a database, so I can see what versions of each file I've got available before restoring them. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: ipfw and divert and trying to do something clever (never mind)
never mind. ipfw fwd does exactly what I am after, I misunderstood the command line. Quoting Bruce Campbell [EMAIL PROTECTED]: I have some machines behind a freebsd firewall, and I'm using ipfw. Presently, I reset attempts to smtp past the firewall: reset tcp from [subnet] to any 25 but I'd like to divert them to my own smtp server, so it doesn't matter what the clients try to use. I thought this would be easy. Maybe it is. The fwd feature doesn't seem to do it, as it just forwards a specific ipaddr[,port] (no subnet/mask) divert looks like the way to do it, and after a few hours of fiddling with a program that opens a divert socket, I can watch all manner of traffic going back and forth, but each time I attempt to send it elsewhere, I get nowhere. I am duly setting both the ip and tcp checksum, before re-injection. Somebody else must have done this, and/or I must be doing it the wrong way. Any suggestions ? Please e-mail me directly also as I am not on this list. A code snippet using divert would be excellent. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
ipfw and divert and trying to do something clever
I have some machines behind a freebsd firewall, and I'm using ipfw. Presently, I reset attempts to smtp past the firewall: reset tcp from [subnet] to any 25 but I'd like to divert them to my own smtp server, so it doesn't matter what the clients try to use. I thought this would be easy. Maybe it is. The fwd feature doesn't seem to do it, as it just forwards a specific ipaddr[,port] (no subnet/mask) divert looks like the way to do it, and after a few hours of fiddling with a program that opens a divert socket, I can watch all manner of traffic going back and forth, but each time I attempt to send it elsewhere, I get nowhere. I am duly setting both the ip and tcp checksum, before re-injection. Somebody else must have done this, and/or I must be doing it the wrong way. Any suggestions ? Please e-mail me directly also as I am not on this list. A code snippet using divert would be excellent. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
ipfw2 loss of feature ?
With ipfw1 on 4.8 I use this: ipfw add 10 check-state ipfw add 20 allow tcp from xxx.xxx.xxx.0/24 to any keep-state limit src-addr 10 to provide stateful firewalling, and limit the number of simultaneous tcp sessions to 10 per client. Seems to work great. On 4.8 I tried ipfw2 (kernel with options IPFW2 and rebuilt ipfw and libalias with -DIPFW2 as instructed in man ipfw) When I tried ipfw2, as I wanted keepalives, I get an error when I run ipfw only one of keep-state and limit is allowed How can I do both the stateful firewalling and limit the simultaneous sessions, with ipfw2 ? Thanks ps. As an aside, I also patch /usr/src/sys/netinet/ip_fw.c to be more verbose when it drops a session... --- ip_fw.c Sun Sep 14 15:33:16 2003 +++ ip_fw.old Sun Sep 14 15:31:10 2003 @@ -999,9 +999,7 @@ if (fw_verbose last_log != time_second) { last_log = time_second; log(LOG_SECURITY | LOG_DEBUG, - drop session 0x%08x %u - 0x%08x %u, TOO many entries \n, - (args-f_id.src_ip), (args-f_id.src_port), - (args-f_id.dst_ip), (args-f_id.dst_port)); + drop session, too many entries\n); } return 1; } -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: problem on 1TB filesystem RAID 5 3ware
Some more test results: 22 Mar 2003 - test with RAID 5 with 4 * WD 200GB with Write Cache disabled *succeeded*. Write Cache can be disabled through the 3ware BIOS, or the 3ware web management tool. Raw write performance to the array dropped from 3KBytes/Second to 4500KBytes/Second, however this did not impact the test significantly, as the test involved copying data via an NFS mount on a 100 MBit/second network. The effective speed of the NFS copy dropped from around 5000 KBytes/Second to about 4500 KBytes/Second with Write Cache disabled. Thread here: http://oss.sgi.com/projects/xfs/mail_archive/200211/msg00056.html suggests firmware/driver mismatches can cause trouble, and someone else who had trouble found turning off write cache fixed it. All my info on this problem being kept here: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem Quoting MikeM [EMAIL PROTECTED]: Jim King [EMAIL PROTECTED] wrote: Which tells me that all of the work in the last year has been maintenance related to changes within FreeBSD itself, and not any updates for 3Ware functionality, e.g no support for firmware 7.5.x on the 7000 series controllers, and no support for the 8000 series controllers. If the above is true, perhaps the hardware guide should be modified. It currently says that the 3Ware 7000 series is supported. I, for one, purchased a 3Ware controller for my FreeBSD server based upon the misleading hardware guide. To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hardware in the body of the message -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: problem on 1TB filesystem RAID 5 3ware
I openned a case with 3ware tech support and they responded: We do not support FreeBSD plus the current driver for FreeBSD has not been updated for some time to keep up with firmware changes. Please try linux instead. So I guess I will try that and see what happens. Quoting Simon [EMAIL PROTECTED]: I have a hard time believing that hardware implementation of RAID5 would corrupt files over RAID10, perhaps your 3ware card/its firmware is malfunctioning, but anything is possible. Well, I'll have to see for myself, I'm about to build RAID5 NAS using maxtor drives and 7500-8 3ware card. -Simon On Tue, 18 Mar 2003 08:05:36 -0500, Bruce Campbell wrote: Tested with RAID 10 instead of RAID 5, success ! RAID 10 arrays tested: 6xWD200GB and 8xWD200GB both worked RAID 5 arrays tested: 6xWD200GB and 4xWD200GB both failed Note: 3ware lists the WD 200GB disk as Under Test. (ie they have not yet given it a Compatible rating) details of tests and the procedure to detect the failure etc at http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem I still have to try an officially approved drive. This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hardware in the body of the message -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: problem on 1TB filesystem RAID 5 3ware
Tested with RAID 10 instead of RAID 5, success ! RAID 10 arrays tested: 6xWD200GB and 8xWD200GB both worked RAID 5 arrays tested: 6xWD200GB and 4xWD200GB both failed Note: 3ware lists the WD 200GB disk as Under Test. (ie they have not yet given it a Compatible rating) details of tests and the procedure to detect the failure etc at http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem I still have to try an officially approved drive. This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: problem on 1TB filesystem RAID 5 3ware
Not solved this yet, but I have determined a few things that the problem isn't. Info at: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerProblem Tested with soft updates off and on, fails in either case, so that isn't it. Seems like the problem is either: - 3ware card or driver - something to do with the large filesystem Quoting Bruce Campbell [EMAIL PROTECTED]: File corruption on 2 identical systems, designed to be backup servers to contain dumps of other systems: FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct 9 15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/twed0s1a20644846906552 18086708 5%/ procfs 4 4 0 100%/proc /dev/twed0s1e 938819776 279031856 58468233832%/backup disks are 6 x Western Digital 2000JB (200GB) I ran tests on /backup for 10 days on each system (fill disk with 50GB files of pseudo random data, then reading them all back and verify contents, then erase, then start over). Tests ran perfectly. details on hardware config at: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware Then, I was ready to put the systems into production, so I copied data from my 2 older backup servers (which have 360GB vinum partitions) and after copying the data (approx 250GB in 325 files) about a dozen files were corrupt after the copy. I copied via an NFS mount. All corruption started on a 64K boundary, except one which was on a 16K boundary. Recopied the dozen corrupt files, and then only 6 were corrupt. Same problem on both systems, each which copied from a different source server. File seems corrupt to the end after first corruption starts, I have not looked for a pattern to see if it is another files contents, or misplaced contents from the same file. fsck shows no problems Restarted my test filling with 50GB files again, has run perfectly. I plan to try: - turn off soft updates - RAID 10 instead of 5 - different file system parameters, for example I don't need 100 million inodes. - rcp'ing the files - staring at computer screen By the way, 3ware has not officially approved the WD 200GB drive last time I checked. Lots of good experience with the motherboard (ASUS P4S533) and network card (Intel Pro/100). Lots of good experience with vinum striped partitions of smaller size (360GB) Does anyone have any suggestions ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
problem on 1TB filesystem RAID 5 3ware
File corruption on 2 identical systems, designed to be backup servers to contain dumps of other systems: FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct 9 15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/twed0s1a20644846906552 18086708 5%/ procfs 4 4 0 100%/proc /dev/twed0s1e 938819776 279031856 58468233832%/backup disks are 6 x Western Digital 2000JB (200GB) I ran tests on /backup for 10 days on each system (fill disk with 50GB files of pseudo random data, then reading them all back and verify contents, then erase, then start over). Tests ran perfectly. details on hardware config at: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware Then, I was ready to put the systems into production, so I copied data from my 2 older backup servers (which have 360GB vinum partitions) and after copying the data (approx 250GB in 325 files) about a dozen files were corrupt after the copy. I copied via an NFS mount. All corruption started on a 64K boundary, except one which was on a 16K boundary. Recopied the dozen corrupt files, and then only 6 were corrupt. Same problem on both systems, each which copied from a different source server. File seems corrupt to the end after first corruption starts, I have not looked for a pattern to see if it is another files contents, or misplaced contents from the same file. fsck shows no problems Restarted my test filling with 50GB files again, has run perfectly. I plan to try: - turn off soft updates - RAID 10 instead of 5 - different file system parameters, for example I don't need 100 million inodes. - rcp'ing the files - staring at computer screen By the way, 3ware has not officially approved the WD 200GB drive last time I checked. Lots of good experience with the motherboard (ASUS P4S533) and network card (Intel Pro/100). Lots of good experience with vinum striped partitions of smaller size (360GB) Does anyone have any suggestions ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: problem on 1TB filesystem RAID 5 3ware
Quoting Simon [EMAIL PROTECTED]: I can only hope I don't have the same issue. I'm currently building a 1.75TB NAS to do daily backups using 3ware 7500-8 and maxtor drives. Tiny bit more info: - NFS was starting to be implicated, but on one of my backup servers I had let it run 2 dumps of our Network Appliance, basically: rsh netapp dump ... | gzip file and I tried gunzip -t to test the file, and both were corrupt. My backup system I've been running with vinum for a long time does a weekly gunzip -t on all files, and I've not seen a problem before. This also removes the network card from suspicion, as if it was the problem, the .gz file would still be valid (it would just be compressed garbage, but it would not be corrupt itself) Here is the program I wrote to test the partitions: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BurnInProcedure (obviously not an outstanding test, since it passed my system) -Simon On Wed, 12 Mar 2003 20:38:13 -0500, Bruce Campbell wrote: File corruption on 2 identical systems, designed to be backup servers to contain dumps of other systems: FreeBSD ecserv18.uwaterloo.ca 4.7-RELEASE FreeBSD 4.7-RELEASE #0: Wed Oct 9 15:08:34 GMT 2002 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 with 1TB /backup partition, on a 3ware 7500-8 ATA RAID card, RAID 5: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/twed0s1a20644846906552 18086708 5%/ procfs 4 4 0 100%/proc /dev/twed0s1e 938819776 279031856 58468233832%/backup disks are 6 x Western Digital 2000JB (200GB) I ran tests on /backup for 10 days on each system (fill disk with 50GB files of pseudo random data, then reading them all back and verify contents, then erase, then start over). Tests ran perfectly. details on hardware config at: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/BackupServerHardware Then, I was ready to put the systems into production, so I copied data from my 2 older backup servers (which have 360GB vinum partitions) and after copying the data (approx 250GB in 325 files) about a dozen files were corrupt after the copy. I copied via an NFS mount. All corruption started on a 64K boundary, except one which was on a 16K boundary. Recopied the dozen corrupt files, and then only 6 were corrupt. Same problem on both systems, each which copied from a different source server. File seems corrupt to the end after first corruption starts, I have not looked for a pattern to see if it is another files contents, or misplaced contents from the same file. fsck shows no problems Restarted my test filling with 50GB files again, has run perfectly. I plan to try: - turn off soft updates - RAID 10 instead of 5 - different file system parameters, for example I don't need 100 million inodes. - rcp'ing the files - staring at computer screen By the way, 3ware has not officially approved the WD 200GB drive last time I checked. Lots of good experience with the motherboard (ASUS P4S533) and network card (Intel Pro/100). Lots of good experience with vinum striped partitions of smaller size (360GB) Does anyone have any suggestions ? -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-hardware in the body of the message -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
swap_pager: indefinite wait buffer but no disk errors
/kernel: pid 68914 (file1), uid 0 on /test: file system full Jan 6 17:50:20 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: #ad/0x20001, blkno: 504, size: 4096 Jan 6 17:50:50 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: #ad/0x20001, blkno: 504, size: 4096 Jan 6 22:16:43 ecserv15 /kernel: pid 69461 (file1), uid 0 on /test: file system full Jan 7 00:00:45 ecserv15 newsyslog[69749]: logfile turned over Jan 7 00:00:45 ecserv15 newsyslog[69749]: logfile turned over Jan 7 03:50:20 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: #ad/0x20001, blkno: 504, size: 4096 Jan 7 03:50:57 ecserv15 /kernel: swap_pager: indefinite wait buffer: device: #ad/0x20001, blkno: 504, size: 4096 Jan 7 06:56:44 ecserv15 /kernel: pid 70291 (file1), uid 0 on /test: file system full -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: ata fallback to PIO mode on dual processor AMD systems
Quoting Bruce Campbell [EMAIL PROTECTED]: Quoting Matthew Emmerton [EMAIL PROTECTED]: [ cc'ing Soren since he's the ATA guru ] Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done The test continues to run with the ata controller in PIO mode, with slower performance, and higher load average. Once the master drops to PIO, attempts to access the slave then cause it to drop to PIO. Are you using 80-conductor cables on all your drives? These are required to get consistent high throughput, and running without them may cause the problems you're seeing. Thanks for the information about the design of IDE etc, and the suggestion about the cables. I was about to shuffle things to get the disks onto separate channels, but I now see that would be a mistake as my CD drive would share a cable with a disk. ps. As an aside, I have since determined that putting a PIO device and a UDMA device on the same channel does not affect the performance of the UDMA device, unless the PIO device is in use. So, sharing a low use CD rom drive with a disk wouldn't be so bad. I am puzzled about the fallback to PIO concept. If a disk has gives some sort of timeout error or whatever, why would trying PIO correct the problem ? That seems equivalent to asking the disk to do the same thing, just more slowly. In my case, some sort of timeout error occurs on ad0, so it falls back to PIO, and works. A later access to ad1 also yields a timeout error, and then it drops to PIO, and works too. I'm fairly confident both disks did not experience media errors at the same time, which suggests a problem with the onboard IDE controller, or a driver bug. Tests continue... This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Followup to fallback to PIO mode on dual processor AMD systems
By the way, I've determined our removable IDE disk trays are manufactured by SNT (http://www.snt.com.tw/metal.htm) and are part number SNT-129. It looks like these are the same ones startech sells. I've placed my hardware configuration here: http://www.freebsd.uwaterloo.ca/twiki/bin/view/Freebsd/DualAmd2000 Out of my 4 AMD systems, my test results are now: - 1 refuses to die - 1 panic'ed and died, after not being able to drop to PIO. Many fsck errors upon reboot. The console error was ata0: resetting devices .. ad0: DMA limited to UDMA33, non-ATA66 cable or device - 2 dropped to PIO after about 15 hours of tests, and ran fine (but slowly) with PIO As for the the 2 that dropped to PIO and worked, I rebooted and manually ran atacontrol mode 0 UDMA33 UDMA33 and restarted the tests. No problems in 36 hours so far. My 4 Intel systems (which only have a UDMA33 controller on the motherboard) have also been running 48 hours no problems. The test I run is... dbench 1 sleep 300 dbench 2 sleep 300 dbench 3 ... up to about dbench 80 and then I kill and restart. With UDMA100, dbench 10 gave 43 MB/Sec With UDMA33, dbench 10 gives 37 MB/Sec I still plan to: - try UDMA100 with the drives directly attached (ie. no removable tray) - maybe try a non onboard IDE controller - shuffle the disks to see if the problems follow the disks or not At present, I don't suspect bad media because the error message is WRITE command timeout tag=0 serv=0 which doesn't suggest a specific sector/track etc, and running with UDMA33 instead of UDMA100 makes the problem appear to vanish. This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: ata fallback to PIO mode on dual processor AMD systems
Quoting Francesco Casadei [EMAIL PROTECTED]: On Tue, Dec 31, 2002 at 03:57:16PM -0500, Bruce Campbell wrote: I am seeing a problem with ata disks on 4 new systems, which I believe is either a bug in the ata driver, or a problem with the onboard IDE controller, or something else. Systems are as follows: ... Motherboard: ASUS A7M266-D CPUs : 2 x 2000+ AMD MP Memory : 2 x 512MB Crucial part: CT6472Y265 Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:26:59 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00 Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode Same problem here, but slightly different configuration: # atacontrol list ATA channel 0: Master: ad0 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5 Slave: no device present ATA channel 1: Master: acd0 LG CD-ROM CRD-8521B/1.03 ATA/ATAPI rev 0 Slave: no device present ATA channel 2: Master: ad4 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5 Slave: no device present ATA channel 3: Master: ad6 IC35L040AVER07-0/ER4OA44A ATA/ATAPI rev 5 Slave: no device present ad4 and ad6 are attached to a Promise FastTrak 100 TX2 ATA RAID controller. # atacontrol mode 0 Master = UDMA100 Slave = ??? # atacontrol mode 1 Master = PIO4 Slave = ??? # atacontrol mode 2 Master = UDMA100 Slave = ??? # atacontrol mode 3 Master = PIO4 Slave = ??? ad6 falls back to PIO mode on heavy I/O activity, i.e. when the system does a level 0 file systems dump from the RAID 1 array (ad4,ad6) to the backup disk ad0. Rebooting and rebuilding the array with the Promise BIOS utility temporarily solve the problem. The system may be up and running for 1-4 weeks doing a level 0 dump every morning at 5:30am and then one day the drive ad6 falls back to PIO mode again (little before the completion of fs dump). Do the hard drives you are using support the ATA tagged queuing? And if so, do you have TQ enbled? I don't have it enabled: hw.ata.tags: 0 I've manually set: atacontrol mode 0 UDMA33 UDMA33 and the problem has not recurred. -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: Followup to fallback to PIO mode on dual processor AMD systems
Quoting Bruce Evans [EMAIL PROTECTED]: On Thu, 2 Jan 2003, Bruce Campbell wrote: At present, I don't suspect bad media because the error message is WRITE command timeout tag=0 serv=0 which doesn't suggest a specific sector/track etc, and running with UDMA33 instead of UDMA100 makes the problem appear to vanish. The fallback is clearly wrong because it turns isolated media errors into pessimized i/o for the whole disk at best, system hangs during resets next best, and system crashes at worst. I keep a disk with bad media on line for testing some of this, and zap the fallback using the following patch (hope this is complete; it was edited from a larger patch). Thanks for the patch. Under moderate load, I am seeing occasional instances of: /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting /kernel: ata0: resetting devices .. done and everything keeps on working normally via DMA. ie it does not drop to PIO. The more manacing case is this: Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:26:59 /kernel: ata0: resetting devices .. done Dec 30 23:26:59 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:27:00 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:27:00 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:27:00 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00 Dec 30 23:27:00 /kernel: ad0: trying fallback to PIO mode Dec 30 23:27:00 /kernel: ata0: resetting devices .. done So it appears it would no longer with DMA, but it would work with PIO. If it is manually set back to UDMA with the atacontrol command, it times out again, and falls back to PIO. However, a soft reboot, and all is well again. %%% Index: ata-disk.c === RCS file: /home/ncvs/src/sys/dev/ata/ata-disk.c,v retrieving revision 1.139 diff -u -2 -r1.139 ata-disk.c --- ata-disk.c17 Dec 2002 16:26:22 - 1.139 +++ ata-disk.c18 Dec 2002 01:03:37 - @@ -597,5 +606,5 @@ else { ata_dmainit(adp-device, ata_pmode(adp-device-param), -1, -1); - printf( falling back to PIO mode\n); + printf( NOT falling back to PIO mode\n); } TAILQ_INSERT_HEAD(adp-device-channel-ata_queue, request, chain); @@ -603,4 +612,5 @@ } +#if 0 /* if using DMA, try once again in PIO mode */ if (request-flags ADR_F_DMA_USED) { @@ -613,4 +623,5 @@ return ATA_OP_FINISHED; } +#endif request-flags |= ADR_F_ERROR; %%% Bruce -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
ata fallback to PIO mode on dual processor AMD systems
I am seeing a problem with ata disks on 4 new systems, which I believe is either a bug in the ata driver, or a problem with the onboard IDE controller, or something else. Systems are as follows: Motherboard: ASUS A7M266-D CPUs : 2 x 2000+ AMD MP Memory : 2 x 512MB Crucial part: CT6472Y265 Disks (all UDMA100): Master Slave System 1: WDC WD400BB WDC WD1000BB System 2: WDC WD400BB WDC WD1000BB System 3: WDC WD400BB WDC WD800BB System 4: WDC WD400BB Maxtor 98196H8 Kernel : 4.7-RELEASE, custom kernel (compared to GENERIC): commented out: cpu I386_CPU cpu I486_CPU enabled options SMP # Symmetric MultiProcessor Kernel options APIC_IO # Symmetric (APIC) I/O I am running a test with dbench (/usr/ports/benchmarks/dbench) with a script which runs: dbench 1 sleep for 5 minutes dbench 2 sleep for 5 minutes dbench 3 ... to simulate 1,2,3... clients. The following has happened on systems 2,3 and 4, after about 15 hours of running the test: Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 - resetting Dec 30 23:26:59 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:26:59 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done Dec 30 23:27:00 ecserv13 /kernel: ad0: WRITE command timeout tag=0 serv=0 resetting Dec 30 23:27:00 ecserv13 /kernel: ad0: timeout waiting for cmd=ef s=d0 e=00 Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done The test continues to run with the ata controller in PIO mode, with slower performance, and higher load average. Once the master drops to PIO, attempts to access the slave then cause it to drop to PIO. If I run: atacontrol mode 0 UDMA100 UDMA100 attempts to access either drive result in a delay until the controller drops to PIO, and then operations resume. A soft reboot and things work in UDMA mode again. Also tried UDMA33 and UDMA66 with no change. I also tried atacontrol reinit 0 with no help. Theories when I search the web for fallback to PIO mode include: - bad disks - something to do with thermal recalibration I don't believe the problems are bad disks, as the slave drops to PIO after the master does, and I can't get in back to UDMA, other than by soft reboot. Plus I see the problem on 6 of 8 disks. The problem is very repeatable. Can anyone offer any ideas, or suggest investigative steps ? I have a system in PIO mode right now. Thanks, -- Bruce Campbell Engineering Computing CPH-2374B University of Waterloo (519)888-4567 ext 5889 This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message
Re: ata fallback to PIO mode on dual processor AMD systems
Quoting Matthew Emmerton [EMAIL PROTECTED]: [ cc'ing Soren since he's the ATA guru ] Dec 30 23:27:00 ecserv13 /kernel: ad0: trying fallback to PIO mode Dec 30 23:27:00 ecserv13 /kernel: ata0: resetting devices .. done The test continues to run with the ata controller in PIO mode, with slower performance, and higher load average. Once the master drops to PIO, attempts to access the slave then cause it to drop to PIO. Are you using 80-conductor cables on all your drives? These are required to get consistent high throughput, and running without them may cause the problems you're seeing. Thanks for the information about the design of IDE etc, and the suggestion about the cables. I was about to shuffle things to get the disks onto separate channels, but I now see that would be a mistake as my CD drive would share a cable with a disk. Anyway, they all have the 80 conductor cable. I forgot to add some environmental and other information. The 4 AMD systems are in Aopen hx08 towers, with 400 watt power supplies, and 5 auxilliary fans (in addition to the power supply fan, and fan on each cpu). They are in an air conditioned machine room. The CPU and motherboard temperatures are within spec. I mention this as I note many reported AMD system problems traced to overheating. All drives are installed in removeable drive bays. I don't have the make/model on hand right now. They were $19 CAD. ($13USD). The low cost makes me suspicious now, but... I'm running the same tests on 4 single processor 2.4GHz Intel systems. They have not failed in this manner so far. Initially, I had 1GB memory modules in the AMD systems (I can't remember the make) and the systems froze and rebooted randomly. I moved to Crucial 512MB modules to cure that problem. This mail sent through www.mywaterloo.ca To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-questions in the body of the message