Re: Analysis of disk file block with ZFS checksum error
Eric Anderson wrote: > I'm starting to think there is a timing issue or some such problem with > ZFS, since I can use the same drives in a gmirror with UFS, and never > have any data problems (md5 checksums confirm it over-and-over). I > highly doubt that everyone is seeing similar issues and it just is > because ZFS is so intense. I've had plenty of systems under severe disk > load that have never exhibited corrupt files because of something like > this. I also wondered this - i.e. if ZFS was triggering a certain timing behavior that revealed the problem. Still, if this is the case, it seems to me that the problem lies in the ATA subsystem, since it should prevent a higher-level things like ZFS to be able to create bad timings (or am I not thinking of this correctly?). Also, I think there were some reports of problems with DMA/ATA when *not* using ZFS. > I wish we could get our hands on this issue.. Seems like some common > threads are ATA/SATA disks. Is your setup running 32bit or 64bit > FreeBSD? (if you already mentioned it, I'm sorry, I missed it) This was on 32bit FreeBSD with PATA. I am the one who had no SMART issues and no DMA errors reported under Linux. Changing the cable may have "fixed" it, since I did not see errors in some further testing, but even if so, my theory is that there is some edge case (timing?) that the FreeBSD ATA drivers were sensitive to, and perhaps my change of cables pushed the problem to the other side of the threshold. Since I never saw errors under Linux (and I've been using that cable for a couple of years), I do not necessarily think the cable was actually "defective". -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: is there any raid5 in software in FreeBSD ?
ZFS has RAIDZ - very similar to RAID5 (with added features), if you don't mind ZFS's current experimental state. -Joe Nenhum_de_Nos wrote: > i've seen RAID 0 through 3 (skip 2 ;) ) > > thanks, > > matheus > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Multiple key presses are hindered when repeat turned off
I have verified this on two machines, but it would be helpful if others out there can reproduce it too. Also, I do not know if it is Xorg or the FreeBSD keyboard drivers, since I see no way to reproduce on the console (i.e. turn off repeat). In an xterm, type: "xset r off". Then try some multiple-key combinations (i.e. keep holding first key(s) when you type the next one): po (o does not appear) lk (k does not appear) grep (e does not appear) When you release the keys, the press events will show up. Keyboards in general have limited multiple-key (rollover) capabilities, but using "xset r off" reduces these to the point that you will often mistype things, and it seems unique to FreeBSD. I am using 7.0-RC2 at the moment. Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Revisiting jerky/freezing mouse issue in 7.0
I spent some time looking again at a trace I posted last month showing mouse "jerkiness/freezing" under load (note that I see it all of the time under light load too, but it is harder to reproduce on demand). Here's the trace: http://www.skyrush.com/downloads/ktr_ule_4.out The large stretches of yellow in the Xorg process are what trouble me. Clearly, Xorg is yielding processor time mostly to, in this case, xtrs, which is getting a whole lot of time. If you look at the fairly regular mouse events, you'll notice that moused runs for a short time on each mouse even from psm0 and then sleeps. This makes sense, and it appears moused is acting correctly. But many of these mouse events are seemingly ignored by Xorg, which spends most of its time yielding (yellow) and not getting "woken up" by the events to simply process them. I've noticed, also, that Xorg can "get behind" easily and spend its time catching up on event processing for a while after I stop using the mouse. It just doesn't seem to be getting an appropriate amount of CPU time, or at least it yields too long between runs, to make interactivity smooth. These yields, I believe, are the freezes I see. Here's a question: does Xorg "respond" to mouse events, or does it just wake up every now and then and check? Note that even when Xorg runs, it only runs for a very short time. If the ULE scheduluer is being fair, I would think this might give Xorg *more* of a share of the CPU to use to service these events, since it is running a lot less than xtrs. One interesting point is at timestamp 1478223777518. It looks like Xorg *starts* to yield when moused runs. Here's the line: 1478223777518 sched_add: 0xa7be1660(Xorg) prio 160 by 0xa5eb7aa0(moused) Does this mean that moused *caused* Xorg to yield, or am I reading this incorrectly)? The yield then lasts through a series of mouse moves. A quick look through the graph shows that this happens quite a bit, which seems like the reverse of what we'd like. This issue (especially since it does not even require continuous heavy CPU use to see) is a constant distraction while using the system, and again I want to volunteer my time to help track it down. I am not sure how to further delve into it, so if there is some additional data I can gather, please let me know, and I'll gladly do it. Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)
New information: it looks as though this ext2fs was already mounted when the mount was attempted. I have reproduced the issue by simply trying to mount the ext2fs volume more than once. Given this, I'd expect the mount to return an already mounted error rather than hanging, so this is perhaps a straightforward bug. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)
Kris Kennaway wrote: > Joe Peterson wrote: >> I just tried (under FreeBSD 7.0-RC1) to mount an ext2fs volume - I've >> mounted it before with no trouble on this same FreeBSD version. This >> time, mount appeared to hang. I noticed that I can see the contents of >> the volume under the mount point, so the mount seemed to "work", but the >> process is stuff. "ps" shows: >> >> root 1307 0.0 0.0 3156 792 p6 D+5:21PM 0:00.00 mount >> /mnt/linux-home >> >> The "ps" man page says that "D" means: "Marks a process in disk (or >> other short term, uninterruptible) wait." >> >> Is there any way I can investigate what is going on? I cannot umount >> (device busy) or break out of the mount command... > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html But unfortunately I do not have KDB and DDB compiled into the kernel. And, obviously, if I reboot, I will lose this opportunity. I suspect this to be an intermittent thing. Is there anything I can extract while the system is running that would be useful? Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)
I just tried (under FreeBSD 7.0-RC1) to mount an ext2fs volume - I've mounted it before with no trouble on this same FreeBSD version. This time, mount appeared to hang. I noticed that I can see the contents of the volume under the mount point, so the mount seemed to "work", but the process is stuff. "ps" shows: root 1307 0.0 0.0 3156 792 p6 D+5:21PM 0:00.00 mount /mnt/linux-home The "ps" man page says that "D" means: "Marks a process in disk (or other short term, uninterruptible) wait." Is there any way I can investigate what is going on? I cannot umount (device busy) or break out of the mount command... Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Analysis of disk file block with ZFS checksum error
Gavin Atkinson wrote: > Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt > block before or after the datestamp of the file it was found within? > i.e. was the corrupt block on the disk before or after the mp3 was > written there? Hi Gavin, those dated are later than the original copy (I do not have the file timestamps to prove this, but according to my email record, I am pretty sure of this). So the corrupt block is later than the original write. If this is the case, I assume that the block got written, by mistake, into the middle of the mp3 file. Someone else suggested that it could be caused by a bad transfer block number or bad drive command (corrupted on the way to the drive, since these are not checksummed in the hardware). If the block went to the wrong place, AND if it was a HW glitch, I suppose the best ZFS could then do is retry the write (if its failure was even detected - still not sure if ZFS does a re-check of the disk data checksum after the disk write), not knowing until the later scrub that the block had corrupted a file. I think that anything is possible, but I know I was getting periodic DMA timeouts, etc. around that time. I hesitate, although it is tempting, to use this evidence to focus blame purely on bad HW, given that others seem to be seeing DMA problems too, and there is reasonable doubt whether their problems are HW related or not. In my case, I have been free of DMA errors (cross your fingers) after re-installed FreeBSD completely (giving it a larger boot partition and redoing the ZFS slice too), and before this, I changed the IDE cable just to eliminate one more variable. Therefore, there are too many variables to reach a firm conclusion, since even if the cable was "bad", I never saw one DMA error or other indication of anything wrong with HW from the Linux side (and I've been using that HW with both Linux and FreeBSD 6.2 for months now - no apparent flakiness of any kind on either system). So either it *was* bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or ZFS was stressing the HW and revealing weaknesses in the cable, or it was a SW issue that got cleared somehow when I re-installed. Is it possible that the problem lies in the ATA drivers in FreeBSD or even in ZFS and just looks like HW issues? I do not have enough info/expertise to know. If not, then it may very well be true that HW problems are pretty widespread (and that disk HW cannot, in fact, be trusted), and there really *is* a strong need for ZFS *now* to protect our data. If there is a possibility that SW could be involved, any hints on how to further debug this would be of great help to those still experiencing recent DMA errors. I just want to be more sure one way or the other, but I know this issue is not an easy one (however, it's the kind of problem that should receive the highest priority, IMHO). -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Analysis of disk file block with ZFS checksum error
Julian Elischer wrote: > it could be an old file.. > what kind of disks? It's a Seagate ST3500630A parallel ATA drive. > I had a scenario where 3ware controllers were just failing to write to > a drive in the array, so old data showed through. I have an Intel ICH4 controller - nothing unusual. > the filesystem and the partitions and the raids all were on different > alignments so teh only part of the system that had a boundary that > aligned with the bad data was the physical stripes laid down by the > controller. It was 64k stripes and 64k data missing, exactly on > stripe boundaries. Due to the fact that FreeBSD had partitioned the > drive staring at 63 blocks in, nothing else aligned with the problem. Hmm, well this is a straight-forward disk situation - never used RAID on this drive. Give what is happening, I wonder the changes of it being HW, OS, or a filesystem issue. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Analysis of disk file block with ZFS checksum error
Chris Dillon wrote: > That is a chunk of a Mozilla Mork-format database. Perhaps the > Firefox URL history or address book from Thunderbird. Interesting (thanks to all who recognized Mork). I do use Firefox and Thunderbird, so it's feasible, but how the heck would a piece of one of those files find its way into 1/2 of a ZFS block in one of my mp3 files? I wonder if it could have been done on write when the file was copied to the ZFS pool (maybe some write-caching issue?), but I thought ZFS would have verified the block after write. It seems unlikely that it would get changed later - I never rewrote that file after the original copy... -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Analysis of disk file block with ZFS checksum error
Mark Day wrote: > Based on the subset of data you posted, the bad data looks like ASCII > text. > The bad data from offset a to a000f is: > > ${138AFE{@ > @$$}1 > > The bad data from offset af6c1 to af6c8 is: > > 392A9}@ > > I don't recognize the content beyond that, but I'd guess that somehow > the > contents of some other file managed to overwrite that portion of the bad > file. As for how that happened, I don't know. But if someone > recognizes > where the bad content came from, that might be a clue. Gary/Mark, Good eye! Yes, it indeed does appear to be ASCII. I *thought* something in the repetition when I originally did an od -a looked interesting. I dumped the whole bad section as a string, and here's (partly) what I get: ${138AFE{@ @$$}138AFE}@ @$${138AFF{@ [A3:^80(^91^2146F)] @$$}138AFF}@ @$${138B00{@ @$$}138B00}@ @$${138B01{@ [181:^80(^91^2146F)] @$$}138B01}@ @$${138B02{@ @$$}138B02}@ @$${138B03{@ [2C:^80(^91^2146F)] @$$}138B03}@ @$${138B04{@ @$$}138B04}@ . . . @$${138B8B{@ <(21470=Thu Jan 24 23:20:58 2008)> [117:^80(^91^21470)] @$$}138B8B}@ . . . @$${138C18{@ <(21472=1201242069)>[-2:^80(^82^85)(^83^1B5)(^84=b)(^85=1)(^86=0)(^87=0) (^88=0)(^89^2146C)(^8A=)(^8B=40)(^8C=2e)(^8D^84)(^8E=0)(^90^21472) (^91^21460)] @$$}138C18}@ @$${138C19{@ <(21473=a72f78)>[2:^80(^89^21473)] @$$}138C19}@ @$${138C1A{@ @$$}138C1A}@ . . . and more of the same. Note the date string. There are several like that. Anyone recognize this text format? -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Analysis of disk file block with ZFS checksum error
In my experimentation with the ZFS filesystem, I encountered one case of a file block with a checksum mismatch. Doing a "zpool scrub" revealed it, and trying to read the file yielded an error - only the part of the file before the bad block was read (ZFS aborts reading at this point, which makes sense), resulting in a short file. The reason the CKSUM error is not fixable is because my ZFS pool contains only one device (no mirror or RAIDZ), but I do have the original/good version of the file affected. Here's the output of zpool status (new scrub in process): pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress, 64.36% done, 0h18m to go config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 2 hda6 ONLINE 0 0 2 errors: Permanent errors have been detected in the following files: /mnt/tank/fbsd/home/joe/music/jukebox/christmas/Esquivel/ Merry_XMas_from_the_SpaceAge_Bachelor_Pad/07-Snowfall.mp3 I was curious about what actually happened: was this a ZFS bug, trouble with its metadata, or truly a bad block? In order to determine this, I modified ZFS's source code temporarily to ignore the checksum mismatch and let the file read fully. What I then got was the full-length file and no errors, showing that there were no disk read errors associated with the read (I already had assumed this from the fact that zpool status showed only a non-zero CKSUM count), however, I may have seen other error counts previously (ZFS resets them to zero on, e.g., reboot). I received no errors when originally copying this file *to* the ZFS pool - only on subsequent reads/scrubs. (Note that I have posted before about DMA errors in my log for the disk I am using, but I have had nothing but successful SeaTools tests (surface scans) of the drive. Jeremy Chadwick had similar issues, as did others, so I think it is worth investigating if there is some OS/software cause rather than real HW issues. This is one reason I wanted to investigate my ZFS checksum issue more deeply.) I also have a good backup of the file in question, so I now have two copies of the file: one good, and one with a bad block. The file is 3575936 bytes long, and recordsize (in ZFS) is 128K, making the file about 27 blocks long. Curiously, the bad section of the file is exactly 65536 bytes long (1/2 a block). The bad block starts at exactly the 5th 128K block (byte 65536 or hex a). I wanted to see the characteristics of the bad data. Was just one bit flipped randomly? No. It is just one bit or set of bits in the bytes that are affected? It doesn't seem so. Were there any other stange patterns here? Well, yes, and maybe someout out there with more knowledge/experience in disk modes of failure will recognize something (I have included some data below). For one thing (as I mentioned), only 65536 bytes are bad (and it's exactly this many, with a few "good" bytes thrown in, but not far from what matches random chance would produce. Also, all bad bytes have a zero in the high bit - interesting? Also, near the end of the block, the bad bytes all go to zero, strangely coincident with the first "good" zero in that bad block - not sure if that's coincidence or not. Also, I calculated the number of "Bits same" (matching bits) in the good vs. bad bytes, and it appears fairly random, so it appears that the bad bytes are very random in nature and not correlated much at all with the good bytes. So except for the fact that the 2nd half (65536 bytes) of the ZFS block are good, the bad block seems to consist of random data, except for the string of zero bytes near the end and the zero high-bit. It's not as if one bit on the disk flipped - it affects the whole (1/2) block. Does this seem like a disk error, controller error/bug, cable problem (I recently put a new cable on, so I doubt this). It seems to me something more systemic rather than a random bit error - opinions are more than welcome. Here is some info from a python program I wrote to look at the data (I've left out spans of essentially uninteresting portions showing similar stuff, but I can get you the whole thing if interested): File posGoodBad Match Good (bin) Bad (bin) Bits same 0009fff0d9 d9 Yes 11011001110110018 0009fff105 05 Yes 010101018 0009fff2c1 c1 Yes 110111018 0009fff381 81 Yes 100110018 0009fff45f 5f Yes 010101018 0009fff566 66 Yes 01100110011001108 0009fff65e 5e Yes 0100010
Re: Frequent USB mouse disconnections under load with RELENG_7
Wayne Sierke wrote: > On Fri, 2008-01-25 at 01:59 +1030, Wayne Sierke wrote: >> I'm getting a lot of USB mouse disconnects on RELENG_7. I wondered >> whether they might have been due to running with a KTR-enabled kernel >> but in just the last 7 hours I've been running on stock GENERIC and >> they're still happening. Hey Wayne, I'm not sure if you associating the disconnects with the "jerky mouse" behavior, but as an added datapoint, I have a PS/2 mouse, I see *no* disconnects in the system logs (well, it's PS/2...), and I still get the jerky mouse... -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Unexpected "resilver" after reboot (after scrub found CKSUM problems)
[...reposting to freebsd-stable - no response on freebsd-fs] I had a strange thing happen on ZFS the other day, and I cannot find any info about it on the web - thought you might have some ideas. I am using 7.0-RC1 at the moment. I found a checksum error in ZFS during a scrub. This is strange in itself, since I believe the disk is OK (see below): pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM tankONLINE 0 0 0 ad0s1dONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_Bachelor_Pad/07-Snowfall.mp3 This is how it appears after a recent reboot, however. After a scrub, I see varying number of non-zero counts under CKSUM. Not sure why it is zero after reboot (maybe that's normal). However, the strange this is that after my first reboot after the scrub found the issue, zpool status told me that "resilver completed with 0 errors", and there were no known errors. Only trying to read the file and/or rescrubbing returned the status to the error state and made the CKSUM column non-zero. Since I do not have a mirror or raid config, I'm not sure why it would resilver at all, and I did nothing explicit to cause a resilver (as far as I know)... Any ideas? As an aside, I, along with some others on freebsd-stable@freebsd.org, have been seeing what "look" like disk errors in the system logs. I have a suspicion that there could be some other cause (lots of discussion on that list, if you are interested). Strangely, this disk checks out fine on both short and long tests in Seatools, and smartctl shows it as OK. Also, using Linux to do lots of reads from it does not show any issue or error logs. At this point, I am not sure if the CKSUM issue is a real HW flaw or something else... Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Remco van Bekkum wrote: > Well it looks like in my case it is hardware related after all. It failed to > read the boot > block several times now. 2nd sort of DOA of this disk... Have you tried reading the block in another OS or using SeaTools? That would at least verify that it's hardware. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Jeremy Chadwick wrote: >> If this is widespread, I think the chances re slim that it is a >> hardware problem in every case. > > I'm in definite agreement here. I think it might be worthwhile to note > what hardware we're all using, in case there's something similar between > our systems (chipset, disk vendor, etc.). > > My system is as follows; timeouts were reported during an rsync of data > from the ZFS stripe (ad8+ad10) to a UFS2 filesystem on ad6. System > eventually panic'd after remaining deadlocked (while kernel messages > about timeouts kept printing on the console for ad6 only) for 10-15 > minutes. > > * MB: Supermicro PDSMI+ (Intel ICH7-based) > * CPU: Intel Core 2 Duo E6600 > * RAM: Corsair CM2X1024-6400 DDR2, 2GB > * ad4: WD Caviar SE WD2000JD (boot/OS) > * ad6: Seagate Barracuda 7200.10 ST3500630AS > * ad8: WD Caviar SE16 WD5000AAKS (ZFS stripe) > * ad10: WD Caviar SE16 WD5000AAKS (ZFS stripe) > * All drives are hooked up to the ICH7. > * SMART stats showed no problems on any of the drives before or after. > * RELENG_7, i386, ULE scheduler. Mine is as follows: * MB: Tyan Trinity S2099 * CPU: Pentium 4, 2.4GHz * RAM: Crucial DDR, ECC, CL2.5, Unbuffered 2GB (1/2 PC2100, 1/2 PC2700) * ad0: Seagate ST3500630A 3.AAE (1 UFS2 boot, 1 ZFS pool) * ad1: Seagate ST3160812A 3.AAH (not used by FreeBSD) * Intel ICH4 UDMA100 controller * ATI Radeon RV280 9250 * Intel PRO/1000 NIC * 7.0-RC1, i386, ULE scheduler -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1
Remco van Bekkum wrote: > Same here. On an amd64 system with 1x sata disk (Western Digital Caviar > Green Power) on an amd690G chipset, with UFS and intensive disk activity > the system hangs and in the end it may panic. I've csupped today and > rebuild world & generic kernel but still it's very unstable, sometimes it > even hangs when activating geom volumes at boot time... > I must add that this is a new system so I'm not 100% sure the hardware is > sane. > Using ZFS it also crashed when doing intensive I/O. This is very interesting. It seems to there are several of us who are experiencing something that *looks* like hardware (disk) issues when using 7.0. Could this be related to the mouse freeze issue? Could some process be locking/grabbing the CPU at inopportune times and causing not only the freezing symptoms but also reads/writes problems? Can anyone else using 7.0 who hasn't already (especially those using ZFS) check his/her /var/log/messages for disk TIMEOUTs or other disk error messages? If this is widespread, I think the chances re slim that it is a hardware problem in every case. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Ivan Voras wrote: > Were both tests done in the same machine (actually, I mean the same PSU)? Yes - I deliberately changed nothing (not even cables) before I ran the tests. I didn't want any variables. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Joe Peterson wrote: > So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the > drive. The short test passed already. The results should be interesting. If > it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS > bugs that just happen to look like drive problems. I already did a long read, > under linux, of disk contents, and got no messages about anything wrong. Update: both SHORT and LONG tests passed for this drive in SeaTools. Hmph... the mystery remains. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
I performed a ZFS scrub, which finished yesterday, and no new /var/log/messages errors were reported during that time. However, the scrub found something interesting: crater# zpool status -v pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008 config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 2 ad0s1dONLINE 1 3 2 errors: Permanent errors have been detected in the following files: /home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_ Bachelor_Pad/07-Snowfall.mp3 Note that I have not touched this file since copying it to this drive. So, it seems one file failed a checksum check during the scrub. I now (expectedly) get errors trying to read this file - probably ZFS indicating the condition. When I just logged in tonight, I got two more /var/log/messages disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just as I was typing my password). Also, smartctl still shows PASSED, however, this is interesting: 195 Hardware_ECC_Recovered 0x001a 061 046 000Old_age Always - 9070 The number is much *smaller* now! It was "6" a few minutes before this... wrap around? Hmm, I'm really not sure, at this point, what is going on. So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the drive. The short test passed already. The results should be interesting. If it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS bugs that just happen to look like drive problems. I already did a long read, under linux, of disk contents, and got no messages about anything wrong. If I can turn on any debugging info to help determine if this is software-related, let me know the magic keywords to use. :) -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Glad you got it back! Yes, when I was first playing with ZFS, I noticed that booting between single and multi user mode could make the pools "invisible". Import seemed to bring them back... So, is the disk toast, or can you still read anything from it (part table, etc.)? -Joe Jeremy Chadwick wrote: > On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote: >> icarus# zfs list >> no datasets available >> >> This doesn't bode well, and doesn't make me happy. At all. > > Pshew! I was able to get ZFS to start seeing the pool again by doing > the following: (Supposedly "zpool import" by itself will show you a > list of pools which it manages to see...") > > icarus# zpool import -f storage > icarus# df -k /storage > Filesystem 1024-blocks Used Avail Capacity Mounted on > storage 957873024 106124032 85174899211%/storage > icarus# zfs list > NAME USED AVAIL REFER MOUNTPOINT > storage 101G 812G 101G /storage > icarus# zpool status > pool: storage > state: ONLINE > scrub: none requested > config: > > NAMESTATE READ WRITE CKSUM > storage ONLINE 0 0 0 > ad8 ONLINE 0 0 0 > ad10 ONLINE 0 0 0 > > errors: No known data errors > > Back to the drawing board. > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Jeremy Chadwick wrote: > Joe, I wanted to send you a note about something that I'm still in the > process of dealing with. The timing couldn't be more ironic. > > I decided it would be worthwhile to migrate from my two-disk ZFS stripe > with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3 > disks combined (since they're all the same size). I had another > terminal with gstat -I500ms running in it, so I could see overall I/O. > > All was going well until about the 81GB mark of the copy. gstat started > showing 0KB in/out on all the drives, and the rsync was stalled. ^Z did > nothing, which is usually a bad sign. :-) I ssh'd in and did a dmesg > (summarised): > > ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing > request directly > ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing > request directly > ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327 > ad6: FAILURE - WRITE_DMA timed out LBA=13951071 > ad6: FAILURE - WRITE_DMA timed out LBA=13951327 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839 > ad6: FAILURE - WRITE_DMA timed out LBA=13951583 > ad6: FAILURE - WRITE_DMA timed out LBA=13951839 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095 > ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351 > g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5 > g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5 > > It appears my /dev/ad6 (a Seagate -- more irony) must have some bad > blocks. Actually, after letting things go for a while, I realised the > box just locked up. Probably kernel panic'd due to the I/O problem. > I'll have to poke at SMART stats later to see what showed up. Wow, pretty crazy! Hmm, and yes, those LBAs do look close together. Well, let me know how the smartctl output looks. I'd be curious if your bad sector count rises. I had noticed that 1 BTW, I tried: crater# dd if=/dev/ad1s4 of=/dev/null bs=64k ^C1408596+0 records in 1408596+0 records out 92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec) (I let it go for 92GB or so) - no messages about ad1. So I wonder if this points at either the cable connector on ad0 or the drive itself. I guess I'd rather have a failing drive than motherboard... I originally was wondering if somehow something peculiar about ZFS's disk access pattern was making it happen... THanks for the recomendations. I'll keep an eye on it, and I'll let you know what a cable change does for me. Still, I have not had any ad0 messages since this morning (I haven't been using the system today much, but maybe the cron processes are more likely to trigger it... -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1
John Baldwin wrote: > Hmm, when I look at that graph using schedgraphy from HEAD it just looks > like xtrs is using up all the CPU. Yeah, xtrs is eating a lot of CPU, but I've never seen this affect the mouse movement (making it really jerky) the same way on, e.g., Linux. And the xtrs test is just a way to *reliably* make it happen. It happens intermittently all of the time (at least every few minutes, and often in small batches) even when the system is pretty idle... -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1
Sam Leffler wrote: > Sigh, you are correct. I backrev'd the machine where I ran schedgraph > to RELENG_7 and didn't notice the old version mis-parses the ktr file. > The graph is totally different w/ schedgraph from HEAD. > > Sorry Joe for misleading you. No problem, Sam, but the question I have for you now is: do you see anything with the updated schedgraph that indicates any "freezes" that look funny? The length of the ones I saw with mouse movement were mostly some portion of a second, from maybe 1/8 to 1/2 sec. And there should be a lot of them in quick succession. Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Chuck Swiger wrote: > On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 114 071 006Pre-fail >> Always - 82422948 > [ ... ] >> 7 Seek_Error_Rate 0x000f 084 060 030Pre-fail >> Always - 286126605 > [ ... ] >> 195 Hardware_ECC_Recovered 0x001a 063 046 000Old_age >> Always - 166181300 > > These numbers are quite worrysome-- they should be zero or nearly so > in a healthy drive. It seems to depend on the drive manufacturer. E.g. this is a Seagate. Every Seagate I've ever had (or heard about on the web via smartctl dumps) reports very large numbers for these values. I've heard it described that Seagate shows you the raw numbers (and correctable errors do happen all the time in all drives). In Western Digital drives (IIRC), the numbers shown are the ones that *should* be zero, thereby hiding the low-level errors. Hard to say if my numbers are "too high", but these "corrected" error counts are always frighteningly high in Seagates. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Jeremy Chadwick wrote: > What you've shown is usually the sign of a disk-related problem. It's > very obvious when it's just one disk reporting DMA errors. You use ZFS, > so chances are you have more than one disk in a pool/volume -- there's > no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate > something specific to ad0. Jeremy, thanks for the response - I have tried to answer all of your questions below... In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. > Manufacturers pick very passive (non-aggressive) thresholds for error > conditions on disks, so disks which are failing very commonly show > "PASSED" during SMART analysis. To make matters worse, most users I > know read SMART stats incorrectly (they're easy to misinterpret). Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. > Can you please provide output of the following: > > * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. > * atacontrol cap ad0 Protocol ATA/ATAPI revision 7 device model ST3500630A serial number 9QG0DG03 firmware revision 3.AAE cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 976773168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Tagged Command Queuing (TCQ) no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 65278/0xFEFE automatic acoustic management no no 0/0x00 208/0xD0 > * atacontrol info Master: ad0 ATA/ATAPI revision 7 Slave: ad1 ATA/ATAPI revision 7 (but note that ad1 is not used by FreeBSD) > * Relevant dmesg output that indicates what kind of ATA controller > these disks are attached to. Start with output from 'ad0:' and > work backwards. For example, ad0 on this machine is using an Intel > ICH6 controller: > atapci0: port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 > ata0: on atapci0 > ad0: 238475MB at ata0-master SATA150 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: on atapci0 ata0: [ITHREAD] ad0: 476940MB at ata0-master UDMA100 > SMART stats which are labelled "Offline" are only updated when a short > or long offline test is performed. Have you tried using "smartctl -t > short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw > values on the far right column increment? I just tried one: # 1 Short offline Completed without error 00% 5252 - # 2 Short offline Completed without error 00% 5252 - Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. > Have you tried using "zpool scrub" on the ZFS pool, then "zpool status" > to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line > states there were errors? OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was "no data loss" (yet) is because the checksum passed, and maybe it just had to retry...? Anyway, here's the output so far: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 2.50% done, 1h58m to go config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 0 ad0s1dONLINE 1 3 0 errors: No known data errors > Other things which have fixed problems in the past for others: > > * BIOS updates > * Change of motherboards (sometimes replacing board with same model, > other times going
Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
Jeremy Chadwick wrote: > What you've shown is usually the sign of a disk-related problem. It's > very obvious when it's just one disk reporting DMA errors. You use ZFS, > so chances are you have more than one disk in a pool/volume -- there's > no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate > something specific to ad0. Jeremy, thanks for the response - I have tried to answer all of your questions below... In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. > Manufacturers pick very passive (non-aggressive) thresholds for error > conditions on disks, so disks which are failing very commonly show > "PASSED" during SMART analysis. To make matters worse, most users I > know read SMART stats incorrectly (they're easy to misinterpret). Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. > Can you please provide output of the following: > > * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. > * atacontrol cap ad0 Protocol ATA/ATAPI revision 7 device model ST3500630A serial number 9QG0DG03 firmware revision 3.AAE cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 976773168 sectors dma supported overlap not supported Feature Support EnableValue Vendor write cacheyes yes read ahead yes yes Tagged Command Queuing (TCQ) no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 65278/0xFEFE automatic acoustic management no no 0/0x00 208/0xD0 > * atacontrol info Master: ad0 ATA/ATAPI revision 7 Slave: ad1 ATA/ATAPI revision 7 (but note that ad1 is not used by FreeBSD) > * Relevant dmesg output that indicates what kind of ATA controller > these disks are attached to. Start with output from 'ad0:' and > work backwards. For example, ad0 on this machine is using an Intel > ICH6 controller: > atapci0: port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 > ata0: on atapci0 > ad0: 238475MB at ata0-master SATA150 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: on atapci0 ata0: [ITHREAD] ad0: 476940MB at ata0-master UDMA100 > SMART stats which are labelled "Offline" are only updated when a short > or long offline test is performed. Have you tried using "smartctl -t > short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw > values on the far right column increment? I just tried one: # 1 Short offline Completed without error 00% 5252 - # 2 Short offline Completed without error 00% 5252 - Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. > Have you tried using "zpool scrub" on the ZFS pool, then "zpool status" > to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line > states there were errors? OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was "no data loss" (yet) is because the checksum passed, and maybe it just had to retry...? Anyway, here's the output so far: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 2.50% done, 1h58m to go config: NAMESTATE READ WRITE CKSUM tankONLINE 1 3 0 ad0s1dONLINE 1 3 0 errors: No known data errors > Other things which have fixed problems in the past for others: > > * BIOS updates > * Change of motherboards (sometimes replacing board with same model, > other times going
"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
I've seen mention of this kind of issue before, but I never saw a solution, except that someone reported that a certain version of 6.x seemed to make it go away - accounts of this problem are a bit vague. I am running 7.0-RC1, and I am seeing the errors periodically, and I am wondering if this is a known issue. Note that smartctl does not report errors logged and gives a "PASSED" to the drive. I am running at UDMA100 ATA. Also, if it matters, I am using ZFS. Attached is a grep of the /var/log/messages file. Let me know if anyone has suggestions. Thanks! Joe Jan 21 23:39:54 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54112319 Jan 22 00:06:29 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=51610951 Jan 22 00:16:40 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53031647 Jan 22 00:30:15 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54243391 Jan 22 07:05:59 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=51768047 Jan 22 09:08:16 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55890239 Jan 22 09:17:52 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55919423 Jan 22 09:23:42 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53470111 Jan 23 00:26:03 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53588527 Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764596887 Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764596887 Jan 23 00:26:26 crater kernel: ad0: FAILURE - WRITE_DMA48 status=51 error=10 LBA=764596887 Jan 23 03:01:06 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=185819705 Jan 23 03:01:37 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54837686 Jan 23 03:03:22 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53472407 Jan 23 03:03:39 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=53627991 Jan 23 11:33:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=5747 Jan 23 12:30:31 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=55407234 Jan 23 13:20:06 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=57779519 Jan 23 17:30:18 crater kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=453849407 Jan 23 17:30:19 crater kernel: ad0: FAILURE - READ_DMA48 status=51 error=10 LBA=453849407 Jan 23 17:30:29 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=187373078 Jan 23 18:34:50 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=1017919 Jan 23 18:35:00 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=54547647 Jan 23 18:35:12 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=56354060 Jan 23 18:35:20 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=53919167 Jan 23 23:59:18 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left) Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=237661119 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236239553 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764595671 Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764595671 Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out LBA=764595671 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236180175 Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left) Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (0 retries left) Jan 24 02:31:53 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=236191551 Jan 24 04:54:57 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=238068287 Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=238068287 Jan 24 04:55:56 crater kernel: ad0
Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1
Sam Leffler wrote: >> http://www.skyrush.com/downloads/ktr_ule_4.out >> > I don't see what it is > from the trace data. It sort of looks like the last thing that ran is > the swi4 which is likely a callout (need to check the log file contents > to be certain). If the callback function does something it wouldn't > necessarily be visible in the schedgraph plot. If you could stick a > dmesg from booting out in the same spot it might be worthwhile. OK, I just ran a dmesg and put it up there: http://www.skyrush.com/downloads/dmesg_4.out The WRITE_DMA messages are not time-correlated with this issue; I don't like the looks of those either, but that's a different issue to look into... > Also if > you rebuild the kernel the kernel with DIAGNOSTIC then softclock() will > complain about callouts that take longer than 2ms to run. OK, recompiling now... Will the new messages appear in dmesg, or in a log file? > This might > generate too much noise in which case you can adjust the threshold by > editing the code in sys/kern/kern_timeout.c. Cool - thanks for looking at this, and I will let you know what I find! Do I need to make another trace concurrently, or should I just repeat the test procedure and see if I get new messages? -Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
New KTR trace for mouse freezing/stuttering in 7.0-RC1
In an attempt to track down this mouse freezing/stuttering (i.e. "jerky mouse movement) behavior in FreeBSD 7.0-RC1, I have come up with a reliable way to cause it to happen, and I have created a longer trace showing the results. Note that I am using the ULE scheduler. In general, it becomes easier to see the effect if there is CPU activity. I have noticed it during kernel compiles, while at the same time loading web pages in firefox that contain images (and moving the mouse while this is happening). But a more controlled way to see it is to run something that uses some CPU and then generating lots of X events. In my case, I start "xtrs" (TRS-80 emulator) in Model IV mode, which happens to poll for input, using the CPU. Then I move the mouse back and forth quickly between windows in "focus under mouse" mode (in my case, a KDE focus mode), which causes many focus events quickly. In about 15 or 20 seconds, the mouse reliably starts to show erratic movement, not moving smoothly. I really hope this can shed more light on what might be going on. Here is the trace: http://www.skyrush.com/downloads/ktr_ule_4.out Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 7.0-PRERELEASE desktop system periodically freezes momentarily
J.R. Oldroyd wrote: > On Wed, 23 Jan 2008 08:27:58 -0700, Joe Peterson <[EMAIL PROTECTED]> wrote: >> Also, it seems that intermittent mouse freezes happen more often when >> I've been away from the machine for a while and return to start using >> the mouse again, but that's not always the case. A few short >> freezes/stutters happen a second or so after mouse movement resumes. >> >> -Joe > > Joe, > > I don't see any postings from you showing any ktr dumps. Do you have > any? Your symptoms (that it seems to happen after you've been away > for a while and then return and move the mouse) sound a lot like mine. Hi J.R., here is the post that contains links to my dumps: http://lists.freebsd.org/pipermail/freebsd-stable/2008-January/039599.html > I posted some ktr dumps and have since chatted off-list with Kris and > Sam about what may be up. My dumps show the shared irq ath/pcm and > the ath taskq are hogging the cpu for ages without the clock swi getting > to run at all. Sam has suggested experimenting with the ath taskq > priority and also with disabling ath bg scans which I will do, but > right now I am back to looking at powerd again as the possible cause. Hmm, well I don't have an Atheros on this machine - only ethernet. Also, I have not tried playing audio, so what I am seeing is simply with "normal" use. > I ran without powerd for a while when originally suggested by David > Lawrence on Jan 12th. I believe I did still see freezes then, but I > re-enabled powerd when I was ready to do LOCK_PROFILING and then ktr > monitoring; I re-enabled it so I could be sure I had the same test > conditions. At this point, I am no longer sure what happened when > powerd was disabled. My recollection is that there were freezes while > powerd was off, but the only email in which I appear to have posted > about that says "no freezes so far". So I'm running without powerd again, > and at this point, several hours at the computer over two days, I have > not seen further freezes. Does anyone else who sees these freezes also > have powerd enabled and can try without powerd for a while? Mine is a desktop machine, so I have not enabled powerd. > Since these freezes are proving so hard to pinpoint, it may be worth > comparing notes to try to find things in common between the systems > or eliminate other things. But first, it seems like we may be chasing > three separate causes: > > 1. the softupdate freeze > after removing a very large file (e.g., >1Gb) there is a > noticeable freeze while the softupdate runs > > 2. the busy freeze > folk complain of short freezes and mouse jerkiness while > the system is busy, e.g., glxgears or compilations > > 3. the idle freeze > short and longer freezes (some going into minutes) apparently > when resuming work after having left the system mostly idle > for a while > > Now, I also had the "busy freeze" when I first tested 7.0. At that time > (several weeks back now) someone suggested switching to the ULE scheduler, > which I did, and the symptoms I had were dramatically improved. Since > then I've had occasions to run several compilations at once and had no > mouse jerkiness. But for folk who still have it: what scheduler do you > have and what processes are running when it happens? I seem to see #2 (busy freezes). They are usually very short (sub-second) freezes, and they happen randomly as I move the mouse (well, I assume that I see it manifested in a mouse freeze, but it could very well be a system or X freeze, since I see it in keyboard key-held-down too). The mouse usually moves smoothly, but every once in a while, it "sticks" for a fraction of a second as I move it - irritating to say the least. Often the small freezes come in spurts, but they often are one at a time as well. When it comes in spurts, it is often shortly after moving the mouse after lots of idle time (as if the scheduler "wakes up" and has some fits for a short time - a "non-scientific" description ;). I am using ULE on 7.0. I'm also using ZFS (so the soft-updates issue doesn't apply, and I spoke with someone else who uses UFS2, not ZFS, and he said the mouse jerked around pretty badly in 7.0 on his machine). I started with using 4BSD under 7.0, of course, and yes, there were worse batches of freezes with it, especially when starting KDE and when compiling the kernel (it was nearly constant). With ULE, I no longer see compiles causing freezes, and generally the freezes are more subtle and shorter - in other words, ULE *is* better than 4BSD in this respect, but it is still worse than normal operation und
Re: 7.0-PRERELEASE desktop system periodically freezes momentarily
Wayne Sierke wrote: > So it seems the only thing of interest that I"ve managed to capture so > far pertains to glxgears - an instance of the "stutter" and a part of a > short freeze when dragging its window. Unfortunately these frequent > mouse disconnects make it difficult to recognise genuine freezes during > 'normal' use, if indeed they are still occurring with RELENG_7. However > the glxgears behaviour remains (apparently) the same as it was on > RELENG_6. Whether that's a telling sign or not remains to be seen. Wayne, thanks for continuing to investigate, since these little "freezes" definitely affect usability. If I can help in any way, let me know. I have not made any further graphs, but I continue to see intermittent mouse freezing (for short sub-seconf periods, usually). As for mouse disconnects, I don't know if that is what I am seeing, but one thing I do notice is that the keyboard is also affected (easily seen by holding down a key and letting it repeat - short pauses can be seen in the echo, which could be xterm, X, or the keyboard input, of course). Also, I tried unplugging my ps/2 mouse and using a USB one instead - same issue exists. In case this is scheduler-related, I tried running a CPU-hogging task (xtrs in "model 4" mode, which spins, polling for input). While running this and moving the mouse rapidly between two windows (I use focus-under-mouse, so this causes focus events), I eventually get repeated short mouse freezes for quite some time (maybe 10 seconds) until things can catch up. This is not reproducible on Linux CFS (2.6.23) - the CPU use certainly affects event "catching up" in X, but the mouse stays smooth. Also, it seems that intermittent mouse freezes happen more often when I've been away from the machine for a while and return to start using the mouse again, but that's not always the case. A few short freezes/stutters happen a second or so after mouse movement resumes. -Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: To 6.3 or to 7.0 that is the question?
One word: ZFS! It's awesome. -Joe Steven Hartland wrote: > With the announcement of 6.3 and with 7.0 looking like it wont be > far behind I'd interested to hear what people thought of the relative > benefits of each where? > > I know 7 has had a lot of work done on locking and ULE but are there > any other reasons to go for that instead of 6.3? Conversely are there > any reason which would point away from 7 such as stability issues? > > Regards > Steve > > > This e.mail is private and confidential between Multiplay (UK) Ltd. and the > person or entity to whom it is addressed. In the event of misdirection, the > recipient is prohibited from using, copying, printing or otherwise > disseminating it or any information contained in it. > > In the event of misdirection, illegible or incomplete transmission please > telephone +44 845 868 1337 > or return the E.mail to [EMAIL PROTECTED] > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 7.0-PRERELEASE desktop system periodically freezes momentarily
Kris Kennaway wrote: > KTR_SCHED Kris, BTW, I am curious if the traces I posted were informative. Let me know if I did not create them correctly. The xterm test seems to vary in usefulness depending on video card (faster cards catch up too quickly), but the freezing still happens quite often using apps like firefox, especially. Here's the post link: http://lists.freebsd.org/pipermail/freebsd-stable/2008-January/039599.html Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: RELENG_7 jerky mouse and skipping sound (still a problem -BETA3)
On 1 Jan, 14:17, Kris Kennaway <[EMAIL PROTECTED]> wrote: > > OK, can you obtain a schedgraph trace when the problem is manifesting? > > See /usr/src/tools/sched/ and previous discussion in this or related > > threads. I just recently installed 7.0-RC1, and I am seeing pretty severe "mouse jerkiness" or "mouse freezing" while, e.g., compiling (as others have reported here). It's not just the mouse, but keyboard events are also delayed in the same manner (seen by holding down a key in xterm, e.g.). I am on a UP 2.4GHz P4, using PS/2 mouse (with moused) and keyboard. I'm glad I found this thread, since you are asking for traces. I really hope my traces help; this problem does seem like a regression from 6.2 (I had seen slight mouse non-interactivity there too at times, but not nearly as bad). Also, with Linux's new CFS making mouse movement *very* responsive, I think it's vital that FreeBSD address this to avoid such comparisons. I have tried both SCHED_4BSD and SCHED_ULE. 4BSD is a lot worse when compiling, say, the kernel. ULE is better when compiling, but still has issues with, e.g., firefox loading a page, catching up on multiple xterm window resizing (see below), etc. This trace is while using SCHED_4BSD and compiling the kernel / moving mouse: http://www.skyrush.com/downloads/ktr_4bsd.out And here are three traces using SCHED_ULE: http://www.skyrush.com/downloads/ktr_ule.out http://www.skyrush.com/downloads/ktr_ule_2.out http://www.skyrush.com/downloads/ktr_ule_3.out Please check out all three, in case I did not get a good sampling of mouse events and compiles in any one... Strangely, ULE exhibits mouse jerkiness more than 4BSD for the following: I opened an xterm and dragged the right edge of the window back and forth quickly, making the window wider/narrower. It is obvious in FreeBSD that this queues up events for X (after some time, the window border no longer follows the mouse at all), and if I release the mouse button at that time, leaving the window narrow and immediately move the mouse in circles, it is jerky for a while, then returns to smooth action after about 5 or 10 seconds. 4BSD is not as severe in this one case, and I never see this at all in Linux with CFS (i.e. kernel 2.6.23) - the window resizing never really gets behind like this. Here is a trace showing this for ULE (xterm still catching up, if I remember correctly, at end): http://www.skyrush.com/downloads/ktr_ule_resize.out Here is one for 4BSD (xterm caught up before trace stopped): http://www.skyrush.com/downloads/ktr_4bsd_resize.out As an aside, renicing Xorg and moused to -10 seems to help smooth the mouse when using 4BSD when compiling, whereas it is not needed (and seems to have little or no effect) when using ULE (even though, as I said, ULE still shows jerkiness). -Thanks, Joe ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"