Re: Analysis of disk file block with ZFS checksum error

2008-03-04 Thread Joe Peterson
Eric Anderson wrote:
> I'm starting to think there is a timing issue or some such problem with 
> ZFS, since I can use the same drives in a gmirror with UFS, and never 
> have any data problems (md5 checksums confirm it over-and-over).  I 
> highly doubt that everyone is seeing similar issues and it just is 
> because ZFS is so intense.  I've had plenty of systems under severe disk 
> load that have never exhibited corrupt files because of something like 
> this.

I also wondered this - i.e. if ZFS was triggering a certain timing
behavior that revealed the problem.  Still, if this is the case, it
seems to me that the problem lies in the ATA subsystem, since it should
prevent a higher-level things like ZFS to be able to create bad timings
(or am I not thinking of this correctly?).

Also, I think there were some reports of problems with DMA/ATA when
*not* using ZFS.

> I wish we could get our hands on this issue..  Seems like some common 
> threads are ATA/SATA disks.  Is your setup running 32bit or 64bit 
> FreeBSD?  (if you already mentioned it, I'm sorry, I missed it)

This was on 32bit FreeBSD with PATA.  I am the one who had no SMART
issues and no DMA errors reported under Linux.  Changing the cable may
have "fixed" it, since I did not see errors in some further testing, but
even if so, my theory is that there is some edge case (timing?) that the
FreeBSD ATA drivers were sensitive to, and perhaps my change of cables
pushed the problem to the other side of the threshold.  Since I never
saw errors under Linux (and I've been using that cable for a couple of
years), I do not necessarily think the cable was actually "defective".

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: is there any raid5 in software in FreeBSD ?

2008-02-19 Thread Joe Peterson
ZFS has RAIDZ - very similar to RAID5 (with added features), if you
don't mind ZFS's current experimental state.

-Joe


Nenhum_de_Nos wrote:
> i've seen RAID 0 through 3 (skip 2 ;) )
> 
> thanks,
> 
> matheus
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Multiple key presses are hindered when repeat turned off

2008-02-19 Thread Joe Peterson
I have verified this on two machines, but it would be helpful if others
out there can reproduce it too.  Also, I do not know if it is Xorg or
the FreeBSD keyboard drivers, since I see no way to reproduce on the
console (i.e. turn off repeat).

In an xterm, type: "xset r off".  Then try some multiple-key
combinations (i.e. keep holding first key(s) when you type the next one):

po (o does not appear)
lk (k does not appear)
grep (e does not appear)

When you release the keys, the press events will show up.

Keyboards in general have limited multiple-key (rollover) capabilities,
but using "xset r off" reduces these to the point that you will often
mistype things, and it seems unique to FreeBSD.  I am using 7.0-RC2 at
the moment.

Thanks, Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Revisiting jerky/freezing mouse issue in 7.0

2008-02-18 Thread Joe Peterson
I spent some time looking again at a trace I posted last month showing
mouse "jerkiness/freezing" under load (note that I see it all of the
time under light load too, but it is harder to reproduce on demand).
Here's the trace:

http://www.skyrush.com/downloads/ktr_ule_4.out

The large stretches of yellow in the Xorg process are what trouble me.
Clearly, Xorg is yielding processor time mostly to, in this case, xtrs,
which is getting a whole lot of time.  If you look at the fairly regular
mouse events, you'll notice that moused runs for a short time on each
mouse even from psm0 and then sleeps.  This makes sense, and it appears
moused is acting correctly.  But many of these mouse events are
seemingly ignored by Xorg, which spends most of its time yielding
(yellow) and not getting "woken up" by the events to simply process
them.  I've noticed, also, that Xorg can "get behind" easily and spend
its time catching up on event processing for a while after I stop using
the mouse.  It just doesn't seem to be getting an appropriate amount of
CPU time, or at least it yields too long between runs, to make
interactivity smooth.  These yields, I believe, are the freezes I see.
Here's a question: does Xorg "respond" to mouse events, or does it just
wake up every now and then and check?

Note that even when Xorg runs, it only runs for a very short time.  If
the ULE scheduluer is being fair, I would think this might give Xorg
*more* of a share of the CPU to use to service these events, since it is
running a lot less than xtrs.

One interesting point is at timestamp 1478223777518.  It looks like Xorg
*starts* to yield when moused runs.  Here's the line:

1478223777518 sched_add: 0xa7be1660(Xorg) prio 160 by 0xa5eb7aa0(moused)

Does this mean that moused *caused* Xorg to yield, or am I reading this
incorrectly)?  The yield then lasts through a series of mouse moves.  A
quick look through the graph shows that this happens quite a bit, which
seems like the reverse of what we'd like.

This issue (especially since it does not even require continuous heavy
CPU use to see) is a constant distraction while using the system, and
again I want to volunteer my time to help track it down.  I am not sure
how to further delve into it, so if there is some additional data I can
gather, please let me know, and I'll gladly do it.

Thanks, Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)

2008-02-11 Thread Joe Peterson
New information: it looks as though this ext2fs was already mounted when
the mount was attempted.  I have reproduced the issue by simply trying
to mount the ext2fs volume more than once.  Given this, I'd expect the
mount to return an already mounted error rather than hanging, so this is
perhaps a straightforward bug.

-Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)

2008-02-11 Thread Joe Peterson
Kris Kennaway wrote:
> Joe Peterson wrote:
>> I just tried (under FreeBSD 7.0-RC1) to mount an ext2fs volume - I've
>> mounted it before with no trouble on this same FreeBSD version.  This
>> time, mount appeared to hang.  I noticed that I can see the contents of
>> the volume under the mount point, so the mount seemed to "work", but the
>> process is stuff.  "ps" shows:
>>
>> root   1307  0.0  0.0  3156   792  p6  D+5:21PM   0:00.00 mount
>> /mnt/linux-home
>>
>> The "ps" man page says that "D" means: "Marks a process in disk (or
>> other short term, uninterruptible) wait."
>>
>> Is there any way I can investigate what is going on?  I cannot umount
>> (device busy) or break out of the mount command...
> 
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html

But unfortunately I do not have KDB and DDB compiled into the kernel.
And, obviously, if I reboot, I will lose this opportunity.  I suspect
this to be an intermittent thing.  Is there anything I can extract while
the system is running that would be useful?

Thanks, Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


mount of ext2fs volume stuck in "D+" state (disk uninterruptible wait)

2008-02-11 Thread Joe Peterson
I just tried (under FreeBSD 7.0-RC1) to mount an ext2fs volume - I've
mounted it before with no trouble on this same FreeBSD version.  This
time, mount appeared to hang.  I noticed that I can see the contents of
the volume under the mount point, so the mount seemed to "work", but the
process is stuff.  "ps" shows:

root   1307  0.0  0.0  3156   792  p6  D+5:21PM   0:00.00 mount
/mnt/linux-home

The "ps" man page says that "D" means: "Marks a process in disk (or
other short term, uninterruptible) wait."

Is there any way I can investigate what is going on?  I cannot umount
(device busy) or break out of the mount command...

Thanks, Joe


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Analysis of disk file block with ZFS checksum error

2008-02-11 Thread Joe Peterson
Gavin Atkinson wrote:
> Are the datestamps (Thu Jan 24 23:20:58 2008) found within the corrupt
> block before or after the datestamp of the file it was found within?
> i.e. was the corrupt block on the disk before or after the mp3 was
> written there?

Hi Gavin, those dated are later than the original copy (I do not have
the file timestamps to prove this, but according to my email record, I
am pretty sure of this).  So the corrupt block is later than the
original write.

If this is the case, I assume that the block got written, by mistake,
into the middle of the mp3 file.  Someone else suggested that it could
be caused by a bad transfer block number or bad drive command (corrupted
on the way to the drive, since these are not checksummed in the
hardware).  If the block went to the wrong place, AND if it was a HW
glitch, I suppose the best ZFS could then do is retry the write (if its
failure was even detected - still not sure if ZFS does a re-check of the
disk data checksum after the disk write), not knowing until the later
scrub that the block had corrupted a file.

I think that anything is possible, but I know I was getting periodic DMA
timeouts, etc. around that time.  I hesitate, although it is tempting,
to use this evidence to focus blame purely on bad HW, given that others
seem to be seeing DMA problems too, and there is reasonable doubt
whether their problems are HW related or not.  In my case, I have been
free of DMA errors (cross your fingers) after re-installed FreeBSD
completely (giving it a larger boot partition and redoing the ZFS slice
too), and before this, I changed the IDE cable just to eliminate one
more variable.  Therefore, there are too many variables to reach a firm
conclusion, since even if the cable was "bad", I never saw one DMA error
or other indication of anything wrong with HW from the Linux side (and
I've been using that HW with both Linux and FreeBSD 6.2 for months now -
no apparent flakiness of any kind on either system).  So either it *was*
bad and FreeBSD 7.0 was being more "honest", FreeBSD's drivers and/or
ZFS was stressing the HW and revealing weaknesses in the cable, or it
was a SW issue that got cleared somehow when I re-installed.

Is it possible that the problem lies in the ATA drivers in FreeBSD or
even in ZFS and just looks like HW issues?  I do not have enough
info/expertise to know.  If not, then it may very well be true that HW
problems are pretty widespread (and that disk HW cannot, in fact, be
trusted), and there really *is* a strong need for ZFS *now* to protect
our data.  If there is a possibility that SW could be involved, any
hints on how to further debug this would be of great help to those still
experiencing recent DMA errors.  I just want to be more sure one way or
the other, but I know this issue is not an easy one (however, it's the
kind of problem that should receive the highest priority, IMHO).

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Analysis of disk file block with ZFS checksum error

2008-02-08 Thread Joe Peterson
Julian Elischer wrote:
> it could be an old file..
> what kind of disks?

It's a Seagate ST3500630A parallel ATA drive.

> I had a scenario where 3ware controllers were just failing to write to
> a drive in the array, so old data showed through.

I have an Intel ICH4 controller - nothing unusual.

> the filesystem and the partitions and the raids all were on different
> alignments so teh only part of the system that had a boundary that 
> aligned with the bad data was the physical stripes laid down by the 
> controller.  It was 64k stripes and 64k data missing, exactly on
> stripe boundaries. Due to the fact that FreeBSD had partitioned the 
> drive staring at 63 blocks in, nothing else aligned with the problem.

Hmm, well this is a straight-forward disk situation - never used RAID on
this drive.  Give what is happening, I wonder the changes of it being
HW, OS, or a filesystem issue.

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Analysis of disk file block with ZFS checksum error

2008-02-08 Thread Joe Peterson
Chris Dillon wrote:
> That is a chunk of a Mozilla Mork-format database.  Perhaps the  
> Firefox URL history or address book from Thunderbird.

Interesting (thanks to all who recognized Mork).  I do use Firefox and
Thunderbird, so it's feasible, but how the heck would a piece of one of
those files find its way into 1/2 of a ZFS block in one of my mp3 files?
   I wonder if it could have been done on write when the file was copied
to the ZFS pool (maybe some write-caching issue?), but I thought ZFS
would have verified the block after write.  It seems unlikely that it
would get changed later - I never rewrote that file after the original
copy...

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Analysis of disk file block with ZFS checksum error

2008-02-08 Thread Joe Peterson
Mark Day wrote:
> Based on the subset of data you posted, the bad data looks like ASCII
> text.
> The bad data from offset a to a000f is:
>
> ${138AFE{@
> @$$}1
>
> The bad data from offset af6c1 to af6c8 is:
>
> 392A9}@
>
> I don't recognize the content beyond that, but I'd guess that somehow
> the
> contents of some other file managed to overwrite that portion of the bad
> file.  As for how that happened, I don't know.  But if someone
> recognizes
> where the bad content came from, that might be a clue.


Gary/Mark,

Good eye!  Yes, it indeed does appear to be ASCII.  I *thought*
something in the repetition when I originally did an od -a looked
interesting.

I dumped the whole bad section as a string, and here's (partly) what I get:

${138AFE{@
@$$}138AFE}@

@$${138AFF{@
[A3:^80(^91^2146F)]
@$$}138AFF}@

@$${138B00{@
@$$}138B00}@

@$${138B01{@
[181:^80(^91^2146F)]
@$$}138B01}@

@$${138B02{@
@$$}138B02}@

@$${138B03{@
[2C:^80(^91^2146F)]
@$$}138B03}@

@$${138B04{@
@$$}138B04}@

.
.
.

@$${138B8B{@
<(21470=Thu Jan 24 23:20:58 2008)>
[117:^80(^91^21470)]
@$$}138B8B}@

.
.
.

@$${138C18{@
<(21472=1201242069)>[-2:^80(^82^85)(^83^1B5)(^84=b)(^85=1)(^86=0)(^87=0)
(^88=0)(^89^2146C)(^8A=)(^8B=40)(^8C=2e)(^8D^84)(^8E=0)(^90^21472)
(^91^21460)]
@$$}138C18}@

@$${138C19{@
<(21473=a72f78)>[2:^80(^89^21473)]
@$$}138C19}@

@$${138C1A{@
@$$}138C1A}@

.
.
.


and more of the same.  Note the date string.  There are several like
that.  Anyone recognize this text format?

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Analysis of disk file block with ZFS checksum error

2008-02-08 Thread Joe Peterson
In my experimentation with the ZFS filesystem, I encountered one case of
a file block with a checksum mismatch.  Doing a "zpool scrub" revealed
it, and trying to read the file yielded an error - only the part of the
file before the bad block was read (ZFS aborts reading at this point,
which makes sense), resulting in a short file.  The reason the CKSUM
error is not fixable is because my ZFS pool contains only one device (no
mirror or RAIDZ), but I do have the original/good version of the file
affected.  Here's the output of zpool status (new scrub in process):

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress, 64.36% done, 0h18m to go
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 2
  hda6  ONLINE   0 0 2

errors: Permanent errors have been detected in the following files:

/mnt/tank/fbsd/home/joe/music/jukebox/christmas/Esquivel/
Merry_XMas_from_the_SpaceAge_Bachelor_Pad/07-Snowfall.mp3


I was curious about what actually happened: was this a ZFS bug, trouble
with its metadata, or truly a bad block?  In order to determine this, I
modified ZFS's source code temporarily to ignore the checksum mismatch
and let the file read fully.  What I then got was the full-length file
and no errors, showing that there were no disk read errors associated
with the read (I already had assumed this from the fact that zpool
status showed only a non-zero CKSUM count), however, I may have seen
other error counts previously (ZFS resets them to zero on, e.g.,
reboot).  I received no errors when originally copying this file *to*
the ZFS pool - only on subsequent reads/scrubs.

(Note that I have posted before about DMA errors in my log for the disk
I am using, but I have had nothing but successful SeaTools tests
(surface scans) of the drive.  Jeremy Chadwick had similar issues, as
did others, so I think it is worth investigating if there is some
OS/software cause rather than real HW issues.  This is one reason I
wanted to investigate my ZFS checksum issue more deeply.)

I also have a good backup of the file in question, so I now have two
copies of the file: one good, and one with a bad block.  The file is
3575936 bytes long, and recordsize (in ZFS) is 128K, making the file
about 27 blocks long.  Curiously, the bad section of the file is exactly
65536 bytes long (1/2 a block).  The bad block starts at exactly the 5th
128K block (byte 65536 or hex a).

I wanted to see the characteristics of the bad data.  Was just one bit
flipped randomly?  No.  It is just one bit or set of bits in the bytes
that are affected?  It doesn't seem so.  Were there any other stange
patterns here?  Well, yes, and maybe someout out there with more
knowledge/experience in disk modes of failure will recognize something
(I have included some data below).

For one thing (as I mentioned), only 65536 bytes are bad (and it's
exactly this many, with a few "good" bytes thrown in, but not far from
what matches random chance would produce.  Also, all bad bytes have a
zero in the high bit - interesting?  Also, near the end of the block,
the bad bytes all go to zero, strangely coincident with the first "good"
zero in that bad block - not sure if that's coincidence or not.  Also, I
calculated the number of "Bits same" (matching bits) in the good vs. bad
bytes, and it appears fairly random, so it appears that the bad bytes
are very random in nature and not correlated much at all with the good
bytes.

So except for the fact that the 2nd half (65536 bytes) of the ZFS block
are good, the bad block seems to consist of random data, except for the
string of zero bytes near the end and the zero high-bit.  It's not as if
one bit on the disk flipped - it affects the whole (1/2) block.  Does
this seem like a disk error, controller error/bug, cable problem (I
recently put a new cable on, so I doubt this).  It seems to me something
more systemic rather than a random bit error - opinions are more than
welcome.

Here is some info from a python program I wrote to look at the data
(I've left out spans of essentially uninteresting portions showing
similar stuff, but I can get you the whole thing if interested):

File posGoodBad Match   Good (bin)  Bad (bin)  Bits same
0009fff0d9  d9  Yes 11011001110110018
0009fff105  05  Yes 010101018
0009fff2c1  c1  Yes 110111018
0009fff381  81  Yes 100110018
0009fff45f  5f  Yes 010101018
0009fff566  66  Yes 01100110011001108
0009fff65e  5e  Yes 0100010

Re: Frequent USB mouse disconnections under load with RELENG_7

2008-02-01 Thread Joe Peterson
Wayne Sierke wrote:
> On Fri, 2008-01-25 at 01:59 +1030, Wayne Sierke wrote: 
>> I'm getting a lot of USB mouse disconnects on RELENG_7. I wondered
>> whether they might have been due to running with a KTR-enabled kernel
>> but in just the last 7 hours I've been running on stock GENERIC and
>> they're still happening.

Hey Wayne,

I'm not sure if you associating the disconnects with the "jerky mouse"
behavior, but as an added datapoint, I have a PS/2 mouse, I see *no*
disconnects in the system logs (well, it's PS/2...), and I still get the
jerky mouse...

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Unexpected "resilver" after reboot (after scrub found CKSUM problems)

2008-01-30 Thread Joe Peterson
[...reposting to freebsd-stable - no response on freebsd-fs]

I had a strange thing happen on ZFS the other day, and I cannot find any
info about it on the web - thought you might have some ideas.  I am
using 7.0-RC1 at the moment.

I found a checksum error in ZFS during a scrub.  This is strange in
itself, since I believe the disk is OK (see below):



  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   0 0 0
  ad0s1dONLINE   0 0 0

errors: Permanent errors have been detected in the following files:


/home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_Bachelor_Pad/07-Snowfall.mp3



This is how it appears after a recent reboot, however.  After a scrub, I
see varying number of non-zero counts under CKSUM.  Not sure why it is
zero after reboot (maybe that's normal).

However, the strange this is that after my first reboot after the scrub
found the issue, zpool status told me that "resilver completed with 0
errors", and there were no known errors.  Only trying to read the file
and/or rescrubbing returned the status to the error state and made the
CKSUM column non-zero.  Since I do not have a mirror or raid config, I'm
not sure why it would resilver at all, and I did nothing explicit to
cause a resilver (as far as I know)...

Any ideas?

As an aside, I, along with some others on freebsd-stable@freebsd.org,
have been seeing what "look" like disk errors in the system logs.  I
have a suspicion that there could be some other cause (lots of
discussion on that list, if you are interested).

Strangely, this disk checks out fine on both short and long tests in
Seatools, and smartctl shows it as OK.  Also, using Linux to do lots of
reads from it does not show any issue or error logs.  At this point, I
am not sure if the CKSUM issue is a real HW flaw or something else...

Thanks, Joe


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1

2008-01-27 Thread Joe Peterson
Remco van Bekkum wrote:
> Well it looks like in my case it is hardware related after all. It failed to 
> read the boot
> block several times now. 2nd sort of DOA of this disk...

Have you tried reading the block in another OS or using SeaTools?  That would
at least verify that it's hardware.

-Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1

2008-01-26 Thread Joe Peterson
Jeremy Chadwick wrote:
>> If this is widespread, I think the chances re slim that it is a
>> hardware problem in every case.
> 
> I'm in definite agreement here.  I think it might be worthwhile to note
> what hardware we're all using, in case there's something similar between
> our systems (chipset, disk vendor, etc.).
> 
> My system is as follows; timeouts were reported during an rsync of data
> from the ZFS stripe (ad8+ad10) to a UFS2 filesystem on ad6.  System
> eventually panic'd after remaining deadlocked (while kernel messages
> about timeouts kept printing on the console for ad6 only) for 10-15
> minutes.
>
> *   MB: Supermicro PDSMI+  (Intel ICH7-based)
> *  CPU: Intel Core 2 Duo E6600
> *  RAM: Corsair CM2X1024-6400 DDR2, 2GB
> *  ad4: WD Caviar SE WD2000JD (boot/OS)
> *  ad6: Seagate Barracuda 7200.10 ST3500630AS
> *  ad8: WD Caviar SE16 WD5000AAKS (ZFS stripe)
> * ad10: WD Caviar SE16 WD5000AAKS (ZFS stripe)
> * All drives are hooked up to the ICH7.
> * SMART stats showed no problems on any of the drives before or after.
> * RELENG_7, i386, ULE scheduler.

Mine is as follows:

*   MB: Tyan Trinity S2099
*  CPU: Pentium 4, 2.4GHz
*  RAM: Crucial DDR, ECC, CL2.5, Unbuffered 2GB (1/2 PC2100, 1/2 PC2700)
*  ad0: Seagate ST3500630A 3.AAE (1 UFS2 boot, 1 ZFS pool)
*  ad1: Seagate ST3160812A 3.AAH (not used by FreeBSD)
* Intel ICH4 UDMA100 controller
* ATI Radeon RV280 9250
* Intel PRO/1000 NIC
* 7.0-RC1, i386, ULE scheduler

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: ad8: TIMEOUT - WRITE_DMA errors UFS 7.0-RC1

2008-01-26 Thread Joe Peterson
Remco van Bekkum wrote:
> Same here. On an amd64 system with 1x sata disk (Western Digital Caviar
> Green Power) on an amd690G chipset, with UFS and intensive disk activity
> the system hangs and in the end it may panic. I've csupped today and
> rebuild world & generic kernel but still it's very unstable, sometimes it
> even hangs when activating geom volumes at boot time... 
> I must add that this is a new system so I'm not 100% sure the hardware is 
> sane.
> Using ZFS it also crashed when doing intensive I/O.

This is very interesting.  It seems to there are several of us who are
experiencing something that *looks* like hardware (disk) issues when using 7.0.

Could this be related to the mouse freeze issue?  Could some process be
locking/grabbing the CPU at inopportune times and causing not only the
freezing symptoms but also reads/writes problems?

Can anyone else using 7.0 who hasn't already (especially those using ZFS)
check his/her /var/log/messages for disk TIMEOUTs or other disk error
messages?  If this is widespread, I think the chances re slim that it is a
hardware problem in every case.

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
Ivan Voras wrote:
> Were both tests done in the same machine (actually, I mean the same PSU)?

Yes - I deliberately changed nothing (not even cables) before I ran the tests.
 I didn't want any variables.

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
Joe Peterson wrote:
> So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the
> drive.  The short test passed already.  The results should be interesting.  If
> it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
> bugs that just happen to look like drive problems.  I already did a long read,
> under linux, of disk contents, and got no messages about anything wrong.

Update: both SHORT and LONG tests passed for this drive in SeaTools.
Hmph...  the mystery remains.
-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-26 Thread Joe Peterson
I performed a ZFS scrub, which finished yesterday, and no new
/var/log/messages errors were reported during that time.  However, the scrub
found something interesting:


crater# zpool status -v
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 1 errors on Fri Jan 25 12:52:32 2008
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 2
  ad0s1dONLINE   1 3 2

errors: Permanent errors have been detected in the following files:


/home/joe/music/jukebox/christmas/Esquivel/Merry_XMas_from_the_SpaceAge_
Bachelor_Pad/07-Snowfall.mp3



Note that I have not touched this file since copying it to this drive.

So, it seems one file failed a checksum check during the scrub.  I now
(expectedly) get errors trying to read this file - probably ZFS indicating the
condition.  When I just logged in tonight, I got two more /var/log/messages
disk messages about WRITE_DMA48 TIMEOUT/FAILURE - might be a coincidence (just
as I was typing my password).

Also, smartctl still shows PASSED, however, this is interesting:

195 Hardware_ECC_Recovered  0x001a   061   046   000Old_age   Always
  -   9070

The number is much *smaller* now!  It was "6" a few minutes before this...
wrap around?  Hmm, I'm really not sure, at this point, what is going on.

So I have started a "SeaTools" (disk scanner from Seagate) "long test" of the
drive.  The short test passed already.  The results should be interesting.  If
it finds nothing wrong, I am going to start to wonder if I am experiencing ZFS
bugs that just happen to look like drive problems.  I already did a long read,
under linux, of disk contents, and got no messages about anything wrong.

If I can turn on any debugging info to help determine if this is
software-related, let me know the magic keywords to use.  :)

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Glad you got it back!  Yes, when I was first playing with ZFS, I noticed
that booting between single and multi user mode could make the pools
"invisible".  Import seemed to bring them back...

So, is the disk toast, or can you still read anything from it (part
table, etc.)?

-Joe


Jeremy Chadwick wrote:
> On Fri, Jan 25, 2008 at 05:00:54PM -0800, Jeremy Chadwick wrote:
>> icarus# zfs list
>> no datasets available
>>
>> This doesn't bode well, and doesn't make me happy.  At all.
> 
> Pshew!  I was able to get ZFS to start seeing the pool again by doing
> the following:  (Supposedly "zpool import" by itself will show you a
> list of pools which it manages to see...")
> 
> icarus# zpool import -f storage
> icarus# df -k /storage
> Filesystem  1024-blocks  Used Avail Capacity  Mounted on
> storage   957873024 106124032 85174899211%/storage
> icarus# zfs list
> NAME  USED  AVAIL  REFER  MOUNTPOINT
> storage   101G   812G   101G  /storage
> icarus# zpool status
>   pool: storage
>  state: ONLINE
>  scrub: none requested
> config:
> 
> NAMESTATE READ WRITE CKSUM
> storage ONLINE   0 0 0
>   ad8   ONLINE   0 0 0
>   ad10  ONLINE   0 0 0
> 
> errors: No known data errors
> 
> Back to the drawing board.
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
> Joe, I wanted to send you a note about something that I'm still in the
> process of dealing with.  The timing couldn't be more ironic.
> 
> I decided it would be worthwhile to migrate from my two-disk ZFS stripe
> with a non-ZFS disk for nightly backups, to to a RAIDZ pool of all 3
> disks combined (since they're all the same size).  I had another
> terminal with gstat -I500ms running in it, so I could see overall I/O.
> 
> All was going well until about the 81GB mark of the copy.  gstat started
> showing 0KB in/out on all the drives, and the rsync was stalled.  ^Z did
> nothing, which is usually a bad sign.  :-)  I ssh'd in and did a dmesg
> (summarised):
> 
> ad6: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
> request directly
> ad6: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing 
> request directly
> ad6: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951071
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951327
> ad6: FAILURE - WRITE_DMA timed out LBA=13951071
> ad6: FAILURE - WRITE_DMA timed out LBA=13951327
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951583
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13951839
> ad6: FAILURE - WRITE_DMA timed out LBA=13951583
> ad6: FAILURE - WRITE_DMA timed out LBA=13951839
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952095
> ad6: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=13952351
> g_vfs_done():ad6s1d[WRITE(offset=7142916096, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143047168, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143178240, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143309312, length=131072)]error = 5
> g_vfs_done():ad6s1d[WRITE(offset=7143440384, length=131072)]error = 5
> 
> It appears my /dev/ad6 (a Seagate -- more irony) must have some bad
> blocks.  Actually, after letting things go for a while, I realised the
> box just locked up.  Probably kernel panic'd due to the I/O problem.
> I'll have to poke at SMART stats later to see what showed up.

Wow, pretty crazy!  Hmm, and yes, those LBAs do look close together.
Well, let me know how the smartctl output looks.  I'd be curious if your
bad sector count rises.  I had noticed that 1

BTW, I tried:

crater# dd if=/dev/ad1s4 of=/dev/null bs=64k
^C1408596+0 records in
1408596+0 records out
92313747456 bytes transferred in 1415.324362 secs (65224446 bytes/sec)

(I let it go for 92GB or so) - no messages about ad1.  So I wonder if
this points at either the cable connector on ad0 or the drive itself.  I
guess I'd rather have a failing drive than motherboard...

I originally was wondering if somehow something peculiar about ZFS's
disk access pattern was making it happen...

THanks for the recomendations.  I'll keep an eye on it, and I'll let you
know what a cable change does for me.  Still, I have not had any ad0
messages since this morning (I haven't been using the system today much,
but maybe the cron processes are more likely to trigger it...

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1

2008-01-25 Thread Joe Peterson
John Baldwin wrote:
> Hmm, when I look at that graph using schedgraphy from HEAD it just looks
> like xtrs is using up all the CPU.

Yeah, xtrs is eating a lot of CPU, but I've never seen this affect the
mouse movement (making it really jerky) the same way on, e.g., Linux.
And the xtrs test is just a way to *reliably* make it happen.  It
happens intermittently all of the time (at least every few minutes, and
often in small batches) even when the system is pretty idle...

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1

2008-01-25 Thread Joe Peterson
Sam Leffler wrote:
> Sigh, you are correct.  I backrev'd the machine where I ran schedgraph 
> to RELENG_7 and didn't notice the old version mis-parses the ktr file.  
> The graph is totally different w/ schedgraph from HEAD.
> 
> Sorry Joe for misleading you.

No problem, Sam, but the question I have for you now is: do you see
anything with the updated schedgraph that indicates any "freezes" that
look funny?  The length of the ones I saw with mouse movement were
mostly some portion of a second, from maybe 1/8 to 1/2 sec.  And there
should be a lot of them in quick succession.

Thanks, Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Chuck Swiger wrote:
> On Jan 25, 2008, at 11:24 AM, Joe Peterson wrote:
>> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE   
>> UPDATED  WHEN_FAILED RAW_VALUE
>>  1 Raw_Read_Error_Rate 0x000f   114   071   006Pre-fail   
>> Always   -   82422948
> [ ... ]
>>  7 Seek_Error_Rate 0x000f   084   060   030Pre-fail   
>> Always   -   286126605
> [ ... ]
>> 195 Hardware_ECC_Recovered  0x001a   063   046   000Old_age
>> Always   -   166181300
> 
> These numbers are quite worrysome-- they should be zero or nearly so  
> in a healthy drive.

It seems to depend on the drive manufacturer.  E.g. this is a Seagate.  Every
Seagate I've ever had (or heard about on the web via smartctl dumps) reports
very large numbers for these values.  I've heard it described that Seagate
shows you the raw numbers (and correctable errors do happen all the time in
all drives).

In Western Digital drives (IIRC), the numbers shown are the ones that *should*
be zero, thereby hiding the low-level errors.

Hard to say if my numbers are "too high", but these "corrected" error counts
are always frighteningly high in Seagates.

-Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
> What you've shown is usually the sign of a disk-related problem.  It's
> very obvious when it's just one disk reporting DMA errors.  You use ZFS,
> so chances are you have more than one disk in a pool/volume -- there's
> no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
> something specific to ad0.

Jeremy, thanks for the response - I have tried to answer all of your
questions below...

In my case, I am using only one disk (ad0) for FreeBSD, and I am only
using one partition on this disk in my ZFS pool.  So, in this case,
unfortunately, it's not possible to tell from the fact that only ad0 is
listed that it is specific to this drive.

> Manufacturers pick very passive (non-aggressive) thresholds for error
> conditions on disks, so disks which are failing very commonly show
> "PASSED" during SMART analysis.  To make matters worse, most users I
> know read SMART stats incorrectly (they're easy to misinterpret).

Yep, I am also always skeptical of smart reports.  That's one reason I
am very interested in ZFS.  I don't trust the drive to be completely
reliable, and the fact that ZFS does end-to-end data integrity is very
intriguing.

> Can you please provide output of the following:
> 
> * smartctl -a /dev/ad0

OK, I've attached this to the end of this email.

> * atacontrol cap ad0

Protocol  ATA/ATAPI revision 7
device model  ST3500630A
serial number 9QG0DG03
firmware revision 3.AAE
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported   976773168 sectors
dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Tagged Command Queuing (TCQ)   no   no  0/0x00
SMART  yes  yes
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  no   no  65278/0xFEFE
automatic acoustic management  no   no  0/0x00  208/0xD0

> * atacontrol info 

Master:  ad0  ATA/ATAPI revision 7
Slave:   ad1  ATA/ATAPI revision 7

(but note that ad1 is not used by FreeBSD)

> * Relevant dmesg output that indicates what kind of ATA controller
>   these disks are attached to.  Start with output from 'ad0:' and
>   work backwards.  For example, ad0 on this machine is using an Intel
>   ICH6 controller:
>   atapci0:  port 
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
>   ata0:  on atapci0
>   ad0: 238475MB  at ata0-master SATA150

atapci0:  port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0

ata0:  on atapci0

ata0: [ITHREAD]
ad0: 476940MB  at ata0-master UDMA100

> SMART stats which are labelled "Offline" are only updated when a short
> or long offline test is performed.  Have you tried using "smartctl -t
> short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw
> values on the far right column increment?

I just tried one:

# 1  Short offline   Completed without error   00%  5252
 -
# 2  Short offline   Completed without error   00%  5252
 -

Also, none of the numbers that were zero incremented, esp:

198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0

Also, no more errors were reported in the system log during the self-tests.

> Have you tried using "zpool scrub" on the ZFS pool, then "zpool status"
> to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line
> states there were errors?

OK, I started a scrub, and it will take some more time to complete...
But I get the following with status.  Could this be due to the timeouts
and failures?  I suspect so, so maybe this is not surprizing.  I'd also
guess that this doesn't necessarily point to the drive, but anything in
the chain of events...  I do not have a mirror or RADI-Z, so I guess the
reason there was "no data loss" (yet) is because the checksum passed,
and maybe it just had to retry...?  Anyway, here's the output so far:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.50% done, 1h58m to go
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 0
  ad0s1dONLINE   1 3 0

errors: No known data errors

> Other things which have fixed problems in the past for others:
> 
> * BIOS updates
> * Change of motherboards (sometimes replacing board with same model,
>   other times going 

Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
Jeremy Chadwick wrote:
> What you've shown is usually the sign of a disk-related problem.  It's
> very obvious when it's just one disk reporting DMA errors.  You use ZFS,
> so chances are you have more than one disk in a pool/volume -- there's
> no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
> something specific to ad0.

Jeremy, thanks for the response - I have tried to answer all of your
questions below...

In my case, I am using only one disk (ad0) for FreeBSD, and I am only
using one partition on this disk in my ZFS pool.  So, in this case,
unfortunately, it's not possible to tell from the fact that only ad0 is
listed that it is specific to this drive.

> Manufacturers pick very passive (non-aggressive) thresholds for error
> conditions on disks, so disks which are failing very commonly show
> "PASSED" during SMART analysis.  To make matters worse, most users I
> know read SMART stats incorrectly (they're easy to misinterpret).

Yep, I am also always skeptical of smart reports.  That's one reason I
am very interested in ZFS.  I don't trust the drive to be completely
reliable, and the fact that ZFS does end-to-end data integrity is very
intriguing.

> Can you please provide output of the following:
> 
> * smartctl -a /dev/ad0

OK, I've attached this to the end of this email.

> * atacontrol cap ad0

Protocol  ATA/ATAPI revision 7
device model  ST3500630A
serial number 9QG0DG03
firmware revision 3.AAE
cylinders 16383
heads 16
sectors/track 63
lba supported 268435455 sectors
lba48 supported   976773168 sectors
dma supported
overlap not supported

Feature  Support  EnableValue   Vendor
write cacheyes  yes
read ahead yes  yes
Tagged Command Queuing (TCQ)   no   no  0/0x00
SMART  yes  yes
microcode download yes  yes
security   yes  no
power management   yes  yes
advanced power management  no   no  65278/0xFEFE
automatic acoustic management  no   no  0/0x00  208/0xD0

> * atacontrol info 

Master:  ad0  ATA/ATAPI revision 7
Slave:   ad1  ATA/ATAPI revision 7

(but note that ad1 is not used by FreeBSD)

> * Relevant dmesg output that indicates what kind of ATA controller
>   these disks are attached to.  Start with output from 'ad0:' and
>   work backwards.  For example, ad0 on this machine is using an Intel
>   ICH6 controller:
>   atapci0:  port 
> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
>   ata0:  on atapci0
>   ad0: 238475MB  at ata0-master SATA150

atapci0:  port
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0

ata0:  on atapci0

ata0: [ITHREAD]
ad0: 476940MB  at ata0-master UDMA100

> SMART stats which are labelled "Offline" are only updated when a short
> or long offline test is performed.  Have you tried using "smartctl -t
> short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw
> values on the far right column increment?

I just tried one:

# 1  Short offline   Completed without error   00%  5252
 -
# 2  Short offline   Completed without error   00%  5252
 -

Also, none of the numbers that were zero incremented, esp:

198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0

Also, no more errors were reported in the system log during the self-tests.

> Have you tried using "zpool scrub" on the ZFS pool, then "zpool status"
> to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line
> states there were errors?

OK, I started a scrub, and it will take some more time to complete...
But I get the following with status.  Could this be due to the timeouts
and failures?  I suspect so, so maybe this is not surprizing.  I'd also
guess that this doesn't necessarily point to the drive, but anything in
the chain of events...  I do not have a mirror or RADI-Z, so I guess the
reason there was "no data loss" (yet) is because the checksum passed,
and maybe it just had to retry...?  Anyway, here's the output so far:

  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 2.50% done, 1h58m to go
config:

NAMESTATE READ WRITE CKSUM
tankONLINE   1 3 0
  ad0s1dONLINE   1 3 0

errors: No known data errors

> Other things which have fixed problems in the past for others:
> 
> * BIOS updates
> * Change of motherboards (sometimes replacing board with same model,
>   other times going 

"ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1

2008-01-25 Thread Joe Peterson
I've seen mention of this kind of issue before, but I never saw a
solution, except that someone reported that a certain version of 6.x
seemed to make it go away - accounts of this problem are a bit vague.  I
am running 7.0-RC1, and I am seeing the errors periodically, and I am
wondering if this is a known issue.  Note that smartctl does not report
errors logged and gives a "PASSED" to the drive.  I am running at
UDMA100 ATA.  Also, if it matters, I am using ZFS.

Attached is a grep of the /var/log/messages file.  Let me know if anyone
has suggestions.

Thanks!  Joe
Jan 21 23:39:54 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=54112319
Jan 22 00:06:29 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=51610951
Jan 22 00:16:40 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=53031647
Jan 22 00:30:15 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=54243391
Jan 22 07:05:59 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=51768047
Jan 22 09:08:16 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=55890239
Jan 22 09:17:52 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=55919423
Jan 22 09:23:42 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=53470111
Jan 23 00:26:03 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=53588527
Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry 
left) LBA=764596887
Jan 23 00:26:26 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries 
left) LBA=764596887
Jan 23 00:26:26 crater kernel: ad0: FAILURE - WRITE_DMA48 
status=51 error=10 LBA=764596887
Jan 23 03:01:06 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) 
LBA=185819705
Jan 23 03:01:37 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=54837686
Jan 23 03:03:22 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=53472407
Jan 23 03:03:39 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=53627991
Jan 23 11:33:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=5747
Jan 23 12:30:31 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=55407234
Jan 23 13:20:06 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=57779519
Jan 23 17:30:18 crater kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry 
left) LBA=453849407
Jan 23 17:30:19 crater kernel: ad0: FAILURE - READ_DMA48 
status=51 error=10 LBA=453849407
Jan 23 17:30:29 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) 
LBA=187373078
Jan 23 18:34:50 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=1017919
Jan 23 18:35:00 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) 
LBA=54547647
Jan 23 18:35:12 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=56354060
Jan 23 18:35:20 crater kernel: ad0: TIMEOUT - READ_DMA retrying (1 retry left) 
LBA=53919167
Jan 23 23:59:18 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left)
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=237661119
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries 
left) LBA=237661119
Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=237661119
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=236239553
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries 
left) LBA=236239553
Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236239553
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry 
left) LBA=764595671
Jan 24 00:00:27 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries 
left) LBA=764595671
Jan 24 00:00:27 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out 
LBA=764595671
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry 
left) LBA=764595671
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries 
left) LBA=764595671
Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA48 timed out 
LBA=764595671
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=236180175
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries 
left) LBA=236180175
Jan 24 00:01:13 crater kernel: ad0: FAILURE - WRITE_DMA timed out LBA=236180175
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left)
Jan 24 00:01:13 crater kernel: ad0: TIMEOUT - FLUSHCACHE retrying (0 retries 
left)
Jan 24 02:31:53 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=236191551
Jan 24 04:54:57 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) 
LBA=238068287
Jan 24 04:55:56 crater kernel: ad0: TIMEOUT - WRITE_DMA retrying (0 retries 
left) LBA=238068287
Jan 24 04:55:56 crater kernel: ad0

Re: New KTR trace for mouse freezing/stuttering in 7.0-RC1

2008-01-24 Thread Joe Peterson
Sam Leffler wrote:
>>  http://www.skyrush.com/downloads/ktr_ule_4.out
>>
> I don't see what it is 
> from the trace data.  It sort of looks like the last thing that ran is 
> the swi4 which is likely a callout (need to check the log file contents 
> to be certain).  If the callback function does something it wouldn't 
> necessarily be visible in the schedgraph plot.  If you could stick a 
> dmesg from booting out in the same spot it might be worthwhile.

OK, I just ran a dmesg and put it up there:

http://www.skyrush.com/downloads/dmesg_4.out

The WRITE_DMA messages are not time-correlated with this issue; I don't
like the looks of those either, but that's a different issue to look into...

> Also if 
> you rebuild the kernel the kernel with DIAGNOSTIC then softclock() will 
> complain about callouts that take longer than 2ms to run.

OK, recompiling now...  Will the new messages appear in dmesg, or in a
log file?

> This might 
> generate too much noise in which case you can adjust the threshold by 
> editing the code in sys/kern/kern_timeout.c.

Cool - thanks for looking at this, and I will let you know what I find!
 Do I need to make another trace concurrently, or should I just repeat
the test procedure and see if I get new messages?

-Thanks, Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


New KTR trace for mouse freezing/stuttering in 7.0-RC1

2008-01-23 Thread Joe Peterson
In an attempt to track down this mouse freezing/stuttering (i.e. "jerky
mouse movement) behavior in FreeBSD 7.0-RC1, I have come up with a
reliable way to cause it to happen, and I have created a longer trace
showing the results.  Note that I am using the ULE scheduler.

In general, it becomes easier to see the effect if there is CPU
activity.  I have noticed it during kernel compiles, while at the same
time loading web pages in firefox that contain images (and moving the
mouse while this is happening).  But a more controlled way to see it is
to run something that uses some CPU and then generating lots of X events.

In my case, I start "xtrs" (TRS-80 emulator) in Model IV mode, which
happens to poll for input, using the CPU.  Then I move the mouse back
and forth quickly between windows in "focus under mouse" mode (in my
case, a KDE focus mode), which causes many focus events quickly.  In
about 15 or 20 seconds, the mouse reliably starts to show erratic
movement, not moving smoothly.

I really hope this can shed more light on what might be going on.  Here
is the trace:

http://www.skyrush.com/downloads/ktr_ule_4.out

Thanks, Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: 7.0-PRERELEASE desktop system periodically freezes momentarily

2008-01-23 Thread Joe Peterson
J.R. Oldroyd wrote:
> On Wed, 23 Jan 2008 08:27:58 -0700, Joe Peterson <[EMAIL PROTECTED]> wrote:
>> Also, it seems that intermittent mouse freezes happen more often when
>> I've been away from the machine for a while and return to start using
>> the mouse again, but that's not always the case.  A few short
>> freezes/stutters happen a second or so after mouse movement resumes.
>>
>>  -Joe
> 
> Joe,
> 
> I don't see any postings from you showing any ktr dumps.  Do you have
> any?  Your symptoms (that it seems to happen after you've been away
> for a while and then return and move the mouse) sound a lot like mine.

Hi J.R., here is the post that contains links to my dumps:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-January/039599.html

> I posted some ktr dumps and have since chatted off-list with Kris and
> Sam about what may be up.  My dumps show the shared irq ath/pcm and
> the ath taskq are hogging the cpu for ages without the clock swi getting
> to run at all.  Sam has suggested experimenting with the ath taskq
> priority and also with disabling ath bg scans which I will do, but
> right now I am back to looking at powerd again as the possible cause.

Hmm, well I don't have an Atheros on this machine - only ethernet.
Also, I have not tried playing audio, so what I am seeing is simply with
"normal" use.

> I ran without powerd for a while when originally suggested by David
> Lawrence on Jan 12th.  I believe I did still see freezes then, but I
> re-enabled powerd when I was ready to do LOCK_PROFILING and then ktr
> monitoring; I re-enabled it so I could be sure I had the same test
> conditions.  At this point, I am no longer sure what happened when
> powerd was disabled.  My recollection is that there were freezes while
> powerd was off, but the only email in which I appear to have posted
> about that says "no freezes so far".  So I'm running without powerd again,
> and at this point, several hours at the computer over two days, I have
> not seen further freezes.  Does anyone else who sees these freezes also
> have powerd enabled and can try without powerd for a while?

Mine is a desktop machine, so I have not enabled powerd.

> Since these freezes are proving so hard to pinpoint, it may be worth
> comparing notes to try to find things in common between the systems
> or eliminate other things.  But first, it seems like we may be chasing
> three separate causes:
> 
> 1. the softupdate freeze
>   after removing a very large file (e.g., >1Gb) there is a
>   noticeable freeze while the softupdate runs 
> 
> 2. the busy freeze
>   folk complain of short freezes and mouse jerkiness while
>   the system is busy, e.g., glxgears or compilations
> 
> 3. the idle freeze
>   short and longer freezes (some going into minutes) apparently
>   when resuming work after having left the system mostly idle
>   for a while
> 
> Now, I also had the "busy freeze" when I first tested 7.0.  At that time
> (several weeks back now) someone suggested switching to the ULE scheduler,
> which I did, and the symptoms I had were dramatically improved.  Since
> then I've had occasions to run several compilations at once and had no
> mouse jerkiness.  But for folk who still have it: what scheduler do you
> have and what processes are running when it happens?

I seem to see #2 (busy freezes).  They are usually very short
(sub-second) freezes, and they happen randomly as I move the mouse
(well, I assume that I see it manifested in a mouse freeze, but it could
very well be a system or X freeze, since I see it in keyboard
key-held-down too).  The mouse usually moves smoothly, but every once in
a while, it "sticks" for a fraction of a second as I move it -
irritating to say the least.  Often the small freezes come in spurts,
but they often are one at a time as well.  When it comes in spurts, it
is often shortly after moving the mouse after lots of idle time (as if
the scheduler "wakes up" and has some fits for a short time - a
"non-scientific" description ;).

I am using ULE on 7.0.  I'm also using ZFS (so the soft-updates issue
doesn't apply, and I spoke with someone else who uses UFS2, not ZFS, and
he said the mouse jerked around pretty badly in 7.0 on his machine).

I started with using 4BSD under 7.0, of course, and yes, there were
worse batches of freezes with it, especially when starting KDE and when
compiling the kernel (it was nearly constant).  With ULE, I no longer
see compiles causing freezes, and generally the freezes are more subtle
and shorter - in other words, ULE *is* better than 4BSD in this respect,
but it is still worse than normal operation und

Re: 7.0-PRERELEASE desktop system periodically freezes momentarily

2008-01-23 Thread Joe Peterson
Wayne Sierke wrote:
> So it seems the only thing of interest that I"ve managed to capture so
> far pertains to glxgears - an instance of the "stutter" and a part of a
> short freeze when dragging its window. Unfortunately these frequent
> mouse disconnects make it difficult to recognise genuine freezes during
> 'normal' use, if indeed they are still occurring with RELENG_7. However
> the glxgears behaviour remains (apparently) the same as it was on
> RELENG_6. Whether that's a telling sign or not remains to be seen.

Wayne, thanks for continuing to investigate, since these little
"freezes" definitely affect usability.  If I can help in any way, let me
know.  I have not made any further graphs, but I continue to see
intermittent mouse freezing (for short sub-seconf periods, usually).  As
for mouse disconnects, I don't know if that is what I am seeing, but one
thing I do notice is that the keyboard is also affected (easily seen by
holding down a key and letting it repeat - short pauses can be seen in
the echo, which could be xterm, X, or the keyboard input, of course).
Also, I tried unplugging my ps/2 mouse and using a USB one instead -
same issue exists.

In case this is scheduler-related, I tried running a CPU-hogging task
(xtrs in "model 4" mode, which spins, polling for input).  While running
this and moving the mouse rapidly between two windows (I use
focus-under-mouse, so this causes focus events), I eventually get
repeated short mouse freezes for quite some time (maybe 10 seconds)
until things can catch up.  This is not reproducible on Linux CFS
(2.6.23) - the CPU use certainly affects event "catching up" in X, but
the mouse stays smooth.

Also, it seems that intermittent mouse freezes happen more often when
I've been away from the machine for a while and return to start using
the mouse again, but that's not always the case.  A few short
freezes/stutters happen a second or so after mouse movement resumes.

-Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: To 6.3 or to 7.0 that is the question?

2008-01-18 Thread Joe Peterson
One word: ZFS!  It's awesome.

-Joe


Steven Hartland wrote:
> With the announcement of 6.3 and with 7.0 looking like it wont be 
> far behind I'd interested to hear what people thought of the relative
> benefits of each where?
> 
> I know 7 has had a lot of work done on locking and ULE but are there
> any other reasons to go for that instead of 6.3? Conversely are there
> any reason which would point away from 7 such as stability issues?
> 
> Regards
> Steve
> 
> 
> This e.mail is private and confidential between Multiplay (UK) Ltd. and the 
> person or entity to whom it is addressed. In the event of misdirection, the 
> recipient is prohibited from using, copying, printing or otherwise 
> disseminating it or any information contained in it. 
> 
> In the event of misdirection, illegible or incomplete transmission please 
> telephone +44 845 868 1337
> or return the E.mail to [EMAIL PROTECTED]
> 
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: 7.0-PRERELEASE desktop system periodically freezes momentarily

2008-01-17 Thread Joe Peterson
Kris Kennaway wrote:
> KTR_SCHED

Kris, BTW, I am curious if the traces I posted were informative.  Let me
know if I did not create them correctly.  The xterm test seems to vary
in usefulness depending on video card (faster cards catch up too
quickly), but the freezing still happens quite often using apps like
firefox, especially.  Here's the post link:

http://lists.freebsd.org/pipermail/freebsd-stable/2008-January/039599.html

Thanks, Joe
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: RELENG_7 jerky mouse and skipping sound (still a problem -BETA3)

2008-01-14 Thread Joe Peterson
On 1 Jan, 14:17, Kris Kennaway <[EMAIL PROTECTED]> wrote:
> > OK, can you obtain a schedgraph trace when the problem is manifesting?
> > See /usr/src/tools/sched/ and previous discussion in this or related
> > threads.

I just recently installed 7.0-RC1, and I am seeing pretty severe "mouse
jerkiness" or "mouse freezing" while, e.g., compiling (as others have reported
here).  It's not just the mouse, but keyboard events are also delayed in
the same manner (seen by holding down a key in xterm, e.g.).  I am on a UP
2.4GHz P4, using PS/2 mouse (with moused) and keyboard.

I'm glad I found this thread, since you are asking for traces.  I really hope
my traces help; this problem does seem like a regression from 6.2 (I had seen
slight mouse non-interactivity there too at times, but not nearly as bad).
Also, with Linux's new CFS making mouse movement *very* responsive, I think
it's vital that FreeBSD address this to avoid such comparisons.

I have tried both SCHED_4BSD and SCHED_ULE.  4BSD is a lot worse when
compiling, say, the kernel.  ULE is better when compiling, but still has
issues with, e.g., firefox loading a page, catching up on multiple xterm
window resizing (see below), etc.

This trace is while using SCHED_4BSD and compiling the kernel / moving mouse:

http://www.skyrush.com/downloads/ktr_4bsd.out

And here are three traces using SCHED_ULE:

http://www.skyrush.com/downloads/ktr_ule.out
http://www.skyrush.com/downloads/ktr_ule_2.out
http://www.skyrush.com/downloads/ktr_ule_3.out

Please check out all three, in case I did not get a good sampling of mouse
events and compiles in any one...

Strangely, ULE exhibits mouse jerkiness more than 4BSD for the following: I
opened an xterm and dragged the right edge of the window back and forth
quickly, making the window wider/narrower.  It is obvious in FreeBSD that this
queues up events for X (after some time, the window border no longer follows
the mouse at all), and if I release the mouse button at that time, leaving the
window narrow and immediately move the mouse in circles, it is jerky for a
while, then returns to smooth action after about 5 or 10 seconds.  4BSD is not
as severe in this one case, and I never see this at all in Linux with CFS
(i.e. kernel 2.6.23) - the window resizing never really gets behind like this.

Here is a trace showing this for ULE (xterm still catching up, if I
remember correctly, at end):

http://www.skyrush.com/downloads/ktr_ule_resize.out

Here is one for 4BSD (xterm caught up before trace stopped):

http://www.skyrush.com/downloads/ktr_4bsd_resize.out

As an aside, renicing Xorg and moused to -10 seems to help smooth the
mouse when using 4BSD when compiling, whereas it is not needed (and
seems to have little or no effect) when using ULE (even though, as I said, ULE
still shows jerkiness).

-Thanks, Joe

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"