Re: taskqueue timeout [SOLVED]

2008-07-17 Thread Steve Bertrand
> Steve Bertrand wrote:

> The only other box I have with four SATA ports on it is my actual
> workstation. The board is ASUS P5GD1, and has an Intel 82801FR SATA
> controller.

I transferred the SATA disks to the above board, loaded up the zpool, and
I can not reproduce the problem :)

Currently, for the last 15 minutes, I'm writing 80MB/s to the zpool with
no problems.

Thanks all,

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-17 Thread Steve Bertrand

Steve Bertrand wrote:

I'm wondering if the problems described in the following link have been 
resolved:


http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html 



I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them 
are experiencing the behavior.


Thanks to all who have provided patches off list. Unfortunately, none of 
them helped.


The only other box I have with four SATA ports on it is my actual 
workstation. The board is ASUS P5GD1, and has an Intel 82801FR SATA 
controller.


I despise the thought that if this works, I'll have to rebuild my 
workstation, but heres to sacrificing my Windows PC in the name of 
ruling out the problem.


In the meantime, can anyone provide any feedback on the board I 
mentioned in regards to FreeBSD?


Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Matthew Dillon wrote:


This issue is vexing a lot of people.


Heh... I can appreciate this. I would like someone to inform me that 
this can't be guaranteed to be a ZFS problem... if I can get 
confirmation that others have this issue aside from ZFS, I would feel 
content.



Setting the timeout to 30 will not effect performance, but it will
cause a 30 second delay in recovery when (if) the problem occurs.
i.e. when the disk stalls it will just sit there doing nothing for
30 seconds, then it will print the timeout message and try to recover.


If I have the timeout at >= 30 and the issue still occurs, the problem 
must be elsewhere.



It occurs to me that it might be beneficial to actually measure the
disk's response time to each request, and then graph it over a period
of time.  Maybe seeing the issue visually will give some clue as to the
actual cause.


I am interested in following through with this, but can't do it on my 
own. I'm willing to dedicate the box and bandwidth to anyone who can 
legitimately test this as you state. ie: I need either guidance or 
assistance.


This box is ready for the taking. Beyond this box, I can provide 
legitimate parties other network resources to produce a consistent flow 
of data to ensure the ability to easily reproduce the issue locally, on 
demand.


Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Andrew Snow wrote:


 From Western Digital's line of "enterprise" drives:

"RAID-specific time-limited error recovery (TLER) - Pioneered by WD, 
this feature prevents drive fallout caused by the extended hard drive 
error-recovery processes common to desktop drives."


Therefore I think the FreeBSD timeout should also be set to 8 seconds 
instead of 5 seconds.  Desktop-targetted drives will not respond for 
over 10 seconds, up to minutes, so its not worth setting the FreeBSD 
timeout any higher.


Interesting you say this. To reiterate, I have /boot on USB thumb drive, 
and the system is mounted from / on a raidz pool called /storage via 
loader.conf.


The four drives in question (per the packaging) are:

- Western Digital Caviar SE16 500GB
- 7200, 16MB, SATA-300, OEM

Per the packaging on the rest of the hardware:

# mobo
- XFX 610i, 7050 GeForce (I *never* use graphics on my FreeBSD boxen, I 
*only* know/have CLI with no 'windows')


# memory
- 2 GB Corsair XMS2 Twin2X 6400C4 memory

# cpu
- Intel Pentium DC E2200 2.20GHz OEM
- 2.20 GHz, 1MB Cache, 800MHz FSB, Allendale, Dual Core, OEM, Socket 
775, Processor


# swap
- I don't run any, but can/will add in an IDE/ATA 7200 200GB in the 
event this problem may be related to ZFS/RAM issues.


Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon

:...
:> and see if the problem reoccurs with just two drives.
:
:... I knew that was going to come up... my response is "I worked so hard 
:to get this system with ZFS all configured *exactly* how I wanted it".
:
:To test, I'm going to flip to 30 as per Matthews recommendation, and see 
:how far that takes me. At this time, I'm only testing by backing up one 
:machine on the network. If it fails, I'll clock the time, and then 
:'reformat' with two drives.
:
:Is there a technical reason this may work better with only two drives?
:
:Is there anyone interested to the point where remote login would be helpful?
:
:Steve

This issue is vexing a lot of people.

Setting the timeout to 30 will not effect performance, but it will
cause a 30 second delay in recovery when (if) the problem occurs.
i.e. when the disk stalls it will just sit there doing nothing for
30 seconds, then it will print the timeout message and try to recover.

It occurs to me that it might be beneficial to actually measure the
disk's response time to each request, and then graph it over a period
of time.  Maybe seeing the issue visually will give some clue as to the
actual cause.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Jeremy Chadwick wrote:

On Tue, Jul 15, 2008 at 10:29:28PM -0400, Steve Bertrand wrote:

Is there anyone interested to the point where remote login would be helpful?


I believe my FreeBSD Wiki page documents what to do if your problem
is easily reproducable: contact Scott Long, who has offered to help
track down the source of these problems.


Changing to 30 second timeout made no difference whatsoever. The problem 
occurred at about the same time during the single


I'm at a standstill.

I'm willing to help provide any information necessary to fix this issue, 
or provide remote access to the box in question.


scottl@ has been Cc:'d.

Thanks all,

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Alex Trull wrote:

Don't want to give conflicting advice, and would suggest you certainly
try the 30 sec thing first. I'm already on 10 myself but haven't pushed
further.


What were you doing, and what did you notice when the problem started?

As much as it seems silly, I'm mostly interested in what your network 
was doing at the time things went sour.



In my own case I've not had any issue with zfs in particular since I
applied the ZFS zil/prefetch disable loader.conf tunables 10 hours ago.
I am observing this now.


For some reason, and with no explanation or science behind it, I don't 
think this is a ZFS problem, and I'm trying to defend this thought to my 
peers until I prove otherwise.


I have to be a bit careful on how I adjust loader properties, given that 
I'm loading from USB, and mounting root from a ZFS zpool hard disk. Like 
my GELI systems, tweaking things can be a bit touchy unless I put a 
little more planning into it.



For the record ..

What ata chipset/motherboard and model of disk have you got ?


I'm not a hardware person per-se, but I'm advised to post that the 
motherboard is:


- XFS nForce 610i with GeForce 7050

If there is more hardware info I can provide, let me know specifically 
what I should be looking for.



Have you seen any smart errors (real or otherwise) ?
What do your 'zpool status' counters look like ?


zpool status is always clean. There are no errors otherwise, even if the 
box is up for multiple hours straight. The problem occurs only if I 
through work at it.


Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Jeremy Chadwick
On Tue, Jul 15, 2008 at 10:29:28PM -0400, Steve Bertrand wrote:
> Is there anyone interested to the point where remote login would be helpful?

I believe my FreeBSD Wiki page documents what to do if your problem
is easily reproducable: contact Scott Long, who has offered to help
track down the source of these problems.

I'll reply to the other part of your mail in a bit.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Matthew Dillon wrote:
:Went from 10->15, and it took quite a bit longer into the backup before 
:the problem cropped back up.


Jumping right into it, there is another post after this one, but I'm 
going to try to reply inline:



Try 30 or longer.  See if you can make the problem go away entirely.
then fall back to 5 and see if the problem resumes at its earlier
pace.


I'm sure 30 will either push the issue longer, or into non-existence, 
but are there any developers here who can say what this timer does? ie. 
How does changing this timer affect the performance of the disk 
subsystem (aside from allowing it to work, of course).


After I'm done responding this message, I'll be testing the sysctl to 30.

>   It could be temperature related.  The drives are being exercised

a lot, they could very well be overheating.  To find out add more
airflow (a big house fan would do the trick).



Temperature is a good thought, but currently, my physical situation has 
this:


- 2U chassis
- multiple fans in the case
- in my lab (which is essentially beside my desk)
- the case has no lid
- it is 64 degrees with A/C and circulating fans in this area
- hard drives are separated relatively well inside the case


It could be that errors are accumulating on the drives, but it seems
unlikely that four drives would exhibit the same problem.


Thats what I'm thinking. All four drives are exhibiting the same 
errors... or, for all intents and purposes, the machine is coughing the 
same errors for all the drives.



Also make sure the power supply can handle four drives.  Most power
supplies that come with consumer boxes can't under full load if you
also have a mid or high-end graphics card installed.  Power supplies
that come with OEM slap-together enclosures are not usually much better.


I currently have a 550W PSU in the 2U chassis, which again, is sitting 
open. I have more hardware, running in worse conditions with less 
wattage PSUs that don't exhibit this behavior. I need to determine 
whether this problem is SATA, ZFS, the motherboard or code.



Specifically, look at the +5V and +12V amperage maximums on the power
supply, then check the disk labels to see what they draw, then
multiply by 2.  e.g. if your power supply can do [EMAIL PROTECTED] and you 
have
four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), 
thats
4x2x2 = [EMAIL PROTECTED] and you would probably be ok.


I'm well within specs. Even after V/A tests with the meter. The power 
supply is providing ample wattage to each device accordingly.



To test, remove two of the four drives, reformat the ZFS to use just 2,
and see if the problem reoccurs with just two drives.


... I knew that was going to come up... my response is "I worked so hard 
to get this system with ZFS all configured *exactly* how I wanted it".


To test, I'm going to flip to 30 as per Matthews recommendation, and see 
how far that takes me. At this time, I'm only testing by backing up one 
machine on the network. If it fails, I'll clock the time, and then 
'reformat' with two drives.


Is there a technical reason this may work better with only two drives?

Is there anyone interested to the point where remote login would be helpful?

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Andrew Snow

Matthew Dillon wrote:

Try that first.  If it helps then it is a known issue.  Basically
a combination of the on-disk write cache and possible ECC corrections,
remappings, or excessive remapped sectors can cause the drive to take
much longer then normal to complete a request.  The default 5-second
timeout is insufficient.


From Western Digital's line of "enterprise" drives:

"RAID-specific time-limited error recovery (TLER) - Pioneered by WD, 
this feature prevents drive fallout caused by the extended hard drive 
error-recovery processes common to desktop drives."



Western Digital's information sheet on TLER states that they found most 
RAID controllers will wait 8 seconds for a disk to respond before 
dropping it from the RAID set.  Consequently they changed their 
"enterprise" drives to try reading a bad sector for only 7 seconds 
before returning an error.


Therefore I think the FreeBSD timeout should also be set to 8 seconds 
instead of 5 seconds.  Desktop-targetted drives will not respond for 
over 10 seconds, up to minutes, so its not worth setting the FreeBSD 
timeout any higher.



More info:
http://www.wdc.com/en/library/sata/2579-001098.pdf
http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery



- Andrew
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Alex Trull
Don't want to give conflicting advice, and would suggest you certainly
try the 30 sec thing first. I'm already on 10 myself but haven't pushed
further.

In my own case I've not had any issue with zfs in particular since I
applied the ZFS zil/prefetch disable loader.conf tunables 10 hours ago.
I am observing this now.

For the record ..

What ata chipset/motherboard and model of disk have you got ?
Have you seen any smart errors (real or otherwise) ?
What do your 'zpool status' counters look like ?

--
Alex

On Tue, 2008-07-15 at 12:55 -0700, Matthew Dillon wrote:
> :Went from 10->15, and it took quite a bit longer into the backup before 
> :the problem cropped back up.
> 
> Try 30 or longer.  See if you can make the problem go away entirely.
> then fall back to 5 and see if the problem resumes at its earlier
> pace.
> 
> --
> 
> It could be temperature related.  The drives are being exercised
> a lot, they could very well be overheating.  To find out add more
> airflow (a big house fan would do the trick).
> 
> --
> 
> It could be that errors are accumulating on the drives, but it seems
> unlikely that four drives would exhibit the same problem.
> 
> --
> 
> Also make sure the power supply can handle four drives.  Most power
> supplies that come with consumer boxes can't under full load if you
> also have a mid or high-end graphics card installed.  Power supplies
> that come with OEM slap-together enclosures are not usually much better.
> 
> Specifically, look at the +5V and +12V amperage maximums on the power
> supply, then check the disk labels to see what they draw, then
> multiply by 2.  e.g. if your power supply can do [EMAIL PROTECTED] and 
> you have
> four drives each taking [EMAIL PROTECTED] (and typically ~half that at 
> 5V), thats
> 4x2x2 = [EMAIL PROTECTED] and you would probably be ok.
> 
> To test, remove two of the four drives, reformat the ZFS to use just 2,
> and see if the problem reoccurs with just two drives.
> 
>   -Matt
> 
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"


signature.asc
Description: This is a digitally signed message part


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon
:Went from 10->15, and it took quite a bit longer into the backup before 
:the problem cropped back up.

Try 30 or longer.  See if you can make the problem go away entirely.
then fall back to 5 and see if the problem resumes at its earlier
pace.

--

It could be temperature related.  The drives are being exercised
a lot, they could very well be overheating.  To find out add more
airflow (a big house fan would do the trick).

--

It could be that errors are accumulating on the drives, but it seems
unlikely that four drives would exhibit the same problem.

--

Also make sure the power supply can handle four drives.  Most power
supplies that come with consumer boxes can't under full load if you
also have a mid or high-end graphics card installed.  Power supplies
that come with OEM slap-together enclosures are not usually much better.

Specifically, look at the +5V and +12V amperage maximums on the power
supply, then check the disk labels to see what they draw, then
multiply by 2.  e.g. if your power supply can do [EMAIL PROTECTED] and you 
have
four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), 
thats
4x2x2 = [EMAIL PROTECTED] and you would probably be ok.

To test, remove two of the four drives, reformat the ZFS to use just 2,
and see if the problem reoccurs with just two drives.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Steve Bertrand wrote:

Matthew Dillon wrote:


If you are getting DMA timeouts, go to this URL:

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Then I would suggest going into /usr/src/sys/dev/ata (I think, on
FreeBSD), locate all instances where request->timeout is set to 5,
and change them all to 10.

cd /usr/src/sys/dev/ata
fgrep 'request->timeout' *.c
... change all assignments of 5 to 10 ...


Changing 5 to 10 in all cases and rebuilding the kernel does not fix the 
problem.


Went from 10->15, and it took quite a bit longer into the backup before 
the problem cropped back up.


Here is what I was seeing at the time it failed. Where netstat and zpool 
iostat drop off is where I start seeing the errors occur:


# top

last pid:  1069;  load averages:  0.09,  0.17,  0.10 


   up 0+00:08:31  19:22:39
53 processes:  1 running, 52 sleeping
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% 
idle

Mem: 28M Active, 3644K Inact, 301M Wired, 76K Cache, 1634M Free
Swap:


# netstat -w 1 -h

  4.8K 011M   3.5K 0   5.4M 0
  4.5K 010M   3.3K 0   5.1M 0
  4.9K 011M   3.6K 0   5.5M 0
  4.8K 011M   3.5K 0   5.4M 0
  4.3K 0   9.5M   3.1K 0   4.8M 0
  5.1K 011M   3.7K 0   5.7M 0
  5.0K 011M   3.6K 0   5.6M 0
  5.3K 012M   3.9K 0   6.0M 0
  4.8K 011M   3.5K 0   5.4M 0
  4.7K 010M   3.4K 0   5.2M 0
  4.8K 011M   3.5K 0   5.4M 0
  4.6K 010M   3.4K 0   5.2M 0
  4.1K 0   9.1M   3.0K 0   4.6M 0
  5.3K 012M   3.9K 0   6.0M 0
  5.2K 012M   3.8K 0   5.8M 0
  4.3K 0   9.5M   3.1K 0   4.8M 0
  4.3K 0   9.6M   3.2K 0   4.9M 0
  5.4K 012M   4.0K 0   6.1M 0
  4.8K 011M   3.5K 0   5.4M 0
  2.4K 0   5.1M   1.7K 0   2.5M 0
input(Total)   output
   packets  errs  bytespackets  errs  bytes colls
 2 0120  2 0316 0
 3 0180  4 0   1.0K 0
 3 0180  2 0316 0
 3 0180  3 0658 0
 5 0   1.6K  5 0942 0
 3 0254  4 0840 0
 3 0180  2 0316 0


# zpool iostat 1

storage 6.40G  1.81T  0296  0  37.0M
storage 6.43G  1.81T  0188  0  14.5M
storage 6.43G  1.81T  0  0  0  0
storage 6.43G  1.81T  0  0  0  0
storage 6.43G  1.81T  0  0  0  0
storage 6.43G  1.81T  0 47  0  5.99M
storage 6.46G  1.81T  0218  0  18.0M
storage 6.46G  1.81T  0  0  0  0
storage 6.46G  1.81T  0  0  0  0
storage 6.46G  1.81T  9  0   192K  0
storage 6.46G  1.81T  0 59  0  7.39M
storage 6.49G  1.81T  1250  3.42K  14.9M
storage 6.49G  1.81T  0  0  0  0
storage 6.49G  1.81T  0  0  0  0
storage 6.49G  1.81T  0  0  0  0
storage 6.49G  1.81T  0141  0  17.5M
storage 6.52G  1.81T  0 74  0   232K
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0151  0  18.8M
storage 6.52G  1.81T  0114  0  8.07M
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0  0  0  0
storage 6.52G  1.81T  0  0  0  0



Don't know if this will help anyone or not.

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Matthew Dillon wrote:


If you are getting DMA timeouts, go to this URL:

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Then I would suggest going into /usr/src/sys/dev/ata (I think, on
FreeBSD), locate all instances where request->timeout is set to 5,
and change them all to 10.

cd /usr/src/sys/dev/ata
fgrep 'request->timeout' *.c
... change all assignments of 5 to 10 ...


Changing 5 to 10 in all cases and rebuilding the kernel does not fix the 
problem.


I'm going to install the patch that allows the values to be changed via 
sysctl and up it to 15.


This problem happens across all four disks.

Does anyone else have any suggestions on what I can check?

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Steve Bertrand

Matthew Dillon wrote:


If you are getting DMA timeouts, go to this URL:


Yes, I am.


http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting


I fall under the category of "ATA/SATA DMA timeout issues".


Then I would suggest going into /usr/src/sys/dev/ata (I think, on
FreeBSD), locate all instances where request->timeout is set to 5,
and change them all to 10.

cd /usr/src/sys/dev/ata
fgrep 'request->timeout' *.c
... change all assignments of 5 to 10 ...

Try that first.  If it helps then it is a known issue.  Basically
a combination of the on-disk write cache and possible ECC corrections,
remappings, or excessive remapped sectors can cause the drive to take
much longer then normal to complete a request.  The default 5-second
timeout is insufficient.

If it does help, post confirmation to prod the FBsd developers to
change the timeouts.


I've just reproduced the problem, and will try hacking the code now to 
see if the problem goes away.


Since the box won't take input, I can't tell the disk usage at the time 
it dies. However, it seems to appear while running an Amanda backup, and 
my network throughput hits about ~90 Mbps @ ~5 kpps.


I'll post back with results of the increase of the timeout.

Steve
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: taskqueue timeout

2008-07-15 Thread Matthew Dillon

:Hi everyone,
:
:I'm wondering if the problems described in the following link have been 
:resolved:
:
:http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html
:
:I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them 
:are experiencing the behavior.
:
:The problem only happens with extreme disk activity. The box becomes 
:unresponsive (can not SSH etc). Keyboard input is displayed on the 
:console, but the commands are not accepted.
:
:Is there anything I can do to either figure this out, or work around it?
:
:Steve

If you are getting DMA timeouts, go to this URL:

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Then I would suggest going into /usr/src/sys/dev/ata (I think, on
FreeBSD), locate all instances where request->timeout is set to 5,
and change them all to 10.

cd /usr/src/sys/dev/ata
fgrep 'request->timeout' *.c
... change all assignments of 5 to 10 ...

Try that first.  If it helps then it is a known issue.  Basically
a combination of the on-disk write cache and possible ECC corrections,
remappings, or excessive remapped sectors can cause the drive to take
much longer then normal to complete a request.  The default 5-second
timeout is insufficient.

If it does help, post confirmation to prod the FBsd developers to
change the timeouts.

--

If you are NOT getting DMA timeouts then the ZFS lockups may be due
to buffer/memory deadlocks.  ZFS has knobs for adjusting its memory
footprint size.  Lowering the footprint ought to solve (most of) those
issues.  It's actually somewhat of a hard issue to solve.  Filesystems
like UFS aren't complex enough to require the sort of dynamic memory
allocations deep in the filesystem that ZFS and HAMMER need to do.

-Matt
Matthew Dillon 
<[EMAIL PROTECTED]>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"