Re: taskqueue timeout [SOLVED]
> Steve Bertrand wrote: > The only other box I have with four SATA ports on it is my actual > workstation. The board is ASUS P5GD1, and has an Intel 82801FR SATA > controller. I transferred the SATA disks to the above board, loaded up the zpool, and I can not reproduce the problem :) Currently, for the last 15 minutes, I'm writing 80MB/s to the zpool with no problems. Thanks all, Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Steve Bertrand wrote: I'm wondering if the problems described in the following link have been resolved: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them are experiencing the behavior. Thanks to all who have provided patches off list. Unfortunately, none of them helped. The only other box I have with four SATA ports on it is my actual workstation. The board is ASUS P5GD1, and has an Intel 82801FR SATA controller. I despise the thought that if this works, I'll have to rebuild my workstation, but heres to sacrificing my Windows PC in the name of ruling out the problem. In the meantime, can anyone provide any feedback on the board I mentioned in regards to FreeBSD? Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Matthew Dillon wrote: This issue is vexing a lot of people. Heh... I can appreciate this. I would like someone to inform me that this can't be guaranteed to be a ZFS problem... if I can get confirmation that others have this issue aside from ZFS, I would feel content. Setting the timeout to 30 will not effect performance, but it will cause a 30 second delay in recovery when (if) the problem occurs. i.e. when the disk stalls it will just sit there doing nothing for 30 seconds, then it will print the timeout message and try to recover. If I have the timeout at >= 30 and the issue still occurs, the problem must be elsewhere. It occurs to me that it might be beneficial to actually measure the disk's response time to each request, and then graph it over a period of time. Maybe seeing the issue visually will give some clue as to the actual cause. I am interested in following through with this, but can't do it on my own. I'm willing to dedicate the box and bandwidth to anyone who can legitimately test this as you state. ie: I need either guidance or assistance. This box is ready for the taking. Beyond this box, I can provide legitimate parties other network resources to produce a consistent flow of data to ensure the ability to easily reproduce the issue locally, on demand. Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Andrew Snow wrote: From Western Digital's line of "enterprise" drives: "RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives." Therefore I think the FreeBSD timeout should also be set to 8 seconds instead of 5 seconds. Desktop-targetted drives will not respond for over 10 seconds, up to minutes, so its not worth setting the FreeBSD timeout any higher. Interesting you say this. To reiterate, I have /boot on USB thumb drive, and the system is mounted from / on a raidz pool called /storage via loader.conf. The four drives in question (per the packaging) are: - Western Digital Caviar SE16 500GB - 7200, 16MB, SATA-300, OEM Per the packaging on the rest of the hardware: # mobo - XFX 610i, 7050 GeForce (I *never* use graphics on my FreeBSD boxen, I *only* know/have CLI with no 'windows') # memory - 2 GB Corsair XMS2 Twin2X 6400C4 memory # cpu - Intel Pentium DC E2200 2.20GHz OEM - 2.20 GHz, 1MB Cache, 800MHz FSB, Allendale, Dual Core, OEM, Socket 775, Processor # swap - I don't run any, but can/will add in an IDE/ATA 7200 200GB in the event this problem may be related to ZFS/RAM issues. Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
:... :> and see if the problem reoccurs with just two drives. : :... I knew that was going to come up... my response is "I worked so hard :to get this system with ZFS all configured *exactly* how I wanted it". : :To test, I'm going to flip to 30 as per Matthews recommendation, and see :how far that takes me. At this time, I'm only testing by backing up one :machine on the network. If it fails, I'll clock the time, and then :'reformat' with two drives. : :Is there a technical reason this may work better with only two drives? : :Is there anyone interested to the point where remote login would be helpful? : :Steve This issue is vexing a lot of people. Setting the timeout to 30 will not effect performance, but it will cause a 30 second delay in recovery when (if) the problem occurs. i.e. when the disk stalls it will just sit there doing nothing for 30 seconds, then it will print the timeout message and try to recover. It occurs to me that it might be beneficial to actually measure the disk's response time to each request, and then graph it over a period of time. Maybe seeing the issue visually will give some clue as to the actual cause. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Jeremy Chadwick wrote: On Tue, Jul 15, 2008 at 10:29:28PM -0400, Steve Bertrand wrote: Is there anyone interested to the point where remote login would be helpful? I believe my FreeBSD Wiki page documents what to do if your problem is easily reproducable: contact Scott Long, who has offered to help track down the source of these problems. Changing to 30 second timeout made no difference whatsoever. The problem occurred at about the same time during the single I'm at a standstill. I'm willing to help provide any information necessary to fix this issue, or provide remote access to the box in question. scottl@ has been Cc:'d. Thanks all, Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Alex Trull wrote: Don't want to give conflicting advice, and would suggest you certainly try the 30 sec thing first. I'm already on 10 myself but haven't pushed further. What were you doing, and what did you notice when the problem started? As much as it seems silly, I'm mostly interested in what your network was doing at the time things went sour. In my own case I've not had any issue with zfs in particular since I applied the ZFS zil/prefetch disable loader.conf tunables 10 hours ago. I am observing this now. For some reason, and with no explanation or science behind it, I don't think this is a ZFS problem, and I'm trying to defend this thought to my peers until I prove otherwise. I have to be a bit careful on how I adjust loader properties, given that I'm loading from USB, and mounting root from a ZFS zpool hard disk. Like my GELI systems, tweaking things can be a bit touchy unless I put a little more planning into it. For the record .. What ata chipset/motherboard and model of disk have you got ? I'm not a hardware person per-se, but I'm advised to post that the motherboard is: - XFS nForce 610i with GeForce 7050 If there is more hardware info I can provide, let me know specifically what I should be looking for. Have you seen any smart errors (real or otherwise) ? What do your 'zpool status' counters look like ? zpool status is always clean. There are no errors otherwise, even if the box is up for multiple hours straight. The problem occurs only if I through work at it. Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
On Tue, Jul 15, 2008 at 10:29:28PM -0400, Steve Bertrand wrote: > Is there anyone interested to the point where remote login would be helpful? I believe my FreeBSD Wiki page documents what to do if your problem is easily reproducable: contact Scott Long, who has offered to help track down the source of these problems. I'll reply to the other part of your mail in a bit. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Matthew Dillon wrote: :Went from 10->15, and it took quite a bit longer into the backup before :the problem cropped back up. Jumping right into it, there is another post after this one, but I'm going to try to reply inline: Try 30 or longer. See if you can make the problem go away entirely. then fall back to 5 and see if the problem resumes at its earlier pace. I'm sure 30 will either push the issue longer, or into non-existence, but are there any developers here who can say what this timer does? ie. How does changing this timer affect the performance of the disk subsystem (aside from allowing it to work, of course). After I'm done responding this message, I'll be testing the sysctl to 30. > It could be temperature related. The drives are being exercised a lot, they could very well be overheating. To find out add more airflow (a big house fan would do the trick). Temperature is a good thought, but currently, my physical situation has this: - 2U chassis - multiple fans in the case - in my lab (which is essentially beside my desk) - the case has no lid - it is 64 degrees with A/C and circulating fans in this area - hard drives are separated relatively well inside the case It could be that errors are accumulating on the drives, but it seems unlikely that four drives would exhibit the same problem. Thats what I'm thinking. All four drives are exhibiting the same errors... or, for all intents and purposes, the machine is coughing the same errors for all the drives. Also make sure the power supply can handle four drives. Most power supplies that come with consumer boxes can't under full load if you also have a mid or high-end graphics card installed. Power supplies that come with OEM slap-together enclosures are not usually much better. I currently have a 550W PSU in the 2U chassis, which again, is sitting open. I have more hardware, running in worse conditions with less wattage PSUs that don't exhibit this behavior. I need to determine whether this problem is SATA, ZFS, the motherboard or code. Specifically, look at the +5V and +12V amperage maximums on the power supply, then check the disk labels to see what they draw, then multiply by 2. e.g. if your power supply can do [EMAIL PROTECTED] and you have four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), thats 4x2x2 = [EMAIL PROTECTED] and you would probably be ok. I'm well within specs. Even after V/A tests with the meter. The power supply is providing ample wattage to each device accordingly. To test, remove two of the four drives, reformat the ZFS to use just 2, and see if the problem reoccurs with just two drives. ... I knew that was going to come up... my response is "I worked so hard to get this system with ZFS all configured *exactly* how I wanted it". To test, I'm going to flip to 30 as per Matthews recommendation, and see how far that takes me. At this time, I'm only testing by backing up one machine on the network. If it fails, I'll clock the time, and then 'reformat' with two drives. Is there a technical reason this may work better with only two drives? Is there anyone interested to the point where remote login would be helpful? Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Matthew Dillon wrote: Try that first. If it helps then it is a known issue. Basically a combination of the on-disk write cache and possible ECC corrections, remappings, or excessive remapped sectors can cause the drive to take much longer then normal to complete a request. The default 5-second timeout is insufficient. From Western Digital's line of "enterprise" drives: "RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives." Western Digital's information sheet on TLER states that they found most RAID controllers will wait 8 seconds for a disk to respond before dropping it from the RAID set. Consequently they changed their "enterprise" drives to try reading a bad sector for only 7 seconds before returning an error. Therefore I think the FreeBSD timeout should also be set to 8 seconds instead of 5 seconds. Desktop-targetted drives will not respond for over 10 seconds, up to minutes, so its not worth setting the FreeBSD timeout any higher. More info: http://www.wdc.com/en/library/sata/2579-001098.pdf http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery - Andrew ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Don't want to give conflicting advice, and would suggest you certainly try the 30 sec thing first. I'm already on 10 myself but haven't pushed further. In my own case I've not had any issue with zfs in particular since I applied the ZFS zil/prefetch disable loader.conf tunables 10 hours ago. I am observing this now. For the record .. What ata chipset/motherboard and model of disk have you got ? Have you seen any smart errors (real or otherwise) ? What do your 'zpool status' counters look like ? -- Alex On Tue, 2008-07-15 at 12:55 -0700, Matthew Dillon wrote: > :Went from 10->15, and it took quite a bit longer into the backup before > :the problem cropped back up. > > Try 30 or longer. See if you can make the problem go away entirely. > then fall back to 5 and see if the problem resumes at its earlier > pace. > > -- > > It could be temperature related. The drives are being exercised > a lot, they could very well be overheating. To find out add more > airflow (a big house fan would do the trick). > > -- > > It could be that errors are accumulating on the drives, but it seems > unlikely that four drives would exhibit the same problem. > > -- > > Also make sure the power supply can handle four drives. Most power > supplies that come with consumer boxes can't under full load if you > also have a mid or high-end graphics card installed. Power supplies > that come with OEM slap-together enclosures are not usually much better. > > Specifically, look at the +5V and +12V amperage maximums on the power > supply, then check the disk labels to see what they draw, then > multiply by 2. e.g. if your power supply can do [EMAIL PROTECTED] and > you have > four drives each taking [EMAIL PROTECTED] (and typically ~half that at > 5V), thats > 4x2x2 = [EMAIL PROTECTED] and you would probably be ok. > > To test, remove two of the four drives, reformat the ZFS to use just 2, > and see if the problem reoccurs with just two drives. > > -Matt > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "[EMAIL PROTECTED]" signature.asc Description: This is a digitally signed message part
Re: taskqueue timeout
:Went from 10->15, and it took quite a bit longer into the backup before :the problem cropped back up. Try 30 or longer. See if you can make the problem go away entirely. then fall back to 5 and see if the problem resumes at its earlier pace. -- It could be temperature related. The drives are being exercised a lot, they could very well be overheating. To find out add more airflow (a big house fan would do the trick). -- It could be that errors are accumulating on the drives, but it seems unlikely that four drives would exhibit the same problem. -- Also make sure the power supply can handle four drives. Most power supplies that come with consumer boxes can't under full load if you also have a mid or high-end graphics card installed. Power supplies that come with OEM slap-together enclosures are not usually much better. Specifically, look at the +5V and +12V amperage maximums on the power supply, then check the disk labels to see what they draw, then multiply by 2. e.g. if your power supply can do [EMAIL PROTECTED] and you have four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), thats 4x2x2 = [EMAIL PROTECTED] and you would probably be ok. To test, remove two of the four drives, reformat the ZFS to use just 2, and see if the problem reoccurs with just two drives. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Steve Bertrand wrote: Matthew Dillon wrote: If you are getting DMA timeouts, go to this URL: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting Then I would suggest going into /usr/src/sys/dev/ata (I think, on FreeBSD), locate all instances where request->timeout is set to 5, and change them all to 10. cd /usr/src/sys/dev/ata fgrep 'request->timeout' *.c ... change all assignments of 5 to 10 ... Changing 5 to 10 in all cases and rebuilding the kernel does not fix the problem. Went from 10->15, and it took quite a bit longer into the backup before the problem cropped back up. Here is what I was seeing at the time it failed. Where netstat and zpool iostat drop off is where I start seeing the errors occur: # top last pid: 1069; load averages: 0.09, 0.17, 0.10 up 0+00:08:31 19:22:39 53 processes: 1 running, 52 sleeping CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 28M Active, 3644K Inact, 301M Wired, 76K Cache, 1634M Free Swap: # netstat -w 1 -h 4.8K 011M 3.5K 0 5.4M 0 4.5K 010M 3.3K 0 5.1M 0 4.9K 011M 3.6K 0 5.5M 0 4.8K 011M 3.5K 0 5.4M 0 4.3K 0 9.5M 3.1K 0 4.8M 0 5.1K 011M 3.7K 0 5.7M 0 5.0K 011M 3.6K 0 5.6M 0 5.3K 012M 3.9K 0 6.0M 0 4.8K 011M 3.5K 0 5.4M 0 4.7K 010M 3.4K 0 5.2M 0 4.8K 011M 3.5K 0 5.4M 0 4.6K 010M 3.4K 0 5.2M 0 4.1K 0 9.1M 3.0K 0 4.6M 0 5.3K 012M 3.9K 0 6.0M 0 5.2K 012M 3.8K 0 5.8M 0 4.3K 0 9.5M 3.1K 0 4.8M 0 4.3K 0 9.6M 3.2K 0 4.9M 0 5.4K 012M 4.0K 0 6.1M 0 4.8K 011M 3.5K 0 5.4M 0 2.4K 0 5.1M 1.7K 0 2.5M 0 input(Total) output packets errs bytespackets errs bytes colls 2 0120 2 0316 0 3 0180 4 0 1.0K 0 3 0180 2 0316 0 3 0180 3 0658 0 5 0 1.6K 5 0942 0 3 0254 4 0840 0 3 0180 2 0316 0 # zpool iostat 1 storage 6.40G 1.81T 0296 0 37.0M storage 6.43G 1.81T 0188 0 14.5M storage 6.43G 1.81T 0 0 0 0 storage 6.43G 1.81T 0 0 0 0 storage 6.43G 1.81T 0 0 0 0 storage 6.43G 1.81T 0 47 0 5.99M storage 6.46G 1.81T 0218 0 18.0M storage 6.46G 1.81T 0 0 0 0 storage 6.46G 1.81T 0 0 0 0 storage 6.46G 1.81T 9 0 192K 0 storage 6.46G 1.81T 0 59 0 7.39M storage 6.49G 1.81T 1250 3.42K 14.9M storage 6.49G 1.81T 0 0 0 0 storage 6.49G 1.81T 0 0 0 0 storage 6.49G 1.81T 0 0 0 0 storage 6.49G 1.81T 0141 0 17.5M storage 6.52G 1.81T 0 74 0 232K storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0151 0 18.8M storage 6.52G 1.81T 0114 0 8.07M storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0 0 0 0 storage 6.52G 1.81T 0 0 0 0 Don't know if this will help anyone or not. Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Matthew Dillon wrote: If you are getting DMA timeouts, go to this URL: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting Then I would suggest going into /usr/src/sys/dev/ata (I think, on FreeBSD), locate all instances where request->timeout is set to 5, and change them all to 10. cd /usr/src/sys/dev/ata fgrep 'request->timeout' *.c ... change all assignments of 5 to 10 ... Changing 5 to 10 in all cases and rebuilding the kernel does not fix the problem. I'm going to install the patch that allows the values to be changed via sysctl and up it to 15. This problem happens across all four disks. Does anyone else have any suggestions on what I can check? Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
Matthew Dillon wrote: If you are getting DMA timeouts, go to this URL: Yes, I am. http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting I fall under the category of "ATA/SATA DMA timeout issues". Then I would suggest going into /usr/src/sys/dev/ata (I think, on FreeBSD), locate all instances where request->timeout is set to 5, and change them all to 10. cd /usr/src/sys/dev/ata fgrep 'request->timeout' *.c ... change all assignments of 5 to 10 ... Try that first. If it helps then it is a known issue. Basically a combination of the on-disk write cache and possible ECC corrections, remappings, or excessive remapped sectors can cause the drive to take much longer then normal to complete a request. The default 5-second timeout is insufficient. If it does help, post confirmation to prod the FBsd developers to change the timeouts. I've just reproduced the problem, and will try hacking the code now to see if the problem goes away. Since the box won't take input, I can't tell the disk usage at the time it dies. However, it seems to appear while running an Amanda backup, and my network throughput hits about ~90 Mbps @ ~5 kpps. I'll post back with results of the increase of the timeout. Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
:Hi everyone, : :I'm wondering if the problems described in the following link have been :resolved: : :http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html : :I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them :are experiencing the behavior. : :The problem only happens with extreme disk activity. The box becomes :unresponsive (can not SSH etc). Keyboard input is displayed on the :console, but the commands are not accepted. : :Is there anything I can do to either figure this out, or work around it? : :Steve If you are getting DMA timeouts, go to this URL: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting Then I would suggest going into /usr/src/sys/dev/ata (I think, on FreeBSD), locate all instances where request->timeout is set to 5, and change them all to 10. cd /usr/src/sys/dev/ata fgrep 'request->timeout' *.c ... change all assignments of 5 to 10 ... Try that first. If it helps then it is a known issue. Basically a combination of the on-disk write cache and possible ECC corrections, remappings, or excessive remapped sectors can cause the drive to take much longer then normal to complete a request. The default 5-second timeout is insufficient. If it does help, post confirmation to prod the FBsd developers to change the timeouts. -- If you are NOT getting DMA timeouts then the ZFS lockups may be due to buffer/memory deadlocks. ZFS has knobs for adjusting its memory footprint size. Lowering the footprint ought to solve (most of) those issues. It's actually somewhat of a hard issue to solve. Filesystems like UFS aren't complex enough to require the sort of dynamic memory allocations deep in the filesystem that ZFS and HAMMER need to do. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"