Re: ahcich Timeouts SATA SSD
On 2012-Oct-14 16:03:39 -0700, nate keegan nate.kee...@gmail.com wrote: Based on what I'm seeing for post types on freebsd-questions this might be the best forum for this issue as it looks like some sort of a strange issue or bug between FreeBSD 8.2/9.0 and SATA SSD drives. This system was commissioned in February of 2012 and ran without issue as a ZFS backup system on our network until about 3 weeks ago. At that time I started getting kernel panics due to timeouts to the on-board SATA devices. The only change to the system since it was built was to add an SSD for swap (32 Gb swap device) and this issue did not happen until several months after this was added. This _does_ sound more like hardware than software - it's difficult to envisage a software bug that does nothing for 6 months and then makes the system hang regularly. Has there been any significant change to the system load, how much data is being transferred, clients, how full the data zpool is, etc that might correlate with the onset of hangs? I then moved to systematically replacing items such as SATA cables, memory, motherboard, etc and the problem continued. For example, I swapped out the 4 SATA cables with brand new SATA cables and waited to see if the problem happened again. Once it did I moved on to replacing the motherboard with an identical motherboard, waited, etc. Have you tried replacing RAM PSU? The system logs do not show anything prior to event happening and the OS will respond to ping requests after the issue and if you have an active SSH session you will remain connected to the system until you attempt to do something like 'ls', 'ps', etc. This implies that the kernel is still active but the filesystem is deadlocked. Are you able to drop into DDB? Is anything displayed on the kernel? New SSH requests to the system get 'connection refused'. This implies that sshd has died - a filesystem deadlock should result in connection attempts either timing out or just hanging. I'm open to suggestions, direction, etc to see if I can nail down what is going on and put this issue to bed for not only myself but for anyone else who might run into it in the future. Are you running a GENERIC kernel? If not, what changes have you made? Have you set any loader tunables or sysctls? Have you scrubbed the pools? If you run gstat -a, do any devices have anomolous readings? I can't offer any definite fixes but can suggest a few more things to try: 1) Try FreeBSD-9.1RC2 and see if the problem persists. 2) Try a new kernel with options WITNESS options WITNESS_SKIPSPIN this may make a software bug more obvious (but will somewhat increase kernel overheads) 3) If you can afford it, detach the L2ARC - which removes one potential issue. 4) If you haven't already, build a kernel with makeoptions DEBUG=-g options KDB options KDB_TRACE options KDB_UNATTENDED options DDB this won't have any impact on normal operation but will simplify debugging. -- Peter Jeremy pgpjZCrSIYLe8.pgp Description: PGP signature
Re: ahcich Timeouts SATA SSD
On 15 oct. 2012, at 11:58, Peter Jeremy wrote: This _does_ sound more like hardware than software I do agree with that. Have you tried replacing RAM PSU? I, too, was about to suggest a test or replacement of the PSU. Also, I've had a (quite) similar problem years ago (no raid, no zfs, older freebsd…) where HDD would detach or be lost by the system on a random basis. I search a long time of the software side, but it was cured by a firmware update on HDDs. good luck with this issue. Patrick
Current problem reports assigned to freebsd-hardware@FreeBSD.org
Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: ahcich Timeouts SATA SSD
I took a look at the DDB man page and I am not able to do this when the issue happens as the system is completely blown up (meaning no keyboard input on IPMI console, existing SSH sessions, etc. No changes have been seen in the ZFS load on the system. The nature of this system (backup) is such that the heaviest load would be created in the first week or so of going online as we use rsync to copy files down from our Windows servers and during this first week or so the system has to 'seed' the initial copies which would be much heavier on I/O than after that first week where things are relatively constant in terms of I/O. I have 48 Gb of Crucial memory that I will put in this system today to replace the 24 Gb or so of Kingston memory I have in the system. If the issue happens again with the memory change I plan on replacing both SSD (Crucial M4) with two non-SSD SATA disks with the idea that maybe the Crucial firmware on the disks (002 on both disks) is the culprit somehow. It neither item turn out to solve the issue will move on to 9.1RC2 or 9.1-RELEASE if it is out by then and adding kernel options requested. The amount of monkeying that I have had to do via /boot/loader.conf and the camcontrol script I run is telling me that the SSD, the firmware on the SSD, etc is somehow causing the issue as we have plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA drives without this issue popping up in their daily workloads. My /boot/loader.conf looks like this currently: # Set in the BIOS as well to activate ahci_load=YES # Should be auto-negotiation in FreeBSD 9.x # See ahci(4) hint.ahcich.0.sata_rev=1 hint.ahcich.1.sata_rev=1 hint.ahcich.0.pm_level=1 hint.ahcich.1.pm_level=1 And /usr/local/etc/rc.d/camcontrol: #!/bin/sh CAMCONTROL=/sbin/camcontrol # Disable NCQ $CAMCONTROL tags ada0 -N 1 /dev/null $CAMCONTROL tags ada1 -N 1 /dev/null # Disable APM $CAMCONTROL cmd ada0 -a EF 85 00 00 00 00 00 00 00 00 00 00 /dev/null $CAMCONTROL cmd ada1 -a EF 85 00 00 00 00 00 00 00 00 00 00 /dev/null Without both of these shims in place I get maybe 1.5 hours to two hours or so before the system goes kablooie and that is without the system doing any real I/O work just running FreeBSD during the business day and a few scripts from cron to check for data and shuffle it around. ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org
Re: ahcich Timeouts SATA SSD
On 2012-Oct-15 07:54:21 -0700, nate keegan nate.kee...@gmail.com wrote: The system is dual PSU behind a UPS so I don't think that this is an issue. OK I do have a complete set of replacement memory (Crucial vs Kingston that is in the system now) and will swap out the memory in case one of the DIMMs is flaky but not poor enough for the BIOS to notice on a consistent basis. I presume this is registered ECC RAM - which makes it more robust. Non-ECC RAM can develop pattern-sensitive faults - which are virtually impossible to test for. And BIOS RAM 'tests' generally can't be relied on to do much more than verify that something is responding. Swapping RAM is the best way to rule out RAM issues. I am not able to drop into DDB when the issue happens as the system is locked up completely. That's surprising. I haven't seen a failure mode where the kernel will respond to pings but not the console. Will get the output of gstat -a and post it up here. gstat -a gives a dynamic picture of disk activity. I was hoping you could watch it for a minute or so (on a tall window) whilst the system was running and see if any disks look odd - significantly higher or lower than expected I/O volume or long ms/r or ms/w. On 2012-Oct-15 10:21:06 -0700, nate keegan nate.kee...@gmail.com wrote: I took a look at the DDB man page and I am not able to do this when the issue happens as the system is completely blown up (meaning no keyboard input on IPMI console, existing SSH sessions, etc. Note that I'm referring to ddb(4), not ddb(8). The former is entered via a magic key sequence on the console and should work even if the system won't react to normal commands. To enter ddb, use Ctrl-Alt-ESC on a graphical console or the character sequence CR ~ Ctrl-B on a serial console (in the latter case, the sysctl debug.kdb.alt_break_to_debugger also needs to be set to 1). If you do get into ddb, a useful set of initial commands is: show all procs show alllocks show allpcpu show lockedvnods call doadump Note that the first 4 commands will generate lots of output - ideally you would have a serial console with logging. The last command generates a crashdump and needs 'dumpdev=AUTO' in /etc/rc.conf (run service dumpon start after editing rc.conf to enable it without rebooting). The amount of monkeying that I have had to do via /boot/loader.conf and the camcontrol script I run is telling me that the SSD, the firmware on the SSD, etc is somehow causing the issue as we have plenty of other FreeBSD 8.x and 9.x systems that use non-SSD SATA drives without this issue popping up in their daily workloads. Are you able to move the SSD(s) to a different type of SATA port? One (not especially likely) possibility is it's an interaction between the SSD and the SATA controller. -- Peter Jeremy pgp6kwXmUaVZt.pgp Description: PGP signature
Re: ahcich Timeouts SATA SSD
SSD are connected to on-board SATA port on motherboard Presumably to controllers provided by the Intel Tylersburg 5520 chipset. This system was commissioned in February of 2012 and ran without issue as a ZFS backup system on our network until about 3 weeks ago. The system is dual PSU behind a UPS so I don't think that this is an issue. No changes? e.g. no added hardware to increase power load. Overloading the power supply and/or the wiring (with too many splitters) can result in flaky problems like this. OS will respond to ping requests after the issue and if you have an active SSH session you will remain connected to the system until you attempt to do something like 'ls', 'ps', etc. I am not able to drop into DDB when the issue happens as the system is locked up completely. Could be a failure on my part to understand/engage in how to do this, will try if the issue happens again (should on Wednesday AM unless setting camcontrol apm to off for the disks somehow fixes the issue). If the system is alive enough to respond to ping, I'd expect you should be able to get into DDB? Can you get into DDB when the system is working normally? 2 x Crucial M4 64 Gb SATA SSD for FreeBSD OS (zroot) 2 x Intel 320 MLC 80 Gb SATA SSD for L2ARC and swap I ran the Crucial firmware update ISO and it did not see any firmware updates as necessary on the SSD disks. Does the problem happen with both the Crucial and the Intel SSDs? If software I agree that it would not make sense that this would suddenly pop-up after months of operation with no issues. If something causes the software/firmware to take a different path, new issues can appear. E.g. error handling or even timing. Infrequently used code paths might not have been tested sufficiently. Does the controller have firmware? Part of the BIOS I suppose. Is there a BIOS update available? Have you considered connecting the SSDs to a different controller? the on-board AHCI portion of the BIOS does not always see the disks after the event without a hard system power reset. That's at least one bug somewhere, probably the hardware isn't getting reset properly. Does Supermicro know about this bug? I have 48 Gb of Crucial memory that I will put in this system today to replace the 24 Gb or so of Kingston memory I have in the system. Which in addition to being different memory, should reduce swap activity. Suggestion: move everything to conventional drives. Keep at least one SSD connected to system, but normally unused. Now you can beat on the SSD in a controlled manner to debug the problem. Does reading trigger the problem? Writing? Try dd with different blocksizes, accessing multiple SSDs at once, etc. I have to wonder if there is a timing problem, or missing interrupt, or... * Ditch FreeBSD for Solaris so I can keep ZFS lovin for the intended purpose of this system If it fails with FreeBSD but works with Solaris on the same hardware, then it is almost certainly a problem with the device driver. (Or at least a problem that Solaris has a workaround for.) ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to freebsd-hardware-unsubscr...@freebsd.org