Re: scrub implies failing drive - smartctl blissfully unaware

Robert White Wed, 19 Nov 2014 13:06:13 -0800

On 11/19/2014 08:07 AM, Phillip Susi wrote:

On 11/18/2014 9:46 PM, Duncan wrote:

I'm not sure about normal operation, but certainly, many drives
take longer than 30 seconds to stabilize after power-on, and I
routinely see resets during this time.


As far as I have seen, typical drive spin up time is on the order of
3-7 seconds.  Hell, I remember my pair of first generation seagate
cheetah 15,000 rpm drives seemed to take *forever* to spin up and that
still was maybe only 15 seconds.  If a drive takes longer than 30
seconds, then there is something wrong with it.  I figure there is a
reason why spin up time is tracked by SMART so it seems like long spin
up time is a sign of a sick drive.

I was recently re-factoring Underdog (http://underdog.sourceforge.net)startup scripts to separate out the various startup domains (e.g. lvm,luks, mdadm) in the prtotype init.

So I notice you (Duncan) use the word "stabilize", as do a small numberof drivers in the linux kernel. This word has very little to do with"disks" per se.

Between SCSI probing LUNs (where the controller tries every theoreticaladdress and gives a potential device ample time to reply), andusb-storage having a simple timer delay set for each volume it sees,there is a lot of "waiting in the name of safety" going on in the linuxkernel at device initialization.

When I added the messages "scanning /dev/sd??" to the startup sequenceas I iterate through the disks and partitions present I discovered thatthe first time I called blkid (e.g. right between /dev/sda and/dev/sda1) I'd get a huge hit of many human seconds (I didn't time it,but I'd say eight or so) just for having a 2Tb My Book WD 3.0 diskenclosure attached as /dev/sdc. This enclosure having "spun up" in theprevious boot cycle and only bing a soft reboot was immaterial. In thiscase usb-store is going to take its time and do its deal regardless ofthe state of the physical drive itself.

So there are _lots_ of places where you are going to get delays and veryfew of them involve the disk itself going from power-off to ready.


You said it yourself with respect to SSDs.

It's cheaper, and less error prone, and less likely to generate customerreturns if the generic controller chips just "send init, wait a fixeddelay, then request a status" compared to trying to "are-you-there-yet"poll each device like a nagging child. And you are going to see that atevery level. And you are going to see it multiply with _sparsely_provisioned buses where the cycle is going to be retried for absent LUNs(one disk on a Wide SCSI bus and a controller set to probe all LUNs isparticularly egregious)

One of the reasons that the whole industry has started favoringpoint-to-point (SATA, SAS) or physical intercessor chainingpoint-to-point (eSATA) buses is to remove a lot of those wait-and-seedelays.

That said, you should not see a drive (or target enclosure, orcontroller) "reset" during spin up. In a SCSI setting this is almostalways a cabling, termination, or addressing issue. In IDE its jumpermismatch (master vs slave vs cable-select). Less often its apartitioning issue (trying to access sectors beyond the end of the drive).

Another strong actor is selecting the wrong storage controller chipsetdriver. In that case you may be faling back from high-end device youthink it is, through intermediate chip-set, and back to ACPI or BIOSemulation

Another common cause is having a dedicated hardware RAID controller(dell likes to put LSI MegaRaid controllers in their boxes for example),many mother boards have hardware RAID support available through thebios, etc, leaving that feature active, then the adding a drive and_not_ initializing that drive with the RAID controller disk setup. Inthis case the controller is going to repeatedly probe the drive for itsproprietary controller signature blocks (and reset the drive after eachattempt) and then finally fall back to raw block pass-through. This cantake a long time (thirty seconds to a minute).

But seriously, if you are seeing "reset" anywhere in any storage chainduring a normal power-on cycle then you've got a problem with geometryor configuration.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

Reply via email to