On Thu, 23 Nov 2006, DervishD wrote:

>     Hi Alan :)
> 
>  * Alan Stern <[EMAIL PROTECTED]> dixit:
> > On Wed, 22 Nov 2006, Andrew Morton wrote:
> > >     The problem is the following: whenever I copy a lot of data to
> > > the usb-storage device (more than a few GB's), the copy goes OK,
> > > without an error, but when I compare the copied files with the
> > > original files, sometimes a copied file is different. This does not
> > > happen if I copy the files one by one, and it doesn't happens all the
> > > time, sometimes the copy is perfect.
> > 
> > Intermittent problems like this are very hard to track down.  It
> > sounds like a hardware problem of some sort, but without more
> > information it's impossible to say if the problem lies in your
> > computer, the USB cable, the USB-storage adapters, or the hard disk
> > drives.  Have you tried using different cables?
> 
>     Yes, and the error dissappears (or at least it hasn't been
> produced yet) when using a very short cable (less than 0.5m), while a
> USB memory stick works OK with a 1m long cable at the same speed! The
> set of cables causing problems are of different brands, and their
> only common "feature" is their lenght: about 1m. The same cables work
> OK (no detectable problems) in other computers, I've tested this
> morning.

It could be some sort of electromagnetic interference phenomenon.  Someone 
once reported that merely turning on the fluorescent lights in the root 
with his computer was enough to cause USB errors.

On the other hand, variations in cable length don't appear to relate to 
the error code mentioned below.  So who knows...

>     The fact is that the USB card works OK in another computer I've
> tested on, but that computer uses Windows and I cannot install Linux
> there, so... :(
> 
>     Really the problem is very difficult to track down.
> 
> > >     In addition to this, from time to time the usb-storage
> > > adapters (any of them, with any of the USB cards and any kernel)
> > > report a read error, telling that some sector could not be read.
> > > This is false because if I repeat the operation, the sector is
> > > correctly retrieved.
> > 
> > No, the messages are not false.  They definitely indicate a
> > problem; you mustn't dismiss them so easily.  With borderline
> > hardware it's entirely possible that an operation can fail at
> > moment and then succeed a few moments later.
> 
>     I've tested the hard disk with a destructive badblocks and with
> some diagnostic tool of Seagate, and all the disks are OK. In fact
> they work reliably (and SMART doesn't show any problem) if used
> directly. That leaves us with the usb-storage adapters as causing
> those failures, but: why should them fail for a sector that is being
> read from a hard disk which can be read after a while?

I agree that a likely source is the adapters.  Remember, these things have
to communicate with both the hard disk and the USB subsystem.  So even
though the disk may be working fine, a problem at the USB level could
cause errors to show up.  Or a problem in the connection between the
adapter and the drive.  Or there could be some internal error in the
adapter itself, unrelated to either the drive or the USB connection.

>     Probably the motherboard is the culprit, I don't really know :(
> I've tested the same set (USB-card+USB-cable+usb-storage
> adapter+harddisk) in a windows computer and it works OK, but I don't
> know if it really works or if windows is ignoring IO errors and
> silently retrying :??? Unfortunately, I cannot carry under windows
> the same tests I can carry under Linux (including modifying the
> kernel to add traces or any other debugging helper).
> 
> > >     The fact is that I cannot reproduce the problem reliably, so I
> > > cannot give you a "recipe", except that it happens when I copy a lot
> > > of data at a time.
> > > 
> > >     Any suggestion about how to narrow the problem down? Any more
> > > data that you may need? A known bug? Am I doing any stupidity?
> > 
> > Well, you could start by posting some of the error messages!
> 
>     Yep, sorry O:)) Here are they:
> 
> kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 8000002
> kernel: Current sd08:01: sns = 70  4
> kernel: ASC=4b ASCQ= 0
> kernel: Raw sense data:0x70 0x00 0x04 0x00 0x00 0x00 0x00 0x0a 0x00
>                        0x00 0x00 0x00 0x4b 0x00 0x00 0x00 0x00 0x00
> kernel:  I/O error: dev 08:01, sector 1804512

Sense key = 0x04 means Hardware error.  ASC = 0x4b means Data Phase error, 
which is probably a fancy way of saying that the adapter had some 
unspecified difficulty communicating with the drive (although 
manufacturers aren't very careful about the error codes used in their 
hardware, so it could easily mean something else).

>     The sector number varies on each IO error, and are completely
> unrelated. Here are the list of "bad sectors" so far:
> 
> 6133384
> 1804512
> 18490944
> 31794768
> 31177200
> 
>     Again, nor badblocks nor the Seagate diagnostic program have
> revealed any error when connecting the disk to the IDE bus directly.

These all support the idea that the drive is working okay and the problem 
lies in the communication between the drive and the adapter, or within the 
adapter itself.

> 
> > Also, it wouldn't hurt to turn on CONFIG_USB_DEBUG and rebuild the
> > USB drivers in your kernel, then post the entire kernel log
> > starting from when you plug in the drive.
> 
>     I cannot rebuild the kernel because I cannot reboot the machine
> right now, but I have already a kernel prepared with CONFIG_USB_DEBUG
>  
> > It's impossible to tell whether your problem is due to a known bug
> > without more information.  About all you've told us so far is that
> > from time to time something goes wrong.  That's not much to go on.
> 
>     Well, I just wanted some advice to further investigate the
> problem and then provide more information. Now that I've discarded
> part of the problem (1m cables seem too long for the USB card *in
> this computer*), I'll concentrate in the IO error.
> 
>     Thanks a lot for the advice :) I'll try to provide more and
> better data, and more tests. I'm going to perform some kind of
> differential analysis to discover the exact combination (if any) or
> conditions (again, if any) that lead to the error. Anyway this is
> going to be slow, because each test tooks half an hour or even more.

Sometimes differential analysis doesn't help.  Here's an example:

I've got a USB hard drive adapter.  Plugged in to my home computer, it
doesn't work.  But if I use a different USB cable, then it does work.  
Alternatively, I can use the old USB cable with a USB flash drive, and
that works.  Or, I can move both the adapter/hard drive and the cable to a
different computer, and again they work.  In short, switching any one of
the three components (computer, cable, device) is enough to get things 
working again -- so which component is at fault?

Alan Stern


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
linux-usb-devel@lists.sourceforge.net
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to