Bug#714295: Data corruption when using INIC-1623TA2 controller

2013-06-30 Thread Ben Hutchings
Martin Braure de Calignon wrote:
> I'm experiencing frequent data corruption on my raid1 ext4 fs.
> The error is not always the same.
> 
> I first thought it was due to a previous resize of the FS I've done.
> I had multiple times some message about huge amount of multiply claimed 
> blocks in inode .
> fsck.ext4 was not working fully and was always ending with a message like "FS 
> still have error".
> I was unable to copy all the files to another FS to save it.
> So I end up checking the badblocks (that's where I've been dumb, I choose a 
> non data conservative way).
> However, badblock was succesful without errors.
> So in the end I lost some data, however, I don't know if it's due to the bug 
> or the the badblocks check. So feel free to readjust severity.
> 
> Since theni, I have bought two brand new disks, created a completly new ext4 
> FS, and copied the files that I had succesfully recovered.
> Then I run fsck.ext4 on the FS... it seems it is almost working.
> I'm remounting the /dev/md0... And each time I start using the system 
> seriously, I have new errors, like the one I had today:
> (I was just copying files on it)
[...]
> [1436849.120036] EXT4-fs (md0): error count: 6
> [1436849.120044] EXT4-fs (md0): initial error at 1371763084: 
> htree_dirblock_to_tree:587: inode 20971803: block 83894316
> [1436849.120054] EXT4-fs (md0): last error at 1371765809: 
> htree_dirblock_to_tree:587: inode 41813096: block 167256110
> [1446656.923648] EXT4-fs error (device md0): htree_dirblock_to_tree:587: 
> inode #52698372: block 210773049: comm smbd: bad entry in directory: 
> directory entry across blocks - offset=1052(9244), inode=1949184565, 
> rec_len=29816, name_len=24
[...]

The kernel log also showed the CPU was reaching its temperature limit,
but after he cleaned out the CPU cooler and corrected the CPU frequency
the problem persisted.  I suggested swapping disks between controllers:

On Fri, 2013-06-28 at 15:55 +0200, Martin Braure de Calignon wrote:
[...]
> So as planned I unplugged the working non RAID1 disk from their
> controller, and connect the ext4 RAID1 and the ext3 RAID1 disk to it
> (yeah these are 2 powerful RAID1 with 1 device only ;) for testing
> purposes).
> I also tried to re-plug each PCI card, and connect the video card fan
> that was not connected (yeah it was a bad idea to limit the noise level
> few years ago).
> 
> I did all the tests I could to try to overheat the system (same as
> yesterday):
> * 4 running dd if=/dev/urandom | gzip >/dev/null for the cpu
> * massive copy from one disk to the other 
> * delete of duplicates between two directories (with many duplicates)
> 
> All that in parallel. Everything seems to work fine. No corruption nor
> CPU overheating message (yesterday I still had some even after remove
> the overclock of the CPU).
[...]
> Here's the lspci - for this card (if I'm not wrong):
> 
> 02:09.0 SATA controller: Initio Corporation INI-1623 PCI
> SATA-II Controller (rev 02) (prog-if 00 [Vendor
> specific])
> Subsystem: Initio Corporation Device 1626
> Control: I/O+ Mem+ BusMaster+ SpecCycle-
> MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
> DisINTx-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr-
> DEVSEL=medium >TAbort- SERR-  INTx-
> Latency: 32, Cache Line Size: 32 bytes
> Interrupt: pin A routed to IRQ 17
> Region 0: I/O ports at 9000 [size=256]
> Region 1: Memory at ef022000 (32-bit,
> non-prefetchable) [size=4K]
> [virtual] Expansion ROM at 8000 [disabled]
> [size=128K]
> Capabilities: [dc] Power Management version 2
> Flags: PMEClk+ DSI- D1+ D2+
> AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
> Status: D0 NoSoftRst- PME-Enable- DSel=0
> DScale=0 PME-
> Kernel driver in use: sata_inic162x
[...]

So this does seem to be a fault in either this card or the driver.  Can
you suggest any further tests that Martin could do?

Ben.

-- 
Ben Hutchings
Sturgeon's Law: Ninety percent of everything is crap.


signature.asc
Description: This is a digitally signed message part


Bug#714295: Data corruption when using INIC-1623TA2 controller

2013-06-30 Thread Tejun Heo
Hello,

On Sun, Jun 30, 2013 at 03:49:24PM +0100, Ben Hutchings wrote:
> So this does seem to be a fault in either this card or the driver.  Can
> you suggest any further tests that Martin could do?

Unfortunately, I don't have any idea.  That driver never really
matured enough.  I couldn't find enough information and no one from
initio responded, so

-- 
tejun


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130701061726.ga10...@mtj.dyndns.org



Bug#714295: Data corruption when using INIC-1623TA2 controller

2013-07-01 Thread Martin Braure de Calignon
Hello,

On Sun., 2013-06-30 at 23:17 -0700, Tejun Heo wrote:
> Unfortunately, I don't have any idea.  That driver never really
> matured enough.  I couldn't find enough information and no one from
> initio responded, so

Thank you Tejun and Ben,

that's totally suxx. I'm gonna try to contact them too, but I doubt it's
gonna change anything :(
I was hopping that we could turn on some logging so that we understand,
at least, what is happening (even if we can't solve it), and I could
then have made some tests by modifying code, recompiling module, ...
If the driver could cause data loss, shouldn't it be flagged as
experimental?
In the meantime, I'm probably going to purchase a new SATA card :(

Thanks again guys for your hard work! I really appreciate!

Martin

-- 
Martin Braure de Calignon


signature.asc
Description: This is a digitally signed message part