> > Well, I was able to run memtest on the system last night, that passed with > flying colors, so I'm now leaning toward the problem being in the sas card. > But I'll have to run some more tests. >
Seriously use the "stres.sh" for couple of days, When I was running memtest it was running continuously for 3 days without the error, day of stres.sh and errors started showing up. Be VERY careful with trusting any sort of that tool, modern CPU's lye to you continuously !!! 1. You may think that you've wrote best on the planet code that bypasses a CPU cache, but in reality since CPU's are multicore you can end up with overzealous MPMD traping you inside of you cache memory and all you resting will do is write a page (trapped in cache) read it from cache (coherency mechanism, not the mis/hit one) will trap you inside of L3 so you have no clue you don't touch the ram, then CPU will just dump your page to RAM and "job done" 2. Since coherency problems and real problems with non blocking on mpmd you can have a DMA controller sucking pages out your own cache, due to ram being marked as dirty and CPU will try to spare the time and accelerate the operation to push DMA straigh out of L3 to somewhere else (mentioning that sine some testers use crazy way of forcing your ram access via DMA to somewhere and back to force droping out of L3) 3. This one is actually funny: some testers didn't claim the pages to the process so for some reason pages that the were using were not showing up as used / dirty etc so all the testing was done 32kB of L1 ... tests were fast thou :) srters.sh will test operation of the whole system !!! it shifts a lot of data so disks are engaged, CPU keeps pumping out CRC32 all the time so it's busy, RAM gets hit nicely as well due to high DMA. When come to think about it, if your device points change during operation of the system it might be an LSI card dying -> reinitialize -> rediscovering drives -> drives show up in different point. On my system I can hot swap sata and it will come up with different dev even thou it was connected to same place on the controller. I think, most important - I presume you run nonECC ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html