Re: raid1 has failing disks, but smart is clear

Austin S. Hemmelgarn Fri, 08 Jul 2016 05:15:33 -0700

On 2016-07-08 07:14, Tomasz Kusmierz wrote:


Well, I was able to run memtest on the system last night, that passed with
flying colors, so I'm now leaning toward the problem being in the sas card.
But I'll have to run some more tests.


Seriously use the "stres.sh" for couple of days, When I was running
memtest it was running continuously for 3 days without the error, day
of stres.sh and errors started showing up.
Be VERY careful with trusting any sort of that tool, modern CPU's lye
to you continuously !!!
1. You may think that you've wrote best on the planet code that
bypasses a CPU cache, but in reality since CPU's are multicore you can
end up with overzealous MPMD traping you inside of you cache memory
and all you resting will do is write a page (trapped in cache) read it
from cache (coherency mechanism, not the mis/hit one) will trap you
inside of L3 so you have no clue you don't touch the ram, then CPU
will just dump your page to RAM and "job done"
2. Since coherency problems and real problems with non blocking on
mpmd you can have a DMA controller sucking pages out your own cache,
due to ram being marked as dirty and CPU will try to spare the time
and accelerate the operation to push DMA straigh out of L3 to
somewhere else (mentioning that sine some testers use crazy way of
forcing your ram access via DMA to somewhere and back to force droping
out of L3)
3. This one is actually funny: some testers didn't claim the pages to
the process so for some reason pages that the were using were not
showing up as used / dirty etc so all the testing was done 32kB of L1
... tests were fast thou :)

srters.sh will test operation of the whole system !!! it shifts a lot
of data so disks are engaged, CPU keeps pumping out CRC32 all the time
so it's busy, RAM gets hit nicely as well due to high DMA.

Agreed, never just trust memtest86 or memtest86+.

FWIW< here's the routine I go through to test new RAM:

1. Run regular memtest86 for at least 3 full cycles in full SMP mode (F2while starting up to force SMP). On some systems this may hang, butthat's an issue in the BIOS's setup of the CPU and MC, not the RAM, andis generally not indicative of a system which will have issues.2. Run regular memtest86 for at least 3 full cycles in regular UP mode(the default on most non-NUMA hardware).3. Repeat 1 and 2 with memtest86+. It's diverged enough from regularmemtest86 that it's functionally a separate tool, and I've seen RAM thatpasses one but not the other on multiple occasions before.4. Boot SystemRescueCD, download a copy of the Linux sources, and run asmany allmodconfig builds in parallel as I have CPU's, each with a numberof make jobs equal to the twice number of CPU's (so each CPU ends uprunning at least two threads). This forces enough context switching tocompletely trash even the L3 cache on almost any modern processor, whichmeans it forces things out to RAM. It won't hit all your RAM, but I'vefound it to be a relatively reliable way to verify the memory bus andthe memory controller work properly.5. Still from SystemRescueCD, use a tool called memtester (essentiallymemtest86, but run from userspace) to check the RAM.6. Still from SystemRescueCD, use sha1sum to compute SHA-1 hashes of allthe disks in the system, using at least 8 instances of sha1sum per CPUcore, and make sure that all the sums for a disk match.7. Do 6 again, but using cat to compute the sum of a concatenation ofall the disks in the system (so the individual commands end up being`cat /dev/sd? | sha1sum`). This will rapidly use all available memoryon the system and keep it in use for quite a while.8. If I'm using my home server system, I also have a special virtualrunlevel set up where I spin up 4 times as many VM's as I have CPU cores(so on my current 8 core system, I spin up 32), all assigned a part ofthe RAM not used by the host (which I shrink to the minimum useable sizeof about 500MB), all running steps 1-3 in parallel.

It may also be worth mentioning that I've seen very poorly behaved HBA'sthat produce symptoms that look like bad RAM, including issues notrelated to the disks themselves, yet show no issues when regular memorytesting is run.


When come to think about it, if your device points change during
operation of the system it might be an LSI card dying -> reinitialize
-> rediscovering drives -> drives show up in different point. On my
system I can hot swap sata and it will come up with different dev even
thou it was connected to same place on the controller.

Barring a few odd controllers I've seen which support hot-plug but nothot-remove, that shouldn't happen unless the device is in use, and inthat case it only happens because of the existing open references to thedevice being held by whatever is using it.


I think, most important - I presume you run nonECC ?

And if not, how well shielded is your system? You can often get by withnon-ECC RAM if you have good EMI shielding and reboot regularly. Mostservers actually do have good EMI shielding, and many pre-built desktopsdo, but a lot of DIY systems don't (especially if ti's a gaming case,the poly-carbonate windows many of them have in the side panel are a_huge_ hole in the EMI shielding).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 has failing disks, but smart is clear

Reply via email to