Hi,

Le 09/05/2016 16:53, Niccolò Belli a écrit :
> On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote:
>> Are you using any power management tweaks?
>
> Yes, as stated in my very first post I use TLP with
> SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the
> bug even without TLP. Also in the past week I've alwyas been on AC.
>
> On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote:
>> Memtest doesn't replicate typical usage patterns very well.  My usual
>> testing for RAM involves not just memtest, but also booting into a
>> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel
>> source, and then running as many concurrent kernel builds as cores,
>> each with as many make jobs as cores (so if you've got a quad core
>> CPU (or a dual core with hyperthreading), it would be running 4
>> builds with -j4 passed to make).  GCC seems to have memory usage
>> patterns that reliably trigger memory errors that aren't caught by
>> memtest, so this generally gives good results.
>
> Building kernel with 4 concurrent threads is not an issue for my
> system, in fact I do compile a lot and I never had any issue.

Note : I once had a server which would pass memtest86 and repeated
kernel compilations maxing out the CPU threads but couldn't at the same
time reliably compile a kernel and copy large amounts of data.
I think I lost my little automated test suite (I should definitely look
for it again or code it from scratch) but what I did on new servers
since that time was :

1/ create a file larger than the system's RAM (this makes sure you will
read and write all data from disk and not only caches and might catch
controller hardware problems too) with dd if=/dev/urandom (several
gigabytes of random data exercise many different patterns, far more than
what memtest86 would test), compute its md5 checksum
2/ launch a subprocess repeatedly compiling the kernel with more jobs
than available CPU threads and stopping as soon as the make exit code
was != 0.
3/ launch another subprocess repeatedly copying the random file to
another location and exiting when the md5 checksum didn't match the source.

Let it run as a burn-in test for as long as you can afford (from
experience after 24 hours if it's still running the probability that the
test will find a problem becomes negligible).
If one of the subprocess stopped by itself your hardware is not stable.

This actually caught a few unstable systems before it could go into
production for me.

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to