Hi, Le 09/05/2016 16:53, Niccolò Belli a écrit : > On domenica 8 maggio 2016 20:27:55 CEST, Patrik Lundquist wrote: >> Are you using any power management tweaks? > > Yes, as stated in my very first post I use TLP with > SATA_LINKPWR_ON_BAT=max_performance, but I managed to reproduce the > bug even without TLP. Also in the past week I've alwyas been on AC. > > On lunedì 9 maggio 2016 13:52:16 CEST, Austin S. Hemmelgarn wrote: >> Memtest doesn't replicate typical usage patterns very well. My usual >> testing for RAM involves not just memtest, but also booting into a >> LiveCD (usually SystemRescueCD), pulling down a copy of the kernel >> source, and then running as many concurrent kernel builds as cores, >> each with as many make jobs as cores (so if you've got a quad core >> CPU (or a dual core with hyperthreading), it would be running 4 >> builds with -j4 passed to make). GCC seems to have memory usage >> patterns that reliably trigger memory errors that aren't caught by >> memtest, so this generally gives good results. > > Building kernel with 4 concurrent threads is not an issue for my > system, in fact I do compile a lot and I never had any issue.
Note : I once had a server which would pass memtest86 and repeated kernel compilations maxing out the CPU threads but couldn't at the same time reliably compile a kernel and copy large amounts of data. I think I lost my little automated test suite (I should definitely look for it again or code it from scratch) but what I did on new servers since that time was : 1/ create a file larger than the system's RAM (this makes sure you will read and write all data from disk and not only caches and might catch controller hardware problems too) with dd if=/dev/urandom (several gigabytes of random data exercise many different patterns, far more than what memtest86 would test), compute its md5 checksum 2/ launch a subprocess repeatedly compiling the kernel with more jobs than available CPU threads and stopping as soon as the make exit code was != 0. 3/ launch another subprocess repeatedly copying the random file to another location and exiting when the md5 checksum didn't match the source. Let it run as a burn-in test for as long as you can afford (from experience after 24 hours if it's still running the probability that the test will find a problem becomes negligible). If one of the subprocess stopped by itself your hardware is not stable. This actually caught a few unstable systems before it could go into production for me. Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html