Re: bad key ordering - repairable?

Duncan Wed, 24 Jan 2018 15:18:25 -0800

Claes Fransson posted on Wed, 24 Jan 2018 20:44:33 +0100 as excerpted:

> So, I have now some results from the PassMark Memtest86! I let the
> default automatic tests run for about 19 hours and 16 passes. It
> reported zero "Errors", but 4 lines of "[Note] RAM may be vulnerable to
> high frequency row hammer bit flips". If I understand it correctly,
> it means that some errors were detected when the RAM was tested at
> higher rates than guaranteed accurate by the vendors.


>From Wikipedia:

Row hammer (also written as rowhammer) is an unintended side effect in 
dynamic random-access memory (DRAM) that causes memory cells to leak 
their charges and interact electrically between themselves, possibly 
altering the contents of nearby memory rows that were not addressed in 
the original memory access. This circumvention of the isolation between 
DRAM memory cells results from the high cell density in modern DRAM, and 
can be triggered by specially crafted memory access patterns that rapidly 
activate the same memory rows numerous times.[1][2][3]

The row hammer effect has been used in some privilege escalation computer 
security exploits.

https://en.wikipedia.org/wiki/Row_hammer

So it has nothing to do with (generic) testing the RAM at higher rates 
than guaranteed by the vendors, but rather, with deliberate rapid 
repeated access (at normal clock rates) of the same cell rows in ordered 
to trigger a bitflip in nearby memory cells that could not normally be 
accessed due to process separation and insufficient privileges.

IOW, it's unlikely to be accidentally tripped, and thus is exceedingly 
unlikely to be relevant here, unless you're being hacked, of course.


That said, and entirely unrelated to rowhammer, I know one of the 
problems of memory test false-negatives from experience.

In my case, I was even running ECC RAM.  But the memory I had purchased 
(back in the day when memory was far more expensive and sub-GB memory was 
the norm) was cheap, and as it happened, marked as stable at slightly 
higher clock rates than it actually was.  But I couldn't afford more (or 
I'd have procured less dodgy RAM in the first place) and had little 
recourse but to live with it for awhile.  A year or so later there was a 
BIOS update that added better memory clocking control, and I was able to 
declock the RAM slightly from its rating (IIRC to PC-3000 level, it was 
PC3200 rated, this was DDR1 era), after which it was /entirely/ stable, 
even after reducing some of the wait-state settings somewhat to try to 
claw back some of what I lost due to the underclocking.

I run gentoo, and nearly all of my problems occurred when I was doing 
updates, building packages at 100% CPU with multiple cores accessing the 
same RAM.  FWIW, the most frequent /detected/ problem was bunzip checksum 
errors as it decompressed and verified the data in memory (before writing 
out)... that would move or go away if I tried again.  Occasionally I'd 
get machine-check errors (MCEs), but not frequently, and the ECC RAM 
subsystem /never/ reported errors.

But the memory tests gave that memory an all-clear.

The problem with the memory tests in this case is that they tend to work 
on an otherwise unloaded system, and test the retention of the memory 
cells, /not/ so much the speed and reliability at which they are accessed 
under fully loaded system stress -- and how could they when memory speed 
is normally set by the BIOS and not something the memory tester has 
access to?

But my memory problems weren't with the memory cells themselves -- they 
retained their data just fine and indeed it was ECC RAM so would have 
triggered ECC errors if they didn't -- but with the precision timing of 
memory IO -- it wasn't quite up to the specs it claimed to support and 
would occasionally produce in-transit errors (the ECC would have detected 
and possibly corrected errors in storage), and the memory testers simply 
didn't test that like a fully loaded system doing unpacks of sources and 
builds from them did.

As mentioned, once I got a BIOS update that let me declock the RAM a bit, 
everything was fine, and it remained fine when I did upgrade the RAM some 
years later, after prices had fallen, as well.

(The system was first-gen AMD Opteron, on a server-grade Tyan board, that 
I ran from purchase in late 2003 for over eight years, maxing out the 
pair of CPUs to dual-core Opteron 290s and the RAM to 8 gigs, over time, 
until the board finally died in 2012 due to burst capacitors.  Which 
reminds me, I'm still running the replacement, a Gigabyte with an fx6100 
overclocked a bit to 3.9 GHz and 16 gig RAM, and it's now nearing six 
years old, so I suppose I better start planning for the next upgrade...  
I've spent that six years upgrading to big-screen TVs as monitors, with a 
65inch/165cm 4K as my primary now and a 48inch/122cm as a secondary to 
put youtube or whatever on fullscreen, and to now my second generation of 
ssds, a pair of 1 TB samsung evos, but this reminds me that at nearing 
six years old the main system's aging too, so I better start thinking of 
replacing it again...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad key ordering - repairable?

Reply via email to