On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote: > On 1/21/2022 4:31 PM, Jane Chu wrote: > > On baremetal Intel platform with DCPMEM installed and configured to > > provision daxfs, say a poison was consumed by a load from a user thread, > > and then daxfs takes action and clears the poison, confirmed by "ndctl > > -NM". > > > > Now, depends on the luck, after sometime(from a few seconds to 5+ hours) > > the ghost of the previous poison will surface, and it takes > > unload/reload the libnvdimm drivers in order to drive the phantom poison > > away, confirmed by ARS. > > > > Turns out, the issue is quite reproducible with the latest stable Linux. > > > > Here is the relevant console message after injected 8 poisons in one > > page via > > # ndctl inject-error namespace0.0 -n 2 -B 8210 > > There is a cut-n-paste error, the above line should be > "# ndctl inject-error namespace0.0 -n 8 -B 8210"
You say "in one page" here. What is the page size? > > -jane > > > then, cleared them all, and wait for 5+ hours, notice the time stamp. > > BTW, the system is idle otherwise. > > > > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at > > 1850602400 > > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to > > fsdax_poison_v1:8457 due to hardware memory corruption > > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page: > > Recovered > > [ 2439.769949] mce: [Hardware Error]: Machine check events logged > > -1850603000 uncached-minus<->write-back > > [ 2439.769984] x86/PAT: memtype_reserve failed [mem > > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus > > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map > > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype > > [mem 0x1850602000-0x1850602fff] This error is reported in PFN=1850602 (at offset 0x400 = 1K) > > > > At this point, > > # ndctl list -NMu -r 0 > > { > > "dev":"namespace0.0", > > "mode":"fsdax", > > "map":"dev", > > "size":"15.75 GiB (16.91 GB)", > > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb", > > "sector_size":4096, > > "align":2097152, > > "blockdev":"pmem0" > > } > > > > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic > > Hardware Error Source: 1 > > [21352.001528] {2}[Hardware Error]: event severity: recoverable > > [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable > > [21352.014156] {2}[Hardware Error]: section_type: memory error > > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200 This error is in the following page: PFN=1850603 (at offset 0x200 = 512b) Is that what you mean by "phantom error" ... from a different address from those that were injected? -Tony