On Thu, 2019-01-03 at 20:02 +0000, Kamal Kakri wrote:
> My device has errors injected:
> # ndctl inject-error --status namespace2.0
> {
>   "badblocks":[
>     {
>       "block":35000,
>       "count":10
>     }
>   ]
> }
> 
> No problem reading from the bad offsets:
> # dd if=/dev/pmem2 of=/tmp/pmem_out bs=512 count=10 skip=35000
> 10+0 records in
> 10+0 records out
> 5120 bytes (5.1 kB) copied, 0.000108226 s, 47.3 MB/s

Did you ever read from /dev/pmem2 before injecting the error? There is
a possibility that the page is already present in the page cache and
the read gets serviced from there. You can set iflag=direct to ensure
you're reading from the device.

Other than that, there /should/ have been an MCE/sigbus in this case.
I'd check with your hardware/platform vendor to ensure machine checks
are available, and to ensure that injecting error does result in a
memory error/poison consumption by the CPU.

> 
> Kernel doesn't know of the badblocks yet so this should have resulted
> in sigbus for dd:
> # cat /sys/block/pmem2/badblocks
> #
> 
> I dont have mcelog daemon running but there is no error in
> /var/log/messages for pmem device. Is there some setting/config that
> I am missing ?
> 
> -KK
> 
> 
> On Thursday, January 3, 2019, 1:39:44 PM EST, Verma, Vishal L <
> vishal.l.ve...@intel.com> wrote:
> 
> 
> 
> On Thu, 2019-01-03 at 17:13 +0000, Kamal Kakri wrote:
> > I am playing around with ndctl inject-error and have a few
> questions
> > around the behavior of the application when an error occurs.
> > After successfully injecting error with --no-notify, I am able to
> > read and write to the namespace device with no problems. For e.g.:
> > 
> > # ndctl inject-error --block=35000 --count=10 --no-notify
> > namespace2.0{
> >  "dev":"namespace2.0",
> >  "mode":"raw",
> >  "size":17179869184,
> >  "blockdev":"pmem2"
> > }
> > 
> > 
> > # dd  if=/dev/pmem2 of=/tmp/pmem-dump bs=512 count=10 seek=35000
> oflag=direct
> 
> I think you want 'skip=35000' here instead of seek= to read from that
> offset in the input.
> 
> > 10+0 records in
> > 10+0 records out
> > 5120 bytes (5.1 kB) copied, 0.0128088 s, 400 kB/s
> > 
> > # pwd
> > /sys/block/pmem2
> > # cat badblocks
> > #    ----------> empty badblock list
> 
> With --no-notify badblocks is expected to be empty, as ACPI will not
> notify the OS of new errors.
> 
> > 
> > 
> > [Question] Shouldn't my "dd" get a SIGBUS (default machine-check
> > handling) when it encounters badblocks that its not aware of (no-
> > notify) ?
> 
> Yes it should - I'd be curious to see if you still don't get a
> machine
> check with the seek/skip fix above.
> 
> 
> > 
> > 
> > I tried to do both reading and writing to badblocks and things just
> > work. If I scrub my nvdimm's (ndctl start-scrub) and the badblocks
> > show up in device badblock list (/sys/block/pmem/badblocks) but dd
> > can still work and writing the blocks clears out the badblock list:
> > # cat /sys/block/pmem2/badblocks35000 10
> > 
> > # dd if=/dev/zero of=/dev/pmem2 bs=512 count=10 seek=35000
> > oflag=direct
> > 10+0 records in
> > 10+0 records out
> 
> 
> Writing with O_DIRECT is the canonical way to clear errors - what you
> might see here is a corrected machine check notification in your
> kernel
> logs (CMCI), but that is just a notification that the platform has
> handled the error and no action is required.
> 
>     -Vishal
> 
> 

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Reply via email to