Re: Detecting NUMA per pmem

2018-01-09 Thread Oren Berman
Hi Dan

Which driver are you referring to?
If it is the dax driver than it is always loaded - we see /dev/dax0.
If you refer to the user space application which called the mmap on the dax
device then this application is not running anymore.
We used this application to get the virtual address mapping(doing mmap on
dax) and then by going over the proc pagemap we got the physical address.
After that the application terminates and we pass this physical address to
our kernel thread .
Then from the kernel thread we access this range by using phys_to_virt (we
know the physical so we convert it virtual).
As far as I know once in kernel space all address range  should be mapped
to the kernel page tables in 64 bit architecture ofcourse,
thus accessible using phys_to_virt.
Is this a wrong assumption when dealing with NVRAM?
If I know the physical address of the nvram isn't it accessible from the
kernel  using the simple conversion of phys_to_virt?

Thanks
Oren




On 10 January 2018 at 01:05, Dan Williams  wrote:

> On Tue, Jan 9, 2018 at 2:25 PM, Oren Berman 
> wrote:
> > Hi
> >
> > I would like to know if you encountered such a problem.
> >
> > We are accessing the nvram as memory from withing the kernel.
> > By mapping dax device and reading its mapping we can know the physical
> > address of the nvram.
> > As a result we can access this address range in the kernel by calling
> > phys_to_virt.
> > This  is working in most case but we saw some issue that after reboot,
> when
> > trying to read the info saved
> > on the nvram before the power off, one kernel thread was able to read
> > from this range but another kernel thread got page fault.
> >
> > This is not recreated very easily and we need run many reboot sequences
> to
> > get this failure again.
> > Are you aware of any mapping issues of nvram to kernel space?
>
> When are you using phys_to_virt()? That will only return a valid
> virtual address as long as the driver is loaded. It sounds like you
> may be losing a race with the driver setting up or tearing down the
> mappings.
>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[no subject]

2018-01-09 Thread bakre


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Detecting NUMA per pmem

2018-01-09 Thread Dan Williams
On Tue, Jan 9, 2018 at 2:25 PM, Oren Berman  wrote:
> Hi
>
> I would like to know if you encountered such a problem.
>
> We are accessing the nvram as memory from withing the kernel.
> By mapping dax device and reading its mapping we can know the physical
> address of the nvram.
> As a result we can access this address range in the kernel by calling
> phys_to_virt.
> This  is working in most case but we saw some issue that after reboot, when
> trying to read the info saved
> on the nvram before the power off, one kernel thread was able to read
> from this range but another kernel thread got page fault.
>
> This is not recreated very easily and we need run many reboot sequences to
> get this failure again.
> Are you aware of any mapping issues of nvram to kernel space?

When are you using phys_to_virt()? That will only return a valid
virtual address as long as the driver is loaded. It sounds like you
may be losing a race with the driver setting up or tearing down the
mappings.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Detecting NUMA per pmem

2018-01-09 Thread Oren Berman
Hi

I would like to know if you encountered such a problem.

We are accessing the nvram as memory from withing the kernel.
By mapping dax device and reading its mapping we can know the physical
address of the nvram.
As a result we can access this address range in the kernel by calling
phys_to_virt.
This  is working in most case but we saw some issue that after reboot, when
trying to read the info saved
on the nvram before the power off, one kernel thread was able to read
from this range but another kernel thread got page fault.

This is not recreated very easily and we need run many reboot sequences to
get this failure again.
Are you aware of any mapping issues of nvram to kernel space?

Thanks for any suggestions you might have.
BR
Oren

On 31 December 2017 at 10:23, Yigal Korman  wrote:

> You can try to force a legacy pmem device with memmap=XX!YY kernel
> parameter.
>
> On Thu, Dec 28, 2017 at 8:16 PM, Dan Williams 
> wrote:
>
>> Type-7 only tells the kernel to reserve the memory range. NFIT carves that
>> reservation into pmem devices. Type-12 skips the reservation and creates a
>> pmem device directly. There is no workaround if the platform only has a
>> BIOS that produces a type-12 range.
>>
>>
>> On Thursday, December 28, 2017, Oren Berman 
>> wrote:
>>
>> > Thanks Dan.
>> > I understand the shortcomings of using legacy mode but currently my
>> problem
>> > is that TYPE 12 is detected and I can use dax even in legacy mode but
>> for
>> > some reason type 7 is not. Is there a way to force it be treated as
>> legacy
>> > as well.
>> > The reason I am asking is that I am not sure I can change my bios and I
>> > know at least that type 12 NVDIMM is working for me.
>> >
>> > BR
>> > Oren
>> >
>> >
>> >
>> > On 28 December 2017 at 11:14, Dan Williams 
>> > wrote:
>> >
>> > > [sent from my phone, forgive formatting]
>> > >
>> > > Your BIOS would need to put SPA range entries in the ACPI NFIT. The
>> > > problem with legacy pmem ranges in the e820 table is that it omits
>> > critical
>> > > details like battery status and whether the platform supports flushing
>> > > memory controller buffers at power loss (ADR).
>> > >
>> > > The NFIT can also reliably communicate NUMA information  for NVDIMMs
>> that
>> > > e820 does not.
>> > >
>> > > On Wednesday, December 27, 2017, Oren Berman 
>> > > wrote:
>> > >
>> > >> Hi
>> > >>
>> > >> I have a question regrading NVDIMM detection.
>> > >>
>> > >> When we are working with NVDIMM of type 12 it is detected by the
>> linux
>> > in
>> > >> legacy mode and we can
>> > >> accesses it as pmem or dax device. we have an e820 bios.
>> > >>
>> > >> When we are using a type 7 NVDIMM it is reported by the linux as
>> > >> persistence type 7 memory but there is no pmem or dax device created.
>> > >> Linux Kernel identifies this memory in the e820 table but it does not
>> > >> trigger nvdimm probe for it.
>> > >> Do you know what could be the cause? Is their a workaround for that?
>> > >> Can it still be treated as legacy mode so we can access it through
>> > >> pmem/dax
>> > >> device?
>> > >>
>> > >> BR
>> > >> Oren Berman
>> > >>
>> > >> On 22 October 2017 at 16:52, Dan Williams 
>> > >> wrote:
>> > >>
>> > >> > On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman <
>> o...@lightbitslabs.com>
>> > >> > wrote:
>> > >> > > Hi Ross
>> > >> > >
>> > >> > > Thanks for the speedy reply. I am also adding the public list to
>> > this
>> > >> > > thread as you suggested.
>> > >> > >
>> > >> > > We have tried to dump the SPA table and this is what we get:
>> > >> > >
>> > >> > > /*
>> > >> > >  * Intel ACPI Component Architecture
>> > >> > >  * AML/ASL+ Disassembler version 20160108-64
>> > >> > >  * Copyright (c) 2000 - 2016 Intel Corporation
>> > >> > >  *
>> > >> > >  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
>> > >> > >  *
>> > >> > >  * ACPI Data Table [NFIT]
>> > >> > >  *
>> > >> > >  * Format: [HexOffset DecimalOffset ByteLength]  FieldName :
>> > >> FieldValue
>> > >> > >  */
>> > >> > >
>> > >> > > [000h    4]Signature : "NFIT"[NVDIMM
>> > >> Firmware
>> > >> > > Interface Table]
>> > >> > > [004h 0004   4] Table Length : 0028
>> > >> > > [008h 0008   1] Revision : 01
>> > >> > > [009h 0009   1] Checksum : B2
>> > >> > > [00Ah 0010   6]   Oem ID : "SUPERM"
>> > >> > > [010h 0016   8] Oem Table ID : "SMCI--MB"
>> > >> > > [018h 0024   4] Oem Revision : 0001
>> > >> > > [01Ch 0028   4]  Asl Compiler ID : " "
>> > >> > > [020h 0032   4]Asl Compiler Revision : 0001
>> > >> > >
>> > >> > > [024h 0036   4] Reserved : 
>> > >> > >
>> > >> > > Raw Table Data: Length 40 (0x28)
>> > >> > >
>> > >> > >   : 4E 46 49 54 28 00 00 00 01 B2 53 55 50 

Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-09 Thread Christoph Hellwig
On Mon, Jan 08, 2018 at 12:01:16PM -0700, Jason Gunthorpe wrote:
> > So I very much disagree about where to place that workaround - the
> > RDMA code is exactly the right place.
> 
> But why? RDMA is using core code to do this. It uses dma_ops in struct
> device and it uses normal dma_map SG. How is it RDMA's problem that
> some PCI drivers provide strange DMA ops?

Because RDMA uses the dma_virt_ops to pretend a device does DMA when
in fact it doesn't - at least not for the exact data mapped, or
as far as I can tell often not all - e.g. the QIB/HFI devices
might do mmio access for data mapped.

This whole problem only exist because a few RDMA HCA drivers lie
with the help of the RDMA core.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-09 Thread Christoph Hellwig
On Mon, Jan 08, 2018 at 12:05:57PM -0700, Logan Gunthorpe wrote:
> Ok, so if we shouldn't touch the dma_map infrastructure how should the 
> workaround to opt-out HFI and QIB look? I'm not that familiar with the RDMA 
> code.

We can add a no_p2p quirk, I'm just not sure what the right place
for it is.  As said they device don't really mind P2P at the PCIe
level, it's just that their RDMA implementation is really strange.

So I'd be tempted to do it in RDMA.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-09 Thread Christoph Hellwig
On Mon, Jan 08, 2018 at 12:49:50PM -0700, Jason Gunthorpe wrote:
> Pretty sure P2P capable IOMMU hardware exists.
> 
> With SOC's we also have the scenario that an DMA originated from an
> on-die device wishes to target an off-die PCI BAR (through the IOMMU),
> that definitely exists today, and people care about it :)

Then people will have to help and contribute support for it.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 2/4] KVM: X86: Fix loss of exception which has not yet injected

2018-01-09 Thread Haozhong Zhang
On 01/09/18 00:57 -0800, Liran Alon wrote:
> 
> - haozhong.zh...@intel.com wrote:
> 
> > On 01/07/18 00:26 -0700, Ross Zwisler wrote:
> > > On Wed, Aug 23, 2017 at 10:21 PM, Wanpeng Li 
> > wrote:
> > > > From: Wanpeng Li 
> > > >
> > > > vmx_complete_interrupts() assumes that the exception is always
> > injected,
> > > > so it would be dropped by kvm_clear_exception_queue(). This patch
> > separates
> > > > exception.pending from exception.injected, exception.inject
> > represents the
> > > > exception is injected or the exception should be reinjected due to
> > vmexit
> > > > occurs during event delivery in VMX non-root operation.
> > exception.pending
> > > > represents the exception is queued and will be cleared when
> > injecting the
> > > > exception to the guest. So exception.pending and
> > exception.injected can
> > > > cooperate to guarantee exception will not be lost.
> > > >
> > > > Reported-by: Radim Krčmář 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Signed-off-by: Wanpeng Li 
> > > > ---
> > > 
> > > I'm seeing a regression in my QEMU based NVDIMM testing system, and
> > I
> > > bisected it to this commit.
> > > 
> > > The behavior I'm seeing is that heavy I/O to simulated NVDIMMs in
> > > multiple virtual machines causes the QEMU guests to receive double
> > > faults, crashing them.  Here's an example backtrace:
> > > 
> > > [ 1042.653816] PANIC: double fault, error_code: 0x0
> > > [ 1042.654398] CPU: 2 PID: 30257 Comm: fsstress Not tainted
> > 4.15.0-rc5 #1
> > > [ 1042.655169] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > 1996),
> > > BIOS 1.10.2-2.fc27 04/01/2014
> > > [ 1042.656121] RIP: 0010:memcpy_flushcache+0x4d/0x180
> > > [ 1042.656631] RSP: 0018:ac098c7d3808 EFLAGS: 00010286
> > > [ 1042.657245] RAX: ac0d18ca8000 RBX: 0fe0 RCX:
> > ac0d18ca8000
> > > [ 1042.658085] RDX: 921aaa5df000 RSI: 921aaa5e RDI:
> > 19f26e6c9000
> > > [ 1042.658802] RBP: 1000 R08:  R09:
> > 
> > > [ 1042.659503] R10:  R11:  R12:
> > 921aaa5df020
> > > [ 1042.660306] R13: ac0d18ca8000 R14: f4c102a977c0 R15:
> > 1000
> > > [ 1042.661132] FS:  7f71530b90c0()
> > GS:921b3b28()
> > > knlGS:
> > > [ 1042.662051] CS:  0010 DS:  ES:  CR0: 80050033
> > > [ 1042.662528] CR2: 01156002 CR3: 00012a936000 CR4:
> > 06e0
> > > [ 1042.663093] Call Trace:
> > > [ 1042.663329]  write_pmem+0x6c/0xa0 [nd_pmem]
> > > [ 1042.663668]  pmem_do_bvec+0x15f/0x330 [nd_pmem]
> > > [ 1042.664056]  ? kmem_alloc+0x61/0xe0 [xfs]
> > > [ 1042.664393]  pmem_make_request+0xdd/0x220 [nd_pmem]
> > > [ 1042.664781]  generic_make_request+0x11f/0x300
> > > [ 1042.665135]  ? submit_bio+0x6c/0x140
> > > [ 1042.665436]  submit_bio+0x6c/0x140
> > > [ 1042.665754]  ? next_bio+0x18/0x40
> > > [ 1042.666025]  ? _cond_resched+0x15/0x40
> > > [ 1042.666341]  submit_bio_wait+0x53/0x80
> > > [ 1042.666804]  blkdev_issue_zeroout+0xdc/0x210
> > > [ 1042.667336]  ? __dax_zero_page_range+0xb5/0x140
> > > [ 1042.667810]  __dax_zero_page_range+0xb5/0x140
> > > [ 1042.668197]  ? xfs_file_iomap_begin+0x2bd/0x8e0 [xfs]
> > > [ 1042.668611]  iomap_zero_range_actor+0x7c/0x1b0
> > > [ 1042.668974]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.669318]  iomap_apply+0xa4/0x110
> > > [ 1042.669616]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.669958]  iomap_zero_range+0x52/0x80
> > > [ 1042.670255]  ? iomap_write_actor+0x170/0x170
> > > [ 1042.670616]  xfs_setattr_size+0xd4/0x330 [xfs]
> > > [ 1042.670995]  xfs_ioc_space+0x27e/0x2f0 [xfs]
> > > [ 1042.671332]  ? terminate_walk+0x87/0xf0
> > > [ 1042.671662]  xfs_file_ioctl+0x862/0xa40 [xfs]
> > > [ 1042.672035]  ? _copy_to_user+0x22/0x30
> > > [ 1042.672346]  ? cp_new_stat+0x150/0x180
> > > [ 1042.672663]  do_vfs_ioctl+0xa1/0x610
> > > [ 1042.672960]  ? SYSC_newfstat+0x3c/0x60
> > > [ 1042.673264]  SyS_ioctl+0x74/0x80
> > > [ 1042.673661]  entry_SYSCALL_64_fastpath+0x1a/0x7d
> > > [ 1042.674239] RIP: 0033:0x7f71525a2dc7
> > > [ 1042.674681] RSP: 002b:7ffef97aa778 EFLAGS: 0246
> > ORIG_RAX:
> > > 0010
> > > [ 1042.675664] RAX: ffda RBX: 000112bc RCX:
> > 7f71525a2dc7
> > > [ 1042.676592] RDX: 7ffef97aa7a0 RSI: 40305825 RDI:
> > 0003
> > > [ 1042.677520] RBP: 0009 R08: 0045 R09:
> > 7ffef97aa78c
> > > [ 1042.678442] R10:  R11: 0246 R12:
> > 0003
> > > [ 1042.679330] R13: 00019e38 R14: 000fcca7 R15:
> > 0016
> > > [ 1042.680216] Code: 48 8d 5d e0 4c 8d 62 20 48 89 cf 48 29 d7 48
> > 89
> > > de 48 83 e6 e0 4c 01 e6 48 8d 04 17 4c 8b 02 4c 8b 4a 08 4c 8b 52
> > 10
> > > 4c 8b 5a 18 <4c> 0f c3 00 4c 0f 

Re: [PATCH v3 2/4] KVM: X86: Fix loss of exception which has not yet injected

2018-01-09 Thread Liran Alon

- haozhong.zh...@intel.com wrote:

> On 01/07/18 00:26 -0700, Ross Zwisler wrote:
> > On Wed, Aug 23, 2017 at 10:21 PM, Wanpeng Li 
> wrote:
> > > From: Wanpeng Li 
> > >
> > > vmx_complete_interrupts() assumes that the exception is always
> injected,
> > > so it would be dropped by kvm_clear_exception_queue(). This patch
> separates
> > > exception.pending from exception.injected, exception.inject
> represents the
> > > exception is injected or the exception should be reinjected due to
> vmexit
> > > occurs during event delivery in VMX non-root operation.
> exception.pending
> > > represents the exception is queued and will be cleared when
> injecting the
> > > exception to the guest. So exception.pending and
> exception.injected can
> > > cooperate to guarantee exception will not be lost.
> > >
> > > Reported-by: Radim Krčmář 
> > > Cc: Paolo Bonzini 
> > > Cc: Radim Krčmář 
> > > Signed-off-by: Wanpeng Li 
> > > ---
> > 
> > I'm seeing a regression in my QEMU based NVDIMM testing system, and
> I
> > bisected it to this commit.
> > 
> > The behavior I'm seeing is that heavy I/O to simulated NVDIMMs in
> > multiple virtual machines causes the QEMU guests to receive double
> > faults, crashing them.  Here's an example backtrace:
> > 
> > [ 1042.653816] PANIC: double fault, error_code: 0x0
> > [ 1042.654398] CPU: 2 PID: 30257 Comm: fsstress Not tainted
> 4.15.0-rc5 #1
> > [ 1042.655169] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996),
> > BIOS 1.10.2-2.fc27 04/01/2014
> > [ 1042.656121] RIP: 0010:memcpy_flushcache+0x4d/0x180
> > [ 1042.656631] RSP: 0018:ac098c7d3808 EFLAGS: 00010286
> > [ 1042.657245] RAX: ac0d18ca8000 RBX: 0fe0 RCX:
> ac0d18ca8000
> > [ 1042.658085] RDX: 921aaa5df000 RSI: 921aaa5e RDI:
> 19f26e6c9000
> > [ 1042.658802] RBP: 1000 R08:  R09:
> 
> > [ 1042.659503] R10:  R11:  R12:
> 921aaa5df020
> > [ 1042.660306] R13: ac0d18ca8000 R14: f4c102a977c0 R15:
> 1000
> > [ 1042.661132] FS:  7f71530b90c0()
> GS:921b3b28()
> > knlGS:
> > [ 1042.662051] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1042.662528] CR2: 01156002 CR3: 00012a936000 CR4:
> 06e0
> > [ 1042.663093] Call Trace:
> > [ 1042.663329]  write_pmem+0x6c/0xa0 [nd_pmem]
> > [ 1042.663668]  pmem_do_bvec+0x15f/0x330 [nd_pmem]
> > [ 1042.664056]  ? kmem_alloc+0x61/0xe0 [xfs]
> > [ 1042.664393]  pmem_make_request+0xdd/0x220 [nd_pmem]
> > [ 1042.664781]  generic_make_request+0x11f/0x300
> > [ 1042.665135]  ? submit_bio+0x6c/0x140
> > [ 1042.665436]  submit_bio+0x6c/0x140
> > [ 1042.665754]  ? next_bio+0x18/0x40
> > [ 1042.666025]  ? _cond_resched+0x15/0x40
> > [ 1042.666341]  submit_bio_wait+0x53/0x80
> > [ 1042.666804]  blkdev_issue_zeroout+0xdc/0x210
> > [ 1042.667336]  ? __dax_zero_page_range+0xb5/0x140
> > [ 1042.667810]  __dax_zero_page_range+0xb5/0x140
> > [ 1042.668197]  ? xfs_file_iomap_begin+0x2bd/0x8e0 [xfs]
> > [ 1042.668611]  iomap_zero_range_actor+0x7c/0x1b0
> > [ 1042.668974]  ? iomap_write_actor+0x170/0x170
> > [ 1042.669318]  iomap_apply+0xa4/0x110
> > [ 1042.669616]  ? iomap_write_actor+0x170/0x170
> > [ 1042.669958]  iomap_zero_range+0x52/0x80
> > [ 1042.670255]  ? iomap_write_actor+0x170/0x170
> > [ 1042.670616]  xfs_setattr_size+0xd4/0x330 [xfs]
> > [ 1042.670995]  xfs_ioc_space+0x27e/0x2f0 [xfs]
> > [ 1042.671332]  ? terminate_walk+0x87/0xf0
> > [ 1042.671662]  xfs_file_ioctl+0x862/0xa40 [xfs]
> > [ 1042.672035]  ? _copy_to_user+0x22/0x30
> > [ 1042.672346]  ? cp_new_stat+0x150/0x180
> > [ 1042.672663]  do_vfs_ioctl+0xa1/0x610
> > [ 1042.672960]  ? SYSC_newfstat+0x3c/0x60
> > [ 1042.673264]  SyS_ioctl+0x74/0x80
> > [ 1042.673661]  entry_SYSCALL_64_fastpath+0x1a/0x7d
> > [ 1042.674239] RIP: 0033:0x7f71525a2dc7
> > [ 1042.674681] RSP: 002b:7ffef97aa778 EFLAGS: 0246
> ORIG_RAX:
> > 0010
> > [ 1042.675664] RAX: ffda RBX: 000112bc RCX:
> 7f71525a2dc7
> > [ 1042.676592] RDX: 7ffef97aa7a0 RSI: 40305825 RDI:
> 0003
> > [ 1042.677520] RBP: 0009 R08: 0045 R09:
> 7ffef97aa78c
> > [ 1042.678442] R10:  R11: 0246 R12:
> 0003
> > [ 1042.679330] R13: 00019e38 R14: 000fcca7 R15:
> 0016
> > [ 1042.680216] Code: 48 8d 5d e0 4c 8d 62 20 48 89 cf 48 29 d7 48
> 89
> > de 48 83 e6 e0 4c 01 e6 48 8d 04 17 4c 8b 02 4c 8b 4a 08 4c 8b 52
> 10
> > 4c 8b 5a 18 <4c> 0f c3 00 4c 0f c3 48 08 4c 0f c3 50 10 4c 0f c3 58
> 18
> > 48 83
> > 
> > This appears to be independent of both the guest kernel version
> (this
> > backtrace has v4.15.0-rc5, but I've seen it with other kernels) as
> > well as independent of the host QMEU version (mine happens