Re: [PATCH v3 02/13] dax: require 'struct page' for filesystem dax

2017-10-22 Thread Martin Schwidefsky
On Fri, 20 Oct 2017 18:29:33 +0200
Christoph Hellwig  wrote:

> On Fri, Oct 20, 2017 at 08:23:02AM -0700, Dan Williams wrote:
> > Yes, however it seems these drivers / platforms have been living with
> > the lack of struct page for a long time. So they either don't use DAX,
> > or they have a constrained use case that never triggers
> > get_user_pages(). If it is the latter then they could introduce a new
> > configuration option that bypasses the pfn_t_devmap() check in
> > bdev_dax_supported() and fix up the get_user_pages() paths to fail.
> > So, I'd like to understand how these drivers have been using DAX
> > support without struct page to see if we need a workaround or we can
> > go ahead delete this support. If the usage is limited to
> > execute-in-place perhaps we can do a constrained ->direct_access() for
> > just that case.  
> 
> For axonram I doubt anyone is using it any more - it was a very for
> the IBM Cell blades, which were produceѕ in a rather limited number.
> And Cell basically seems to be dead as far as I can tell.
> 
> For S/390 Martin might be able to help out what the status of xpram
> in general and DAX support in particular is.

The goes back to the time where DAX was called XIP. The initial design
point has been *not* to have struct pages for a large read-only memory
area. There is a block device driver for z/VM that maps a DCSS segment
somewhere in memore (no struct page!) with e.g. the complete /usr
filesystem. The xpram driver is a different beast and has nothing to
do with XIP/DAX.

Now, if any there are very few users of the dcssblk driver out there.
The idea to save a few megabyte for /usr never really took of.

We have to look at our get_user_pages() implementation to see how hard
it would be to make it fail if the target address is for an area without
struct pages.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


转发:如何建立起系统的、有效的EHS管理体系

2017-10-22 Thread 郭斧
您好!linux-nvdimm
EHS管理者既要提升专业技术知识,更要提高人员管理的软技能,在充分理解每一项工作的意义后,让工作变得卓有成效,让自己的得到更大发展空间。
课 题内 容 在 附 件,请 查 收
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


转发:在体验中学习市场管理的方法、工具

2017-10-22 Thread 江执
linux-nvdimm
您好!课题内容在附件,请查收
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Logan Gunthorpe

On 22/10/17 12:13 AM, Petrosyan, Ludwig wrote:
> But at first sight it has to be simple:
> The PCIe Write transactions are address routed, so if in the packet header 
> the other endpoint address is written the TLP has to be routed (by PCIe 
> Switch to the endpoint), the DMA reading from the end point is really write 
> transactions from the endpoint, usually (Xilinx core) to start DMA one has to 
> write to the DMA control register of the endpoint the destination address. So 
> I have change the device driver to set in this register the physical address 
> of the other endpoint (get_resource start called to other endpoint, and it is 
> the same address which I could see in lspci - -s bus-address of the 
> switch port, memories behind bridge), so now the endpoint has to start send 
> writes TLP with the other endpoint address in the TLP header.
> But this is not working (I want to understand why ...), but I could see the 
> first address of the destination endpoint is changed (with the wrong value 
> 0xFF),
> now I want to try prepare in the driver of one endpoint the DMA buffer , but 
> using physical address of the other endpoint,
> Could be it will never work, but I want to understand why, there is my error 
> ...

Hmm, well if I understand you correctly it sounds like, in theory, it
should work. But there could be any number of reasons why it does not.
You may need to get a hold of a PCIe analyzer to figure out what's
actually going on.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Detecting NUMA per pmem

2017-10-22 Thread Dan Williams
On Sun, Oct 22, 2017 at 4:33 AM, Oren Berman  wrote:
> Hi Ross
>
> Thanks for the speedy reply. I am also adding the public list to this
> thread as you suggested.
>
> We have tried to dump the SPA table and this is what we get:
>
> /*
>  * Intel ACPI Component Architecture
>  * AML/ASL+ Disassembler version 20160108-64
>  * Copyright (c) 2000 - 2016 Intel Corporation
>  *
>  * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
>  *
>  * ACPI Data Table [NFIT]
>  *
>  * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
>  */
>
> [000h    4]Signature : "NFIT"[NVDIMM Firmware
> Interface Table]
> [004h 0004   4] Table Length : 0028
> [008h 0008   1] Revision : 01
> [009h 0009   1] Checksum : B2
> [00Ah 0010   6]   Oem ID : "SUPERM"
> [010h 0016   8] Oem Table ID : "SMCI--MB"
> [018h 0024   4] Oem Revision : 0001
> [01Ch 0028   4]  Asl Compiler ID : " "
> [020h 0032   4]Asl Compiler Revision : 0001
>
> [024h 0036   4] Reserved : 
>
> Raw Table Data: Length 40 (0x28)
>
>   : 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  // NFIT(.SUPERM
>   0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  // SMCI--MB
>   0020: 01 00 00 00 00 00 00 00
>
> As you can see the memory region info is missing.
>
> This specific check was done on a supermicro server.
> We also performed a bios update but the results were the same.
>
> As said before ,the pmem devices are detected correctly and we verified
> that they correspond to different numa nodes using the PCM utility.However,
>  linux still reports both pmem devices to be on the same numa - Numa 0.
>
> If this information is missing, why pmem devices and address ranges are
> still detected correctly?

I suspect your BIOS might be using E820-type-12 to describe the pmem
ranges which is not compliant with the ACPI specification and would
need a BIOS change.

> Is there another table that we need to check?

You can dump /proc/iomem.  If it shows "Persistent Memory (legacy)"
then the BIOS is using the E820-type-12 description scheme which does
not include NUMA information.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Detecting NUMA per pmem

2017-10-22 Thread Oren Berman
Hi Ross

Thanks for the speedy reply. I am also adding the public list to this
thread as you suggested.

We have tried to dump the SPA table and this is what we get:

/*
 * Intel ACPI Component Architecture
 * AML/ASL+ Disassembler version 20160108-64
 * Copyright (c) 2000 - 2016 Intel Corporation
 *
 * Disassembly of NFIT, Sun Oct 22 10:46:19 2017
 *
 * ACPI Data Table [NFIT]
 *
 * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue
 */

[000h    4]Signature : "NFIT"[NVDIMM Firmware
Interface Table]
[004h 0004   4] Table Length : 0028
[008h 0008   1] Revision : 01
[009h 0009   1] Checksum : B2
[00Ah 0010   6]   Oem ID : "SUPERM"
[010h 0016   8] Oem Table ID : "SMCI--MB"
[018h 0024   4] Oem Revision : 0001
[01Ch 0028   4]  Asl Compiler ID : " "
[020h 0032   4]Asl Compiler Revision : 0001

[024h 0036   4] Reserved : 

Raw Table Data: Length 40 (0x28)

  : 4E 46 49 54 28 00 00 00 01 B2 53 55 50 45 52 4D  // NFIT(.SUPERM
  0010: 53 4D 43 49 2D 2D 4D 42 01 00 00 00 01 00 00 00  // SMCI--MB
  0020: 01 00 00 00 00 00 00 00

As you can see the memory region info is missing.

This specific check was done on a supermicro server.
We also performed a bios update but the results were the same.

As said before ,the pmem devices are detected correctly and we verified
that they correspond to different numa nodes using the PCM utility.However,
 linux still reports both pmem devices to be on the same numa - Numa 0.

If this information is missing, why pmem devices and address ranges are
still detected correctly?
Is there another table that we need to check?

I also ran dmidecode and the NVDIMMs are being listed (we tested with
netlist NVDIMMs). I can also see the bank locator showing P0 and P1 which I
think indicates the numa.  Here is  an example:

Handle 0x002D, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66F50006
Asset Tag: P1-DIMMA3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown


Handle 0x003B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0038
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Netlist
Serial Number: 66B50010
Asset Tag: P2-DIMME3_AssetTag (date:16/42)
Part Number: NV3A74SBT20-000
Rank: 1
Configured Clock Speed: 1600 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Did you encounter such a a case? We would appreciate any insight you might
have.

BR
Oren Berman


On 20 October 2017 at 19:22, Ross Zwisler 
wrote:

> On Thu, Oct 19, 2017 at 06:12:24PM +0300, Oren Berman wrote:
> >Hi Ross
> >My name is Oren Berman and I am a senior developer at lightbitslabs.
> >We are working with NDIMMs but we encountered a problem that the
> kernel
> > does not seem to detect the numa id per PMEM device.
> >It always reports numa 0 although we have NVDIMM devices on both
> nodes.
> >We checked that it always returns 0 from sysfs and also from
> retrieving
> >the device of pmem in the kernel and calling dev_to_node.
> >The result is always 0 for both pmem0 and pmem1.
> >In order to make sure that indeed both numa sockets are used we ran
> >intel's pcm utlity. We verified that writing to pmem 0 increases
> socket 0
> >utilization and  writing to pmem1 increases socket 1 utilization so
> the hw
> >works properly.
> >Only the detection seems to be invalid.
> >Did you encounter such a problem?
> >We are using kernel version 4.9 - are you aware of any fix for this
> issue
> >or workaround that we can use.
> >Are we missing something?
> >Thanks for any help you can give us.
> >BR
> >Oren Berman
>
> Hi Oren,
>
> My first guess is that your platform isn't properly filling out the
> "proximity
> domain" field in the NFIT SPA table.
>
> See section 5.2.25.2 in ACPI 6.2:
> http://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>
> Here's how to check that:
>
>   # cd /tmp
>   # cp /sys/firmware/acpi/tables/NFIT .
>   # iasl NFIT
>
>   Intel ACPI Component Architecture
>   ASL+ Optimizing Compiler version 20160831-64
>   Copyright (c) 2000 - 2016 Intel Corporation
>
>   Binary file appears to be a valid ACPI table, disassembling
>   Input file NFIT, Length 0xE0 (224) 

Re: Enabling peer to peer device transactions for PCIe devices

2017-10-22 Thread Petrosyan, Ludwig
Hello Logan

Thank You very much for respond.
Could be I have done is stupid...
But at first sight it has to be simple:
The PCIe Write transactions are address routed, so if in the packet header the 
other endpoint address is written the TLP has to be routed (by PCIe Switch to 
the endpoint), the DMA reading from the end point is really write transactions 
from the endpoint, usually (Xilinx core) to start DMA one has to write to the 
DMA control register of the endpoint the destination address. So I have change 
the device driver to set in this register the physical address of the other 
endpoint (get_resource start called to other endpoint, and it is the same 
address which I could see in lspci - -s bus-address of the switch port, 
memories behind bridge), so now the endpoint has to start send writes TLP with 
the other endpoint address in the TLP header.
But this is not working (I want to understand why ...), but I could see the 
first address of the destination endpoint is changed (with the wrong value 
0xFF),
now I want to try prepare in the driver of one endpoint the DMA buffer , but 
using physical address of the other endpoint,
Could be it will never work, but I want to understand why, there is my error ...

with best regards

Ludwig

- Original Message -
From: "Logan Gunthorpe" 
To: "Ludwig Petrosyan" , "Deucher, Alexander" 
, "linux-ker...@vger.kernel.org" 
, "linux-r...@vger.kernel.org" 
, "linux-nvdimm@lists.01.org" 
, "linux-me...@vger.kernel.org" 
, "dri-de...@lists.freedesktop.org" 
, "linux-...@vger.kernel.org" 

Cc: "Bridgman, John" , "Kuehling, Felix" 
, "Sagalovitch, Serguei" , 
"Blinzer, Paul" , "Koenig, Christian" 
, "Suthikulpanit, Suravee" 
, "Sander, Ben" 
Sent: Friday, 20 October, 2017 17:48:58
Subject: Re: Enabling peer to peer device transactions for PCIe devices

Hi Ludwig,

P2P transactions are still *very* experimental at the moment and take a 
lot of expertise to get working in a general setup. It will definitely 
require changes to the kernel, including the drivers of all the devices 
you are trying to make talk to eachother. If you're up for it you can 
take a look at:

https://github.com/sbates130272/linux-p2pmem/

Which has our current rough work making NVMe fabrics use p2p transactions.

Logan

On 10/20/2017 6:36 AM, Ludwig Petrosyan wrote:
> Dear Linux kernel group
> 
> my name is Ludwig Petrosyan I am working in DESY (Germany)
> 
> we are responsible for the control system of  all accelerators in DESY.
> 
> For a 7-8 years we have switched to MTCA.4 systems and using PCIe as a 
> central Bus.
> 
> I am mostly responsible for the Linux drivers of the AMC Cards (PCIe 
> endpoints).
> 
> The idea is start to use peer to peer transaction for PCIe endpoint (DMA 
> and/or usual Read/Write)
> 
> Could You please advise me where to start, is there some Documentation 
> how to do it.
> 
> 
> with best regards
> 
> 
> Ludwig
> 
> 
> On 11/21/2016 09:36 PM, Deucher, Alexander wrote:
>> This is certainly not the first time this has been brought up, but I'd 
>> like to try and get some consensus on the best way to move this 
>> forward.  Allowing devices to talk directly improves performance and 
>> reduces latency by avoiding the use of staging buffers in system 
>> memory.  Also in cases where both devices are behind a switch, it 
>> avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, 
>> CUDA, HSA) that deal with this are pointer based.  Ideally we'd be 
>> able to take a CPU virtual address and be able to get to a physical 
>> address taking into account IOMMUs, etc.  Having struct pages for the 
>> memory would allow it to work more generally and wouldn't require as 
>> much explicit support in drivers that wanted to use it.
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>> Here is a relatively simple example of how this could work for 
>> testing.  This is obviously not a complete solution.
>> - Device memory will be registered with Linux memory sub-system by 
>> created corresponding struct page structures for device memory
>> - get_user_pages_fast() will  return corresponding struct pages when 
>> CPU address points to the device memory
>> - put_page() will deal with struct pages for device memory
>> Previously proposed solutions and related proposals:
>> 1.P2P DMA
>> DMA-API/PCI map_peer_resource support for