Re: NVMe vs DMA addressing limitations

2017-01-12 Thread Christoph Hellwig
On Thu, Jan 12, 2017 at 12:56:07PM +0100, Arnd Bergmann wrote:
> That is an interesting question: We actually have the
> "DMA_ATTR_NO_KERNEL_MAPPING" for this case, and ARM implements
> it in the coherent interface, so that might be a good fit.

Yes. my WIP HMB patch uses DMA_ATTR_NO_KERNEL_MAPPING, although I'm
workin on x86 at the moment where it's a no-op.

> Implementing it in the streaming API makes no sense since we
> already have a kernel mapping here, but using a normal allocation
> (possibly with DMA_ATTR_NON_CONSISTENT or DMA_ATTR_SKIP_CPU_SYNC,
> need to check) might help on other architectures that have
> limited amounts of coherent memory and no CMA.

Though about that - but in the end DMA_ATTR_NO_KERNEL_MAPPING implies
those, so instead of using lots of flags in driver I'd rather fix
up more dma_ops implementations to take advantage of
DMA_ATTR_NO_KERNEL_MAPPING.


Re: NVMe vs DMA addressing limitations

2017-01-12 Thread Arnd Bergmann
On Thursday, January 12, 2017 12:09:11 PM CET Sagi Grimberg wrote:
> >> Another workaround me might need is to limit amount of concurrent DMA
> >> in the NVMe driver based on some platform quirk. The way that NVMe works,
> >> it can have very large amounts of data that is concurrently mapped into
> >> the device.
> >
> > That's not really just NVMe - other storage and network controllers also
> > can DMA map giant amounts of memory.  There are a couple aspects to it:
> >
> >  - dma coherent memoery - right now NVMe doesn't use too much of it,
> >but upcoming low-end NVMe controllers will soon start to require
> >fairl large amounts of it for the host memory buffer feature that
> >allows for DRAM-less controller designs.  As an interesting quirk
> >that is memory only used by the PCIe devices, and never accessed
> >by the Linux host at all.
> 
> Would it make sense to convert the nvme driver to use normal allocations
> and use the DMA streaming APIs (dma_sync_single_for_[cpu|device]) for
> both queues and future HMB?

That is an interesting question: We actually have the
"DMA_ATTR_NO_KERNEL_MAPPING" for this case, and ARM implements
it in the coherent interface, so that might be a good fit.

Implementing it in the streaming API makes no sense since we
already have a kernel mapping here, but using a normal allocation
(possibly with DMA_ATTR_NON_CONSISTENT or DMA_ATTR_SKIP_CPU_SYNC,
need to check) might help on other architectures that have
limited amounts of coherent memory and no CMA.

Another benefit of the coherent API for this kind of buffer is
that we can use CMA where available to get a large consecutive
chunk of RAM on architectures without an IOMMU when normal
memory is no longer available because of fragmentation.

Arnd



Re: NVMe vs DMA addressing limitations

2017-01-12 Thread Sagi Grimberg



Another workaround me might need is to limit amount of concurrent DMA
in the NVMe driver based on some platform quirk. The way that NVMe works,
it can have very large amounts of data that is concurrently mapped into
the device.


That's not really just NVMe - other storage and network controllers also
can DMA map giant amounts of memory.  There are a couple aspects to it:

 - dma coherent memoery - right now NVMe doesn't use too much of it,
   but upcoming low-end NVMe controllers will soon start to require
   fairl large amounts of it for the host memory buffer feature that
   allows for DRAM-less controller designs.  As an interesting quirk
   that is memory only used by the PCIe devices, and never accessed
   by the Linux host at all.


Would it make sense to convert the nvme driver to use normal allocations
and use the DMA streaming APIs (dma_sync_single_for_[cpu|device]) for
both queues and future HMB?


 - size vs number of the dynamic mapping.  We probably want the dma_ops
   specify a maximum mapping size for a given device.  As long as we
   can make progress with a few mappings swiotlb / the iommu can just
   fail mapping and the driver will propagate that to the block layer
   that throttles I/O.


Isn't max mapping size per device too restrictive? it is possible that
not all devices posses active mappings concurrently.


Re: NVMe vs DMA addressing limitations

2017-01-10 Thread Arnd Bergmann
On Tuesday, January 10, 2017 3:48:39 PM CET Christoph Hellwig wrote:
> On Tue, Jan 10, 2017 at 12:01:05PM +0100, Arnd Bergmann wrote:
> > Another workaround me might need is to limit amount of concurrent DMA
> > in the NVMe driver based on some platform quirk. The way that NVMe works,
> > it can have very large amounts of data that is concurrently mapped into
> > the device.
> 
> That's not really just NVMe - other storage and network controllers also
> can DMA map giant amounts of memory.  There are a couple aspects to it:
> 
>  - dma coherent memoery - right now NVMe doesn't use too much of it,
>but upcoming low-end NVMe controllers will soon start to require
>fairl large amounts of it for the host memory buffer feature that
>allows for DRAM-less controller designs.  As an interesting quirk
>that is memory only used by the PCIe devices, and never accessed
>by the Linux host at all.

Right, that is going to become interesting, as some platforms are
very limited with their coherent allocations.

>  - size vs number of the dynamic mapping.  We probably want the dma_ops
>specify a maximum mapping size for a given device.  As long as we
>can make progress with a few mappings swiotlb / the iommu can just
>fail mapping and the driver will propagate that to the block layer
>that throttles I/O.

Good idea.

Arnd


Re: NVMe vs DMA addressing limitations

2017-01-10 Thread Christoph Hellwig
On Tue, Jan 10, 2017 at 12:01:05PM +0100, Arnd Bergmann wrote:
> Another workaround me might need is to limit amount of concurrent DMA
> in the NVMe driver based on some platform quirk. The way that NVMe works,
> it can have very large amounts of data that is concurrently mapped into
> the device.

That's not really just NVMe - other storage and network controllers also
can DMA map giant amounts of memory.  There are a couple aspects to it:

 - dma coherent memoery - right now NVMe doesn't use too much of it,
   but upcoming low-end NVMe controllers will soon start to require
   fairl large amounts of it for the host memory buffer feature that
   allows for DRAM-less controller designs.  As an interesting quirk
   that is memory only used by the PCIe devices, and never accessed
   by the Linux host at all.

 - size vs number of the dynamic mapping.  We probably want the dma_ops
   specify a maximum mapping size for a given device.  As long as we
   can make progress with a few mappings swiotlb / the iommu can just
   fail mapping and the driver will propagate that to the block layer
   that throttles I/O.


Re: NVMe vs DMA addressing limitations

2017-01-10 Thread Arnd Bergmann
On Tuesday, January 10, 2017 10:31:47 AM CET Nikita Yushchenko wrote:
> Christoph, thanks for clear input.
> 
> Arnd, I think that given this discussion, best short-term solution is
> indeed the patch I've submitted yesterday. That is, your version +
> coherent mask support.  With that, set_dma_mask(DMA_BIT_MASK(64)) will
> succeed and hardware with work with swiotlb.

Ok, good.

> Possible next step is to teach swiotlb to dynamically allocate bounce
> buffers within entire arm64's ZONE_DMA.

That seems reasonable, yes. We probably have to do both, as there are
cases where a device has dma_mask smaller than ZONE_DMA but the swiotlb
bounce area is low enough to work anyway.

Another workaround me might need is to limit amount of concurrent DMA
in the NVMe driver based on some platform quirk. The way that NVMe works,
it can have very large amounts of data that is concurrently mapped into
the device. SWIOTLB is one case where this currently fails, but another
example would be old PowerPC servers that have a 256MB window of virtual
I/O addresses per VM guest in their IOMMU. Those will likely fail the same
way that your does.

> Also there is some hope that R-Car *can* iommu-translate addresses that
> PCIe module issues to system bus.  Although previous attempts to make
> that working failed. Additional research is needed here.

Does this IOMMU support remapping data within a virtual machine? I believe
there are some that only do one of the two -- either you can have guest
machines with DMA access to their low memory, or you can remap data on
the fly in the host.

Arnd


Re: NVMe vs DMA addressing limitations

2017-01-10 Thread Arnd Bergmann
On Tuesday, January 10, 2017 8:07:20 AM CET Christoph Hellwig wrote:
> On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote:
> > I'm now working with HW that:
> > - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores,
> > and is being manufactured and developed,
> > - has 75% of it's RAM located beyond first 4G of address space,
> > - can't physically handle incoming PCIe transactions addressed to memory
> > beyond 4G.
> 
> It might not be low end or obselete, but it's absolutely braindead.
> Your I/O performance will suffer badly for the life of the platform
> because someone tries to save 2 cents, and there is not much we can do
> about it.

Unfortunately it is a common problem for arm64 chips that were designed
by taking a 32-bit SoC and replacing the CPU core. The swiotlb is the
right workaround for this, and I think we all agree that we should
just make it work correctly.

Arnd


Re: NVMe vs DMA addressing limitations

2017-01-09 Thread Nikita Yushchenko
Christoph, thanks for clear input.

Arnd, I think that given this discussion, best short-term solution is
indeed the patch I've submitted yesterday. That is, your version +
coherent mask support.  With that, set_dma_mask(DMA_BIT_MASK(64)) will
succeed and hardware with work with swiotlb.

Possible next step is to teach swiotlb to dynamically allocate bounce
buffers within entire arm64's ZONE_DMA.

Also there is some hope that R-Car *can* iommu-translate addresses that
PCIe module issues to system bus.  Although previous attempts to make
that working failed. Additional research is needed here.

Nikita

> On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote:
>> I'm now working with HW that:
>> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores,
>> and is being manufactured and developed,
>> - has 75% of it's RAM located beyond first 4G of address space,
>> - can't physically handle incoming PCIe transactions addressed to memory
>> beyond 4G.
> 
> It might not be low end or obselete, but it's absolutely braindead.
> Your I/O performance will suffer badly for the life of the platform
> because someone tries to save 2 cents, and there is not much we can do
> about it.
> 
>> (1) it constantly runs of swiotlb space, logs are full of warnings
>> despite of rate limiting,
> 
>> Per my current understanding, blk-level bounce buffering will at least
>> help with (1) - if done properly it will allocate bounce buffers within
>> entire memory below 4G, not within dedicated swiotlb space (that is
>> small and enlarging it makes memory permanently unavailable for other
>> use).  This looks simple and safe (in sense of not anyhow breaking
>> unrelated use cases).
> 
> Yes.  Although there is absolutely no reason why swiotlb could not
> do the same.
> 
>> (2) it runs far suboptimal due to bounce-buffering almost all i/o,
>> despite of lots of free memory in area where direct DMA is possible.
> 
>> Addressing (2) looks much more difficult because different memory
>> allocation policy is required for that.
> 
> It's basically not possible.  Every piece of memory in a Linux
> kernel is a possible source of I/O, and depending on the workload
> type it might even be a the prime source of I/O.
> 
>>> NVMe should never bounce, the fact that it currently possibly does
>>> for highmem pages is a bug.
>>
>> The entire topic is absolutely not related to highmem (i.e. memory not
>> directly addressable by 32-bit kernel).
> 
> I did not say this affects you, but thanks to your mail I noticed that
> NVMe has a suboptimal setting there.  Also note that highmem does not
> have to imply a 32-bit kernel, just physical memory that is not in the
> kernel mapping.
> 
>> What we are discussing is hw-originated restriction on where DMA is
>> possible.
> 
> Yes, where hw means the SOC, and not the actual I/O device, which is an
> important distinction.
> 
>>> Or even better remove the call to dma_set_mask_and_coherent with
>>> DMA_BIT_MASK(32).  NVMe is designed around having proper 64-bit DMA
>>> addressing, there is not point in trying to pretent it works without that
>>
>> Are you claiming that NVMe driver in mainline is intentionally designed
>> to not work on HW that can't do DMA to entire 64-bit space?
> 
> It is not intenteded to handle the case where the SOC / chipset
> can't handle DMA to all physical memoery, yes.
> 
>> Such setups do exist and there is interest to make them working.
> 
> Sure, but it's not the job of the NVMe driver to work around such a broken
> system.  It's something your architecture code needs to do, maybe with
> a bit of core kernel support.
> 
>> Quite a few pages used for block I/O are allocated by filemap code - and
>> at allocation point it is known what inode page is being allocated for.
>> If this inode is from filesystem located on a known device with known
>> DMA limitations, this knowledge can be used to allocate page that can be
>> DMAed directly.
> 
> But in other cases we might never DMA to it.  Or we rarely DMA to it, say
> for a machine running databses or qemu and using lots of direct I/O. Or
> a storage target using it's local alloc_pages buffers.
> 
>> Sure there are lots of cases when at allocation time there is no idea
>> what device will run DMA on page being allocated, or perhaps page is
>> going to be shared, or whatever. Such cases unavoidably require bounce
>> buffers if page ends to be used with device with DMA limitations. But
>> still there are cases when better allocation can remove need for bounce
>> buffers - without any hurt for other cases.
> 
> It takes your max 1GB DMA addressable memoery away from other uses,
> and introduce the crazy highmem VM tuning issues we had with big
> 32-bit x86 systems in the past.
> 


Re: NVMe vs DMA addressing limitations

2017-01-09 Thread Christoph Hellwig
On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote:
> I'm now working with HW that:
> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores,
> and is being manufactured and developed,
> - has 75% of it's RAM located beyond first 4G of address space,
> - can't physically handle incoming PCIe transactions addressed to memory
> beyond 4G.

It might not be low end or obselete, but it's absolutely braindead.
Your I/O performance will suffer badly for the life of the platform
because someone tries to save 2 cents, and there is not much we can do
about it.

> (1) it constantly runs of swiotlb space, logs are full of warnings
> despite of rate limiting,

> Per my current understanding, blk-level bounce buffering will at least
> help with (1) - if done properly it will allocate bounce buffers within
> entire memory below 4G, not within dedicated swiotlb space (that is
> small and enlarging it makes memory permanently unavailable for other
> use).  This looks simple and safe (in sense of not anyhow breaking
> unrelated use cases).

Yes.  Although there is absolutely no reason why swiotlb could not
do the same.

> (2) it runs far suboptimal due to bounce-buffering almost all i/o,
> despite of lots of free memory in area where direct DMA is possible.

> Addressing (2) looks much more difficult because different memory
> allocation policy is required for that.

It's basically not possible.  Every piece of memory in a Linux
kernel is a possible source of I/O, and depending on the workload
type it might even be a the prime source of I/O.

> > NVMe should never bounce, the fact that it currently possibly does
> > for highmem pages is a bug.
> 
> The entire topic is absolutely not related to highmem (i.e. memory not
> directly addressable by 32-bit kernel).

I did not say this affects you, but thanks to your mail I noticed that
NVMe has a suboptimal setting there.  Also note that highmem does not
have to imply a 32-bit kernel, just physical memory that is not in the
kernel mapping.

> What we are discussing is hw-originated restriction on where DMA is
> possible.

Yes, where hw means the SOC, and not the actual I/O device, which is an
important distinction.

> > Or even better remove the call to dma_set_mask_and_coherent with
> > DMA_BIT_MASK(32).  NVMe is designed around having proper 64-bit DMA
> > addressing, there is not point in trying to pretent it works without that
> 
> Are you claiming that NVMe driver in mainline is intentionally designed
> to not work on HW that can't do DMA to entire 64-bit space?

It is not intenteded to handle the case where the SOC / chipset
can't handle DMA to all physical memoery, yes.

> Such setups do exist and there is interest to make them working.

Sure, but it's not the job of the NVMe driver to work around such a broken
system.  It's something your architecture code needs to do, maybe with
a bit of core kernel support.

> Quite a few pages used for block I/O are allocated by filemap code - and
> at allocation point it is known what inode page is being allocated for.
> If this inode is from filesystem located on a known device with known
> DMA limitations, this knowledge can be used to allocate page that can be
> DMAed directly.

But in other cases we might never DMA to it.  Or we rarely DMA to it, say
for a machine running databses or qemu and using lots of direct I/O. Or
a storage target using it's local alloc_pages buffers.

> Sure there are lots of cases when at allocation time there is no idea
> what device will run DMA on page being allocated, or perhaps page is
> going to be shared, or whatever. Such cases unavoidably require bounce
> buffers if page ends to be used with device with DMA limitations. But
> still there are cases when better allocation can remove need for bounce
> buffers - without any hurt for other cases.

It takes your max 1GB DMA addressable memoery away from other uses,
and introduce the crazy highmem VM tuning issues we had with big
32-bit x86 systems in the past.