Re: [lng-odp] continuous memory allocation for drivers

2016-11-17 Thread Maciej Czekaj

On Fri, 11 Nov 2016 11:13:27 +0100 Francois Ozog wrote:

On 11 November 2016 at 10:10, Brian Brooks  wrote:


On 11/10 18:52:49, Christophe Milard wrote:

Hi,

My hope was that packet segments would all be smaller than one page
(either normal pages or huge pages)

When is this the case? With a 4096 byte page, a couple 1518 byte Ethernet
packets can fit. A 9038 byte Jumbo wont fit.


[FF] WHen you allocate a queue with 256 packets for Intel, Vritio, Mellanox
cards, you need a small area of 256 descriptors that fit in a page. Then
drivers allocate 256 buffers in a contiguous memory. This leads to 512K of
buffers. They may allocate this zone per packet though. But for high
performance cards such as Chelsio and Netcope: this is a strict requirement
because the obtained memory zone is managed by HW: packet allocation is not
controlled by software. You give the zone to hardware wich places packets
the way it wants in the zone. HW informs the SW where the packets are by
updating the ring. VFIO does not change the requirement of a large
contiguous area.
PCIexpress has a limitation of 36M DMA transactions per second. Which is
lower than 60Mpps required for 40Gbps and much lower than 148Mpps required
for 100Gbps. The only way to achieve line rate is to fit more than one
packet in a DMA transaction. That's what Chelsio, Netcope and others are
doing. This requires HW controlled memory allocations. This requires large
memory blocks to be supplied to HW.
As we move forawrd, I expect all cards to adopt a similar scheme and escape
the "Intel" Model of IO.
Now if we look at performance, cost of managing virt_to_phys() even in
kernel for each packet is preventing spread allocations. You amortize the
cost by getting the physical address of the 256 buffer zone, and using
offsets from that to get the physical address of an individual packet. If
you try to do that in userland, then just use linux networking stack, it
will be faster ;-)




I second Francois here. H/W requires physically contiguous memory at 
least for DMA queues.
A queue well exceeds the size of a page or even huge page in some cases. 
E.g. ThunderX has a single
DMA queue of size 512K and it may even have a 4M queue which is beyond a 
2M huge page on ARM when 4K page is default.
64K pages are preferable and they are safer as the huge page is then 
512M but it may be too much for some clients.


vfio-pci only partially solves that problem because of 2 reasons:

1. In the guest environment there is no vfio-pci, or at least there is 
no extra mapping that could be done.
   From VM perspective all DMA memory should be contiguous with respect 
to Intermediate Virtual Address (under ARM nomenclature).


2. IOMMU mappings are not for free. In synthetic benchmarks there is no 
difference in performance due to IOMMU
   but if the system is using IOMMU extensively, e.g. due to many VMs, 
it may well prove otherwise.
   IOTLB miss has similar cost to TLB miss. Ideally, a system 
integrator should have a choice to use it or not.





Or is it to ease the memory manager by having a logical array of objects
laid out in virtual memory space and depending on the number of objects
and the size of each object, a few are bound to span across 2 pages which
might not be adjacent in physical memory?

Or is it when the view of a packet is a bunch of packet segments which
may be of varying sizes and possibly scattered across in memory and the
packet needs to go out the wire?

Are 2M, 16M, 1G page size used?


to guarantee physical memory
continuity which is needed by some drivers (read non vfio drivers for
PCI).

[FF] linux kernel uses a special allocator for that, huge pages are not

the unit. As said above, some HW require large contiguous blocks and vfio
or iommu does not avoid the requirement.



If IOMMU enables IO device the same virtual addressing as the CPU by
sharing page tables, would ever a IO device or IOMMU have limitations
on the number of pages supported or other performance limitations
during the VA->PA translation?

[FF] no information on that



Does the IOMMU remap interrupts from the IO device when the vm
migrates cores? What happens when no irq remapping, does core get
irq and must interprocessorinterrupt core where vm is now running?

[FF] I hope not.



Are non vfio drivers for PCI needing contiguous physical memory the
design target?


[FF] not related to VFIO but related to HW requirements.



Francois Ozog's experience (with dpdk)shows that this hope will fail
in some case: not all platforms support the required huge page size.
And it would be nice to be able to run even in the absence of huge
pages.

I am therefore planning to expand drvshm to include a flag requesting
contiguous physical memory. But sadly, from user space, this is
nothing we can guarantee... So when this flag is set, the allocator
will allocate untill physical memory "happens to be continuous".
This is a bit like the DPDK approach (try & error), which I dislike,

Re: [lng-odp] continuous memory allocation for drivers

2016-11-11 Thread Francois Ozog
On 11 November 2016 at 10:10, Brian Brooks  wrote:

> On 11/10 18:52:49, Christophe Milard wrote:
> > Hi,
> >
> > My hope was that packet segments would all be smaller than one page
> > (either normal pages or huge pages)
>
> When is this the case? With a 4096 byte page, a couple 1518 byte Ethernet

> packets can fit. A 9038 byte Jumbo wont fit.
>

[FF] WHen you allocate a queue with 256 packets for Intel, Vritio, Mellanox
cards, you need a small area of 256 descriptors that fit in a page. Then
drivers allocate 256 buffers in a contiguous memory. This leads to 512K of
buffers. They may allocate this zone per packet though. But for high
performance cards such as Chelsio and Netcope: this is a strict requirement
because the obtained memory zone is managed by HW: packet allocation is not
controlled by software. You give the zone to hardware wich places packets
the way it wants in the zone. HW informs the SW where the packets are by
updating the ring. VFIO does not change the requirement of a large
contiguous area.
PCIexpress has a limitation of 36M DMA transactions per second. Which is
lower than 60Mpps required for 40Gbps and much lower than 148Mpps required
for 100Gbps. The only way to achieve line rate is to fit more than one
packet in a DMA transaction. That's what Chelsio, Netcope and others are
doing. This requires HW controlled memory allocations. This requires large
memory blocks to be supplied to HW.
As we move forawrd, I expect all cards to adopt a similar scheme and escape
the "Intel" Model of IO.
Now if we look at performance, cost of managing virt_to_phys() even in
kernel for each packet is preventing spread allocations. You amortize the
cost by getting the physical address of the 256 buffer zone, and using
offsets from that to get the physical address of an individual packet. If
you try to do that in userland, then just use linux networking stack, it
will be faster ;-)


> Or is it to ease the memory manager by having a logical array of objects
> laid out in virtual memory space and depending on the number of objects
> and the size of each object, a few are bound to span across 2 pages which
> might not be adjacent in physical memory?
>
> Or is it when the view of a packet is a bunch of packet segments which
> may be of varying sizes and possibly scattered across in memory and the
> packet needs to go out the wire?
>
> Are 2M, 16M, 1G page size used?
>
> > to guarantee physical memory
> > continuity which is needed by some drivers (read non vfio drivers for
> > PCI).
>
> [FF] linux kernel uses a special allocator for that, huge pages are not
the unit. As said above, some HW require large contiguous blocks and vfio
or iommu does not avoid the requirement.


> If IOMMU enables IO device the same virtual addressing as the CPU by
> sharing page tables, would ever a IO device or IOMMU have limitations
> on the number of pages supported or other performance limitations
> during the VA->PA translation?
>
> [FF] no information on that


> Does the IOMMU remap interrupts from the IO device when the vm
> migrates cores? What happens when no irq remapping, does core get
> irq and must interprocessorinterrupt core where vm is now running?
>
> [FF] I hope not.


> Are non vfio drivers for PCI needing contiguous physical memory the
> design target?
>

[FF] not related to VFIO but related to HW requirements.


> > Francois Ozog's experience (with dpdk)shows that this hope will fail
> > in some case: not all platforms support the required huge page size.
> > And it would be nice to be able to run even in the absence of huge
> > pages.
> >
> > I am therefore planning to expand drvshm to include a flag requesting
> > contiguous physical memory. But sadly, from user space, this is
> > nothing we can guarantee... So when this flag is set, the allocator
> > will allocate untill physical memory "happens to be continuous".
> > This is a bit like the DPDK approach (try & error), which I dislike,
> > but there aren't many alternatives from user space. This would be
> > triggered only in case huge page allocation failed, or if the
> > requested size exceed the HP size.
>
> Are device drivers for the target devices (SoCs, cards) easier to
> program when there's an IOMMU? If so, is this contiguous physical
> memory requirement necessary?
>
> [FF] again, depends on what entity is managing packet placement.


> > Last alternative would be to have a kernel module to do this kind of
> > allocation, but I guess we don't really want to depend on that...
> >
> > Any comment?
>



-- 
[image: Linaro] 
François-Frédéric Ozog | *Director Linaro Networking Group*
T: +33.67221.6485
francois.o...@linaro.org | Skype: ffozog


Re: [lng-odp] continuous memory allocation for drivers

2016-11-11 Thread Francois Ozog
There is no such module. If it was upstreamable, Intel would have obtained
it a long ago


On 11 November 2016 at 08:50, Christophe Milard <
christophe.mil...@linaro.org> wrote:

> I hoped such a kernel module would already exist, but I am surprised
> that DPDK would not rely on it. Maybe there is a limitation I cannot
> see. I'll keep searching. Francois, maybe you know the answer?
>
> On 10 November 2016 at 19:10, Mike Holmes  wrote:
> >
> >
> > On 10 November 2016 at 12:52, Christophe Milard
> >  wrote:
> >>
> >> Hi,
> >>
> >> My hope was that packet segments would all be smaller than one page
> >> (either normal pages or huge pages) to guarantee physical memory
> >> continuity which is needed by some drivers (read non vfio drivers for
> >> PCI).
> >>
> >> Francois Ozog's experience (with dpdk)shows that this hope will fail
> >> in some case: not all platforms support the required huge page size.
> >> And it would be nice to be able to run even in the absence of huge
> >> pages.
> >>
> >> I am therefore planning to expand drvshm to include a flag requesting
> >> contiguous physical memory. But sadly, from user space, this is
> >> nothing we can guarantee... So when this flag is set, the allocator
> >> will allocate untill physical memory "happens to be continuous".
> >> This is a bit like the DPDK approach (try & error), which I dislike,
> >> but there aren't many alternatives from user space. This would be
> >> triggered only in case huge page allocation failed, or if the
> >> requested size exceed the HP size.
> >>
> >> Last alternative would be to have a kernel module to do this kind of
> >> allocation, but I guess we don't really want to depend on that...
> >>
> >> Any comment?
> >
> >
> > Would that module be launched from the implementations global init, or
> be an
> > independent expectation on some other support lib that existed and it had
> > installed the module ?
> > Feels like and external support lib would be reusable by all
> implementations
> > and not need re coding.
> >
> >
> >
> >
> > --
> > Mike Holmes
> > Program Manager - Linaro Networking Group
> > Linaro.org │ Open source software for ARM SoCs
> > "Work should be fun and collaborative, the rest follows"
> >
> >
>



-- 
[image: Linaro] 
François-Frédéric Ozog | *Director Linaro Networking Group*
T: +33.67221.6485
francois.o...@linaro.org | Skype: ffozog


Re: [lng-odp] continuous memory allocation for drivers

2016-11-11 Thread Brian Brooks
On 11/10 18:52:49, Christophe Milard wrote:
> Hi,
> 
> My hope was that packet segments would all be smaller than one page
> (either normal pages or huge pages)

When is this the case? With a 4096 byte page, a couple 1518 byte Ethernet
packets can fit. A 9038 byte Jumbo wont fit.

Or is it to ease the memory manager by having a logical array of objects
laid out in virtual memory space and depending on the number of objects
and the size of each object, a few are bound to span across 2 pages which
might not be adjacent in physical memory?

Or is it when the view of a packet is a bunch of packet segments which
may be of varying sizes and possibly scattered across in memory and the
packet needs to go out the wire?

Are 2M, 16M, 1G page size used?

> to guarantee physical memory
> continuity which is needed by some drivers (read non vfio drivers for
> PCI).

If IOMMU enables IO device the same virtual addressing as the CPU by
sharing page tables, would ever a IO device or IOMMU have limitations
on the number of pages supported or other performance limitations
during the VA->PA translation?

Does the IOMMU remap interrupts from the IO device when the vm
migrates cores? What happens when no irq remapping, does core get
irq and must interprocessorinterrupt core where vm is now running?

Are non vfio drivers for PCI needing contiguous physical memory the
design target?

> Francois Ozog's experience (with dpdk)shows that this hope will fail
> in some case: not all platforms support the required huge page size.
> And it would be nice to be able to run even in the absence of huge
> pages.
> 
> I am therefore planning to expand drvshm to include a flag requesting
> contiguous physical memory. But sadly, from user space, this is
> nothing we can guarantee... So when this flag is set, the allocator
> will allocate untill physical memory "happens to be continuous".
> This is a bit like the DPDK approach (try & error), which I dislike,
> but there aren't many alternatives from user space. This would be
> triggered only in case huge page allocation failed, or if the
> requested size exceed the HP size.

Are device drivers for the target devices (SoCs, cards) easier to
program when there's an IOMMU? If so, is this contiguous physical
memory requirement necessary?

> Last alternative would be to have a kernel module to do this kind of
> allocation, but I guess we don't really want to depend on that...
> 
> Any comment?


Re: [lng-odp] continuous memory allocation for drivers

2016-11-10 Thread Christophe Milard
I hoped such a kernel module would already exist, but I am surprised
that DPDK would not rely on it. Maybe there is a limitation I cannot
see. I'll keep searching. Francois, maybe you know the answer?

On 10 November 2016 at 19:10, Mike Holmes  wrote:
>
>
> On 10 November 2016 at 12:52, Christophe Milard
>  wrote:
>>
>> Hi,
>>
>> My hope was that packet segments would all be smaller than one page
>> (either normal pages or huge pages) to guarantee physical memory
>> continuity which is needed by some drivers (read non vfio drivers for
>> PCI).
>>
>> Francois Ozog's experience (with dpdk)shows that this hope will fail
>> in some case: not all platforms support the required huge page size.
>> And it would be nice to be able to run even in the absence of huge
>> pages.
>>
>> I am therefore planning to expand drvshm to include a flag requesting
>> contiguous physical memory. But sadly, from user space, this is
>> nothing we can guarantee... So when this flag is set, the allocator
>> will allocate untill physical memory "happens to be continuous".
>> This is a bit like the DPDK approach (try & error), which I dislike,
>> but there aren't many alternatives from user space. This would be
>> triggered only in case huge page allocation failed, or if the
>> requested size exceed the HP size.
>>
>> Last alternative would be to have a kernel module to do this kind of
>> allocation, but I guess we don't really want to depend on that...
>>
>> Any comment?
>
>
> Would that module be launched from the implementations global init, or be an
> independent expectation on some other support lib that existed and it had
> installed the module ?
> Feels like and external support lib would be reusable by all implementations
> and not need re coding.
>
>
>
>
> --
> Mike Holmes
> Program Manager - Linaro Networking Group
> Linaro.org │ Open source software for ARM SoCs
> "Work should be fun and collaborative, the rest follows"
>
>


Re: [lng-odp] continuous memory allocation for drivers

2016-11-10 Thread Mike Holmes
On 10 November 2016 at 12:52, Christophe Milard <
christophe.mil...@linaro.org> wrote:

> Hi,
>
> My hope was that packet segments would all be smaller than one page
> (either normal pages or huge pages) to guarantee physical memory
> continuity which is needed by some drivers (read non vfio drivers for
> PCI).
>
> Francois Ozog's experience (with dpdk)shows that this hope will fail
> in some case: not all platforms support the required huge page size.
> And it would be nice to be able to run even in the absence of huge
> pages.
>
> I am therefore planning to expand drvshm to include a flag requesting
> contiguous physical memory. But sadly, from user space, this is
> nothing we can guarantee... So when this flag is set, the allocator
> will allocate untill physical memory "happens to be continuous".
> This is a bit like the DPDK approach (try & error), which I dislike,
> but there aren't many alternatives from user space. This would be
> triggered only in case huge page allocation failed, or if the
> requested size exceed the HP size.
>
> Last alternative would be to have a kernel module to do this kind of
> allocation, but I guess we don't really want to depend on that...
>
> Any comment?
>

Would that module be launched from the implementations global init, or be
an independent expectation on some other support lib that existed and it
had installed the module ?
Feels like and external support lib would be reusable by all
implementations and not need re coding.




-- 
Mike Holmes
Program Manager - Linaro Networking Group
Linaro.org  *│ *Open source software for ARM SoCs
"Work should be fun and collaborative, the rest follows"


[lng-odp] continuous memory allocation for drivers

2016-11-10 Thread Christophe Milard
Hi,

My hope was that packet segments would all be smaller than one page
(either normal pages or huge pages) to guarantee physical memory
continuity which is needed by some drivers (read non vfio drivers for
PCI).

Francois Ozog's experience (with dpdk)shows that this hope will fail
in some case: not all platforms support the required huge page size.
And it would be nice to be able to run even in the absence of huge
pages.

I am therefore planning to expand drvshm to include a flag requesting
contiguous physical memory. But sadly, from user space, this is
nothing we can guarantee... So when this flag is set, the allocator
will allocate untill physical memory "happens to be continuous".
This is a bit like the DPDK approach (try & error), which I dislike,
but there aren't many alternatives from user space. This would be
triggered only in case huge page allocation failed, or if the
requested size exceed the HP size.

Last alternative would be to have a kernel module to do this kind of
allocation, but I guess we don't really want to depend on that...

Any comment?