Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Andre Oppermann

On 21.08.2013 21:59, Navdeep Parhar wrote:

On 08/21/13 12:41, Scott Long wrote:


On Aug 21, 2013, at 8:59 AM, Andre Oppermann  wrote:


On 19.08.2013 23:45, Navdeep Parhar wrote:

On 08/19/13 13:58, Andre Oppermann wrote:

On 19.08.2013 19:33, Navdeep Parhar wrote:

On 08/19/13 04:16, Andre Oppermann wrote:

Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520

Log:
Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING
external mbuf
type together with a dedicated *ext_free function.

Discussed with:trociny, glebius

Modified:
head/sys/kern/kern_mbuf.c
head/sys/kern/uipc_mbuf.c
head/sys/sys/mbuf.h



Hello Andre,

Is this just garbage collection or is there some other reason for this?


This is garbage collection and removal of not quite right, rotten,
functionality.


I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


I'm looking through your experimental code and that is some really good
numbers you're achieving there!

However a couple things don't feel quite right, hackish even, and not
fit for HEAD.  This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)).  There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.


If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate.  The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism.  This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).


If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
touched yet, but should get better support.

The hackish part for me is that the driver again manages its own memory
pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
has and remains at a central network memory pool.  The latter (our current)
way of doing it seems more efficient overall especially on heavily loaded
networked machines.  There may be significant queues building (think app
blocked having many sockets buffer fill up) up delaying the freeing and
returning of network memory resources.  Together with fragmentation this
can lead to bad very outcomes.  Router applications with many interfaces
also greatly benefit from central memory pools.

So I'm really not sure that we should move back in the driver owned pool
direction with lots of code duplication and copy-pasting (see NetBSD).
Also it is kinda weird to have a kernel based pool for data going down
the stack and another one in each driver for those going up.

Actually I'm of the opinion that we should stay with the central memory
pool and fix so that it works just as well for those cases a driver pool
currently performs better.


The central memory pool approach is too slow, unfortunately.  There's a
reason that other OS's are moving to them.  At Netflix we are
currently working on some approaches to private memory pools in order to
achieve better efficiency, and we're closely watching and anticipating Navdeep's
work.


I should point out that I went to great lengths to use the jumbop zone
in my experiments, and not create my own pool of memory for the rx
buffers.  The hope was to share cache warmth (sounds very cosy :-) with
the likes of m_uiotombuf (which uses jumbop too) etc.  So I'm actually
in the camp that prefers central pools.  I'm just trying out ways to
reduce the trips we have to make to the pool(s) involved.  Laying down
mbufs within clusters, and packing multiple frames per cluster clearly
helps.  Careful cluster recycling within the NIC seems to work too.


What you describe does make a lot of sense.  Jumbop is the optimal size
for the VM.  We should really look at and pushing forward to have a nicer
M_NOFREE+M_EXTREF API for 10 and new HEAD going forward.

As always it seems to depend on the use case and what is being measured
too.  Is it single stream performance?  Concurrent streams?  Both ways
or only one way, in or out?  Each makes very different use of many parts
of the stack and driver leading to different bottlenecks.

The Netfli

Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Navdeep Parhar
On 08/21/13 12:41, Scott Long wrote:
> 
> On Aug 21, 2013, at 8:59 AM, Andre Oppermann  wrote:
> 
>> On 19.08.2013 23:45, Navdeep Parhar wrote:
>>> On 08/19/13 13:58, Andre Oppermann wrote:
 On 19.08.2013 19:33, Navdeep Parhar wrote:
> On 08/19/13 04:16, Andre Oppermann wrote:
>> Author: andre
>> Date: Mon Aug 19 11:16:53 2013
>> New Revision: 254520
>> URL: http://svnweb.freebsd.org/changeset/base/254520
>>
>> Log:
>>Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
>> users
>>for a very long time, if ever.
>>
>>Should such a functionality ever be needed again the appropriate and
>>much better way to do it is through a custom EXT_SOMETHING
>> external mbuf
>>type together with a dedicated *ext_free function.
>>
>>Discussed with:trociny, glebius
>>
>> Modified:
>>head/sys/kern/kern_mbuf.c
>>head/sys/kern/uipc_mbuf.c
>>head/sys/sys/mbuf.h
>>
>
> Hello Andre,
>
> Is this just garbage collection or is there some other reason for this?

 This is garbage collection and removal of not quite right, rotten,
 functionality.

> I recently tried some experiments to reduce the number of mbuf and
> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
> very useful and the code changes to the kernel were minimal.  See
> user/np/cxl_tuning.  The experiment was quite successful and I was
> planning to bring in most of those changes to HEAD.  I was hoping to get
> some runtime mileage on the approach in general before tweaking the
> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
> within the cluster.  But now M_NOFREE has vanished without a warning...

 I'm looking through your experimental code and that is some really good
 numbers you're achieving there!

 However a couple things don't feel quite right, hackish even, and not
 fit for HEAD.  This is a bit the same situation we had with some of the
 first 1GigE cards quite a number of years back (mostly ti(4)).  There
 we ended up with a couple of just good enough hacks to make it fast.
 Most of the remains I've collected today.
>>>
>>> If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
>>> arguing that they were, before r254520) then the changes are perfectly
>>> legitimate.  The only hackish part was that I was getting the cluster
>>> from the jumbop zone while bypassing its normal refcnt mechanism.  This
>>> I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
>>> consumers (driver + network stack).
>>
>> If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
>> touched yet, but should get better support.
>>
>> The hackish part for me is that the driver again manages its own memory
>> pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
>> has and remains at a central network memory pool.  The latter (our current)
>> way of doing it seems more efficient overall especially on heavily loaded
>> networked machines.  There may be significant queues building (think app
>> blocked having many sockets buffer fill up) up delaying the freeing and
>> returning of network memory resources.  Together with fragmentation this
>> can lead to bad very outcomes.  Router applications with many interfaces
>> also greatly benefit from central memory pools.
>>
>> So I'm really not sure that we should move back in the driver owned pool
>> direction with lots of code duplication and copy-pasting (see NetBSD).
>> Also it is kinda weird to have a kernel based pool for data going down
>> the stack and another one in each driver for those going up.
>>
>> Actually I'm of the opinion that we should stay with the central memory
>> pool and fix so that it works just as well for those cases a driver pool
>> currently performs better.
> 
> The central memory pool approach is too slow, unfortunately.  There's a
> reason that other OS's are moving to them.  At Netflix we are
> currently working on some approaches to private memory pools in order to
> achieve better efficiency, and we're closely watching and anticipating 
> Navdeep's
> work.

I should point out that I went to great lengths to use the jumbop zone
in my experiments, and not create my own pool of memory for the rx
buffers.  The hope was to share cache warmth (sounds very cosy :-) with
the likes of m_uiotombuf (which uses jumbop too) etc.  So I'm actually
in the camp that prefers central pools.  I'm just trying out ways to
reduce the trips we have to make to the pool(s) involved.  Laying down
mbufs within clusters, and packing multiple frames per cluster clearly
helps.  Careful cluster recycling within the NIC seems to work too.

Regards,
Navdeep

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-s

Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Scott Long

On Aug 21, 2013, at 8:59 AM, Andre Oppermann  wrote:

> On 19.08.2013 23:45, Navdeep Parhar wrote:
>> On 08/19/13 13:58, Andre Oppermann wrote:
>>> On 19.08.2013 19:33, Navdeep Parhar wrote:
 On 08/19/13 04:16, Andre Oppermann wrote:
> Author: andre
> Date: Mon Aug 19 11:16:53 2013
> New Revision: 254520
> URL: http://svnweb.freebsd.org/changeset/base/254520
> 
> Log:
>Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
> users
>for a very long time, if ever.
> 
>Should such a functionality ever be needed again the appropriate and
>much better way to do it is through a custom EXT_SOMETHING
> external mbuf
>type together with a dedicated *ext_free function.
> 
>Discussed with:trociny, glebius
> 
> Modified:
>head/sys/kern/kern_mbuf.c
>head/sys/kern/uipc_mbuf.c
>head/sys/sys/mbuf.h
> 
 
 Hello Andre,
 
 Is this just garbage collection or is there some other reason for this?
>>> 
>>> This is garbage collection and removal of not quite right, rotten,
>>> functionality.
>>> 
 I recently tried some experiments to reduce the number of mbuf and
 cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
 very useful and the code changes to the kernel were minimal.  See
 user/np/cxl_tuning.  The experiment was quite successful and I was
 planning to bring in most of those changes to HEAD.  I was hoping to get
 some runtime mileage on the approach in general before tweaking the
 ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
 within the cluster.  But now M_NOFREE has vanished without a warning...
>>> 
>>> I'm looking through your experimental code and that is some really good
>>> numbers you're achieving there!
>>> 
>>> However a couple things don't feel quite right, hackish even, and not
>>> fit for HEAD.  This is a bit the same situation we had with some of the
>>> first 1GigE cards quite a number of years back (mostly ti(4)).  There
>>> we ended up with a couple of just good enough hacks to make it fast.
>>> Most of the remains I've collected today.
>> 
>> If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
>> arguing that they were, before r254520) then the changes are perfectly
>> legitimate.  The only hackish part was that I was getting the cluster
>> from the jumbop zone while bypassing its normal refcnt mechanism.  This
>> I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
>> consumers (driver + network stack).
> 
> If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
> touched yet, but should get better support.
> 
> The hackish part for me is that the driver again manages its own memory
> pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
> has and remains at a central network memory pool.  The latter (our current)
> way of doing it seems more efficient overall especially on heavily loaded
> networked machines.  There may be significant queues building (think app
> blocked having many sockets buffer fill up) up delaying the freeing and
> returning of network memory resources.  Together with fragmentation this
> can lead to bad very outcomes.  Router applications with many interfaces
> also greatly benefit from central memory pools.
> 
> So I'm really not sure that we should move back in the driver owned pool
> direction with lots of code duplication and copy-pasting (see NetBSD).
> Also it is kinda weird to have a kernel based pool for data going down
> the stack and another one in each driver for those going up.
> 
> Actually I'm of the opinion that we should stay with the central memory
> pool and fix so that it works just as well for those cases a driver pool
> currently performs better.

The central memory pool approach is too slow, unfortunately.  There's a
reason that other OS's are moving to them.  At Netflix we are
currently working on some approaches to private memory pools in order to
achieve better efficiency, and we're closely watching and anticipating Navdeep's
work.

Scott

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Peter Grehan

Hi Andre,


On 21.08.2013 17:03, Peter Grehan wrote:

The way to go should be 4K clusters as they are native to the
architecture.
IIRC a PCIe DMA can't cross a 4K boundary anyway


  That's a 4G boundary, for some devices.


I meant a single PCIe DMA transaction can be at most 4K before it has to
set up another one?


 4K is the maximum TLP size but that's rarely (ever?) used - 128 or 256 
bytes is more common. A DMA operation is just a sequence of TLPs 
originating from the endpoint.


later,

Peter.

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Andre Oppermann

On 21.08.2013 17:03, Peter Grehan wrote:

The way to go should be 4K clusters as they are native to the architecture.
IIRC a PCIe DMA can't cross a 4K boundary anyway


  That's a 4G boundary, for some devices.


I meant a single PCIe DMA transaction can be at most 4K before it has to set
up another one?

--
Andre

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Peter Grehan

The way to go should be 4K clusters as they are native to the architecture.
IIRC a PCIe DMA can't cross a 4K boundary anyway


 That's a 4G boundary, for some devices.

later,

Peter.

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Andre Oppermann

On 20.08.2013 00:03, Navdeep Parhar wrote:

On 08/19/13 14:08, Andre Oppermann wrote:

On 19.08.2013 19:40, Peter Grehan wrote:

I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


   I also had a virtualization work-in-progress where static mbufs were
allocated in the kernel and
M_NOFREE set.

   Might be worth sending a prior heads-up for these type of changes.


I'm sorry for ambushing but this stuff has to be done.  I have provided
an alternative way of handling it and I'm happy to help you with your
use case to make it good for you and to prevent the mbuf system from
getting bloated and hackish again.


I don't know what Peter's use case is but I'm curious about the
already-available alternative to M_NOFREE, if that's what you meant.
Can you please elaborate?


When you supply your own (*ext_free) function you can simply omit freeing
the mbuf itself.  Should make sure not to leak it though.

--
Andre

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-21 Thread Andre Oppermann

On 19.08.2013 23:45, Navdeep Parhar wrote:

On 08/19/13 13:58, Andre Oppermann wrote:

On 19.08.2013 19:33, Navdeep Parhar wrote:

On 08/19/13 04:16, Andre Oppermann wrote:

Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520

Log:
Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
users
for a very long time, if ever.

Should such a functionality ever be needed again the appropriate and
much better way to do it is through a custom EXT_SOMETHING
external mbuf
type together with a dedicated *ext_free function.

Discussed with:trociny, glebius

Modified:
head/sys/kern/kern_mbuf.c
head/sys/kern/uipc_mbuf.c
head/sys/sys/mbuf.h



Hello Andre,

Is this just garbage collection or is there some other reason for this?


This is garbage collection and removal of not quite right, rotten,
functionality.


I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


I'm looking through your experimental code and that is some really good
numbers you're achieving there!

However a couple things don't feel quite right, hackish even, and not
fit for HEAD.  This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)).  There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.


If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate.  The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism.  This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).


If you insist I'll revert the commit removing M_NOFREE.  EXT_EXTREF isn't
touched yet, but should get better support.

The hackish part for me is that the driver again manages its own memory
pool.  Windows works that way, NetBSD is moving towards it while FreeBSD
has and remains at a central network memory pool.  The latter (our current)
way of doing it seems more efficient overall especially on heavily loaded
networked machines.  There may be significant queues building (think app
blocked having many sockets buffer fill up) up delaying the freeing and
returning of network memory resources.  Together with fragmentation this
can lead to bad very outcomes.  Router applications with many interfaces
also greatly benefit from central memory pools.

So I'm really not sure that we should move back in the driver owned pool
direction with lots of code duplication and copy-pasting (see NetBSD).
Also it is kinda weird to have a kernel based pool for data going down
the stack and another one in each driver for those going up.

Actually I'm of the opinion that we should stay with the central memory
pool and fix so that it works just as well for those cases a driver pool
currently performs better.


I believe most of the improvements you've shown can be implemented in
a more generic and safe way into the mbuf system.  Also a number of things
in your experimental code may have side-effects in situations other than
netperf runs.


Agreed.  As I mentioned my long term plan was to tweak the jumboX zones
to allow them to allocate cluster with embedded mbuf + refcount.  The
M_NOFREE+EXT_EXTREF approach is the perfect bridge from here to there.
It is non-intrusive and lends itself well to experimentation.


Agreed, full support on fixing the refcount issue.


To summarize what I get from your experimental branch commits:
- the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
   a single memory buffer, provided it is large enough.
- you make use of that feature and point multiple m_ext mbufs into that
   buffer to separate the packets and send them up the stack.
- you embed the m_ext refcount into the single memory buffer as well.


yes, yes, and yes.


- you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
   with the cxgbe code)


I recycle the cluster (and the embedded mbuf in it) when possible.  All
depends on whether it's still in use by the time the rx queue needs it back.


There's always a couple of problems with driver managed pools.  Driver shutdown/
unloading requires all such managed mbufs to have returned before it can 
proceed.
This may take a undetermined long time as mbufs are sitting in socket buffers or
other q

Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Navdeep Parhar
On 08/19/13 14:08, Andre Oppermann wrote:
> On 19.08.2013 19:40, Peter Grehan wrote:
>>> I recently tried some experiments to reduce the number of mbuf and
>>> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
>>> very useful and the code changes to the kernel were minimal.  See
>>> user/np/cxl_tuning.  The experiment was quite successful and I was
>>> planning to bring in most of those changes to HEAD.  I was hoping to get
>>> some runtime mileage on the approach in general before tweaking the
>>> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
>>> within the cluster.  But now M_NOFREE has vanished without a warning...
>>
>>   I also had a virtualization work-in-progress where static mbufs were
>> allocated in the kernel and
>> M_NOFREE set.
>>
>>   Might be worth sending a prior heads-up for these type of changes.
> 
> I'm sorry for ambushing but this stuff has to be done.  I have provided
> an alternative way of handling it and I'm happy to help you with your
> use case to make it good for you and to prevent the mbuf system from
> getting bloated and hackish again.

I don't know what Peter's use case is but I'm curious about the
already-available alternative to M_NOFREE, if that's what you meant.
Can you please elaborate?

Regards,
Navdeep
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Navdeep Parhar
On 08/19/13 13:58, Andre Oppermann wrote:
> On 19.08.2013 19:33, Navdeep Parhar wrote:
>> On 08/19/13 04:16, Andre Oppermann wrote:
>>> Author: andre
>>> Date: Mon Aug 19 11:16:53 2013
>>> New Revision: 254520
>>> URL: http://svnweb.freebsd.org/changeset/base/254520
>>>
>>> Log:
>>>Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree
>>> users
>>>for a very long time, if ever.
>>>
>>>Should such a functionality ever be needed again the appropriate and
>>>much better way to do it is through a custom EXT_SOMETHING
>>> external mbuf
>>>type together with a dedicated *ext_free function.
>>>
>>>Discussed with:trociny, glebius
>>>
>>> Modified:
>>>head/sys/kern/kern_mbuf.c
>>>head/sys/kern/uipc_mbuf.c
>>>head/sys/sys/mbuf.h
>>>
>>
>> Hello Andre,
>>
>> Is this just garbage collection or is there some other reason for this?
> 
> This is garbage collection and removal of not quite right, rotten,
> functionality.
> 
>> I recently tried some experiments to reduce the number of mbuf and
>> cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
>> very useful and the code changes to the kernel were minimal.  See
>> user/np/cxl_tuning.  The experiment was quite successful and I was
>> planning to bring in most of those changes to HEAD.  I was hoping to get
>> some runtime mileage on the approach in general before tweaking the
>> ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
>> within the cluster.  But now M_NOFREE has vanished without a warning...
> 
> I'm looking through your experimental code and that is some really good
> numbers you're achieving there!
> 
> However a couple things don't feel quite right, hackish even, and not
> fit for HEAD.  This is a bit the same situation we had with some of the
> first 1GigE cards quite a number of years back (mostly ti(4)).  There
> we ended up with a couple of just good enough hacks to make it fast.
> Most of the remains I've collected today.

If M_NOFREE and EXT_EXTREF are properly supported in the tree (and I'm
arguing that they were, before r254520) then the changes are perfectly
legitimate.  The only hackish part was that I was getting the cluster
from the jumbop zone while bypassing its normal refcnt mechanism.  This
I did so as to use the same zone as m_uiotombuf to keep it "hot" for all
consumers (driver + network stack).

> I believe most of the improvements you've shown can be implemented in
> a more generic and safe way into the mbuf system.  Also a number of things
> in your experimental code may have side-effects in situations other than
> netperf runs.

Agreed.  As I mentioned my long term plan was to tweak the jumboX zones
to allow them to allocate cluster with embedded mbuf + refcount.  The
M_NOFREE+EXT_EXTREF approach is the perfect bridge from here to there.
It is non-intrusive and lends itself well to experimentation.

> 
> To summarize what I get from your experimental branch commits:
> - the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
>   a single memory buffer, provided it is large enough.
> - you make use of that feature and point multiple m_ext mbufs into that
>   buffer to separate the packets and send them up the stack.
> - you embed the m_ext refcount into the single memory buffer as well.

yes, yes, and yes.

> - you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
>   with the cxgbe code)

I recycle the cluster (and the embedded mbuf in it) when possible.  All
depends on whether it's still in use by the time the rx queue needs it back.

> 
> Lets examine and discuss these parts:
> - M_NOFREE wasn't really safe with bpf anyway at least for packets going
>   down the stack.

I see.  Can you point out the parts of bpf unable to deal with M_NOFREE?

> - Instead of M_NOFREE a custom *ext_free should be used that has the same
>   and even more functionality.

Yes, that's what I was planning too, with the jumboX zone changes.  It
would be faster than running the m_ext's free function (which is a
function dereference+call).

> - Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
>   and possibly NUMA in the future.  Not that we are very good with it at
>   the moment but bypassing the mbuf allocator shouldn't become the norm.
>   If it is a problem/bottleneck again it should be fixed, not worked around
>   and copy-pasted n-times in so many drivers.

If/when a cluster is recycled, it is given back to the same rx ithread
that originally allocated it, and not not any other queue.  If the
ithread stays in the same NUMA domain (and it really should) then
recycling the cluster in the same queue really shouldn't cause imbalances.

> - jumbo9 and jumbo16 mbufs should not be used because they are more special
>   than necessary with being KVM and physically contiguous.  This probably
>   isn't necessary for the T4/T5 cards and any other modern DMA engine.
>   Under heavy diverse network the kernel memory becom

Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Adrian Chadd
fwiw - +1 on everything you've said to date about this.



-adrian
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Andre Oppermann

On 19.08.2013 19:40, Peter Grehan wrote:

I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


  I also had a virtualization work-in-progress where static mbufs were 
allocated in the kernel and
M_NOFREE set.

  Might be worth sending a prior heads-up for these type of changes.


I'm sorry for ambushing but this stuff has to be done.  I have provided
an alternative way of handling it and I'm happy to help you with your
use case to make it good for you and to prevent the mbuf system from
getting bloated and hackish again.

Mbuf is a core system for the kernel and we should avoid to kitchen-sink
it again while at the same time to keep speedy enough to keep up with
the speed requirements.

I believe it would be bad to have Navdeep, you and others invent their
own network buffer management routines over again in slightly different
ways tailored to each immediate use case.

Can you please describe your intended use of M_NOFREE to better understand
the shortcomings of the current mbuf systems and the additional advantages
of the M_NOFREE case?

--
Andre

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Andre Oppermann

On 19.08.2013 19:33, Navdeep Parhar wrote:

On 08/19/13 04:16, Andre Oppermann wrote:

Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520

Log:
   Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree users
   for a very long time, if ever.

   Should such a functionality ever be needed again the appropriate and
   much better way to do it is through a custom EXT_SOMETHING external mbuf
   type together with a dedicated *ext_free function.

   Discussed with:  trociny, glebius

Modified:
   head/sys/kern/kern_mbuf.c
   head/sys/kern/uipc_mbuf.c
   head/sys/sys/mbuf.h



Hello Andre,

Is this just garbage collection or is there some other reason for this?


This is garbage collection and removal of not quite right, rotten, 
functionality.


I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


I'm looking through your experimental code and that is some really good
numbers you're achieving there!

However a couple things don't feel quite right, hackish even, and not
fit for HEAD.  This is a bit the same situation we had with some of the
first 1GigE cards quite a number of years back (mostly ti(4)).  There
we ended up with a couple of just good enough hacks to make it fast.
Most of the remains I've collected today.

I believe most of the improvements you've shown can be implemented in
a more generic and safe way into the mbuf system.  Also a number of things
in your experimental code may have side-effects in situations other than
netperf runs.

To summarize what I get from your experimental branch commits:
- the Chelsio T4/T5 card can DMA multiple ethernet frames (packets) into
  a single memory buffer, provided it is large enough.
- you make use of that feature and point multiple m_ext mbufs into that
  buffer to separate the packets and send them up the stack.
- you embed the m_ext refcount into the single memory buffer as well.
- you recycle mbufs? (I'm not entirely clear on that as I'm not familiar
  with the cxgbe code)

Lets examine and discuss these parts:
- M_NOFREE wasn't really safe with bpf anyway at least for packets going
  down the stack.
- Instead of M_NOFREE a custom *ext_free should be used that has the same
  and even more functionality.
- Recycling mbufs may cause imbalances to the mbuf pool with multiple cores
  and possibly NUMA in the future.  Not that we are very good with it at
  the moment but bypassing the mbuf allocator shouldn't become the norm.
  If it is a problem/bottleneck again it should be fixed, not worked around
  and copy-pasted n-times in so many drivers.
- jumbo9 and jumbo16 mbufs should not be used because they are more special
  than necessary with being KVM and physically contiguous.  This probably
  isn't necessary for the T4/T5 cards and any other modern DMA engine.
  Under heavy diverse network the kernel memory becomes fragmented and can't
  find memory fulfilling both criteria anymore.  In fact both are an artifact
  of the early GigE hacks when high speed DMA engines were in their infancy.
  Both jumbo9 and jumbo16 should go away without direct replacement.
  In your T4/T5 case the alternative would be either to a) allocate your
  own memory directly from KVM with the necessary properties (KVM and/or phys
  contig); b) have such a generic kernel mbuf type fulfilling the same role.
  There may be some cache line issues on non-x86 systems that have to be
  though and taken care of.
- Refcounts should have the option of being separately allocated.  It was
  mandatory to use the refcount zone so far because there wasn't any other
  type of refcount.  Now that we have one we should revisit the issue.
  Actually the entire mbuf allocation and initialization could be streamlined
  as a whole.

On a side note a different mbuf handling for the RX DMA rings may give some
significant improvements as well: allocate just a cluster without a mbuf through
m_cljget() and put it into the RX ring.  Only when the DMA has put a packet into
it allocated attach the mbuf (in the drivers RX function).  This avoids the 
cache
pollution from touching the mbuf fields during initial allocation, including the
refcount.

Another nice trick would be to shorten the mbuf by 4 bytes (in ext_size) and put
the refcount there.

Lets work on these together.

--
Andre

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, sen

Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Peter Grehan

I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...


 I also had a virtualization work-in-progress where static mbufs were 
allocated in the kernel and M_NOFREE set.


 Might be worth sending a prior heads-up for these type of changes.

later,

Peter.
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Navdeep Parhar
On 08/19/13 04:16, Andre Oppermann wrote:
> Author: andre
> Date: Mon Aug 19 11:16:53 2013
> New Revision: 254520
> URL: http://svnweb.freebsd.org/changeset/base/254520
> 
> Log:
>   Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree users
>   for a very long time, if ever.
>   
>   Should such a functionality ever be needed again the appropriate and
>   much better way to do it is through a custom EXT_SOMETHING external mbuf
>   type together with a dedicated *ext_free function.
>   
>   Discussed with: trociny, glebius
> 
> Modified:
>   head/sys/kern/kern_mbuf.c
>   head/sys/kern/uipc_mbuf.c
>   head/sys/sys/mbuf.h
> 

Hello Andre,

Is this just garbage collection or is there some other reason for this?

I recently tried some experiments to reduce the number of mbuf and
cluster allocations in a 40G NIC driver.  M_NOFREE and EXT_EXTREF proved
very useful and the code changes to the kernel were minimal.  See
user/np/cxl_tuning.  The experiment was quite successful and I was
planning to bring in most of those changes to HEAD.  I was hoping to get
some runtime mileage on the approach in general before tweaking the
ctors/dtors for jumpbo, jumbo9, jumbo16 to allow for an mbuf+refcnt
within the cluster.  But now M_NOFREE has vanished without a warning...

Regards,
Navdeep
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Andre Oppermann

On 19.08.2013 13:23, Davide Italiano wrote:

On Mon, Aug 19, 2013 at 1:16 PM, Andre Oppermann  wrote:

Modified: head/sys/sys/mbuf.h
==
--- head/sys/sys/mbuf.h Mon Aug 19 11:08:36 2013(r254519)
+++ head/sys/sys/mbuf.h Mon Aug 19 11:16:53 2013(r254520)
@@ -200,7 +200,7 @@ struct mbuf {
  /* 0x8000free */
  #defineM_VLANTAG   0x0001 /* ether_vtag is valid */
  #defineM_PROMISC   0x0002 /* packet was not for us */
-#defineM_NOFREE0x0004 /* do not free mbuf, embedded in 
cluster */
+/* 0x0004free */


I think you should just use M_UNUSED or something similar here for
consistency, like it's happening in td_pflags (sys/sys/proc.h), e.g.:

#define TDP_UNUSED9 0x0100 /* --available-- */


There's a couple of more changes upcoming that will take care of it.

--
Andre

___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


Re: svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Davide Italiano
On Mon, Aug 19, 2013 at 1:16 PM, Andre Oppermann  wrote:
> Author: andre
> Date: Mon Aug 19 11:16:53 2013
> New Revision: 254520
> URL: http://svnweb.freebsd.org/changeset/base/254520
>
> Log:
>   Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree users
>   for a very long time, if ever.
>
>   Should such a functionality ever be needed again the appropriate and
>   much better way to do it is through a custom EXT_SOMETHING external mbuf
>   type together with a dedicated *ext_free function.
>
>   Discussed with:   trociny, glebius
>
> Modified:
>   head/sys/kern/kern_mbuf.c
>   head/sys/kern/uipc_mbuf.c
>   head/sys/sys/mbuf.h
>
> Modified: head/sys/kern/kern_mbuf.c
> ==
> --- head/sys/kern/kern_mbuf.c   Mon Aug 19 11:08:36 2013(r254519)
> +++ head/sys/kern/kern_mbuf.c   Mon Aug 19 11:16:53 2013(r254520)
> @@ -474,7 +474,6 @@ mb_dtor_mbuf(void *mem, int size, void *
> if ((flags & MB_NOTAGS) == 0 && (m->m_flags & M_PKTHDR) != 0)
> m_tag_delete_chain(m, NULL);
> KASSERT((m->m_flags & M_EXT) == 0, ("%s: M_EXT set", __func__));
> -   KASSERT((m->m_flags & M_NOFREE) == 0, ("%s: M_NOFREE set", __func__));
>  #ifdef INVARIANTS
> trash_dtor(mem, size, arg);
>  #endif
>
> Modified: head/sys/kern/uipc_mbuf.c
> ==
> --- head/sys/kern/uipc_mbuf.c   Mon Aug 19 11:08:36 2013(r254519)
> +++ head/sys/kern/uipc_mbuf.c   Mon Aug 19 11:16:53 2013(r254520)
> @@ -278,17 +278,10 @@ m_extadd(struct mbuf *mb, caddr_t buf, u
>  void
>  mb_free_ext(struct mbuf *m)
>  {
> -   int skipmbuf;
>
> KASSERT((m->m_flags & M_EXT) == M_EXT, ("%s: M_EXT not set", 
> __func__));
> KASSERT(m->m_ext.ref_cnt != NULL, ("%s: ref_cnt not set", __func__));
>
> -
> -   /*
> -* check if the header is embedded in the cluster
> -*/
> -   skipmbuf = (m->m_flags & M_NOFREE);
> -
> /* Free attached storage if this mbuf is the only reference to it. */
> if (*(m->m_ext.ref_cnt) == 1 ||
> atomic_fetchadd_int(m->m_ext.ref_cnt, -1) == 1) {
> @@ -329,8 +322,6 @@ mb_free_ext(struct mbuf *m)
> ("%s: unknown ext_type", __func__));
> }
> }
> -   if (skipmbuf)
> -   return;
>
> /*
>  * Free this mbuf back to the mbuf zone with all m_ext
> @@ -395,7 +386,7 @@ m_demote(struct mbuf *m0, int all)
> m_freem(m->m_nextpkt);
> m->m_nextpkt = NULL;
> }
> -   m->m_flags = m->m_flags & (M_EXT|M_RDONLY|M_NOFREE);
> +   m->m_flags = m->m_flags & (M_EXT|M_RDONLY);
> }
>  }
>
>
> Modified: head/sys/sys/mbuf.h
> ==
> --- head/sys/sys/mbuf.h Mon Aug 19 11:08:36 2013(r254519)
> +++ head/sys/sys/mbuf.h Mon Aug 19 11:16:53 2013(r254520)
> @@ -200,7 +200,7 @@ struct mbuf {
>  /* 0x8000free */
>  #defineM_VLANTAG   0x0001 /* ether_vtag is valid */
>  #defineM_PROMISC   0x0002 /* packet was not for us */
> -#defineM_NOFREE0x0004 /* do not free mbuf, embedded in 
> cluster */
> +/* 0x0004free */

I think you should just use M_UNUSED or something similar here for
consistency, like it's happening in td_pflags (sys/sys/proc.h), e.g.:

#define TDP_UNUSED9 0x0100 /* --available-- */


>  #defineM_PROTO60x0008 /* protocol-specific */
>  #defineM_PROTO70x0010 /* protocol-specific */
>  #defineM_PROTO80x0020 /* protocol-specific */
> @@ -515,7 +515,7 @@ m_free(struct mbuf *m)
>
> if (m->m_flags & M_EXT)
> mb_free_ext(m);
> -   else if ((m->m_flags & M_NOFREE) == 0)
> +   else
> uma_zfree(zone_mbuf, m);
> return (n);
>  }



-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"


svn commit: r254520 - in head/sys: kern sys

2013-08-19 Thread Andre Oppermann
Author: andre
Date: Mon Aug 19 11:16:53 2013
New Revision: 254520
URL: http://svnweb.freebsd.org/changeset/base/254520

Log:
  Remove the unused M_NOFREE mbuf flag.  It didn't have any in-tree users
  for a very long time, if ever.
  
  Should such a functionality ever be needed again the appropriate and
  much better way to do it is through a custom EXT_SOMETHING external mbuf
  type together with a dedicated *ext_free function.
  
  Discussed with:   trociny, glebius

Modified:
  head/sys/kern/kern_mbuf.c
  head/sys/kern/uipc_mbuf.c
  head/sys/sys/mbuf.h

Modified: head/sys/kern/kern_mbuf.c
==
--- head/sys/kern/kern_mbuf.c   Mon Aug 19 11:08:36 2013(r254519)
+++ head/sys/kern/kern_mbuf.c   Mon Aug 19 11:16:53 2013(r254520)
@@ -474,7 +474,6 @@ mb_dtor_mbuf(void *mem, int size, void *
if ((flags & MB_NOTAGS) == 0 && (m->m_flags & M_PKTHDR) != 0)
m_tag_delete_chain(m, NULL);
KASSERT((m->m_flags & M_EXT) == 0, ("%s: M_EXT set", __func__));
-   KASSERT((m->m_flags & M_NOFREE) == 0, ("%s: M_NOFREE set", __func__));
 #ifdef INVARIANTS
trash_dtor(mem, size, arg);
 #endif

Modified: head/sys/kern/uipc_mbuf.c
==
--- head/sys/kern/uipc_mbuf.c   Mon Aug 19 11:08:36 2013(r254519)
+++ head/sys/kern/uipc_mbuf.c   Mon Aug 19 11:16:53 2013(r254520)
@@ -278,17 +278,10 @@ m_extadd(struct mbuf *mb, caddr_t buf, u
 void
 mb_free_ext(struct mbuf *m)
 {
-   int skipmbuf;

KASSERT((m->m_flags & M_EXT) == M_EXT, ("%s: M_EXT not set", __func__));
KASSERT(m->m_ext.ref_cnt != NULL, ("%s: ref_cnt not set", __func__));
 
-
-   /*
-* check if the header is embedded in the cluster
-*/ 
-   skipmbuf = (m->m_flags & M_NOFREE);
-   
/* Free attached storage if this mbuf is the only reference to it. */
if (*(m->m_ext.ref_cnt) == 1 ||
atomic_fetchadd_int(m->m_ext.ref_cnt, -1) == 1) {
@@ -329,8 +322,6 @@ mb_free_ext(struct mbuf *m)
("%s: unknown ext_type", __func__));
}
}
-   if (skipmbuf)
-   return;

/*
 * Free this mbuf back to the mbuf zone with all m_ext
@@ -395,7 +386,7 @@ m_demote(struct mbuf *m0, int all)
m_freem(m->m_nextpkt);
m->m_nextpkt = NULL;
}
-   m->m_flags = m->m_flags & (M_EXT|M_RDONLY|M_NOFREE);
+   m->m_flags = m->m_flags & (M_EXT|M_RDONLY);
}
 }
 

Modified: head/sys/sys/mbuf.h
==
--- head/sys/sys/mbuf.h Mon Aug 19 11:08:36 2013(r254519)
+++ head/sys/sys/mbuf.h Mon Aug 19 11:16:53 2013(r254520)
@@ -200,7 +200,7 @@ struct mbuf {
 /* 0x8000free */
 #defineM_VLANTAG   0x0001 /* ether_vtag is valid */
 #defineM_PROMISC   0x0002 /* packet was not for us */
-#defineM_NOFREE0x0004 /* do not free mbuf, embedded in 
cluster */
+/* 0x0004free */
 #defineM_PROTO60x0008 /* protocol-specific */
 #defineM_PROTO70x0010 /* protocol-specific */
 #defineM_PROTO80x0020 /* protocol-specific */
@@ -515,7 +515,7 @@ m_free(struct mbuf *m)
 
if (m->m_flags & M_EXT)
mb_free_ext(m);
-   else if ((m->m_flags & M_NOFREE) == 0)
+   else
uma_zfree(zone_mbuf, m);
return (n);
 }
___
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"