RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-19 Thread Liang, Kan


> > Yes, rtnl will bring some overheads. But the configuration is one time
> > thing for application or socket. It only happens on receiving first
> > packet.
> 
> Thanks for destroying our connection rates.
> 
> This kind of overhead is simply unacceptable.

If so, I think I can make the configuration asynchronized for next
version. The connection rate should not be destroyed.

Thanks,
Kan



Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread David Miller
From: "Liang, Kan" 
Date: Tue, 19 Jul 2016 01:49:41 +

> Yes, rtnl will bring some overheads. But the configuration is one
> time thing for application or socket. It only happens on receiving
> first packet.

Thanks for destroying our connection rates.

This kind of overhead is simply unacceptable.


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> 
> > Also of course it would be fundamentally less efficient than kernel
> > code doing that, just because of the additional context switches
> > needed.
> 
> Synchronizing or configuring any kind of queues already requires rtnl_mutex.
> I didn't test it but acquiring rtnl mutex in inet_recvmsg is unlikely to fly
> performance wise and

Yes, rtnl will bring some overheads. But the configuration is one time thing for
application or socket. It only happens on receiving first packet.
Unless the application/socket only transmit few packets, the overhead
could be ignored. If they only transmit few packets, why they care about
performance?

> might even be very dangerous under DoS attacks (like
> I see in 24/30).
> 
Patch 29/30 tries to prevent such case.

Thanks,
Kan


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Hannes Frederic Sowa
Hello,

On Mon, Jul 18, 2016, at 21:43, Andi Kleen wrote:
> > I wonder if this can be attacked from a different angle. What would be
> > missing to add support for this in user space? The first possibility
> > that came to my mind is to just multiplex those hints in the kernel.
> 
> "just" is the handwaving part here -- you're proposing a micro kernel
> approach where part of the multiplexing job that the kernel is doing
> is farmed out to a message passing user space component.
> 
> I suspect this would be far more complicated to get right and
> perform well than a straight forward monolithic kernel subsystem --
> which is traditionally how Linux has approached things.

At the same time having any kind of policy in the kernel was also always
avoided.

> The daemon would always need to work with out of date state
> compared to the latest, because it cannot do any locking with the
> kernel state.  So you end up with a complex distributed system with
> multiple
> agents "fighting" with each other, and the tuning agent
> never being able to keep up with the actual work.

But you don't want to have the tuning agents in the fast path? If you
really try to synchronously update all queue mappings/irqs during socket
creation or connect time this would add rtnl lock to basically socket
creation, as drivers require that. This would slow down basic socket
operations a lot and synchronize them with the management interface.

Even dst_entries are not synchronously updated anymore nowadays as that
would require too much locking overhead in the kernel.

> Also of course it would be fundamentally less efficient than
> kernel code doing that, just because of the additional context
> switches needed.

Synchronizing or configuring any kind of queues already requires
rtnl_mutex. I didn't test it but acquiring rtnl mutex in inet_recvmsg is
unlikely to fly performance wise and might even be very dangerous under
DoS attacks (like I see in 24/30).

Bye,
Hannes


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Daniel Borkmann

On 07/18/2016 08:30 PM, Liang, Kan wrote:

On 07/18/2016 08:55 AM, kan.li...@intel.com wrote:

[...]

On a higher level picture, why for example, a new cgroup in combination
with tc shouldn't be the ones resolving these policies on resource usage?


The NET policy doesn't support cgroup yet, but it's on my todo list.
The granularity for the device resource is per queue. The packet will be
redirected to the specific queue.
I'm not sure if cgroup with tc can do that.


Did you have a look at sch_mqprio, which can be used along with either
netprio cgroup or netcls cgroup plus tc on clsact's egress side to set
the priority for mqprio mappings from application side? At leats ixgbe,
i40e, fm10k have offload support for it and a number of other nics. You
could also use cls_bpf for making the prio assignment if you need to
involve also other meta data from the skb (like mark or prio derived from
sockets, etc). Maybe it doesn't cover all of what you need, but could be
a start to extend upon?

Thanks,
Daniel


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> >>
> >> On Mon, Jul 18, 2016 at 8:45 AM, Andi Kleen  wrote:
> >> >> It seems strange to me to add such policies to the kernel.
> >> >> Addmittingly, documentation of some settings is non-existent and
> >> >> one needs various different tools to set this (sysctl, procfs, sysfs,
> ethtool, etc).
> >> >
> >> > The problem is that different applications need different policies.
> >> >
> >> > The only entity which can efficiently negotiate between different
> >> > applications' conflicting requests is the kernel. And that is
> >> > pretty much the basic job description of a kernel: multiplex
> >> > hardware efficiently between different users.
> >> >
> >> > So yes the user space tuning approach works for simple cases ("only
> >> > run workloads that require the same tuning"), but is ultimately not
> >> > very interesting nor scalable.
> >>
> >> I don't read the code yet, just the cover letter.
> >>
> >> We have global tunings, per-network-namespace tunings, per-socket
> tunings.
> >> It is still unclear why you can't just put different applications
> >> into different namespaces/containers to get different policies.
> >
> > In NET policy, we do per queue tunings.
> 
> Is it possible to isolate NIC queues for containers?

Yes, but we don't  have containers support yet. 


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Cong Wang
On Mon, Jul 18, 2016 at 1:14 PM, Liang, Kan  wrote:
>
>
>>
>> On Mon, Jul 18, 2016 at 8:45 AM, Andi Kleen  wrote:
>> >> It seems strange to me to add such policies to the kernel.
>> >> Addmittingly, documentation of some settings is non-existent and one
>> >> needs various different tools to set this (sysctl, procfs, sysfs, 
>> >> ethtool, etc).
>> >
>> > The problem is that different applications need different policies.
>> >
>> > The only entity which can efficiently negotiate between different
>> > applications' conflicting requests is the kernel. And that is pretty
>> > much the basic job description of a kernel: multiplex hardware
>> > efficiently between different users.
>> >
>> > So yes the user space tuning approach works for simple cases ("only
>> > run workloads that require the same tuning"), but is ultimately not
>> > very interesting nor scalable.
>>
>> I don't read the code yet, just the cover letter.
>>
>> We have global tunings, per-network-namespace tunings, per-socket tunings.
>> It is still unclear why you can't just put different applications into 
>> different
>> namespaces/containers to get different policies.
>
> In NET policy, we do per queue tunings.

Is it possible to isolate NIC queues for containers?


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> 
> On Mon, Jul 18, 2016 at 8:45 AM, Andi Kleen  wrote:
> >> It seems strange to me to add such policies to the kernel.
> >> Addmittingly, documentation of some settings is non-existent and one
> >> needs various different tools to set this (sysctl, procfs, sysfs, ethtool, 
> >> etc).
> >
> > The problem is that different applications need different policies.
> >
> > The only entity which can efficiently negotiate between different
> > applications' conflicting requests is the kernel. And that is pretty
> > much the basic job description of a kernel: multiplex hardware
> > efficiently between different users.
> >
> > So yes the user space tuning approach works for simple cases ("only
> > run workloads that require the same tuning"), but is ultimately not
> > very interesting nor scalable.
> 
> I don't read the code yet, just the cover letter.
> 
> We have global tunings, per-network-namespace tunings, per-socket tunings.
> It is still unclear why you can't just put different applications into 
> different
> namespaces/containers to get different policies.

In NET policy, we do per queue tunings.


Thanks,
Kan


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> > So where is your policy for power saving?  From past experience I can tell 
> > you
> 
> There is no policy for power saving yet. I will add it to my todo list.

Yes it's interesting to consider. The main goal here is to maximize CPU
idle residency? I wonder if that's that much different from the CPU policy.

-Andi


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> On Sun, Jul 17, 2016 at 11:55 PM,   wrote:
> > From: Kan Liang 
> >
> > It is a big challenge to get good network performance. First, the
> > network performance is not good with default system settings. Second,
> > it is too difficult to do automatic tuning for all possible workloads,
> > since workloads have different requirements. Some workloads may want
> > high throughput. Some may need low latency. Last but not least, there are
> lots of manual configurations.
> > Fine grained configuration is too difficult for users.
> 
> The problem as I see it is that this is just going to end up likely being an 
> even
> more intrusive version of irqbalance.  I really don't like the way that turned
> out as it did a number of really dumb things that usually result in it being
> disabled as soon as you actually want to do anything that will actually 
> involve
> any kind of performance tuning.  If this stuff is pushed into the kernel it 
> will
> be even harder to get rid of and that is definitely a bad thing.
> 
> > NET policy intends to simplify the network configuration and get a
> > good network performance according to the hints(policy) which is
> > applied by user. It provides some typical "policies" for user which
> > can be set per-socket, per-task or per-device. The kernel will
> > automatically figures out how to merge different requests to get good
> network performance.
> 
> So where is your policy for power saving?  From past experience I can tell you

There is no policy for power saving yet. I will add it to my todo list.

> that while performance tuning is a good thing, doing so at the expense of
> power management is bad.  In addition you seem to be making a lot of
> assumptions here that the end users are going to rewrite their applications to
> use the new socket options you added in order to try and tune the

Currently, they can set per task policy by proc to get good performance without
code changes.

> performance.  I have a hard time believing most developers are going to go
> to all that trouble.  In addition I suspect that even if they do go to that
> trouble they will probably still screw it up and you will end up with
> applications advertising latency as a goal when they should have specified
> CPU and so on.
> 
> > Net policy is designed for multiqueue network devices. This
> > implementation is only for Intel NICs using i40e driver. But the
> > concepts and generic code should apply to other multiqueue NICs too.
> 
> I would argue that your code is not very generic.  The fact that it is 
> relying on
> flow director already greatly limits what you can do.  If you want to make 
> this
> truly generic I would say you need to find ways to make this work on
> everything all the way down to things like i40evf and igb which don't have
> support for Flow Director.

Actually the NET policy codes employ ethtool's interface set_rxnfc to set rules.
It should be generic.
I guess I emphasize Flow Director too much in the document which make
you confuse.

> 
> > Net policy is also a combination of generic policy manager code and
> > some ethtool callbacks (per queue coalesce setting, flow
> > classification rules) to configure the driver.
> > This series also supports CPU hotplug and device hotplug.
> >
> > Here are some key Interfaces/APIs for NET policy.
> >
> >/proc/net/netpolicy/$DEV/policy
> >User can set/get per device policy from /proc
> >
> >/proc/$PID/net_policy
> >User can set/get per task policy from /proc
> >prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
> >An alternative way to set/get per task policy is from prctl.
> >
> >setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,,sizeof(int))
> >User can set/get per socket policy by setsockopt
> >
> >
> >int (*ndo_netpolicy_init)(struct net_device *dev,
> >  struct netpolicy_info *info);
> >Initialize device driver for NET policy
> >
> >int (*ndo_get_irq_info)(struct net_device *dev,
> >struct netpolicy_dev_info *info);
> >Collect device irq information
> 
> Instead of making the irq info a part of the ndo ops it might make more
> sense to make it part of an ethtool op.  Maybe you could make it so that you
> could specify a single queue at a time and get things like statistics, IRQ, 
> and
> ring information.

I will think about it. Thanks.

> 
> >int (*ndo_set_net_policy)(struct net_device *dev,
> >  enum netpolicy_name name);
> >Configure device according to policy name
> 
> I really don't like this piece of it.  I really think we shouldn't be leaving 
> so
> much up to the driver to determine how to handle things.

There are some settings are device specific. For example, the interrupt
moderation for i40e for BULK policy is (50, 125). For other device, the number
could be different. For other device, only tunning interrupt moderation may
not be enough. So 

Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> I wonder if this can be attacked from a different angle. What would be
> missing to add support for this in user space? The first possibility
> that came to my mind is to just multiplex those hints in the kernel.

"just" is the handwaving part here -- you're proposing a micro kernel
approach where part of the multiplexing job that the kernel is doing
is farmed out to a message passing user space component.

I suspect this would be far more complicated to get right and
perform well than a straight forward monolithic kernel subsystem --
which is traditionally how Linux has approached things.

The daemon would always need to work with out of date state
compared to the latest, because it cannot do any locking with the
kernel state.  So you end up with a complex distributed system with multiple
agents "fighting" with each other, and the tuning agent
never being able to keep up with the actual work.

Also of course it would be fundamentally less efficient than
kernel code doing that, just because of the additional context
switches needed.

-Andi


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Hannes Frederic Sowa
On 18.07.2016 17:45, Andi Kleen wrote:
>> It seems strange to me to add such policies to the kernel.
>> Addmittingly, documentation of some settings is non-existent and one needs
>> various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).
> 
> The problem is that different applications need different policies.

I fear that if those policies get changed in future, people will rely on
some of their side-effects, causing us to add more and more policies
which basically just differ in those side-effects.

If you compare your policies to madvise or fadvise options, they seem a
have a much more strict and narrower effects, which can be reasoned much
more easily about.

> The only entity which can efficiently negotiate between different
> applications' conflicting requests is the kernel. And that is pretty 
> much the basic job description of a kernel: multiplex hardware
> efficiently between different users.

The multiplexing part seems to be not really relevant for the per-device
settings, thus being controllable from current user space just fine.
Per-task setting could be conflicting with per-socket settings which
could lead to non-deterministic behavior. Probably semantically it
should be made clear what overrides what here (here == cover letter).
Things like indeterminate allocation of sockets in a threaded
environment come to my mind. Also allocation strategy could very much
depend on the installed rss key.

> So yes the user space tuning approach works for simple cases
> ("only run workloads that require the same tuning"), but is ultimately not
> very interesting nor scalable.

I wonder if this can be attacked from a different angle. What would be
missing to add support for this in user space? The first possibility
that came to my mind is to just multiplex those hints in the kernel.
Implement a generic way to add metadata to sockets and allow tuning
daemons to retrieve them via sockdiag? I could imagine that if the
SO_INCOMING_CPU information would be visible in sockdiag, one could
already do more automatic tuning and basically allow to implement your
policy in user space.

Bye,
Hannes



RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> 
> Hi Kan,
> 
> On 07/18/2016 08:55 AM, kan.li...@intel.com wrote:
> > From: Kan Liang 
> >
> > It is a big challenge to get good network performance. First, the
> > network performance is not good with default system settings. Second,
> > it is too difficult to do automatic tuning for all possible workloads,
> > since workloads have different requirements. Some workloads may want
> > high throughput. Some may need low latency. Last but not least, there are
> lots of manual configurations.
> > Fine grained configuration is too difficult for users.
> >
> > NET policy intends to simplify the network configuration and get a
> > good network performance according to the hints(policy) which is
> > applied by user. It provides some typical "policies" for user which
> > can be set per-socket, per-task or per-device. The kernel will
> > automatically figures out how to merge different requests to get good
> network performance.
> > Net policy is designed for multiqueue network devices. This
> > implementation is only for Intel NICs using i40e driver. But the
> > concepts and generic code should apply to other multiqueue NICs too.
> > Net policy is also a combination of generic policy manager code and
> > some ethtool callbacks (per queue coalesce setting, flow
> > classification rules) to configure the driver.
> > This series also supports CPU hotplug and device hotplug.
> >
> > Here are some key Interfaces/APIs for NET policy.
> >
> > /proc/net/netpolicy/$DEV/policy
> > User can set/get per device policy from /proc
> >
> > /proc/$PID/net_policy
> > User can set/get per task policy from /proc
> > prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
> > An alternative way to set/get per task policy is from prctl.
> >
> > setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,,sizeof(int))
> > User can set/get per socket policy by setsockopt
> >
> >
> > int (*ndo_netpolicy_init)(struct net_device *dev,
> >   struct netpolicy_info *info);
> > Initialize device driver for NET policy
> >
> > int (*ndo_get_irq_info)(struct net_device *dev,
> > struct netpolicy_dev_info *info);
> > Collect device irq information
> >
> > int (*ndo_set_net_policy)(struct net_device *dev,
> >   enum netpolicy_name name);
> > Configure device according to policy name
> >
> > netpolicy_register(struct netpolicy_reg *reg);
> > netpolicy_unregister(struct netpolicy_reg *reg);
> > NET policy API to register/unregister per task/socket net policy.
> > For each task/socket, an record will be created and inserted into an RCU
> > hash table.
> >
> > netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx);
> > NET policy API to find the proper queue for packet receiving and
> > transmitting.
> >
> > netpolicy_set_rules(struct netpolicy_reg *reg, u32 queue_index,
> >  struct netpolicy_flow_spec *flow);
> > NET policy API to add flow director rules.
> >
> > For using NET policy, the per-device policy must be set in advance. It
> > will automatically configure the system and re-organize the resource
> > of the system accordingly. For system configuration, in this series,
> > it will disable irq balance, set device queue irq affinity, and modify
> > interrupt moderation. For re-organizing the resource, current
> > implementation forces that CPU and queue irq are 1:1 mapping. An 1:1
> mapping group is also called net policy object.
> > For each device policy, it maintains a policy list. Once the device
> > policy is applied, the objects will be insert and tracked in that
> > device policy list. The policy list only be updated when cpu/device
> > hotplug, queue number changes or device policy changes.
> > The user can use /proc, prctl and setsockopt to set per-task and
> > per-socket net policy. Once the policy is set, an related record will
> > be inserted into RCU hash table. The record includes ptr, policy and
> > net policy object. The ptr is the pointer address of task/socket. The
> > object will not be assigned until the first package receive/transmit.
> > The object is picked by round-robin from object list. Once the object
> > is determined, the following packets will be set to redirect to the
> queue(object).
> > The object can be shared. The per-task or per-socket policy can be
> inherited.
> >
> > Now NET policy supports four per device policies and three per
> > task/socket policies.
> >  - BULK policy: This policy is designed for high throughput. It can be
> >applied to either per device policy or per task/socket policy.
> >  - CPU policy: This policy is designed for high throughput but lower CPU
> >utilization. It can be applied to either per device policy or
> >per task/socket policy.
> >  - LATENCY policy: This policy is designed for low latency. It can be
> >applied to either 

RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> On Mon, Jul 18, 2016 at 5:51 PM, Liang, Kan  wrote:
> >
> >
> >> >
> >> > It is a big challenge to get good network performance. First, the
> >> > network performance is not good with default system settings.
> >> > Second, it is too difficult to do automatic tuning for all possible
> >> > workloads, since workloads have different requirements. Some
> >> > workloads may want
> >> high throughput.
> >>
> >> Seems you did lots of tests to find optimal settings for a given base 
> >> policy.
> >>
> > Yes. Current test only base on Intel i40e driver. The optimal settings
> > should vary for other devices. But adding settings for new device is not
> hard.
> >
> The optimal settings are very dependent on system architecture (NUMA
> config, #cpus, memory, etc.) and sometimes kernel version as well. A
> database that provides best configurations across different devices,
> architectures, and kernel version might be interesting; but beware that that 
> is
> a whole bunch of work to maintain, Either way policy like this really should
> be handled in userspace.

The expression "optimal" I used here is not accurate. Sorry for that.
The NET policy tries to get good (near optimal) performance by very
simple configuration.
I agree that there are lots of dependencies for the optimal settings.
But most of the settings should be very similar. The near optimal performance
by applying those common settings are good enough for most users. 
We don't need to maintain a database for configurations across
devices/architectures/kernel version...

Thanks,
Kan


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Cong Wang
On Mon, Jul 18, 2016 at 8:45 AM, Andi Kleen  wrote:
>> It seems strange to me to add such policies to the kernel.
>> Addmittingly, documentation of some settings is non-existent and one needs
>> various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).
>
> The problem is that different applications need different policies.
>
> The only entity which can efficiently negotiate between different
> applications' conflicting requests is the kernel. And that is pretty
> much the basic job description of a kernel: multiplex hardware
> efficiently between different users.
>
> So yes the user space tuning approach works for simple cases
> ("only run workloads that require the same tuning"), but is ultimately not
> very interesting nor scalable.

I don't read the code yet, just the cover letter.

We have global tunings, per-network-namespace tunings, per-socket
tunings. It is still unclear why you can't just put different applications
into different namespaces/containers to get different policies.


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> 
> > > It seems strange to me to add such policies to the kernel.
> >
> > But kernel is the only place which can merge all user's requests.
> 
> I don't think so.
> 
> If different requests conflict in a way that is possible to dosomething
> meaningful the I don't see why userspace tool cannot do the same thing...
> 

Yes, I should correct my expression.
I think kernel should be a better place to do those things.
Kernel should be more efficient to coordinate those requests to get good
performance.


Thanks,
Kan


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Alexander Duyck
On Sun, Jul 17, 2016 at 11:55 PM,   wrote:
> From: Kan Liang 
>
> It is a big challenge to get good network performance. First, the network
> performance is not good with default system settings. Second, it is too
> difficult to do automatic tuning for all possible workloads, since workloads
> have different requirements. Some workloads may want high throughput. Some may
> need low latency. Last but not least, there are lots of manual configurations.
> Fine grained configuration is too difficult for users.

The problem as I see it is that this is just going to end up likely
being an even more intrusive version of irqbalance.  I really don't
like the way that turned out as it did a number of really dumb things
that usually result in it being disabled as soon as you actually want
to do anything that will actually involve any kind of performance
tuning.  If this stuff is pushed into the kernel it will be even
harder to get rid of and that is definitely a bad thing.

> NET policy intends to simplify the network configuration and get a good 
> network
> performance according to the hints(policy) which is applied by user. It
> provides some typical "policies" for user which can be set per-socket, 
> per-task
> or per-device. The kernel will automatically figures out how to merge 
> different
> requests to get good network performance.

So where is your policy for power saving?  From past experience I can
tell you that while performance tuning is a good thing, doing so at
the expense of power management is bad.  In addition you seem to be
making a lot of assumptions here that the end users are going to
rewrite their applications to use the new socket options you added in
order to try and tune the performance.  I have a hard time believing
most developers are going to go to all that trouble.  In addition I
suspect that even if they do go to that trouble they will probably
still screw it up and you will end up with applications advertising
latency as a goal when they should have specified CPU and so on.

> Net policy is designed for multiqueue network devices. This implementation is
> only for Intel NICs using i40e driver. But the concepts and generic code 
> should
> apply to other multiqueue NICs too.

I would argue that your code is not very generic.  The fact that it is
relying on flow director already greatly limits what you can do.  If
you want to make this truly generic I would say you need to find ways
to make this work on everything all the way down to things like i40evf
and igb which don't have support for Flow Director.

> Net policy is also a combination of generic policy manager code and some
> ethtool callbacks (per queue coalesce setting, flow classification rules) to
> configure the driver.
> This series also supports CPU hotplug and device hotplug.
>
> Here are some key Interfaces/APIs for NET policy.
>
>/proc/net/netpolicy/$DEV/policy
>User can set/get per device policy from /proc
>
>/proc/$PID/net_policy
>User can set/get per task policy from /proc
>prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
>An alternative way to set/get per task policy is from prctl.
>
>setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,,sizeof(int))
>User can set/get per socket policy by setsockopt
>
>
>int (*ndo_netpolicy_init)(struct net_device *dev,
>  struct netpolicy_info *info);
>Initialize device driver for NET policy
>
>int (*ndo_get_irq_info)(struct net_device *dev,
>struct netpolicy_dev_info *info);
>Collect device irq information

Instead of making the irq info a part of the ndo ops it might make
more sense to make it part of an ethtool op.  Maybe you could make it
so that you could specify a single queue at a time and get things like
statistics, IRQ, and ring information.

>int (*ndo_set_net_policy)(struct net_device *dev,
>  enum netpolicy_name name);
>Configure device according to policy name

I really don't like this piece of it.  I really think we shouldn't be
leaving so much up to the driver to determine how to handle things.
In addition just passing one of 4 different types doesn't do much for
actual configuration because the actual configuration of the device is
much more complex then that.  Essentially all this does is provide a
benchmark tuning interface.

>netpolicy_register(struct netpolicy_reg *reg);
>netpolicy_unregister(struct netpolicy_reg *reg);
>NET policy API to register/unregister per task/socket net policy.
>For each task/socket, an record will be created and inserted into an RCU
>hash table.

This piece will take a significant amount of time before it could ever
catch on.  Once again this just looks like a benchmark tuning
interface.  It isn't of much value.

>netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx);
>NET policy API to find the proper queue for packet receiving and

Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Tom Herbert
On Mon, Jul 18, 2016 at 5:51 PM, Liang, Kan  wrote:
>
>
>> >
>> > It is a big challenge to get good network performance. First, the
>> > network performance is not good with default system settings. Second,
>> > it is too difficult to do automatic tuning for all possible workloads,
>> > since workloads have different requirements. Some workloads may want
>> high throughput.
>>
>> Seems you did lots of tests to find optimal settings for a given base policy.
>>
> Yes. Current test only base on Intel i40e driver. The optimal settings should
> vary for other devices. But adding settings for new device is not hard.
>
The optimal settings are very dependent on system architecture (NUMA
config, #cpus, memory, etc.) and sometimes kernel version as well. A
database that provides best configurations across different devices,
architectures, and kernel version might be interesting; but beware
that that is a whole bunch of work to maintain, Either way policy like
this really should be handled in userspace.

Tom


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Daniel Borkmann

Hi Kan,

On 07/18/2016 08:55 AM, kan.li...@intel.com wrote:

From: Kan Liang 

It is a big challenge to get good network performance. First, the network
performance is not good with default system settings. Second, it is too
difficult to do automatic tuning for all possible workloads, since workloads
have different requirements. Some workloads may want high throughput. Some may
need low latency. Last but not least, there are lots of manual configurations.
Fine grained configuration is too difficult for users.

NET policy intends to simplify the network configuration and get a good network
performance according to the hints(policy) which is applied by user. It
provides some typical "policies" for user which can be set per-socket, per-task
or per-device. The kernel will automatically figures out how to merge different
requests to get good network performance.
Net policy is designed for multiqueue network devices. This implementation is
only for Intel NICs using i40e driver. But the concepts and generic code should
apply to other multiqueue NICs too.
Net policy is also a combination of generic policy manager code and some
ethtool callbacks (per queue coalesce setting, flow classification rules) to
configure the driver.
This series also supports CPU hotplug and device hotplug.

Here are some key Interfaces/APIs for NET policy.

/proc/net/netpolicy/$DEV/policy
User can set/get per device policy from /proc

/proc/$PID/net_policy
User can set/get per task policy from /proc
prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
An alternative way to set/get per task policy is from prctl.

setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,,sizeof(int))
User can set/get per socket policy by setsockopt


int (*ndo_netpolicy_init)(struct net_device *dev,
  struct netpolicy_info *info);
Initialize device driver for NET policy

int (*ndo_get_irq_info)(struct net_device *dev,
struct netpolicy_dev_info *info);
Collect device irq information

int (*ndo_set_net_policy)(struct net_device *dev,
  enum netpolicy_name name);
Configure device according to policy name

netpolicy_register(struct netpolicy_reg *reg);
netpolicy_unregister(struct netpolicy_reg *reg);
NET policy API to register/unregister per task/socket net policy.
For each task/socket, an record will be created and inserted into an RCU
hash table.

netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx);
NET policy API to find the proper queue for packet receiving and
transmitting.

netpolicy_set_rules(struct netpolicy_reg *reg, u32 queue_index,
 struct netpolicy_flow_spec *flow);
NET policy API to add flow director rules.

For using NET policy, the per-device policy must be set in advance. It will
automatically configure the system and re-organize the resource of the system
accordingly. For system configuration, in this series, it will disable irq
balance, set device queue irq affinity, and modify interrupt moderation. For
re-organizing the resource, current implementation forces that CPU and queue
irq are 1:1 mapping. An 1:1 mapping group is also called net policy object.
For each device policy, it maintains a policy list. Once the device policy is
applied, the objects will be insert and tracked in that device policy list. The
policy list only be updated when cpu/device hotplug, queue number changes or
device policy changes.
The user can use /proc, prctl and setsockopt to set per-task and per-socket
net policy. Once the policy is set, an related record will be inserted into RCU
hash table. The record includes ptr, policy and net policy object. The ptr is
the pointer address of task/socket. The object will not be assigned until the
first package receive/transmit. The object is picked by round-robin from object
list. Once the object is determined, the following packets will be set to
redirect to the queue(object).
The object can be shared. The per-task or per-socket policy can be inherited.

Now NET policy supports four per device policies and three per task/socket
policies.
 - BULK policy: This policy is designed for high throughput. It can be
   applied to either per device policy or per task/socket policy.
 - CPU policy: This policy is designed for high throughput but lower CPU
   utilization. It can be applied to either per device policy or
   per task/socket policy.
 - LATENCY policy: This policy is designed for low latency. It can be
   applied to either per device policy or per task/socket policy.
 - MIX policy: This policy can only be applied to per device policy. This
   is designed for the case which miscellaneous types of workload running
   on the device.


I'm missing a bit of discussion on the existing facilities there are under
networking and why they cannot be adapted to support these kind of 

Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Florian Westphal
Liang, Kan  wrote:
> > What is missing in the kernel UAPI so userspace could do these settings on 
> > its
> > own, without adding this policy stuff to the kernel?
> 
> The main purpose of the proposal is to simplify the configuration. Too many
> options will let them confuse. 
> For normal users, they just need to tell the kernel that they want high 
> throughput
> for the application. The kernel will take care of the rest.
> So, I don't think we need an interface for user to set their own policy 
> settings.

I don't (yet) agree that the kernel is the right place for this.
I agree that current (bare) kernel config interface(s) for this are
hard to use.

> > It seems strange to me to add such policies to the kernel.
> 
> But kernel is the only place which can merge all user's requests.

I don't think so.

If different requests conflict in a way that is possible to do something
meaningful the I don't see why userspace tool cannot do the same
thing...

> > Addmittingly, documentation of some settings is non-existent and one needs
> > various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).
> > 
> > But all of these details could be hidden from user.
> > Have you looked at tuna for instance?
> 
> Not yet. Is there similar settings for network?

Last time I checked tuna could only set a few network-related sysctls
and handle irq settings/affinity, but not e.g. tune irq coalescening
or any other network interface specific settings.


RE: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Liang, Kan


> >
> > It is a big challenge to get good network performance. First, the
> > network performance is not good with default system settings. Second,
> > it is too difficult to do automatic tuning for all possible workloads,
> > since workloads have different requirements. Some workloads may want
> high throughput.
> 
> Seems you did lots of tests to find optimal settings for a given base policy.
> 
Yes. Current test only base on Intel i40e driver. The optimal settings should
vary for other devices. But adding settings for new device is not hard.

> What is missing in the kernel UAPI so userspace could do these settings on its
> own, without adding this policy stuff to the kernel?

The main purpose of the proposal is to simplify the configuration. Too many
options will let them confuse. 
For normal users, they just need to tell the kernel that they want high 
throughput
for the application. The kernel will take care of the rest.
So, I don't think we need an interface for user to set their own policy 
settings.

> 
> It seems strange to me to add such policies to the kernel.

But kernel is the only place which can merge all user's requests.

> Addmittingly, documentation of some settings is non-existent and one needs
> various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).
> 
> But all of these details could be hidden from user.
> Have you looked at tuna for instance?

Not yet. Is there similar settings for network?

Thanks,
Kan


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Andi Kleen
> It seems strange to me to add such policies to the kernel.
> Addmittingly, documentation of some settings is non-existent and one needs
> various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).

The problem is that different applications need different policies.

The only entity which can efficiently negotiate between different
applications' conflicting requests is the kernel. And that is pretty 
much the basic job description of a kernel: multiplex hardware
efficiently between different users.

So yes the user space tuning approach works for simple cases
("only run workloads that require the same tuning"), but is ultimately not
very interesting nor scalable.

-Andi


Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Florian Westphal
kan.li...@intel.com  wrote:
> From: Kan Liang 
> 
> It is a big challenge to get good network performance. First, the network
> performance is not good with default system settings. Second, it is too
> difficult to do automatic tuning for all possible workloads, since workloads
> have different requirements. Some workloads may want high throughput.

Seems you did lots of tests to find optimal settings for a given base
policy.

What is missing in the kernel UAPI so userspace could do these settings
on its own, without adding this policy stuff to the kernel?

It seems strange to me to add such policies to the kernel.
Addmittingly, documentation of some settings is non-existent and one needs
various different tools to set this (sysctl, procfs, sysfs, ethtool, etc).

But all of these details could be hidden from user.
Have you looked at tuna for instance?


[RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread kan . liang
From: Kan Liang 

It is a big challenge to get good network performance. First, the network
performance is not good with default system settings. Second, it is too
difficult to do automatic tuning for all possible workloads, since workloads
have different requirements. Some workloads may want high throughput. Some may
need low latency. Last but not least, there are lots of manual configurations.
Fine grained configuration is too difficult for users.

NET policy intends to simplify the network configuration and get a good network
performance according to the hints(policy) which is applied by user. It
provides some typical "policies" for user which can be set per-socket, per-task
or per-device. The kernel will automatically figures out how to merge different
requests to get good network performance.
Net policy is designed for multiqueue network devices. This implementation is
only for Intel NICs using i40e driver. But the concepts and generic code should
apply to other multiqueue NICs too.
Net policy is also a combination of generic policy manager code and some
ethtool callbacks (per queue coalesce setting, flow classification rules) to
configure the driver.
This series also supports CPU hotplug and device hotplug.

Here are some key Interfaces/APIs for NET policy.

   /proc/net/netpolicy/$DEV/policy
   User can set/get per device policy from /proc

   /proc/$PID/net_policy
   User can set/get per task policy from /proc
   prctl(PR_SET_NETPOLICY, POLICY_NAME, NULL, NULL, NULL)
   An alternative way to set/get per task policy is from prctl.

   setsockopt(sockfd,SOL_SOCKET,SO_NETPOLICY,,sizeof(int))
   User can set/get per socket policy by setsockopt


   int (*ndo_netpolicy_init)(struct net_device *dev,
 struct netpolicy_info *info);
   Initialize device driver for NET policy

   int (*ndo_get_irq_info)(struct net_device *dev,
   struct netpolicy_dev_info *info);
   Collect device irq information

   int (*ndo_set_net_policy)(struct net_device *dev,
 enum netpolicy_name name);
   Configure device according to policy name

   netpolicy_register(struct netpolicy_reg *reg);
   netpolicy_unregister(struct netpolicy_reg *reg);
   NET policy API to register/unregister per task/socket net policy.
   For each task/socket, an record will be created and inserted into an RCU
   hash table.

   netpolicy_pick_queue(struct netpolicy_reg *reg, bool is_rx);
   NET policy API to find the proper queue for packet receiving and
   transmitting.

   netpolicy_set_rules(struct netpolicy_reg *reg, u32 queue_index,
struct netpolicy_flow_spec *flow);
   NET policy API to add flow director rules.

For using NET policy, the per-device policy must be set in advance. It will
automatically configure the system and re-organize the resource of the system
accordingly. For system configuration, in this series, it will disable irq
balance, set device queue irq affinity, and modify interrupt moderation. For
re-organizing the resource, current implementation forces that CPU and queue
irq are 1:1 mapping. An 1:1 mapping group is also called net policy object.
For each device policy, it maintains a policy list. Once the device policy is
applied, the objects will be insert and tracked in that device policy list. The
policy list only be updated when cpu/device hotplug, queue number changes or
device policy changes.
The user can use /proc, prctl and setsockopt to set per-task and per-socket
net policy. Once the policy is set, an related record will be inserted into RCU
hash table. The record includes ptr, policy and net policy object. The ptr is
the pointer address of task/socket. The object will not be assigned until the
first package receive/transmit. The object is picked by round-robin from object
list. Once the object is determined, the following packets will be set to
redirect to the queue(object).
The object can be shared. The per-task or per-socket policy can be inherited.

Now NET policy supports four per device policies and three per task/socket
policies.
- BULK policy: This policy is designed for high throughput. It can be
  applied to either per device policy or per task/socket policy.
- CPU policy: This policy is designed for high throughput but lower CPU
  utilization. It can be applied to either per device policy or
  per task/socket policy.
- LATENCY policy: This policy is designed for low latency. It can be
  applied to either per device policy or per task/socket policy.
- MIX policy: This policy can only be applied to per device policy. This
  is designed for the case which miscellaneous types of workload running
  on the device.

Lots of tests are done for net policy on platforms with Intel Xeon E5 V2
and XL710 40G NIC. The baseline test is with Linux 4.6.0 kernel.
Netperf is used to evaluate the throughput and latency performance.
  - "netperf -f m -t TCP_RR -H