from:"Tom Herbert"

Re: [Web10g-user] Web10g TCP statistics patch - mainlining into kernel?

2013-01-25 Thread Tom Herbert

> The user side of the list has been pretty quiet to be sure. Most of the
> discussion happens internal to the team. The primary KIS developer is out
> the office until next week but I'll try to make sure he addresses this as
> soon as he can. I agree entirely with the more eyes approach and none of us
> have a lot of ego sunk into the details of the code in and of itself. As
> researchers, more than anything else, we want the concept moved forward. We
> believe, strongly, that having these metrics in the mainline can really do a
> lot to help the entire user community and spur some really neat development.
>
> Chris
>
Hi Chris, Valdis,

This is certainly interesting, and something that we'd really want to
try to get in upstream Linux since it's is so intrusive in the stack
(we have actually done quite a bit of work in this area already that
might be applicable).  Posting patches against davem's net-next branch
seems appropriate and you could certainly post them as RFC to start
some conversation.

Tom

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/5] net: low latency Ethernet device polling

2013-02-27 Thread Tom Herbert

On Wed, Feb 27, 2013 at 10:13 AM, Stephen Hemminger
 wrote:
>
> Have you looked at netmap? Seems like a cleaner user API for this.
>
That might be a bit orthogonal to this.  I believe the intent it to
allow spin polling from socket calls without API change.

>
> There is a version for Linux, but it needs work. My plan is to send it staging
> (after merge window reopens).
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 0/5] net: low latency Ethernet device polling

2013-03-01 Thread Tom Herbert

> This is exactly the kind of approach that makes sense rather than
> trying to put entire TCP stacks in the network card firmware.
>
And should also obviate the need to put a full TCP stack in user space!

> Thanks again for doing this work and I look forward to applying
> this stuff once all the kinks are worked out.  The folks in the
> Intel NIC group continue to impress me.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-04 Thread Tom Herbert

On Thu, Aug 4, 2016 at 12:36 PM,   wrote:
> From: Kan Liang 
>
> To achieve better network performance, the key step is to distribute the
> packets to dedicated queues according to policy and system run time
> status.
>
> This patch provides an interface which can return the proper dedicated
> queue for socket/task. Then the packets of the socket/task will be
> redirect to the dedicated queue for better network performance.
>
> For selecting the proper queue, currently it uses round-robin algorithm
> to find the available object from the given policy object list. The
> algorithm is good enough for now. But it could be improved by some
> adaptive algorithm later.
>
Seriously? You want to all of this code so we revert to TX queue
selection by round robin?

> The selected object will be stored in hashtable. So it does not need to
> go through the whole object list every time.
>
> Signed-off-by: Kan Liang 
> ---
>  include/linux/netpolicy.h |   5 ++
>  net/core/netpolicy.c  | 136 
> ++
>  2 files changed, 141 insertions(+)
>
> diff --git a/include/linux/netpolicy.h b/include/linux/netpolicy.h
> index 5900252..a522015 100644
> --- a/include/linux/netpolicy.h
> +++ b/include/linux/netpolicy.h
> @@ -97,6 +97,7 @@ extern void update_netpolicy_sys_map(void);
>  extern int netpolicy_register(struct netpolicy_instance *instance,
>   enum netpolicy_name policy);
>  extern void netpolicy_unregister(struct netpolicy_instance *instance);
> +extern int netpolicy_pick_queue(struct netpolicy_instance *instance, bool 
> is_rx);
>  #else
>  static inline void update_netpolicy_sys_map(void)
>  {
> @@ -111,6 +112,10 @@ static inline void netpolicy_unregister(struct 
> netpolicy_instance *instance)
>  {
>  }
>
> +static inline int netpolicy_pick_queue(struct netpolicy_instance *instance, 
> bool is_rx)
> +{
> +   return 0;
> +}
>  #endif
>
>  #endif /*__LINUX_NETPOLICY_H*/
> diff --git a/net/core/netpolicy.c b/net/core/netpolicy.c
> index 3605761..98ca430 100644
> --- a/net/core/netpolicy.c
> +++ b/net/core/netpolicy.c
> @@ -290,6 +290,142 @@ static void netpolicy_record_clear_dev_node(struct 
> net_device *dev)
> spin_unlock_bh(&np_hashtable_lock);
>  }
>
> +static struct netpolicy_object *get_avail_object(struct net_device *dev,
> +enum netpolicy_name policy,
> +bool is_rx)
> +{
> +   int dir = is_rx ? NETPOLICY_RX : NETPOLICY_TX;
> +   struct netpolicy_object *tmp, *obj = NULL;
> +   int val = -1;
> +
> +   /* Check if net policy is supported */
> +   if (!dev || !dev->netpolicy)
> +   return NULL;
> +
> +   /* The system should have queues which support the request policy. */
> +   if ((policy != dev->netpolicy->cur_policy) &&
> +   (dev->netpolicy->cur_policy != NET_POLICY_MIX))
> +   return NULL;
> +
> +   spin_lock_bh(&dev->np_ob_list_lock);
> +   list_for_each_entry(tmp, &dev->netpolicy->obj_list[dir][policy], 
> list) {
> +   if ((val > atomic_read(&tmp->refcnt)) ||
> +   (val == -1)) {
> +   val = atomic_read(&tmp->refcnt);
> +   obj = tmp;
> +   }
> +   }
> +
> +   if (WARN_ON(!obj)) {
> +   spin_unlock_bh(&dev->np_ob_list_lock);
> +   return NULL;
> +   }
> +   atomic_inc(&obj->refcnt);
> +   spin_unlock_bh(&dev->np_ob_list_lock);
> +
> +   return obj;
> +}
> +
> +static int get_avail_queue(struct netpolicy_instance *instance, bool is_rx)
> +{
> +   struct netpolicy_record *old_record, *new_record;
> +   struct net_device *dev = instance->dev;
> +   unsigned long ptr_id = (uintptr_t)instance->ptr;
> +   int queue = -1;
> +
> +   spin_lock_bh(&np_hashtable_lock);
> +   old_record = netpolicy_record_search(ptr_id);
> +   if (!old_record) {
> +   pr_warn("NETPOLICY: doesn't registered. Remove net policy 
> settings!\n");
> +   instance->policy = NET_POLICY_INVALID;
> +   goto err;
> +   }
> +
> +   if (is_rx && old_record->rx_obj) {
> +   queue = old_record->rx_obj->queue;
> +   } else if (!is_rx && old_record->tx_obj) {
> +   queue = old_record->tx_obj->queue;
> +   } else {
> +   new_record = kzalloc(sizeof(*new_record), GFP_KERNEL);
> +   if (!new_record)
> +   goto err;
> +   memcpy(new_record, old_record, sizeof(*new_record));
> +
> +   if (is_rx) {
> +   new_record->rx_obj = get_avail_object(dev, 
> new_record->policy, is_rx);
> +   if (!new_record->dev)
> +   new_record->dev = dev;
> +   if (!new_record->rx_obj) {
> +   kfree(new_record);
> +

Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-05 Thread Tom Herbert

On Thu, Aug 4, 2016 at 5:17 PM, Daniel Borkmann  wrote:
> On 08/05/2016 12:54 AM, Andi Kleen wrote:
>>>
>>> +1, I tried to bring this up here [1] in the last spin. I think only very
>>> few changes would be needed, f.e. on eBPF side to add a queue setting
>>> helper function which is probably straight forward ~10loc patch; and with
>>> regards to actually picking it up after clsact egress, we'd need to adapt
>>> __netdev_pick_tx() slightly when CONFIG_XPS so it doesn't override it.
>>
>>
>> You're proposing to rewrite the whole net policy manager as EBPF and run
>> it in a crappy JITer? Is that a serious proposal? It just sounds crazy
>> to me.
>>
>> Especially since we already have a perfectly good compiler and
>> programming language to write system code in.
>>
>> EBPF is ok for temporal instrumentation (if you somehow can accept
>> its security challenges), but using it to replace core
>> kernel functionality (which network policy IMHO is) with some bizarre
>> JITed setup and multiple languages doesn't really make any sense.
>>
>> Especially it doesn't make sense for anything with shared state,
>> which is the core part of network policy: it negotiates with multiple
>> users.
>>
>> After all we're writing Linux here and not some research toy.
>
>
> From what I read I guess you didn't really bother to look any deeper into
> this bizarre "research toy" to double check some of your claims. One of the
> things it's often deployed for by the way is defining policy. And the
> suggestion here was merely to explore existing infrastructure around things
> like tc and whether it already resolves at least a part of your net policy
> manager's requirements (like queue selection) or whether existing
> infrastructure
> can be extended with fewer complexity this way (as was mentioned with a new
> cls module as one option).

+1. The SO_REUSEPORT + BPF patches have already demonstrated the value
of making policy in the kernel programmable. There's no reason to
believe that model can't be extended to cover packet steering in the
data path for TX or RX as well as other cases.

Tom

Re: [RFC V2 PATCH 17/25] net/netpolicy: introduce netpolicy_pick_queue

2016-08-05 Thread Tom Herbert

On Fri, Aug 5, 2016 at 6:55 AM, Liang, Kan  wrote:
>
>
>>
>> On Thu, Aug 4, 2016 at 12:36 PM,   wrote:
>> > From: Kan Liang 
>> >
>> > To achieve better network performance, the key step is to distribute
>> > the packets to dedicated queues according to policy and system run
>> > time status.
>> >
>> > This patch provides an interface which can return the proper dedicated
>> > queue for socket/task. Then the packets of the socket/task will be
>> > redirect to the dedicated queue for better network performance.
>> >
>> > For selecting the proper queue, currently it uses round-robin
>> > algorithm to find the available object from the given policy object
>> > list. The algorithm is good enough for now. But it could be improved
>> > by some adaptive algorithm later.
>> >
>> Seriously? You want to all of this code so we revert to TX queue selection by
>> round robin?
>>
>
> I agree that the round robin is not an optimal algorithm.
> For this series, we intends to provide a generic infrastructure for review.
> For the algorithm parts, it's already in our TODO list. We will replace it 
> later.
>
Kan,

The justification for this patch is that to achieve to network
performance to steer TX distribute the packets to according to policy
and system runtime status. But the patch doesn't remotely implement
that, there's no data provided that these do anything useful or ever
will do anything useful, and it seems like this is completely ignoring
existing mechanisms like XPS that have proven to improve performance.

Tom

Re: [PATCH] rxrpc: recvmsg: use BUG_ON instead of if condition followed by BUG

2017-10-24 Thread Tom Herbert

Please combine these related patches fixing BUG in rxrpc into a patch
set with proper annotation,

Also, can any of these BUG_ONs be replaced by WARN_ONs? Warnings are
generally preferable to crashing the system.

Tom


On Tue, Oct 24, 2017 at 9:20 AM, Gustavo A. R. Silva
 wrote:
> Use BUG_ON instead of if condition followed by BUG.
>
> This issue was detected with the help of Coccinelle.
>
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  net/rxrpc/recvmsg.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/net/rxrpc/recvmsg.c b/net/rxrpc/recvmsg.c
> index bdece21..9598b92 100644
> --- a/net/rxrpc/recvmsg.c
> +++ b/net/rxrpc/recvmsg.c
> @@ -243,8 +243,7 @@ static int rxrpc_verify_packet(struct rxrpc_call *call, 
> struct sk_buff *skb,
>  */
> if ((annotation & RXRPC_RX_ANNO_JUMBO) > 1) {
> __be16 tmp;
> -   if (skb_copy_bits(skb, offset - 2, &tmp, 2) < 0)
> -   BUG();
> +   BUG_ON(skb_copy_bits(skb, offset - 2, &tmp, 2) < 0);
> cksum = ntohs(tmp);
> seq += (annotation & RXRPC_RX_ANNO_JUMBO) - 1;
> }
> @@ -503,8 +502,7 @@ int rxrpc_recvmsg(struct socket *sock, struct msghdr 
> *msg, size_t len,
>
> release_sock(&rx->sk);
>
> -   if (test_bit(RXRPC_CALL_RELEASED, &call->flags))
> -   BUG();
> +   BUG_ON(test_bit(RXRPC_CALL_RELEASED, &call->flags));
>
> if (test_bit(RXRPC_CALL_HAS_USERID, &call->flags)) {
> if (flags & MSG_CMSG_COMPAT) {
> --
> 2.7.4
>

Re: WARNING in strp_data_ready

2017-10-30 Thread Tom Herbert

On Mon, Oct 30, 2017 at 2:44 PM, John Fastabend
 wrote:
> On 10/24/2017 08:20 AM, syzbot wrote:
>> Hello,
>>
>> syzkaller hit the following crash on 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> C reproducer is attached
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers
>>
>>
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me 
>> include/net/sock.h:1505 [inline]
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user 
>> include/net/sock.h:1511 [inline]
>> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>> Kernel panic - not syncing: panic_on_warn set ...
>>
>> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 01/01/2011
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>  panic+0x1e4/0x417 kernel/panic.c:181
>>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>>  report_bug+0x211/0x2d0 lib/bug.c:183
>>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
>> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
>> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
>> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>> RSP: 0018:8801db206b18 EFLAGS: 00010206
>> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: 
>> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0
>> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd
>> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000
>> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001
>>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353
>
> Looks like KCM is calling sk_data_ready() without first taking the
> sock lock.
>
> /* Called with lower sock held */
> static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
> {
>  [...]
> if (kcm_queue_rcv_skb(&kcm->sk, skb)) {
>
> In this case kcm->sk is not the same lock the comment is referring to.
> And kcm_queue_rcv_skb() will eventually call sk_data_ready().
>
> @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
> I don't have anything better in mind immediately.
>
The sock locks are taken in reverse order in the send path so so
grabbing kcm sock lock with lower lock held to call sk_data_ready may
lead to deadlock like I think.

It might be possible to change the order in the send path to do this.
Something like:

trylock on lower socket lock
-if trylock fails
  - release kcm sock lock
  - lock lower sock
  - lock kcm sock
- call sendpage locked function

I admit that dealing with two levels of socket locks in the data path
is quite a pain :-)

Tom

> Thanks,
> John

Re: [RFC PATCH net-next] net/core: initial support for stacked dev feature toggles

2015-10-23 Thread Tom Herbert

On Fri, Oct 23, 2015 at 11:40 PM, Jarod Wilson  wrote:
> There are some netdev features that make little sense to toggle on and
> off in a stacked device setup on only one device in the stack. The prime
> example is a bonded connection, where it really doesn't make sense to
> disable LRO on the master, but not on any of the slaves, nor does it
> really make sense to be able to shut LRO off on a slave when its still
> enabled on the master.
>
> The strategy here is to add a section near the end of
> netdev_fix_features() that looks for upper and lower netdevs, then make
> sure certain feature flags match both up and down the stack. At present,
> only the LRO flag is included.
>
> This has been successfully tested with bnx2x, qlcnic and netxen network
> cards as slaves in a bond interface. Turning LRO on or off on the master
> also turns it on or off on each of the slaves, new slaves are added with
> LRO in the same state as the master, and LRO can't be toggled on the
> slaves.
>
> Also, this should largely remove the need for dev_disable_lro(), and most,
> if not all, of its call sites can be replaced by simply making sure
> NETIF_F_LRO isn't included in the relevant device's feature flags.
>
> Note that this patch is driven by bug reports from users saying it was
> confusing that bonds and slaves had different settings for the same
> features, and while it won't be 100% in sync if a lower device doesn't
> support a feature like LRO, I think this is a good step in the right
> direction.
>
I don't see what real problem this is solving. LRO is purely a feature
of physical devices and should be irrelevant to be configured on any
type of virtual device. I think the same thing will be true of RX csum
and other device RX functions (but this is not true for transmit
features). Seems like a better fix might be to disallow setting these
features on the bonding device in the first place, then we don't need
to worry about syncing them amongst slaves-- if a user needs that it's
a simple script.

Tom

> CC: "David S. Miller" 
> CC: Eric Dumazet 
> CC: Jay Vosburgh 
> CC: Veaceslav Falico 
> CC: Andy Gospodarek 
> CC: Jiri Pirko 
> CC: Nikolay Aleksandrov 
> CC: Michal Kubecek 
> CC: net...@vger.kernel.org
> Signed-off-by: Jarod Wilson 
> ---
>  net/core/dev.c | 57 +
>  1 file changed, 57 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1225b4b..26f4e2d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6261,9 +6261,57 @@ static void rollback_registered(struct net_device *dev)
> list_del(&single);
>  }
>
> +static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
> +   struct net_device *upper, netdev_features_t features)
> +{
> +   netdev_features_t want = upper->wanted_features & lower->hw_features;
> +
> +   if (!(upper->wanted_features & NETIF_F_LRO)
> +   && (features & NETIF_F_LRO)) {
> +   netdev_info(lower, "Dropping LRO, upper dev %s has it off.\n",
> +  upper->name);
> +   features &= ~NETIF_F_LRO;
> +   } else if ((want & NETIF_F_LRO) && !(features & NETIF_F_LRO)) {
> +   netdev_info(lower, "Keeping LRO, upper dev %s has it on.\n",
> +  upper->name);
> +   features |= NETIF_F_LRO;
> +   }
> +
> +   return features;
> +}
> +
> +static void netdev_sync_lower_features(struct net_device *upper,
> +   struct net_device *lower, netdev_features_t features)
> +{
> +   netdev_features_t want = features & lower->hw_features;
> +
> +   if (!(features & NETIF_F_LRO) && (lower->features & NETIF_F_LRO)) {
> +   netdev_info(upper, "Disabling LRO on lower dev %s.\n",
> +  lower->name);
> +   upper->wanted_features &= ~NETIF_F_LRO;
> +   lower->wanted_features &= ~NETIF_F_LRO;
> +   netdev_update_features(lower);
> +   if (unlikely(lower->features & NETIF_F_LRO))
> +   netdev_WARN(upper, "failed to disable LRO on %s!\n",
> +   lower->name);
> +   } else if ((want & NETIF_F_LRO) && !(lower->features & NETIF_F_LRO)) {
> +   netdev_info(upper, "Enabling LRO on lower dev %s.\n",
> +  lower->name);
> +   upper->wanted_features |= NETIF_F_LRO;
> +   lower->wanted_features |= NETIF_F_LRO;
> +   netdev_update_features(lower);
> +   if (unlikely(!(lower->features & NETIF_F_LRO)))
> +   netdev_WARN(upper, "failed to enable LRO on %s!\n",
> +   lower->name);
> +   }
> +}
> +
>  static netdev_features_t netdev_fix_features(struct net_device *dev,
> netdev_features_t features)
>  {
> +   struct net_device *upper, *lower;
> +   struct list_head *iter;
> +
> /* Fix illegal checksum combination

Re: BUG: free active (active state 0) object type: work_struct hint: strp_work

2018-01-04 Thread Tom Herbert

On Thu, Jan 4, 2018 at 4:10 AM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 6bb8824732f69de0f233ae6b1a8158e149627b38
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> Unfortunately, I don't have any reproducer for this bug yet.
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+3c6c745b0d2f341bb...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> Use struct sctp_assoc_value instead
> sctp: [Deprecated]: syz-executor4 (pid 12483) Use of int in maxseg socket
> option.
> Use struct sctp_assoc_value instead
> [ cut here ]
> ODEBUG: free active (active state 0) object type: work_struct hint:
> strp_work+0x0/0xf0 net/strparser/strparser.c:381
> WARNING: CPU: 1 PID: 3502 at lib/debugobjects.c:291
> debug_print_object+0x166/0x220 lib/debugobjects.c:288
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 3502 Comm: kworker/u4:4 Not tainted 4.15.0-rc5+ #170
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: kkcmd kcm_tx_work
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1061
> RIP: 0010:debug_print_object+0x166/0x220 lib/debugobjects.c:288
> RSP: 0018:8801c0ee7068 EFLAGS: 00010086
> RAX: dc08 RBX: 0003 RCX: 8159bc3e
> RDX:  RSI: 1100381dcdc8 RDI: 8801db317dd0
> RBP: 8801c0ee70a8 R08:  R09: 1100381dcd9a
> R10: ed00381dce3c R11: 86137ad8 R12: 0001
> R13: 86113480 R14: 8560dc40 R15: 8146e5f0
>  __debug_check_no_obj_freed lib/debugobjects.c:745 [inline]
>  debug_check_no_obj_freed+0x662/0xf1f lib/debugobjects.c:774
>  kmem_cache_free+0x253/0x2a0 mm/slab.c:3745

I believe we just need to defer kmem_cache_free to call_rcu.

Tom

>  unreserve_psock+0x5a1/0x780 net/kcm/kcmsock.c:547
>  kcm_write_msgs+0xbae/0x1b80 net/kcm/kcmsock.c:590
>  kcm_tx_work+0x2e/0x190 net/kcm/kcmsock.c:731
>  process_one_work+0xbbf/0x1b10 kernel/workqueue.c:2112
>  worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>  kthread+0x33c/0x400 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:515
>
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0-rc5+ #170 Not tainted
> --
> kworker/u4:4/3502 is trying to acquire lock:
>  ((console_sem).lock){-.-.}, at: [<91214b42>] down_trylock+0x13/0x70
> kernel/locking/semaphore.c:136
>
> but task is already holding lock:
>  (&obj_hash[i].lock){-.-.}, at: []
> __debug_check_no_obj_freed lib/debugobjects.c:736 [inline]
>  (&obj_hash[i].lock){-.-.}, at: []
> debug_check_no_obj_freed+0x1e9/0xf1f lib/debugobjects.c:774
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&obj_hash[i].lock){-.-.}:
>__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>__debug_object_init+0x109/0x1040 lib/debugobjects.c:343
>debug_object_init+0x17/0x20 lib/debugobjects.c:391
>debug_hrtimer_init kernel/time/hrtimer.c:396 [inline]
>debug_init kernel/time/hrtimer.c:441 [inline]
>hrtimer_init+0x8c/0x410 kernel/time/hrtimer.c:1122
>init_dl_task_timer+0x1b/0x50 kernel/sched/deadline.c:1023
>__sched_fork+0x2c4/0xb70 kernel/sched/core.c:2188
>init_idle+0x75/0x820 kernel/sched/core.c:5279
>sched_init+0xb19/0xc43 kernel/sched/core.c:5976
>start_kernel+0x452/0x819 init/main.c:582
>x86_64_start_reservations+0x2a/0x2c arch/x86/kernel/head64.c:378
>x86_64_start_kernel+0x77/0x7a arch/x86/kernel/head64.c:359
>secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:237
>
> -> #2 (&rq->lock){-.-.}:
>__raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
>_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
>rq_lock kernel/sched/sched.h:1766 [inline]
>task_fork_fair+0x7a/0x690 kernel/sched/fair.c:9449
>sched_fork+0x435/0xc00 kernel/sched/core.c:2404
>copy_process.part

Re: KASAN: use-after-free Read in psock_write_space

2018-01-16 Thread Tom Herbert

On Tue, Jan 16, 2018 at 12:40 PM, syzbot
 wrote:
> syzkaller has found reproducer for the following crash on
> a8750ddca918032d6349adbf9a4b6555e7db20da
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+8865eaff7f9acd593...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.
>
> ==
> BUG: KASAN: use-after-free in psock_write_space+0x143/0x160
> net/kcm/kcmsock.c:418
> Read of size 8 at addr 8801bb731d20 by task syzkaller858097/3665
>
> CPU: 0 PID: 3665 Comm: syzkaller858097 Not tainted 4.15.0-rc8+ #263
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  psock_write_space+0x143/0x160 net/kcm/kcmsock.c:418
>  sock_wfree+0x10b/0x140 net/core/sock.c:1805
>  skb_orphan include/linux/skbuff.h:2521 [inline]
>  loopback_xmit+0x12e/0x6b0 drivers/net/loopback.c:78
>  __netdev_start_xmit include/linux/netdevice.h:4042 [inline]
>  netdev_start_xmit include/linux/netdevice.h:4051 [inline]
>  xmit_one net/core/dev.c:3003 [inline]
>  dev_hard_start_xmit+0x24e/0xac0 net/core/dev.c:3019
>  __dev_queue_xmit+0x206d/0x2920 net/core/dev.c:3500
>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3533
>  neigh_hh_output include/net/neighbour.h:472 [inline]
>  neigh_output include/net/neighbour.h:480 [inline]
>  ip6_finish_output2+0x1498/0x23a0 net/ipv6/ip6_output.c:120
>  ip6_finish_output+0x302/0x930 net/ipv6/ip6_output.c:146
>  NF_HOOK_COND include/linux/netfilter.h:239 [inline]
>  ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:163
>  dst_output include/net/dst.h:460 [inline]
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ip6_xmit+0xd84/0x2090 net/ipv6/ip6_output.c:269
>  inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
>  dccp_transmit_skb+0x9ac/0x10f0 net/dccp/output.c:142
>  dccp_send_reset+0x21c/0x2a0 net/dccp/output.c:530
>  dccp_disconnect+0x90e/0xbb0 net/dccp/proto.c:276
>  inet_child_forget+0x6b/0x320 net/ipv4/inet_connection_sock.c:899
>  inet_csk_listen_stop+0x128/0x920 net/ipv4/inet_connection_sock.c:987
>  dccp_close+0x780/0xc20 net/dccp/proto.c:1007
>  inet_release+0xed/0x1c0 net/ipv4/af_inet.c:426
>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:432
>  sock_release+0x8d/0x1e0 net/socket.c:602
>  sock_close+0x16/0x20 net/socket.c:1131
>  __fput+0x327/0x7e0 fs/file_table.c:210
>  fput+0x15/0x20 fs/file_table.c:244
>  task_work_run+0x199/0x270 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x9bb/0x1ad0 kernel/exit.c:865
>  do_group_exit+0x149/0x400 kernel/exit.c:968
>  get_signal+0x73f/0x16c0 kernel/signal.c:2335
>  do_signal+0x90/0x1eb0 arch/x86/kernel/signal.c:809
>  exit_to_usermode_loop+0x214/0x310 arch/x86/entry/common.c:158
>  prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
>  syscall_return_slowpath+0x490/0x550 arch/x86/entry/common.c:264
>  entry_SYSCALL_64_fastpath+0x9e/0xa0
> RIP: 0033:0x445819
> RSP: 002b:7fad17780da8 EFLAGS: 0246 ORIG_RAX: 00ca
> RAX: fe00 RBX: 006dac3c RCX: 00445819
> RDX:  RSI:  RDI: 006dac3c
> RBP:  R08:  R09: 
> R10:  R11: 0246 R12: 006dac38
> R13: 0100 R14: 656c6c616b7a7973 R15: 000b
>
> Allocated by task 3664:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
>  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3544
>  kmem_cache_zalloc include/linux/slab.h:678 [inline]
>  kcm_attach net/kcm/kcmsock.c:1394 [inline]
>  kcm_attach_ioctl net/kcm/kcmsock.c:1460 [inline]
>  kcm_ioctl+0x2d2/0x1690 net/kcm/kcmsock.c:1665
>  sock_do_ioctl+0x65/0xb0 net/socket.c:966
>  sock_ioctl+0x2c2/0x440 net/socket.c:1063
>  vfs_ioctl fs/ioctl.c:46 [inline]
>  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>  SYSC_ioctl fs/ioctl.c:701 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>  entry_SYSCALL_64_fastpath+0x29/0xa0
>
> Freed by task 3665:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [in

Re: KASAN: use-after-free Read in get_work_pool

2018-03-11 Thread Tom Herbert

On Sun, Mar 11, 2018 at 2:34 PM, Eric Biggers  wrote:
> On Wed, Feb 14, 2018 at 02:45:05PM +0100, 'Dmitry Vyukov' via syzkaller-bugs 
> wrote:
>> On Wed, Dec 6, 2017 at 1:50 PM, Dmitry Vyukov  wrote:
>> > On Fri, Oct 27, 2017 at 11:18 PM, Cong Wang  
>> > wrote:
>> >> On Thu, Oct 26, 2017 at 11:00 PM, Dmitry Vyukov  
>> >> wrote:
>> >>> On Thu, Oct 26, 2017 at 7:58 PM, Tejun Heo  wrote:
>>  Hello,
>> 
>>  On Thu, Oct 26, 2017 at 09:35:44AM -0700, syzbot wrote:
>> > BUG: KASAN: use-after-free in __read_once_size
>> > include/linux/compiler.h:276 [inline]
>> > BUG: KASAN: use-after-free in atomic64_read
>> > arch/x86/include/asm/atomic64_64.h:21 [inline]
>> > BUG: KASAN: use-after-free in atomic_long_read
>> > include/asm-generic/atomic-long.h:44 [inline]
>> > BUG: KASAN: use-after-free in get_work_pool+0x1c2/0x1e0
>> > kernel/workqueue.c:709
>> > Read of size 8 at addr 8801cc58c378 by task syz-executor5/21326
>> >
>> > CPU: 1 PID: 21326 Comm: syz-executor5 Not tainted 4.13.0+ #43
>> > Hardware name: Google Google Compute Engine/Google Compute Engine,
>> > BIOS Google 01/01/2011
>> > Call Trace:
>> >  __dump_stack lib/dump_stack.c:16 [inline]
>> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
>> >  print_address_description+0x73/0x250 mm/kasan/report.c:252
>> >  kasan_report_error mm/kasan/report.c:351 [inline]
>> >  kasan_report+0x24e/0x340 mm/kasan/report.c:409
>> >  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>> >  __read_once_size include/linux/compiler.h:276 [inline]
>> >  atomic64_read arch/x86/include/asm/atomic64_64.h:21 [inline]
>> >  atomic_long_read include/asm-generic/atomic-long.h:44 [inline]
>> >  get_work_pool+0x1c2/0x1e0 kernel/workqueue.c:709
>> >  __queue_work+0x235/0x1150 kernel/workqueue.c:1401
>> >  queue_work_on+0x16a/0x1c0 kernel/workqueue.c:1486
>> >  queue_work include/linux/workqueue.h:489 [inline]
>> >  strp_check_rcv+0x25/0x30 net/strparser/strparser.c:553
>> >  kcm_attach net/kcm/kcmsock.c:1439 [inline]
>> >  kcm_attach_ioctl net/kcm/kcmsock.c:1460 [inline]
>> >  kcm_ioctl+0x826/0x1610 net/kcm/kcmsock.c:1695
>> >  sock_do_ioctl+0x65/0xb0 net/socket.c:961
>> >  sock_ioctl+0x2c2/0x440 net/socket.c:1058
>> >  vfs_ioctl fs/ioctl.c:45 [inline]
>> >  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
>> >  SYSC_ioctl fs/ioctl.c:700 [inline]
>> >  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
>> >  entry_SYSCALL_64_fastpath+0x1f/0xbe
>> 
>>  Looks like kcm is trying to reuse a work item whose last workqueue has
>>  been destroyed without re-initing it.  A work item needs to be
>>  reinit'd.
>> >>>
>> >>> +kcm maintainers
>> >>
>> >> Can you try the fix below? There is no C reproducer so I can't verify it.
>> >
>> >
>> > Hi Cong,
>> >
>> > syzbot can now test proposed patches, see
>> > https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>> > for details. Please give it a try.
>> >
>> >> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
>> >> index af4e76ac88ff..7816f44c576a 100644
>> >> --- a/net/kcm/kcmsock.c
>> >> +++ b/net/kcm/kcmsock.c
>> >> @@ -1433,11 +1433,12 @@ static int kcm_attach(struct socket *sock,
>> >> struct socket *csock,
>> >> KCM_STATS_INCR(mux->stats.psock_attach);
>> >> mux->psocks_cnt++;
>> >> psock_now_avail(psock);
>> >> -   spin_unlock_bh(&mux->lock);
>> >>
>> >> /* Schedule RX work in case there are already bytes queued */
>> >> strp_check_rcv(&psock->strp);
>> >>
>> >> +   spin_unlock_bh(&mux->lock);
>> >> +
>> >> return 0;
>> >>  }
>>
>>
>> Hi Cong,
>>
>> Was this ever merged? Is it still necessary?
>>
>
> syzbot is no longer hitting this bug for some reason but it's still there.  
> Tom,
> it looks like you wrote the buggy code (it's yet another KCM bug, apparently);
> can you please look into it?
>
Yes. Thank you for the simple reproducer.

Tom

> I've put together a C reproducer that works on latest linux-next 
> (next-20180309,
> commit 61530b14b059d).  It works as an unprivileged user provided that KCM is
> enabled, and that KASAN is enabled so you see the use-after-free report:
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main()
> {
> union bpf_attr prog = {
> .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
> .insn_cnt = 2,
> .insns = (__u64)(__u64[]){ 0xb7, 0x95 },
> .license = (__u64)"",
> };
> int tcp_fd, bpf_fd, kcm_fd;
> struct sockaddr_in addr = {
> .sin_family = AF_INET,
> .sin_port = __constant_htons(3270),
> .sin_addr = { __constant_htonl(0x7f01) }
> };
>
> tcp_fd = socket(AF_INET, SOCK_STREAM, 0);
> bind(tcp_fd, (void *)&addr, sizeof(addr));
> listen(tcp_fd, 1

Re: [PATCHv3 net] i40e: Implement ndo_gso_check()

2014-12-04 Thread Tom Herbert

On Thu, Dec 4, 2014 at 10:39 AM, Joe Stringer  wrote:
> ndo_gso_check() was recently introduced to allow NICs to report the
> offloading support that they have on a per-skb basis. Add an
> implementation for this driver which checks for IPIP, GRE, UDP tunnels.
>
> Signed-off-by: Joe Stringer 
> ---
> v3: Drop IPIP and GRE (no driver support even though hw supports it).
> Check for UDP outer protocol for UDP tunnels.
> v2: Expand to include IP in IP and IPv4/IPv6 inside GRE/UDP tunnels.
> Add MAX_INNER_LENGTH (as 80).
> ---
>  drivers/net/ethernet/intel/i40e/i40e_main.c |   26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index c3a7f4a..0d6493a 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -7447,6 +7447,31 @@ static int i40e_ndo_fdb_dump(struct sk_buff *skb,
>
>  #endif /* USE_DEFAULT_FDB_DEL_DUMP */
>  #endif /* HAVE_FDB_OPS */
> +static bool i40e_gso_check(struct sk_buff *skb, struct net_device *dev)
> +{
> +   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) {
> +   unsigned char *ihdr;
> +
> +   if (skb->protocol != IPPROTO_UDP ||
> +   skb->inner_protocol_type != ENCAP_TYPE_ETHER)
> +   return false;
> +
> +   if (skb->inner_protocol == htons(ETH_P_TEB))
> +   ihdr = skb_inner_mac_header(skb);
> +   else if (skb->inner_protocol == htons(ETH_P_IP) ||
> +skb->inner_protocol == htons(ETH_P_IPV6))
> +   ihdr = skb_inner_network_header(skb);
> +   else
> +   return false;
> +

Wow, this is getting complicated! :-( It's not clear that the protocol
specific checks are needed here since it looks like the header length
is being passed to the device later on. Also, I think we need
skb_inner_mac_header(skb) - skb_transport_header(skb) to always work
to give the length of the encapsulation headers (in case there is no
real inner mac header, then that offset should be for the network
header).

So would a simple check like this work:

if (skb->encapsulation &&
(skb_inner_mac_header(skb) - skb_transport_header(skb)) >
MAX_TUNNEL_HDR_LEN)
   return false;

> +#define MAX_TUNNEL_HDR_LEN 80
> +   if (ihdr - skb_transport_header(skb) > MAX_TUNNEL_HDR_LEN)
> +   return false;
> +   }
> +
> +   return true;
> +}
> +
>  static const struct net_device_ops i40e_netdev_ops = {
> .ndo_open   = i40e_open,
> .ndo_stop   = i40e_close,
> @@ -7487,6 +7512,7 @@ static const struct net_device_ops i40e_netdev_ops = {
> .ndo_fdb_dump   = i40e_ndo_fdb_dump,
>  #endif
>  #endif
> +   .ndo_gso_check  = i40e_gso_check,
>  };
>
>  /**
> --
> 1.7.10.4
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] include:linux:Optimizations to __skb_push

2015-01-22 Thread Tom Herbert

On Thu, Jan 22, 2015 at 10:02 AM, Mohammad Jamal
 wrote:
> This patch optimizes __skb_push function
>
> Signed-off-by: Mohammad Jamal 
> ---
>  include/linux/skbuff.h |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 85ab7d7..9acffb2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1671,9 +1671,8 @@ static inline unsigned char *__skb_put(struct sk_buff 
> *skb, unsigned int len)
>  unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
>  static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int 
> len)
>  {
> -   skb->data -= len;
> skb->len  += len;
> -   return skb->data;
> +   return skb->data -= len;
>  }
>
Hmm, this seems less readable to me. What is the effect of this patch?
Does the compiler even produce different assembly?

Thanks,
Tom

>  unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
> --
> 1.7.9.5
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lkp] [flow_dissector] 6a74fcf426f: RIP: 0010:[] [] __skb_flow_dissect+0x659/0x830

2015-06-15 Thread Tom Herbert

On Sun, Jun 14, 2015 at 11:12 PM, Huang Ying  wrote:
> FYI, we noticed the below changes on
>
> git://flatbed.openfabrics.org/~amirv/linux.git for-upstream
> commit 6a74fcf426f51aaa569f0b973d10ac20468df238 ("flow_dissector: add support 
> for dst, hop-by-hop and routing ext hdrs")
>
Fix was pushed to net-next, commit
1e98a0f08abddde87f0f93237f10629ecb4880ef ("flow_dissector: fix ipv6
dst, hop-by-hop and routing ext hdrs")

Thanks

>
> ++++
> || 611d23c559 | 6a74fcf426 |
> ++++
> | boot_successes | 13 | 14 |
> | boot_failures  | 0  | 12 |
> | RIP:__skb_flow_dissect | 0  | 8  |
> | Kernel_panic-not_syncing:softlockup:hung_tasks | 0  | 12 |
> | RIP:skb_copy_bits  | 0  | 4  |
> ++++
>
>
> [   10.691510] random: systemd urandom read with 16 bits of entropy available
> [   30.147484] random: nonblocking pool is initialized
> [   38.270110] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! 
> [swapper/7:0]
> [   38.278272] Modules linked in:
> [   38.281688] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 
> 4.1.0-rc7-05001-gd94988d #1
> [   38.290042] Hardware name: Intel Corporation S5520UR/S5520UR, BIOS 
> S5500.86B.01.00.0050.050620101605 05/06/2010
> [   38.301305] task: 8801e86258e0 ti: 8801e8634000 task.ti: 
> 8801e8634000
> [   38.309660] RIP: 0010:[]  [] 
> __skb_flow_dissect+0x659/0x830
> [   38.319415] RSP: :88036fac39b8  EFLAGS: 0292
> [   38.325343] RAX: fff2 RBX: dd86 RCX: 
> 0002
> [   38.07] RDX: 88036fac39e0 RSI: 3388fa16 RDI: 
> 88008299a400
> [   38.341273] RBP: 88036fac3a38 R08: dd86 R09: 
> 3388fa16
> [   38.349238] R10:  R11:  R12: 
> 88036fac3928
> [   38.357202] R13: 818d123e R14: 88036fac3a38 R15: 
> 81e688d0
> [   38.365175] FS:  () GS:88036fac() 
> knlGS:
> [   38.374209] CS:  0010 DS:  ES:  CR0: 8005003b
> [   38.380621] CR2:  CR3: 00036ecaf000 CR4: 
> 06e0
> [   38.388584] Stack:
> [   38.390825]  223080006fac39f8 8800dd86 c9003388fa16 
> 88036fac3a64
> [   38.399121]  00035eb53802 2230  
> 88008299a400
> [   38.407416]  88036fac3ad0  88036fac3ac8 
> 88008299a400
> [   38.415712] Call Trace:
> [   38.418441]  
> [   38.420585]  [] __skb_get_hash+0x8a/0x2f0
> [   38.427011]  [] ? dev_hard_start_xmit+0x274/0x400
> [   38.434006]  [] __skb_tx_hash+0x99/0xb0
> [   38.440030]  [] __netdev_pick_tx+0x156/0x170
> [   38.446541]  [] netdev_pick_tx+0xab/0xf0
> [   38.452662]  [] __dev_queue_xmit+0x91/0x550
> [   38.459077]  [] dev_queue_xmit_sk+0x13/0x20
> [   38.465491]  [] ip6_finish_output2+0x2dc/0x490
> [   38.472196]  [] ? selinux_ipv6_postroute+0x1a/0x20
> [   38.479290]  [] ? nf_iterate+0x57/0x80
> [   38.485220]  [] ip6_finish_output+0x8f/0xf0
> [   38.491634]  [] ip6_output+0x48/0xf0
> [   38.497368]  [] ? ip6_fragment+0xaa0/0xaa0
> [   38.503688]  [] NF_HOOK_THRESH+0x27/0x80
> [   38.511071]  [] ? icmp6_dst_alloc+0x121/0x190
> [   38.517679]  [] mld_sendpack+0x16c/0x270
> [   38.523802]  [] mld_ifc_timer_expire+0x19d/0x2f0
> [   38.530701]  [] ? mld_dad_timer_expire+0x70/0x70
> [   38.537601]  [] call_timer_fn+0x39/0x140
> [   38.543724]  [] ? mld_dad_timer_expire+0x70/0x70
> [   38.550624]  [] run_timer_softirq+0x26f/0x340
> [   38.557232]  [] __do_softirq+0x119/0x2e0
> [   38.563355]  [] irq_exit+0x125/0x130
> [   38.569091]  [] smp_apic_timer_interrupt+0x46/0x60
> [   38.576182]  [] apic_timer_interrupt+0x6e/0x80
> [   38.582886]  
> [   38.585032]  [] ? clockevents_switch_state+0x1b/0x60
> [   38.592525]  [] ? cpuidle_enter_state+0xac/0x210
> [   38.599414]  [] ? cpuidle_enter_state+0x7f/0x210
> [   38.606312]  [] cpuidle_enter+0x17/0x20
> [   38.612338]  [] cpu_startup_entry+0x38e/0x410
> [   38.618948]  [] start_secondary+0x143/0x170
> [   38.625360] Code: 32 4d 85 ed 74 2d 48 8d 55 a8 44 89 ce b9 02 00 00 00 4c 
> 89 ef 44 89 45 88 44 88 55 8c 44 89 4d 90 e8 4c 2f ff ff 44 0f b6 55 8c <44> 
> 8b 45 88 44 8b 4d 90 0f b6 45 a9 44 0f b6 5d a8 8d 04 c5 08
> [   38.647067] Kernel panic - not syncing: softlockup: hung tasks
> [   38.653576] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G L  
> 4.1.0-rc7-05001-gd94988d #1
> [   38.663288] Hardware name: Intel Corporation S5520UR/S5520UR, BIOS 
> S5500.86B.01.00.0050.050620101605 05/06/2010
> [   38.674552]  000d 88036fac3750 81

Re: [LKP] [net] 11ef7a8996d: +27.3% netperf.Throughput_Mbps

2014-07-24 Thread Tom Herbert

Are you pointing out a regressions in this?

On Thu, Jul 24, 2014 at 11:31 PM, Aaron Lu  wrote:
> FYI, we noticed the below changes on
>
> commit 11ef7a8996d5d433c9cd75d80651297eccbf6d42 ("net: Performance fix for 
> process_backlog")
>
> test case: lkp-t410/netperf/300s-200%-10K-SCTP_STREAM_MANY
>
> 68b7107b62983f2  11ef7a8996d5d433c9cd75d80
> ---  -
>   1023 ~ 3% +27.3%   1302 ~ 0%  TOTAL netperf.Throughput_Mbps
>   0.72 ~10% -91.4%   0.06 ~23%  TOTAL turbostat.%c3
>  13385 ~12% -92.6%987 ~12%  TOTAL cpuidle.C6-NHM.usage
>  22745 ~10% -95.5%   1016 ~40%  TOTAL cpuidle.C3-NHM.usage
>   42675736 ~12% -84.6%6571705 ~19%  TOTAL cpuidle.C6-NHM.time
>  88342 ~11% -82.1%  15811 ~ 9%  TOTAL softirqs.SCHED
>   19148006 ~12% -95.7% 821873 ~27%  TOTAL cpuidle.C3-NHM.time
> 439.94 ~10% -77.5%  99.05 ~ 5%  TOTAL uptime.idle
>   1.35 ~ 6% -65.8%   0.46 ~24%  TOTAL turbostat.%c6
>  4 ~23%+114.3%  9 ~ 0%  TOTAL vmstat.procs.r
>   1680 ~ 3% +40.4%   2359 ~ 3%  TOTAL proc-vmstat.nr_alloc_batch
> 447921 ~ 4% +36.2% 610047 ~ 0%  TOTAL softirqs.TIMER
>   9.09 ~ 9% +27.6%  11.60 ~ 8%  TOTAL 
> perf-profile.cpu-cycles.memcpy.sctp_outq_flush.sctp_outq_uncork.sctp_cmd_interpreter.sctp_do_sm
>2350916 ~ 4% +35.4%3184088 ~ 0%  TOTAL proc-vmstat.pgalloc_dma
>   7.82 ~ 9% +24.1%   9.71 ~ 4%  TOTAL 
> perf-profile.cpu-cycles.copy_user_generic_string.sctp_user_addto_chunk.sctp_datamsg_from_user.sctp_sendmsg.inet_sendmsg
>   38837537 ~ 3% +27.1%   49358639 ~ 0%  TOTAL proc-vmstat.numa_local
>   38837537 ~ 3% +27.1%   49358639 ~ 0%  TOTAL proc-vmstat.numa_hit
>  50216 ~ 1% +17.0%  58745 ~ 6%  TOTAL softirqs.RCU
>   1.41 ~ 4% +11.6%   1.58 ~ 1%  TOTAL 
> perf-profile.cpu-cycles.get_page_from_freelist.__alloc_pages_nodemask.alloc_kmem_pages_node.kmalloc_large_node.__kmalloc_node_track_caller
>  48539 ~ 1% -25.5%  36171 ~ 1%  TOTAL vmstat.system.cs
>  75.64 ~ 4% +30.6%  98.82 ~ 0%  TOTAL turbostat.%c0
>   3949 ~ 2% +13.4%   4478 ~ 0%  TOTAL vmstat.system.in
>
> Legend:
> ~XX%- stddev percent
> [+-]XX% - change percent
>
>
>   vmstat.system.cs
>
>   52000 ++--+
>   5 ++  *  .*   *.   ***. *  *. |
> | .* + *.**  *. : **.* .* .**.**. * +  +*.*  * + : * .**. *.  .**
>   48000 ****  *  *   * ** .**   **  **  |
>   46000 ++   *  |
> |   |
>   44000 ++  |
>   42000 ++  |
>   4 ++  |
> |   |
>   38000 +O  O  O|
>   36000 O+O  O   O  O OO OO OO  O  O   O  O  O OO  O OO O   |
> |  O   O  O O  O  O   O |
>   34000 ++O  O   O  O   |
>   32000 ++--+
>
>
> [*] bisect-good sample
> [O] bisect-bad  sample
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
> Thanks,
> Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LKP] [net] 11ef7a8996d: +27.3% netperf.Throughput_Mbps

2014-07-25 Thread Tom Herbert

On Thu, Jul 24, 2014 at 11:49 PM, Aaron Lu  wrote:
> On 07/25/2014 02:48 PM, David Miller wrote:
>> From: Tom Herbert 
>> Date: Thu, 24 Jul 2014 23:37:07 -0700
>>
>>> Are you pointing out a regressions in this?
>>
>> I think he's letting you know that performance has improved, in fact :)
>
> Yes, exactly :-)
>
Very nice, thanks for posting the results!

> -Aaron
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: fix setting csum_start in skb_segment()

2014-06-25 Thread Tom Herbert

On Wed, Jun 25, 2014 at 12:51 PM, Eric Dumazet  wrote:
> From: Tom Herbert 
>
> Dave Jones reported that a crash is occurring in
>
> csum_partial
> tcp_gso_segment
> inet_gso_segment
> ? update_dl_migration
> skb_mac_gso_segment
> __skb_gso_segment
> dev_hard_start_xmit
> sch_direct_xmit
> __dev_queue_xmit
> ? dev_hard_start_xmit
> dev_queue_xmit
> ip_finish_output
> ? ip_output
> ip_output
> ip_forward_finish
> ip_forward
> ip_rcv_finish
> ip_rcv
> __netif_receive_skb_core
> ? __netif_receive_skb_core
> ? trace_hardirqs_on
> __netif_receive_skb
> netif_receive_skb_internal
> napi_gro_complete
> ? napi_gro_complete
> dev_gro_receive
> ? dev_gro_receive
> napi_gro_receive
>
> It looks like a likely culprit is that SKB_GSO_CB()->csum_start is
> not set correctly when doing non-scatter gather. We are using
> offset as opposed to doffset.
>
Acked-by: Tom Herbert 

> Reported-by: Dave Jones 
> Tested-by: Dave Jones 
> Signed-off-by: Tom Herbert 
> Signed-off-by: Eric Dumazet 
> Fixes: 7e2b10c1e52ca ("net: Support for multiple checksums with gso")
> ---
>  net/core/skbuff.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9cd5344fad73..c1a33033cbe2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2993,7 +2993,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
> skb_put(nskb, 
> len),
> len, 0);
> SKB_GSO_CB(nskb)->csum_start =
> -   skb_headroom(nskb) + offset;
> +   skb_headroom(nskb) + doffset;
> continue;
> }
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-next] Regression: ssh log in slowdown

2014-06-13 Thread Tom Herbert

>
> I assume this is the series "[PATCH 0/4] Checksum fixes"
> (marc.info/?l=linux-netdev&m=140261417832399&w=2)?
>
Yes.

> As I'm not subscribed to netdev, I cannot reply to that thread.
>
> "[PATCH 1/4] net: Fix save software checksum complete" fixes the issue
> for me.
> However, "[PATCH 2/4] udp: call __skb_checksum_complete when doing full
> checksum" reintroduces the exact same bad behavior :-(
>
This implies the problem is happening in UDP path. AFAICT skb->csum is 
correct, and I don't seem to be able to reproduce the issue on my setup. 
It is possible that an skb copy is happening in which case we don't copy
csum_valid so that checksum_unnecessary would fail in this case.

Can you try with the patch below. Thanks!

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bf92824..9cd5344 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -689,6 +689,9 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
new->ooo_okay   = old->ooo_okay;
new->no_fcs = old->no_fcs;
new->encapsulation  = old->encapsulation;
+   new->encap_hdr_csum = old->encap_hdr_csum;
+   new->csum_valid = old->csum_valid;
+   new->csum_complete_sw   = old->csum_complete_sw;
 #ifdef CONFIG_XFRM
new->sp = secpath_get(old->sp);
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-next] Regression: ssh log in slowdown

2014-06-13 Thread Tom Herbert

> Thanks, I applied the series "[PATCH 0/4] Checksum fixes", and the fix
> above, but it doesn't help.
>
> Note that I'm also using NFS root, which doesn't seem to be affected.
> I can happily run "ls -R /" on the serial console during the 10 s delay 
in ssh.
>
Geert,

Thanks for your patience!

Can you try one more patch below with the series applied? Also, can you 
look at 'netstat -s' to see if UDP checksum errors are being reported.

Thanks,
Tom 

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fdb510c..4b722bc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2820,8 +2820,8 @@ static inline __sum16 
__skb_checksum_validate_complete(struct sk_buff *skb,
if (complete || skb->len <= CHECKSUM_BREAK) {
__sum16 csum;
 
+   /* skb->csum valid set in __skb_checksum_complete */
csum = __skb_checksum_complete(skb);
-   skb->csum_valid = !csum;
return csum;
}
 
diff --git a/net/core/datagram.c b/net/core/datagram.c
index cf6cc4e..488dd1a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -744,6 +744,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
int len)
!skb->csum_complete_sw)
netdev_rx_csum_fault(skb->dev);
}
+   skb->csum_valid = !sum;
return sum;
 }
 EXPORT_SYMBOL(__skb_checksum_complete_head);
@@ -767,6 +768,7 @@ __sum16 __skb_checksum_complete(struct sk_buff *skb)
skb->csum = csum;
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum_complete_sw = 1;
+   skb->csum_valid = !sum;
 
return sum;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bf92824..9cd5344 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -689,6 +689,9 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
new->ooo_okay   = old->ooo_okay;
new->no_fcs = old->no_fcs;
new->encapsulation  = old->encapsulation;
+   new->encap_hdr_csum = old->encap_hdr_csum;
+   new->csum_valid = old->csum_valid;
+   new->csum_complete_sw   = old->csum_complete_sw;
 #ifdef CONFIG_XFRM
new->sp = secpath_get(old->sp);
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/24] net, diet: Make TCP metrics optional

2014-05-06 Thread Tom Herbert

On Tue, May 6, 2014 at 11:32 AM, Andi Kleen  wrote:
>> We simply can not compete with user space, as a programmer is free to
>> keep what he really wants/needs.
>
> Not true.
>
> With my patches and LTO Linux can be competive with LWIP+socket layer.
> (about 60K more text). And it's easier to use because it's just
> the standard interface.
>
>> I have started using linux on 386/486 pcs which had more than 2MB of
>> memory, it makes me sad we want linux-3.16 to run on this kind of
>> hardware, and consuming time to save few KB here and here.
>
> Linux has always been a system from very small to big.
> That's been one of its strengths. It is very adaptable.
>
> Many subsystems are very configurable for this.
> For example that is why we have both SLOB and SLUB.
> That is why we have NOMMU MM and lots of other tuning
> knobs for small systems.
>
> So if the other subsystems can do this, why should it be
> impossible for networking?
>
Can this at least be done without the combinatorial explosion in
number of configurations? As Yuchung pointed out these patches
introduce at least one unresolved configuration dependency. CONFIG_SMP
works quite well since with a single parameter we can enable/disable a
whole bunch of functionality in bulk, and it's quite clear that new
development cannot break smp or non-smp configurations. Maybe you want
something similar like CONFIG_NETWORK_SMALL?

Tom

> -Andi
>
> --
> a...@linux.intel.com -- Speaking for myself only
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT] Networking

2014-06-24 Thread Tom Herbert

> tcp_gso_segment() makes sure that the headers are reachable in the linear
> area with the pskb_may_pull(skb, sizeof(*th)) call, and gso_make_checksum()
> is only working with the area up to SKB_GSO_CB()->csum_start which should
> be within this area for sure.
>
Seems likely that csum_start is not properly initialized in this path.
I am thinking that this may have happened in GRO path on a checksum
error where CHECKSUM_PARTIAL (and hence csum) is not set. That would
explain the infrequency of the occurrence,  and also previously not
setting csum would have just resulted in sending a corrupted packet
not a crash.

> Well, that's the precondition we seem to be relying upon, I suppose an
> assert is in order.

Assert on SKB_GSO_CB()->csum_start == 0 would confirm my suspicion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT] Networking

2014-06-24 Thread Tom Herbert

I believe in the no scatter-gather case of  skb_segment is not set
correctly. Will post a patch momentarily.

On Tue, Jun 24, 2014 at 8:05 PM, Tom Herbert  wrote:
>> tcp_gso_segment() makes sure that the headers are reachable in the linear
>> area with the pskb_may_pull(skb, sizeof(*th)) call, and gso_make_checksum()
>> is only working with the area up to SKB_GSO_CB()->csum_start which should
>> be within this area for sure.
>>
> Seems likely that csum_start is not properly initialized in this path.
> I am thinking that this may have happened in GRO path on a checksum
> error where CHECKSUM_PARTIAL (and hence csum) is not set. That would
> explain the infrequency of the occurrence,  and also previously not
> setting csum would have just resulted in sending a corrupted packet
> not a crash.
>
>> Well, that's the precondition we seem to be relying upon, I suppose an
>> assert is in order.
>
> Assert on SKB_GSO_CB()->csum_start == 0 would confirm my suspicion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] tcp: fix setting csum_start in tcp_gso_segment

2014-06-24 Thread Tom Herbert

Dave Jones reported that a crash is occuring in

csum_partial
tcp_gso_segment
inet_gso_segment
? update_dl_migration
skb_mac_gso_segment
__skb_gso_segment
dev_hard_start_xmit
sch_direct_xmit
__dev_queue_xmit
? dev_hard_start_xmit
dev_queue_xmit
ip_finish_output
? ip_output
ip_output
ip_forward_finish
ip_forward
ip_rcv_finish
ip_rcv
__netif_receive_skb_core
? __netif_receive_skb_core
? trace_hardirqs_on
__netif_receive_skb
netif_receive_skb_internal
napi_gro_complete
? napi_gro_complete
dev_gro_receive
? dev_gro_receive
napi_gro_receive

It looks like a likely culprit is that SKB_GSO_CB()->csum_start is
not set correctly when doing non-scatter gather. We are using
offset as opposed to doffset.

Reported-by: Dave Jones 
Signed-off-by: Tom Herbert 
---
 net/core/skbuff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9cd5344..c1a3303 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2993,7 +2993,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
skb_put(nskb, len),
len, 0);
SKB_GSO_CB(nskb)->csum_start =
-   skb_headroom(nskb) + offset;
+   skb_headroom(nskb) + doffset;
continue;
}
 
-- 
2.0.0.526.g5318336

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] net, tun: remove the flow cache

2013-12-17 Thread Tom Herbert

On Mon, Dec 16, 2013 at 11:26 PM, Zhi Yong Wu  wrote:
> From: Zhi Yong Wu 
>
> The flow cache is an extremely broken concept, and it usually brings up
> growth issues and DoS attacks, so this patch is trying to remove it from
> the tuntap driver, and insteadly use a simpler way for its flow control.
>
Yes , in it's current state it's broken. But maybe we can try to fix
it instead of arbitrarily removing it. Please see my patches on
plumbing RFS into tuntap which may start to make it useful.

Tom

> Signed-off-by: Zhi Yong Wu 
> ---
>  drivers/net/tun.c |  208 
> +++--
>  1 files changed, 10 insertions(+), 198 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7c8343a..7c27fdc 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -32,12 +32,15 @@
>   *
>   *  Daniel Podlejski 
>   *Modifications for 2.3.99-pre5 kernel.
> + *
> + *  Zhi Yong Wu 
> + *Remove the flow cache.
>   */
>
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
>  #define DRV_NAME   "tun"
> -#define DRV_VERSION"1.6"
> +#define DRV_VERSION"1.7"
>  #define DRV_DESCRIPTION"Universal TUN/TAP device driver"
>  #define DRV_COPYRIGHT  "(C) 1999-2004 Max Krasnyansky "
>
> @@ -146,18 +149,6 @@ struct tun_file {
> struct tun_struct *detached;
>  };
>
> -struct tun_flow_entry {
> -   struct hlist_node hash_link;
> -   struct rcu_head rcu;
> -   struct tun_struct *tun;
> -
> -   u32 rxhash;
> -   int queue_index;
> -   unsigned long updated;
> -};
> -
> -#define TUN_NUM_FLOW_ENTRIES 1024
> -
>  /* Since the socket were moved to tun_file, to preserve the behavior of 
> persist
>   * device, socket filter, sndbuf and vnet header size were restore when the
>   * file were attached to a persist device.
> @@ -184,163 +175,11 @@ struct tun_struct {
> int debug;
>  #endif
> spinlock_t lock;
> -   struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
> -   struct timer_list flow_gc_timer;
> -   unsigned long ageing_time;
> unsigned int numdisabled;
> struct list_head disabled;
> void *security;
> -   u32 flow_count;
>  };
>
> -static inline u32 tun_hashfn(u32 rxhash)
> -{
> -   return rxhash & 0x3ff;
> -}
> -
> -static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 
> rxhash)
> -{
> -   struct tun_flow_entry *e;
> -
> -   hlist_for_each_entry_rcu(e, head, hash_link) {
> -   if (e->rxhash == rxhash)
> -   return e;
> -   }
> -   return NULL;
> -}
> -
> -static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
> - struct hlist_head *head,
> - u32 rxhash, u16 queue_index)
> -{
> -   struct tun_flow_entry *e = kmalloc(sizeof(*e), GFP_ATOMIC);
> -
> -   if (e) {
> -   tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
> - rxhash, queue_index);
> -   e->updated = jiffies;
> -   e->rxhash = rxhash;
> -   e->queue_index = queue_index;
> -   e->tun = tun;
> -   hlist_add_head_rcu(&e->hash_link, head);
> -   ++tun->flow_count;
> -   }
> -   return e;
> -}
> -
> -static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
> -{
> -   tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
> - e->rxhash, e->queue_index);
> -   hlist_del_rcu(&e->hash_link);
> -   kfree_rcu(e, rcu);
> -   --tun->flow_count;
> -}
> -
> -static void tun_flow_flush(struct tun_struct *tun)
> -{
> -   int i;
> -
> -   spin_lock_bh(&tun->lock);
> -   for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
> -   struct tun_flow_entry *e;
> -   struct hlist_node *n;
> -
> -   hlist_for_each_entry_safe(e, n, &tun->flows[i], hash_link)
> -   tun_flow_delete(tun, e);
> -   }
> -   spin_unlock_bh(&tun->lock);
> -}
> -
> -static void tun_flow_delete_by_queue(struct tun_struct *tun, u16 queue_index)
> -{
> -   int i;
> -
> -   spin_lock_bh(&tun->lock);
> -   for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
> -   struct tun_flow_entry *e;
> -   struct hlist_node *n;
> -
> -   hlist_for_each_entry_safe(e, n, &tun->flows[i], hash_link) {
> -   if (e->queue_index == queue_index)
> -   tun_flow_delete(tun, e);
> -   }
> -   }
> -   spin_unlock_bh(&tun->lock);
> -}
> -
> -static void tun_flow_cleanup(unsigned long data)
> -{
> -   struct tun_struct *tun = (struct tun_struct *)data;
> -   unsigned long delay = tun->ageing_time;
> -   unsigned long next_timer = jiffies + delay;
> -   unsigned long count = 0;
> -   int i;
> -
> -   tun_debug(KERN_INFO, tun, "tun_flow_cleanu

Re: [RFC PATCH] net, tun: remove the flow cache

2013-12-17 Thread Tom Herbert

>> Yes , in it's current state it's broken. But maybe we can try to fix
>> it instead of arbitrarily removing it. Please see my patches on
>> plumbing RFS into tuntap which may start to make it useful.
> Do you mean you patch [5/5] tun: Added support for RFS on tun flows?
> Sorry, can you say with more details?

Correct. It was RFC since I didn't have a good way to test, if you do
please try it and see if there's any effect. We should also be able to
do something similar for KVM guests, either doing the flow lookup on
each packet from the guest, or use aRFS interface from the guest
driver for end to end RFS (more exciting prospect). We are finding
that guest to driver accelerations like this (and tso, lro) are quite
important in getting virtual networking performance up.

>
>>
>> Tom
>>
>>> Signed-off-by: Zhi Yong Wu 
>>> ---
>>>  drivers/net/tun.c |  208 
>>> +++--
>>>  1 files changed, 10 insertions(+), 198 deletions(-)
>>>
>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>> index 7c8343a..7c27fdc 100644
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -32,12 +32,15 @@
>>>   *
>>>   *  Daniel Podlejski 
>>>   *Modifications for 2.3.99-pre5 kernel.
>>> + *
>>> + *  Zhi Yong Wu 
>>> + *Remove the flow cache.
>>>   */
>>>
>>>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>>>
>>>  #define DRV_NAME   "tun"
>>> -#define DRV_VERSION"1.6"
>>> +#define DRV_VERSION"1.7"
>>>  #define DRV_DESCRIPTION"Universal TUN/TAP device driver"
>>>  #define DRV_COPYRIGHT  "(C) 1999-2004 Max Krasnyansky "
>>>
>>> @@ -146,18 +149,6 @@ struct tun_file {
>>> struct tun_struct *detached;
>>>  };
>>>
>>> -struct tun_flow_entry {
>>> -   struct hlist_node hash_link;
>>> -   struct rcu_head rcu;
>>> -   struct tun_struct *tun;
>>> -
>>> -   u32 rxhash;
>>> -   int queue_index;
>>> -   unsigned long updated;
>>> -};
>>> -
>>> -#define TUN_NUM_FLOW_ENTRIES 1024
>>> -
>>>  /* Since the socket were moved to tun_file, to preserve the behavior of 
>>> persist
>>>   * device, socket filter, sndbuf and vnet header size were restore when the
>>>   * file were attached to a persist device.
>>> @@ -184,163 +175,11 @@ struct tun_struct {
>>> int debug;
>>>  #endif
>>> spinlock_t lock;
>>> -   struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
>>> -   struct timer_list flow_gc_timer;
>>> -   unsigned long ageing_time;
>>> unsigned int numdisabled;
>>> struct list_head disabled;
>>> void *security;
>>> -   u32 flow_count;
>>>  };
>>>
>>> -static inline u32 tun_hashfn(u32 rxhash)
>>> -{
>>> -   return rxhash & 0x3ff;
>>> -}
>>> -
>>> -static struct tun_flow_entry *tun_flow_find(struct hlist_head *head, u32 
>>> rxhash)
>>> -{
>>> -   struct tun_flow_entry *e;
>>> -
>>> -   hlist_for_each_entry_rcu(e, head, hash_link) {
>>> -   if (e->rxhash == rxhash)
>>> -   return e;
>>> -   }
>>> -   return NULL;
>>> -}
>>> -
>>> -static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
>>> - struct hlist_head *head,
>>> - u32 rxhash, u16 queue_index)
>>> -{
>>> -   struct tun_flow_entry *e = kmalloc(sizeof(*e), GFP_ATOMIC);
>>> -
>>> -   if (e) {
>>> -   tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
>>> - rxhash, queue_index);
>>> -   e->updated = jiffies;
>>> -   e->rxhash = rxhash;
>>> -   e->queue_index = queue_index;
>>> -   e->tun = tun;
>>> -   hlist_add_head_rcu(&e->hash_link, head);
>>> -   ++tun->flow_count;
>>> -   }
>>> -   return e;
>>> -}
>>> -
>>> -static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry 
>>> *e)
>>> -{
>>> -   tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
>>> - e->rxhash, e->queue_index);
>>> -   hlist_del_rcu(&e->hash_link);
>>> -   kfree_rcu(e, rcu);
>>> -   --tun->flow_count;
>>> -}
>>> -
>>> -static void tun_flow_flush(struct tun_struct *tun)
>>> -{
>>> -   int i;
>>> -
>>> -   spin_lock_bh(&tun->lock);
>>> -   for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
>>> -   struct tun_flow_entry *e;
>>> -   struct hlist_node *n;
>>> -
>>> -   hlist_for_each_entry_safe(e, n, &tun->flows[i], hash_link)
>>> -   tun_flow_delete(tun, e);
>>> -   }
>>> -   spin_unlock_bh(&tun->lock);
>>> -}
>>> -
>>> -static void tun_flow_delete_by_queue(struct tun_struct *tun, u16 
>>> queue_index)
>>> -{
>>> -   int i;
>>> -
>>> -   spin_lock_bh(&tun->lock);
>>> -   for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++) {
>>> -   struct tun_flow_entry *e;
>>> -   struct hlist_node *n;
>>> -
>>> -   hlist_for_each_entry_safe

Re: [PATCH net-next] hyperv: Add support for Virtual Receive Side Scaling (vRSS)

2013-12-19 Thread Tom Herbert

I posted an implementation of library functions for Toeplitz (see
[PATCH 1/2] net: Toeplitz library functions).  This includes some
pre-computation of the table to get reasonable performance in the
host. Please take a look.

On the other hand, if you're computing a hash in the host, do you
really need  Toeplitz, flow_dissector already supports a good hash
computation and can parse many more packets than just plain UDP/TCP.
We probably only should be doing Toeplitz in the host if we need to
match HW computed values.

On Thu, Dec 19, 2013 at 11:21 AM, Haiyang Zhang  wrote:
>
>
>> -Original Message-
>> From: Daniel Borkmann [mailto:dbork...@redhat.com]
>> Sent: Thursday, December 19, 2013 1:45 PM
>> To: Haiyang Zhang
>> Cc: Ben Hutchings; da...@davemloft.net; net...@vger.kernel.org; KY
>> Srinivasan; o...@aepfle.de; jasow...@redhat.com; linux-
>> ker...@vger.kernel.org; driverdev-de...@linuxdriverproject.org
>> Subject: Re: [PATCH net-next] hyperv: Add support for Virtual Receive Side
>> Scaling (vRSS)
>>
>> On 12/19/2013 07:36 PM, Haiyang Zhang wrote:
>>
>> > Thank you for the suggestions! I will re-write the send queue
>> > selection, enhance the hash calculation, also fix the initialization 
>> > sequence.
>>
>> Btw, Toeplitz hash function should either go into lib/hash.c as well or
>> include/linux/hash.h to avoid ending up w/ various implementations in 
>> multiple
>> places.
>
> Will do.
>
> Thanks,
> - Haiyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] hyperv: Add support for Virtual Receive Side Scaling (vRSS)

2013-12-19 Thread Tom Herbert

Patch is below. This version did most pre-computation of the variants
I built, but results in largest table (40*256*4 bytes), This gives
performance roughly comparable with jhash (roughly same as jhash for
IPv4, about 30% more cycles for IPv6). I have the simpler less memory
intensive versions also if you're interested, these are 10x worse
cycles so I wouldn't want those in critical path.

Introduce Toeplitz hash functions. Toeplitz is a hash used primarily in
NICs to performan RSS flow steering.  This is a software implemenation
of that. In order to make the hash calculation efficient, we precompute
the possible hash values for each inidividual byte of input. The input
length is up to 40 bytes, so we make an array of cache[40][256].

The implemenation was verified against MSDN "Verify RSS hash" sample
values.

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h |  3 +++
 include/linux/toeplitz.h  | 27 +++
 lib/Kconfig   |  3 +++
 lib/Makefile  |  2 ++
 lib/toeplitz.c| 66 +++
 net/Kconfig   |  5 
 net/core/dev.c| 11 
 7 files changed, 117 insertions(+)
 create mode 100644 include/linux/toeplitz.h
 create mode 100644 lib/toeplitz.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3de49ac..546caf2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -195,6 +196,8 @@ struct net_device_stats {
 extern struct static_key rps_needed;
 #endif

+extern struct toeplitz *toeplitz_net;
+
 struct neighbour;
 struct neigh_parms;
 struct sk_buff;
diff --git a/include/linux/toeplitz.h b/include/linux/toeplitz.h
new file mode 100644
index 000..bc0b8e8
--- /dev/null
+++ b/include/linux/toeplitz.h
@@ -0,0 +1,27 @@
+#ifndef __LINUX_TOEPLITZ_H
+#define __LINUX_TOEPLITZ_H
+
+#define TOEPLITZ_KEY_LEN 40
+
+struct toeplitz {
+   u8 key_vals[TOEPLITZ_KEY_LEN];
+   u32 key_cache[TOEPLITZ_KEY_LEN][256];
+};
+
+static inline unsigned int
+toeplitz_hash(const unsigned char *bytes,
+ struct toeplitz *toeplitz, int n)
+{
+   int i;
+   unsigned int result = 0;
+
+   for (i = 0; i < n; i++)
+   result ^= toeplitz->key_cache[i][bytes[i]];
+
+return result;
+};
+
+extern struct toeplitz *toeplitz_alloc(void);
+extern void toeplitz_init(struct toeplitz *toeplitz, u8 *key_vals);
+
+#endif /* __LINUX_TOEPLITZ_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index b3c8be0..463b2b1 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -359,6 +359,9 @@ config CPU_RMAP
 config DQL
bool

+config TOEPLITZ
+   bool
+
 #
 # Netlink attribute parsing support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index f3bb2cb..a28349b 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -133,6 +133,8 @@ obj-$(CONFIG_CORDIC) += cordic.o

 obj-$(CONFIG_DQL) += dynamic_queue_limits.o

+obj-$(CONFIG_TOEPLITZ) += toeplitz.o
+
 obj-$(CONFIG_MPILIB) += mpi/
 obj-$(CONFIG_SIGNATURE) += digsig.o

diff --git a/lib/toeplitz.c b/lib/toeplitz.c
new file mode 100644
index 000..0951dd9
--- /dev/null
+++ b/lib/toeplitz.c
@@ -0,0 +1,66 @@
+/*
+ * Toeplitz hash implemenation. See include/linux/toeplitz.h
+ *
+ * Copyright (c) 2011, Tom Herbert 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct toeplitz *toeplitz_alloc(void)
+{
+   return kmalloc(sizeof(struct toeplitz), GFP_KERNEL);
+}
+
+static u32 toeplitz_get_kval(unsigned char *key, int idx)
+{
+   u32 v, r;
+   int off, rem;
+
+   off = idx / 8;
+   rem = idx % 8;
+
+   v = (((unsigned int)key[off]) << 24) +
+   (((unsigned int)key[off + 1]) << 16) +
+   (((unsigned int)key[off + 2]) << 8) +
+   (((unsigned int)key[off + 3]));
+
+   r = v << rem | (unsigned int)key[off + 4] >> (8 - rem);
+   return r;
+}
+
+static inline int idx8(int idx)
+{
+#ifdef __LITTLE_ENDIAN
+idx = (idx / 8) * 8 + (8 - (idx % 8 + 1));
+#endif
+return idx;
+}
+
+void toeplitz_init(struct toeplitz *toeplitz, u8 *key_vals)
+{
+   int i;
+   unsigned long a, j;
+   unsigned int result = 0;
+
+   /* Set up key val table */
+   if (key_vals)
+   for (i = 0; i < TOEPLITZ_KEY_LEN; i++)
+   toeplitz->key_vals[i] = key_vals[i];
+   else
+   prandom_bytes(toeplitz->key_vals, TOEPLITZ_KEY_LEN);
+
+   /* Set up key cache table */
+   for (i = 0; i < TOEPLITZ_KEY_LEN; i++) {
+   for (j = 0; j < 256; j++) {
+   result = 0;
+   for (a = find_first_bit(&j, 8); a < 8;
+   a = find_next_bit(&j, 8, a + 1))
+   result ^= toeplitz_get_kval(
+

Re: [PATCH net-next] hyperv: Add support for Virtual Receive Side Scaling (vRSS)

2013-12-19 Thread Tom Herbert

On Thu, Dec 19, 2013 at 3:15 PM, Haiyang Zhang  wrote:
>> -Original Message-
>> From: Tom Herbert [mailto:therb...@google.com]
>> Sent: Thursday, December 19, 2013 4:43 PM
>> To: Haiyang Zhang
>> Cc: Daniel Borkmann; Ben Hutchings; da...@davemloft.net;
>> net...@vger.kernel.org; KY Srinivasan; o...@aepfle.de;
>> jasow...@redhat.com; linux-kernel@vger.kernel.org; driverdev-
>> de...@linuxdriverproject.org
>> Subject: Re: [PATCH net-next] hyperv: Add support for Virtual Receive Side
>> Scaling (vRSS)
>>
>> Patch is below. This version did most pre-computation of the variants I 
>> built,
>> but results in largest table (40*256*4 bytes), This gives performance roughly
>> comparable with jhash (roughly same as jhash for IPv4, about 30% more
>> cycles for IPv6). I have the simpler less memory intensive versions also if
>> you're interested, these are 10x worse cycles so I wouldn't want those in
>> critical path.
>> 
>
> Thank you for the code. We like the fast implementation even it uses a bit 
> more
> memory. Are you going to address the comments and re-submit the code soon?
>
I'll take another look now that there's some new motivation :-)

> Thanks,
> - Haiyang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics

2014-11-05 Thread Tom Herbert

On Wed, Nov 5, 2014 at 9:50 AM, Joe Stringer  wrote:
>
> On 5 November 2014 04:38, Or Gerlitz  wrote:
>>
>> On Tue, Nov 4, 2014 at 11:56 PM, Joe Stringer  wrote:
>> > Most NICs that report NETIF_F_GSO_UDP_TUNNEL support VXLAN, and not other
>> > UDP-based encapsulation protocols where the format and size of the header 
>> > may
>> > differ. This patch series implements ndo_gso_check() for these NICs,
>> > restricting the GSO handling to something that looks and smells like VXLAN.
>> >
>> > Implementation shamelessly stolen from Tom Herbert (with minor fixups):
>> > http://thread.gmane.org/gmane.linux.network/332428/focus=333111
>>
>>
>> Hi Joe,
>>
>> 1st, thanks for picking this task...2nd, for drivers that currently
>> support only pure VXLAN, I don't see the point
>> to replicate the helper suggested by Tom (good catch on the size check
>> to be 16 and not 12) four times and who know how more in the future.
>> Let's just have one generic helper and make the mlx4/be/fm10k/benet
>> drivers to have it as their ndo, OK?
>
>
> Thanks for taking a look.
>
> I had debated whether to do this or not as the actual support on each NIC may 
> differ, and each implementation may morph over time to match these 
> capabilities better. Obviously the vendors will know better than me on this, 
> so I'm posing this series to prod them for more information. At this point 
> I've had just one maintainer come back and confirm that this helper is a good 
> fit for their hardware, so I'd like to confirm that multiple drivers will use 
> a ndo_gso_check_vxlan_helper() function before I go and create it.


Thanks for implementing this fix!

Personally, I would rather not have the helper. This is already a
small number of drivers, and each driver owner should consider what
limitations are of their device and try to enable to allow the maximum
number of use cases possible. I'm also hoping that new devices will
implement the more generic mechanism so that VXLAN is just one
supported protocol.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics

2014-11-05 Thread Tom Herbert

On Wed, Nov 5, 2014 at 6:15 PM, David Miller  wrote:
> From: Joe Stringer 
> Date: Wed, 5 Nov 2014 17:06:46 -0800
>
>> My impression was that the changes are more likely to be
>> hardware-specific (like the i40e changes) rather than software-specific,
>> like changes that might be integrated into the helper.
>
> I think there is more commonality amongst hardware capabilities,
> and this is why I want the helper to play itself out.
>
>> That said, I can rework for one helper. The way I see it would be the
>> same code as these patches, as "vxlan_gso_check(struct sk_buff *)" in
>> drivers/net/vxlan.c which would be called from each driver. Is that what
>> you had in mind?
>
> Yes.

Note that this code is not VXLAN specific, it will also accept NVGRE
and GRE/UDP with keyid and TEB. I imagine all these cases should be
indistinguishable to the hardware so they probably just work (which
would be cool!). It might be better to name and locate the helper
function to reflect that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net 0/5] Implement ndo_gso_check() for vxlan nics

2014-11-06 Thread Tom Herbert

The inner headers are reset in iptunnel_handle_offloads. This called
in the xmit encapsulation function (GRE, fou, VXLAN, etc.) before
added in encapsulation headers, so the inner headers will point to the
encapsulation payload, i.e. the encapsulated packet. The headers are
only on the first encapsulation, so with nested tunnels the inner
headers point to encapsulated packet. Since VXLAN and NVGRE have same
size of encapsulation (8 UDP + 8 header),   skb_inner_mac_header(skb)
- skb_transport_header(skb) should always be 16.


On Wed, Nov 5, 2014 at 10:16 PM, Sathya Perla  wrote:
>> -Original Message-
>> From: Tom Herbert [mailto:therb...@google.com]
>>
>> On Wed, Nov 5, 2014 at 6:15 PM, David Miller 
>> wrote:
>> > From: Joe Stringer 
>> > Date: Wed, 5 Nov 2014 17:06:46 -0800
>> >
>> >> My impression was that the changes are more likely to be
>> >> hardware-specific (like the i40e changes) rather than software-specific,
>> >> like changes that might be integrated into the helper.
>> >
>> > I think there is more commonality amongst hardware capabilities,
>> > and this is why I want the helper to play itself out.
>> >
>> >> That said, I can rework for one helper. The way I see it would be the
>> >> same code as these patches, as "vxlan_gso_check(struct sk_buff *)" in
>> >> drivers/net/vxlan.c which would be called from each driver. Is that what
>> >> you had in mind?
>> >
>> > Yes.
>>
>> Note that this code is not VXLAN specific, it will also accept NVGRE
>> and GRE/UDP with keyid and TEB. I imagine all these cases should be
>> indistinguishable to the hardware so they probably just work (which
>> would be cool!). It might be better to name and locate the helper
>> function to reflect that.
>
> Tom, I'm confused as to how the value of (skb_inner_mac_header(skb) - 
> skb_transport_header(skb))
> would be the same for VxLAN and NVGRE encapsulated packets. Wouldn't this 
> value be 16 for VxLAN
> and 8 for NVGRE?
>
The inner headers are reset in iptunnel_handle_offloads. This is
called in the xmit encapsulation functions (GRE, fou, VXLAN, etc.)
before adding in encapsulation headers (skb_push), so the
mac_inner_header will point to the encapsulation payload, i.e. the
encapsulated packet. This should not change after being set, although
inner network and inner transport can. The headers are only set on the
first encapsulation, so with nested tunnels the inner headers point to
the innermost encapsulated packet. Since VXLAN and NVGRE have same
size of encapsulation (8 UDP + 8 header),   skb_inner_mac_header(skb)
- skb_transport_header(skb) should always be 16.

Tom

> thks,
> -Sathya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: allow sleeping when modifying store_rps_map

2015-08-13 Thread Tom Herbert

On Thu, Aug 13, 2015 at 11:03 AM, Sasha Levin  wrote:
> Commit 10e4ea751 ("net: Fix race condition in store_rps_map") has moved the
> manipulation of the rps_needed jump label under a spinlock. Since changing
> the state of a jump label may sleep this is incorrect and causes warnings
> during runtime.
>
> Make rps_map_lock a mutex to allow sleeping under it.
>
> Fixes: 10e4ea751 ("net: Fix race condition in store_rps_map")
> Signed-off-by: Sasha Levin 
> ---
>  net/core/net-sysfs.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 39ec694..b279077 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -689,7 +689,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue 
> *queue,
> struct rps_map *old_map, *map;
> cpumask_var_t mask;
> int err, cpu, i;
> -   static DEFINE_SPINLOCK(rps_map_lock);
> +   static DEFINE_MUTEX(rps_map_mutex);
>
> if (!capable(CAP_NET_ADMIN))
> return -EPERM;
> @@ -722,9 +722,9 @@ static ssize_t store_rps_map(struct netdev_rx_queue 
> *queue,
> map = NULL;
> }
>
> -   spin_lock(&rps_map_lock);
> +   mutex_lock(&rps_map_mutex);
> old_map = rcu_dereference_protected(queue->rps_map,
> -   lockdep_is_held(&rps_map_lock));
> +   mutex_is_locked(&rps_map_mutex));
> rcu_assign_pointer(queue->rps_map, map);
>
> if (map)
> @@ -732,7 +732,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue 
> *queue,
> if (old_map)
> static_key_slow_dec(&rps_needed);
>
> -       spin_unlock(&rps_map_lock);
> +   mutex_unlock(&rps_map_mutex);
>
> if (old_map)
> kfree_rcu(old_map, rcu);
> --
> 1.7.10.4
>

Thanks Sasha!

Acked-by: Tom Herbert 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Tom Herbert

On Thu, Jul 16, 2015 at 5:40 AM, Denys Vlasenko  wrote:
> This patch deinlines jhash, jhash2 and __jhash_nwords.
>
> It also removes rhashtable_jhash2(key, length, seed)
> because it was merely calling jhash2(key, length, seed).
>
> With this .config: http://busybox.net/~vda/kernel_config,
> after deinlining these functions have sizes and callsite counts
> as follows:
>
> __jhash_nwords: 72 bytes, 75 calls
> jhash: 297 bytes, 111 calls
> jhash2: 205 bytes, 136 calls
>
jhash is used in several places in the critical data path. Does the
decrease in text size justify performance impact of not inlining it?

Tom

> Total size decrease is about 38,000 bytes:
>
> text data  bss   dec hex filename
> 90663567 17221960 36659200 144544727 89d93d7 vmlinux5
> 90625577 17221864 36659200 144506641 89cff11 vmlinux.after
>
> Signed-off-by: Denys Vlasenko 
> CC: Thomas Graf 
> CC: Alexander Duyck 
> CC: Jozsef Kadlecsik 
> CC: Herbert Xu 
> CC: net...@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
> Changes in v2: created a new source file, jhash.c
>
>  include/linux/jhash.h | 123 +
>  lib/Makefile  |   2 +-
>  lib/jhash.c   | 149 
> ++
>  lib/rhashtable.c  |  13 +++--
>  4 files changed, 160 insertions(+), 127 deletions(-)
>  create mode 100644 lib/jhash.c
>
> diff --git a/include/linux/jhash.h b/include/linux/jhash.h
> index 348c6f4..0b3f55d 100644
> --- a/include/linux/jhash.h
> +++ b/include/linux/jhash.h
> @@ -31,131 +31,14 @@
>  /* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
>  #define jhash_mask(n)   (jhash_size(n)-1)
>
> -/* __jhash_mix -- mix 3 32-bit values reversibly. */
> -#define __jhash_mix(a, b, c)   \
> -{  \
> -   a -= c;  a ^= rol32(c, 4);  c += b; \
> -   b -= a;  b ^= rol32(a, 6);  a += c; \
> -   c -= b;  c ^= rol32(b, 8);  b += a; \
> -   a -= c;  a ^= rol32(c, 16); c += b; \
> -   b -= a;  b ^= rol32(a, 19); a += c; \
> -   c -= b;  c ^= rol32(b, 4);  b += a; \
> -}
> -
> -/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
> -#define __jhash_final(a, b, c) \
> -{  \
> -   c ^= b; c -= rol32(b, 14);  \
> -   a ^= c; a -= rol32(c, 11);  \
> -   b ^= a; b -= rol32(a, 25);  \
> -   c ^= b; c -= rol32(b, 16);  \
> -   a ^= c; a -= rol32(c, 4);   \
> -   b ^= a; b -= rol32(a, 14);  \
> -   c ^= b; c -= rol32(b, 24);  \
> -}
> -
>  /* An arbitrary initial parameter */
>  #define JHASH_INITVAL  0xdeadbeef
>
> -/* jhash - hash an arbitrary key
> - * @k: sequence of bytes as key
> - * @length: the length of the key
> - * @initval: the previous hash, or an arbitray value
> - *
> - * The generic version, hashes an arbitrary sequence of bytes.
> - * No alignment or length assumptions are made about the input key.
> - *
> - * Returns the hash value of the key. The result depends on endianness.
> - */
> -static inline u32 jhash(const void *key, u32 length, u32 initval)
> -{
> -   u32 a, b, c;
> -   const u8 *k = key;
> -
> -   /* Set up the internal state */
> -   a = b = c = JHASH_INITVAL + length + initval;
> -
> -   /* All but the last block: affect some 32 bits of (a,b,c) */
> -   while (length > 12) {
> -   a += __get_unaligned_cpu32(k);
> -   b += __get_unaligned_cpu32(k + 4);
> -   c += __get_unaligned_cpu32(k + 8);
> -   __jhash_mix(a, b, c);
> -   length -= 12;
> -   k += 12;
> -   }
> -   /* Last block: affect all 32 bits of (c) */
> -   /* All the case statements fall through */
> -   switch (length) {
> -   case 12: c += (u32)k[11]<<24;
> -   case 11: c += (u32)k[10]<<16;
> -   case 10: c += (u32)k[9]<<8;
> -   case 9:  c += k[8];
> -   case 8:  b += (u32)k[7]<<24;
> -   case 7:  b += (u32)k[6]<<16;
> -   case 6:  b += (u32)k[5]<<8;
> -   case 5:  b += k[4];
> -   case 4:  a += (u32)k[3]<<24;
> -   case 3:  a += (u32)k[2]<<16;
> -   case 2:  a += (u32)k[1]<<8;
> -   case 1:  a += k[0];
> -__jhash_final(a, b, c);
> -   case 0: /* Nothing left to add */
> -   break;
> -   }
> -
> -   return c;
> -}
> -
> -/* jhash2 - hash an array of u32's
> - * @k: the key which must be an array of u32's
> - * @length: the number of u32's in the key
> - * @initval: the previous hash, or an arbitray value
> - *
> - * Returns the hash value of the key.
> - */
> -static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
> -{
> -   u32 a, b, c;
> -
> -   /* Set up the internal state */
> -   a = b = c = JHASH_INITVAL + (length<<2) + initval;
> -

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Tom Herbert

We've put considerable effort into cleaning up the checksum interface
to make it as unambiguous as possible, please be very careful to
follow it. Broken checksum processing is really hard to detect and
debug.

CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
(indicated by csum_level) have been verified to be correct in a
packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
checksum it would refer to has not been verified and is incorrect this
is a major bug.

Tom

On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear  wrote:
>
>
> On 04/30/2016 11:33 AM, Ben Hutchings wrote:
>>
>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>>>
>>> Hello,
>
>

 http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>>>
>>> Actually, no, this is not really a regression.
>>
>> [...]
>>
>> It really is.  Even though the old behaviour was a bug (raw packets
>> should not be changed), if there are real applications that depend on
>> that then we have to keep those applications working somehow.
>
>
> To be honest, I fail to see why the old behaviour is a bug when sending
> raw packets from user-space.  If raw packets should not be changed, then
> we need some way to specify what the checksum setting is to begin with,
> otherwise, user-space has not enough control.
>
> A socket option for new programs, and sysctl configurable defaults for raw
> sockets
> for old binary programs would be sufficient I think.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Tom Herbert

On Sat, Apr 30, 2016 at 1:59 PM, Ben Greear  wrote:
>
> On 04/30/2016 12:54 PM, Tom Herbert wrote:
>>
>> We've put considerable effort into cleaning up the checksum interface
>> to make it as unambiguous as possible, please be very careful to
>> follow it. Broken checksum processing is really hard to detect and
>> debug.
>>
>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
>> (indicated by csum_level) have been verified to be correct in a
>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
>> checksum it would refer to has not been verified and is incorrect this
>> is a major bug.
>
>
> Suppose I know that the packet received on a packet-socket has
> already been verified by a NIC that supports hardware checksumming.
>
> Then, I want to transmit it on a veth interface using a second
> packet socket.  I do not want veth to recalculate the checksum on
> transmit, nor to validate it on the peer veth on receive, because I do
> not want to waste the CPU cycles.  I am assuming that my app is not
> accidentally corrupting frames, so the checksum can never be bad.
>
> How should the checksumming be configured for the packets going into
> the packet-socket from user-space?
>
Checksum unnecessary conversion to checksum complete should be done
and then the computed value can be sent to user space. If you want to
take advantage of the value on transmit then we would to change the
interface. But I'm not sure why recalculating the the checksum in the
host is even a problem. With all the advances in checksum offload,
LCO, RCO it seems like that shouldn't be a common case anyway.

> Also, I might want to send raw frames that do have
> broken checksums (lets assume a real NIC, not veth), and I want them
> to hit the wire with those bad checksums.
>
> How do I configure the checksumming in this case?

Just send raw packets with bad checksums (no checksum offload).

Tom

>
>
> Thanks,
> Ben
>
>
>
>>
>> Tom
>>
>> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear 
>> wrote:
>>>
>>>
>>>
>>> On 04/30/2016 11:33 AM, Ben Hutchings wrote:
>>>>
>>>>
>>>> On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>>>>>
>>>>>
>>>>> Hello,
>>>
>>>
>>>
>>>>>>
>>>>>>
>>>>>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>>>>>
>>>>>
>>>>> Actually, no, this is not really a regression.
>>>>
>>>>
>>>> [...]
>>>>
>>>> It really is.  Even though the old behaviour was a bug (raw packets
>>>> should not be changed), if there are real applications that depend on
>>>> that then we have to keep those applications working somehow.
>>>
>>>
>>>
>>> To be honest, I fail to see why the old behaviour is a bug when sending
>>> raw packets from user-space.  If raw packets should not be changed, then
>>> we need some way to specify what the checksum setting is to begin with,
>>> otherwise, user-space has not enough control.
>>>
>>> A socket option for new programs, and sysctl configurable defaults for
>>> raw
>>> sockets
>>> for old binary programs would be sufficient I think.
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>> Ben Greear 
>>> Candela Technologies Inc  http://www.candelatech.com
>>
>>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com

Re: [PATCH] gre6: add Kconfig dependency for NET_IPGRE_DEMUX

2016-05-03 Thread Tom Herbert

On Tue, May 3, 2016 at 8:19 AM, Arnd Bergmann  wrote:
> The ipv6 gre implementation was cleaned up to share more code
> with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
> is disabled, resulting in a link error:
>
> net/built-in.o: In function `gre_rcv':
> :(.text+0x17f5d0): undefined reference to `gre_parse_header'
> ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!
>
> This adds a Kconfig dependency to prevent that now invalid
> configuration.
>
> Signed-off-by: Arnd Bergmann 
> Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE 
> functions")
> ---

Acked-by: Tom Herbert 

>  net/ipv6/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
> index 11e875ffd7ac..3f8411328de5 100644
> --- a/net/ipv6/Kconfig
> +++ b/net/ipv6/Kconfig
> @@ -218,6 +218,7 @@ config IPV6_GRE
> tristate "IPv6: GRE tunnel"
> select IPV6_TUNNEL
> select NET_IP_TUNNEL
> +   depends on NET_IPGRE_DEMUX
> ---help---
>   Tunneling means encapsulating data of one protocol type within
>   another protocol and sending it over a channel that understands the
> --
> 2.7.0
>

Re: [PATCHv2 net] i40e: Implement ndo_gso_check()

2014-12-01 Thread Tom Herbert

On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer  wrote:
> On 21 November 2014 at 09:59, Joe Stringer  wrote:
>> On 20 November 2014 16:19, Jesse Gross  wrote:
>>> I don't know if we need to have the check at all for IPIP though -
>>> after all the driver doesn't expose support for it all (actually it
>>> doesn't expose GRE either). This raises kind of an interesting
>>> question about the checks though - it's pretty easy to add support to
>>> the driver for a new GSO type (and I imagine that people will be
>>> adding GRE soon) and forget to update the check.
>>
>> If the check is more conservative, then testing would show that it's
>> not working and lead people to figure out why (and update the check).
>
> More concretely, one suggestion would be something like following at
> the start of each gso_check():
>
> +   const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
> +
> +   if (skb_shinfo(skb)->gso_type & ~supported)
> +   return false;

This should already be handled by net_gso_ok.

>
> ..followed by checking the specifics for each. So far the patches have
> only been concerned with further checking on UDP tunnels.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv2 net] i40e: Implement ndo_gso_check()

2014-12-01 Thread Tom Herbert

On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross  wrote:
> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert  wrote:
>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer  wrote:
>>> On 21 November 2014 at 09:59, Joe Stringer  wrote:
>>>> On 20 November 2014 16:19, Jesse Gross  wrote:
>>>>> I don't know if we need to have the check at all for IPIP though -
>>>>> after all the driver doesn't expose support for it all (actually it
>>>>> doesn't expose GRE either). This raises kind of an interesting
>>>>> question about the checks though - it's pretty easy to add support to
>>>>> the driver for a new GSO type (and I imagine that people will be
>>>>> adding GRE soon) and forget to update the check.
>>>>
>>>> If the check is more conservative, then testing would show that it's
>>>> not working and lead people to figure out why (and update the check).
>>>
>>> More concretely, one suggestion would be something like following at
>>> the start of each gso_check():
>>>
>>> +   const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
>>> + SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
>>> +
>>> +   if (skb_shinfo(skb)->gso_type & ~supported)
>>> +   return false;
>>
>> This should already be handled by net_gso_ok.
>
> My original point wasn't so much that this isn't handled at the moment
> but that it's easy to add a supported GSO type but then forget to
> update this check - i.e. if a driver already supports UDP_TUNNEL and
> adds support for GRE with the same constraints. It seems not entirely
> ideal that this function is acting as a blacklist rather than a
> whitelist.

Agreed, it would be nice to have all the checking logic in one place.
If all the drivers end up implementing ndo_gso_check then we could
potentially get rid of the GSO types as features. This probably
wouldn't be a bad thing since we already know that the features
mechanism doesn't scale (for instance there's no way to indicate that
certain combinations of GSO types are supported by a device).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: Tree for Nov 10 (net/ipv4/ip_tunnel.c)

2014-11-10 Thread Tom Herbert

I am looking at it.

On Mon, Nov 10, 2014 at 11:24 AM, David Miller  wrote:
> From: Randy Dunlap 
> Date: Mon, 10 Nov 2014 10:15:11 -0800
>
>> On 11/10/14 01:59, Stephen Rothwell wrote:
>>> Hi all,
>>>
>>> Changes since 20141106:
>>>
>>
>> on x86_64:
>> when CONFIG_NET_IP_TUNNEL=y and CONFIG_NET_FOU=m:
>>
>> net/built-in.o: In function `ip_tunnel_encap':
>> (.text+0xb04d8): undefined reference to `gue_build_header'
>> net/built-in.o: In function `ip_tunnel_encap':
>> (.text+0xb04ea): undefined reference to `fou_build_header'
>
> Tom, this has now been reported twice, please fix this.
>
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Allow TCP connections to cache SYN packet for userspace inspection

2015-05-01 Thread Tom Herbert

On Fri, May 1, 2015 at 11:42 AM, Eric Dumazet  wrote:
> On Fri, 2015-05-01 at 13:43 -0400, Eric B Munson wrote:
>> In order to enable policy decisions in userspace, the data contained in
>> the SYN packet would be useful for tracking or identifying connections.
>> Only parts of this data are available to userspace after the hand shake
>> is completed.  This patch exposes a new setsockopt() option that will,
>> when used with a listening socket, ask the kernel to cache the skb
>> holding the SYN packet for retrieval later.  The SYN skbs will not be
>> saved while the kernel is in syn cookie mode.
>>
>> The same option will ask the kernel for the packet headers when used
>> with getsockopt() with the socket returned from accept().  The cached
>> packet will only be available for the first getsockopt() call, the skb
>> is consumed after the requested data is copied to userspace.  Subsequent
>> calls will return -ENOENT.  Because of this behavior, getsockopt() will
>> return -E2BIG if the caller supplied a buffer that is too small to hold
>> the skb header.
>>
>> Signed-off-by: Eric B Munson 
>> Cc: Alexey Kuznetsov 
>> Cc: James Morris 
>> Cc: Hideaki YOSHIFUJI 
>> Cc: Patrick McHardy 
>> Cc: net...@vger.kernel.org
>> Cc: linux-...@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>
> We have a similar patch here at Google, but we do not hold one skb and
> dst per saved syn. That can be ~4KB for some drivers.
>
> Only a kmalloc() with the needed part (headers), usually less than 128
> bytes. We store the length in first byte of this allocation.
>
> This has a huge difference if you want to have ~4 million request socks.
>
+1 on kmalloc solution. I posted a similar patch a couple of years ago
https://patchwork.ozlabs.org/patch/146034/. There was pushback on
memory usage and this having to narrow of a use case.

Tom

>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFCv2 net-next 0/7] OVS conntrack support

2015-03-02 Thread Tom Herbert

On Mon, Mar 2, 2015 at 1:54 PM, Joe Stringer  wrote:
> The goal of this series is to allow OVS to send packets through the Linux
> kernel connection tracker, and subsequently match on fields populated by
> conntrack.
>
> Sending this out as another RFC change as this is the first time IP fragment
> support is included. Only IPv4 is added right now, as we'd like to get some
> feedback on that approach before we implement IPv6 frag support.
>
> Helper support is also yet to be addressed, for tracking a particular flow a 
> la
> iptables CT targets. I think this is just a matter of having userspace specify
> the helper to use (eg via 8-bit field in conntrack action), and setting up the
> conntrack template accordingly when OVS first installs the flow containing a
> conntrack action.
>
> There are some additional related items that I intend to work on, which I do
> not see as prerequisite for this series:
> - OVS Connlabel support.
> - Allow OVS to register logging facilities for conntrack.
> - Conntrack per-zone configuration.
>
> The branch below has been updated with the corresponding userspace pieces:
> https://github.com/justinpettit/ovs/tree/conntrack
>
>
> RFCv2:
> - Support IPv4 fragments
> - Warn when ct->net is different from skb net in skb_has_valid_nfct().
> - Set OVS_CS_F_TRACKED when a flow cannot be identified ("invalid")
> - Continue processing packets when conntrack marks the flow invalid.
> - Use PF_INET6 family when sending IPv6 packets to conntrack.
> - Verify conn_* matches when deserializing metadata from netlink.
> - Only allow conntrack action on IPv4/IPv6 packets.
> - Remove explicit dependencies on conn_zone, conn_mark.
> - General tidyups
>
> RFCv1:
> - Rebase to net-next.
> - Add conn_zone field to the flow key.
> - Add explicit dependencies on conn_zone, conn_mark.
> - Refactor conntrack changes into net/openvswitch/ovs_conntrack.*.
> - Don't allow set_field() actions to change conn_state, conn_zone.
> - Add OVS_CS_F_* flags to indicate connection state.
> - Add "invalid" connection state.
>
>
> Andy Zhou (3):
>   net: refactor ip_fragment()
>   net: Refactor ip_defrag() APIs
>   openvswitch: Support fragmented IPv4 packets for conntrack
>
> Joe Stringer (2):
>   openvswitch: Serialize acts with original netlink len
>   openvswitch: Move MASKED* macros to datapath.h
>
> Justin Pettit (2):
>   openvswitch: Add conntrack action
>   openvswitch: Allow matching on conntrack mark
>
>  drivers/net/macvlan.c   |2 +-
>  include/net/ip.h|   13 +-
>  include/uapi/linux/openvswitch.h|   42 +++-
>  net/ipv4/ip_fragment.c  |   46 ++--
>  net/ipv4/ip_input.c |5 +-
>  net/ipv4/ip_output.c|  113 +

This is a lot of change to core IP. It probably should be done in its
own patch set.

>  net/ipv4/netfilter/nf_defrag_ipv4.c |2 +-
>  net/netfilter/ipvs/ip_vs_core.c |2 +-
>  net/openvswitch/Kconfig |   11 +
>  net/openvswitch/Makefile|1 +
>  net/openvswitch/actions.c   |  140 +---
>  net/openvswitch/conntrack.c |  427 
> +++
>  net/openvswitch/conntrack.h |   91 
>  net/openvswitch/datapath.c  |   60 +++--
>  net/openvswitch/datapath.h  |   10 +
>  net/openvswitch/flow.c  |4 +
>  net/openvswitch/flow.h  |4 +
>  net/openvswitch/flow_netlink.c  |   95 ++--
>  net/openvswitch/flow_netlink.h  |4 +-
>  net/openvswitch/vport.c |1 +
>  net/packet/af_packet.c  |2 +-
>  21 files changed, 938 insertions(+), 137 deletions(-)
>  create mode 100644 net/openvswitch/conntrack.c
>  create mode 100644 net/openvswitch/conntrack.h
>
> --
> 1.7.10.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net: Initialize all members in skb_gro_remcsum_init()

2015-02-18 Thread Tom Herbert

On Wed, Feb 18, 2015 at 2:38 AM, Geert Uytterhoeven
 wrote:
> skb_gro_remcsum_init() initializes the gro_remcsum.delta member only,
> leading to compiler warnings about a possibly uninitialized
> gro_remcsum.offset member:
>
> drivers/net/vxlan.c: In function ‘vxlan_gro_receive’:
> drivers/net/vxlan.c:602: warning: ‘grc.offset’ may be used uninitialized in 
> this function
> net/ipv4/fou.c: In function ‘gue_gro_receive’:
> net/ipv4/fou.c:262: warning: ‘grc.offset’ may be used uninitialized in this 
> function
>
> While these are harmless for now:
>   - skb_gro_remcsum_process() sets offset before changing delta,
>   - skb_gro_remcsum_cleanup() checks if delta is non-zero before
> accessing offset,
> it's safer to let the initialization function initialize all members.
>

Acked-by: Tom Herbert 

> Signed-off-by: Geert Uytterhoeven 
> ---
>  include/linux/netdevice.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5897b4ea5a3f9e0f..429d1790a27e85f3 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2342,6 +2342,7 @@ struct gro_remcsum {
>
>  static inline void skb_gro_remcsum_init(struct gro_remcsum *grc)
>  {
> +   grc->offset = 0;
> grc->delta = 0;
>  }
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] crypto: af_alg - add TLS type encryption

2016-04-07 Thread Tom Herbert

On Thu, Apr 7, 2016 at 11:52 PM, Herbert Xu  wrote:
> On Wed, Apr 06, 2016 at 10:56:12AM -0700, Tadeusz Struk wrote:
>>
>> The intend is to enable HW acceleration of the TLS protocol.
>> The way it will work is that the user space will send a packet of data
>> via AF_ALG and HW will authenticate and encrypt it in one go.
>
> There have been suggestions to implement TLS data-path within
> the kernel.  So we should decide whether we pursue that or go
> with your approach before we start adding algorithms.
>
Yes, please see Dave Watson's patches on this.

Tom

> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: GSO with udp_tunnel_xmit_skb

2015-11-05 Thread Tom Herbert

On Thu, Nov 5, 2015 at 7:52 PM, Jason A. Donenfeld  wrote:
> Hi folks,
>
> When sending arbitrary SKBs with udp_tunnel_xmit_skb, the networking
> stack does not appear to be utilizing UFO on the outgoing UDP packets,
> which significantly caps the transmission speed. I see about 50% CPU
> usage in this send path, triggered for every single outgoing packet.
> Is there a particular skb option I need to set to enable this? I read
> Tom's patch [1] from last year, but this seems to be about setting the
> inner packet type. In my case, the inner type is opaque encrypted
> data, so there's not a relevant setting.
>
Jason,

Is this about UFO or GSO (in email subject)? UFO should operate
independently encapsulation or inner packet setting.

Tom

> Thanks,
> Jason
>
> [1] http://thread.gmane.org/gmane.linux.network/332194
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next network throughput performance regression

2015-11-08 Thread Tom Herbert

On Sun, Nov 8, 2015 at 7:31 PM, Dexuan Cui  wrote:
>> From: David Miller [mailto:da...@davemloft.net]
>> Sent: Monday, November 9, 2015 11:24
>> ...
>> > Thanks, David!
>> > I understand 1 TX queue is the bottleneck (however in Simon's
>> > test, TX=1 => 36.7Gb/s, TX=8 => 37.7 Gb/s, so it looks the TX=1 bottleneck
>> > is not so obvious).
>> > I'm just wondering how the bottleneck became much narrower with
>> > recent linux-next in Simon's result (36.7 Gb/s vs. 18.2 Gb/s). IMO there
>> > must be some latency somewhere.
>>
>> I think the whole thing here is that you misinterpreted what Eric said.
>>
>> He is not arguing that some regression did, or did not, happen.
>>
>> He instead was making the basic statement about the fact that due to
>> the lack of paralellness a single stream TCP case is harder to
>> optimize for high speed NICs.
>>
>> That is all.
> Thanks, I got it.
> I'm actually new to network performance tuning, trying to understand
> all the related details. :-)
>

You might want to look at
https://www.kernel.org/doc/Documentation/networking/scaling.txt as an
introduction to the scaling capabilities of the stack.

Tom

> Thanks,
> -- Dexuan
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Tom Herbert

On Thu, Feb 14, 2019 at 5:00 PM Dominique Martinet
 wrote:
>
> Dominique Martinet wrote on Wed, Oct 31, 2018:
> > Anyway, that probably explains I have no problem with bigger VM
> > (uselessly more memory available) or without KASAN (I guess there's
> > overhead?), but I'm sending at most 300k of data and the VM has a 1.5GB
> > of ram, so if there's an allocation failure there I think there's a
> > problem ! . . .
> >
> > So, well, I'm not sure on the way forward. Adding a bpf helper and
> > document that kcm users should mind the offset?
>
> bump on this - I had mostly forgotten about it but the nfs-ganesha
> community that could make use of KCM reminded me of my patch that's
> waiting for this.
>
> Summary for people coming back after four months:
>  - kcm is great, but the bpf function that's supposed to be called for
> each packet does not automatically adjust the offset so that it can
> assume the skb starts with the packet it needs to look at
>
>  - there is some workaround code that is far from obvious and
> undocumented, see the original thread[1]:
> [1] 
> http://lkml.kernel.org/r/20180822183852.jnwlxnz54gbbf...@davejwatson-mba.dhcp.thefacebook.com
>
>  - my patch here tried to automatically pull the corresponding packet to
> the front, but apparently with KASAN can trigger out of memory
> behaviours on "small" VMs, so even if it doesn't seem to impact
> performance much without KASAN I don't think it's really ok to
> potentially hang the connection due to oom under severe conditions.
>
>
> The best alternative I see is adding a proper helper to get
> "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> lost as I have been; I'm not quite sure how/where to add such a helper
> though as I've barely looked at the bpf code until now, but should we go
> for that?

Dominique,

Thanks for looking into this.

I'd rather not complicate the bpf code for this. Can we just always do
an pskb_pull after skb_clone?

Tom

>
>
> (it really feels wrong to me that some random person who just started by
> trying to use kcm has to put this much effort to keep the ball rolling,
> if nobody cares about kcm I'm also open to have it removed completely
> despite the obvious performance gain I benchmarked for ganesha[2] ;
> barely maintained feature is worse than no feature)
>
> [2] https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/421314
>
> Thanks,
> --
> Dominique

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Tom Herbert

On Thu, Feb 14, 2019 at 5:57 PM Dominique Martinet
 wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > The best alternative I see is adding a proper helper to get
> > > "kcm_rx_msg(skb)->offset" from bpf and document it so users aren't as
> > > lost as I have been; I'm not quite sure how/where to add such a helper
> > > though as I've barely looked at the bpf code until now, but should we go
> > > for that?
> >
> > I'd rather not complicate the bpf code for this. Can we just always do
> > an pskb_pull after skb_clone?
>
> Which skb_clone are you thinking of?
> If you're referring to the one in strparser, that was my very first
> approach here[1], but Dordon shot it down saying that this is not an
> strparser bug but a kcm bug since there are ways for users to properly
> get the offset and use it -- and the ktls code does it right.
>
> Frankly, my opinion still is that it'd be better in strparser because
> there also is some bad use in bpf sockmap (they never look at the offset
> either, while kcm uses it for rcv but not for parse), but looking at it
> from an optimal performance point of view I agree the user can make
> better decision than strparser so I understand where he comes from.
>
>
> This second patch[2] (the current thread) now does an extra clone if
> there is an offset, but the problem really isn't in the clone but the
> pull itself that can fail and return NULL when there is memory pressure.
> For some reason I hadn't been able to reproduce that behaviour with
> strparser doing the pull, but I assume it would also be possible to hit
> in extreme situations, I'm not sure...
>
This option looks the best to me, at least as a way to fix the issue
without requiring a change to the API. If the pull fails, doesn't that
just mean that the parser fails? Is there some behavior with this
patch that is not being handled gracefully?

Thanks,
Tom

> So anyway, we basically have three choices that I can see:
>  - push harder on strparser and go back to my first patch ; it's simple
> and makes using strparser easier/safer but has a small overhead for
> ktls, which uses the current strparser implementation correctly.
> I hadn't been able to get this to fail with KASAN last time I tried but
> I assume it should still be possible somehow.
>
>  - the current patch, that I could only get to fail with KASAN, but does
> complexify kcm a bit; this also does not fix bpf sockmap at all.
> Would still require to fix the hang, so make strparser retry a few times
> if strp->cb.parse_msg failed maybe? Or at least get the error back to
> userspace somehow.
>
>  - my last suggestion to document the kcm behaviour, and if possible add
> a bpf helper to make proper usage easier ; this will make kcm harder to
> use on end users but it's better than not being documented and seeing
> random unappropriate lengths being parsed.
>
>
>
> [1] first strparser patch
> http://lkml.kernel.org/r/1534855906-22870-1-git-send-email-asmad...@codewreck.org
> [2] current thread's patch
> http://lkml.kernel.org/r/1536657703-27577-1-git-send-email-asmad...@codewreck.org
>
>
> Thanks,
> --
> Dominique

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-14 Thread Tom Herbert

On Thu, Feb 14, 2019 at 7:31 PM Dominique Martinet
 wrote:
>
> Tom Herbert wrote on Thu, Feb 14, 2019:
> > > This second patch[2] (the current thread) now does an extra clone if
> > > there is an offset, but the problem really isn't in the clone but the
> > > pull itself that can fail and return NULL when there is memory pressure.
> > > For some reason I hadn't been able to reproduce that behaviour with
> > > strparser doing the pull, but I assume it would also be possible to hit
> > > in extreme situations, I'm not sure...
> >
> > This option looks the best to me, at least as a way to fix the issue
> > without requiring a change to the API. If the pull fails, doesn't that
> > just mean that the parser fails? Is there some behavior with this
> > patch that is not being handled gracefully?
>
> Yes, the parser fails with -ENOMEM ; that is not handled gracefully at
> all: from a user point of view, the connection just hangs (recvmsg never
> returns), without so much as a message in dmesg either.
>
Dominique,

That's by design. Here is the description in kcm.txt:

"When a TCP socket is attached to a KCM multiplexor data ready
(POLLIN) and write space available (POLLOUT) events are handled by the
multiplexor. If there is a state change (disconnection) or other error
on a TCP socket, an error is posted on the TCP socket so that a
POLLERR event happens and KCM discontinues using the socket. When the
application gets the error notification for a TCP socket, it should
unattach the socket from KCM and then handle the error condition (the
typical response is to close the socket and create a new connection if
necessary)."

> It took me a while to figure out what failed exactly as I did indeed
> expect strparser/kcm to handle that better, but ultimately as things
> stand if the parser fails it calls strp_parser_err() with the error
> which ends up in strp_abort_strp that should call
> sk->sk_error_report(sk) but in kcm case sk is the csk and I guess
> failing csk does not make a pending recv on the kcm sock to fail...
>
> I'm not sure whether propagating the error to the upper socket is the
> right thing to do, kcm is meant to be able to work with multiple
> underlying sockets so I feel we must be cautious about that, but

Right, that's the motivation for the design.

> strparser might be able to retry somehow.

We could retry on -ENOMEM since it is potentially a transient error,
but generally for errors (like connection is simply broken) it seems
like it's working as intended. I suppose we could add a socket option
to fail the KCM socket when all attached lower sockets have failed,
but I not sure that would be a significant benefit for additional
complexity.

> This is what I said below:
> > > [,,,]
> > >  - the current patch, that I could only get to fail with KASAN, but does
> > > complexify kcm a bit; this also does not fix bpf sockmap at all.
> > > Would still require to fix the hang, so make strparser retry a few times
> > > if strp->cb.parse_msg failed maybe? Or at least get the error back to
> > > userspace somehow.

The error should be getting to userspace via the TCP socket.

Tom

>
> Thanks,
> --
> Dominique

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-20 Thread Tom Herbert

On Tue, Feb 19, 2019 at 8:12 PM Dominique Martinet
 wrote:
>
> Dominique Martinet wrote on Fri, Feb 15, 2019:
> > With all that said I guess my patch should work correctly then, I'll try
> > to find some time to check the error does come back up the tcp socket in
> > my reproducer but I have no reason to believe it doesn't.
>
> Ok, so I can confirm this part - the 'csock' does come back up with
> POLLERR if the parse function returns ENOMEM in the current code.
>
Good.

> It also comes back up with POLLERR when the remote side closes the
> connection, which is expected, but I'm having a very hard time
> understanding how an application is supposed to deal with these
> POLLERR after having read the documentation and a bit of
> experimentation.
> I'm not sure how much it would matter for real life (if the other end
> closes the connection most servers would not care about what they said
> just before closing, but I can imagine some clients doing that in real
> life e.g. a POST request they don't care if it succeeds or not)...
> My test program works like this:
>  - client connects, sends 10k messages and close()s the socket
>  - server loops recving and close()s after 10k messages; it used to be
> recvmsg() directly but it's now using poll then recvmsg.
>
>
> When the client closes the socket, some messages are obviously still "in
> flight", and the server will recv a POLLERR notification on the csock at
> some point with many messages left.
> The documentation says to unattach the csock when you get POLLER. If I
> do that, the kcm socket will no longer give me any message, so all the
> messages still in flight at the time are lost.

So basically it sounds like you're interested in supporting TCP
connections that are half closed. I believe that the error in half
closed is EPIPE, so if the TCP socket returns that it can be ignored
and the socket can continue being attached and used to send data.

Another possibility is to add some linger semantics to an attached
socket. For instance, a large message might be sent so that part of
the messge is queued in TCP and part is queued in the KCM socket.
Unattach would probably break that message. We probably want to linger
option similar to SO_LINGER (or maybe just use the option on the TCP
socket) that means don't complete the detach until any message being
transmitted on the lower socket has been queued.
>
> If I just ignore the csock like I used to, all the messages do come just
> fine, but as said previously on a real error this will just make recvmsg
> or the polling hang forever and I see no way to distinguish a "real"
> error vs. a connection shut down from the remote side with data left in
> the pipe.
> I thought of checking POLLERR on csock and POLLIN not set on kcmsock,
> but even that seems to happen fairly regularily - the kcm sock hasn't
> been filled up, it's still reading from the csock.
>
>
> On the other hand, checking POLLIN on the csock does say there is still
> data left, so I know there is data left on the csock, but this is also
> the case on a real error (e.g. if parser returns -ENOMEM)
> ... And this made me try to read from the csock after detaching it and I
> can resume manual tcp parsing for a few messages until read() fails with
> EPROTO ?! and I cannot seem to be able to get anything out of attaching
> it back to kcm (for e.g. an ENOMEM error that was transient)...
>
>
>
> I'm honestly not sure how the POLLERR notification mechanism works but I
> think it would be much easier to use KCM if we could somehow delay that
> error until KCM is done feeding from the csock (when netparser really
> stops reading from it like on real error, e.g. abort callback maybe?)
> I think it's fine if the csock is closed before the kcm sock message is
> read, but we should not lose messages like this.

Sounds like linger semantics is needed then.

>
>
>
> > I'd like to see some retry on ENOMEM before this is merged though, so
> > while I'm there I'll resend this with a second patch doing that
> > retry,.. I think just not setting strp->interrupted and not reporting
> > the error up might be enough? Will have to try either way.
>
> I also tried playing with that without much success.
> I had assumed just not calling strp_parser_err() (which calls the
> abort_parser cb) would be enough, eventually calling strp_start_timer()
> like the !len case, but no can do.

I think you need to ignore the ENOMEM and have a timer or other
callback to retry the operation in the future.

> With that said, returning 0 from the parse function also raises POLLERR
> on the csock and hangs netparser, so things aren't that simple...

Can you point to where this is happening. If the parse_msg callback
returns 0 that is suppose to indicate that more bytes are needed.

>

>
> I could use a bit of help again.
>
> Thanks,
> --
> Dominique

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-22 Thread Tom Herbert

On Thu, Feb 21, 2019 at 12:22 AM Dominique Martinet
 wrote:
>
> Tom Herbert wrote on Wed, Feb 20, 2019:
> > > When the client closes the socket, some messages are obviously still "in
> > > flight", and the server will recv a POLLERR notification on the csock at
> > > some point with many messages left.
> > > The documentation says to unattach the csock when you get POLLER. If I
> > > do that, the kcm socket will no longer give me any message, so all the
> > > messages still in flight at the time are lost.
> >
> > So basically it sounds like you're interested in supporting TCP
> > connections that are half closed. I believe that the error in half
> > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > and the socket can continue being attached and used to send data.
>
> Did you mean 'can continue being attached and used to receive data'?
>
No, I meant shutdown on receive side when FIN is receved. TX is still
allowed to drain an queued bytes. To support shutdown on the TX side
would require additional logic since we need to effectively detach the
transmit path but retain the receive path. I'm not sure this is a
compelling use case to support.

> I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> both the csock and kcm socket will do many needless wakeups on only the
> csock from what I can see, so I'd need some holdoff timer or something.
> I guess it's possible though.

We might need to clear the error somehow. May a read of zero bytes?

>
> > Another possibility is to add some linger semantics to an attached
> > socket. For instance, a large message might be sent so that part of
> > the messge is queued in TCP and part is queued in the KCM socket.
> > Unattach would probably break that message. We probably want to linger
> > option similar to SO_LINGER (or maybe just use the option on the TCP
> > socket) that means don't complete the detach until any message being
> > transmitted on the lower socket has been queued.
>
> That would certainly work, even if non-obvious from a user standpoint.
>
>
> > > > I'd like to see some retry on ENOMEM before this is merged though, so
> > > > while I'm there I'll resend this with a second patch doing that
> > > > retry,.. I think just not setting strp->interrupted and not reporting
> > > > the error up might be enough? Will have to try either way.
> > >
> > > I also tried playing with that without much success.
> > > I had assumed just not calling strp_parser_err() (which calls the
> > > abort_parser cb) would be enough, eventually calling strp_start_timer()
> > > like the !len case, but no can do.
> >
> > I think you need to ignore the ENOMEM and have a timer or other
> > callback to retry the operation in the future.
>
> Yes, that's what I had intended to try; basically just break out and
> schedule timer as said above.

You might want to look at some other systems, I don't recall if
there's a hook that can be used for when memory pressure is relieved.

>
> After a bit more debugging, this part works (__strp_recv() is called
> again); but the next packet that is treated properly is rejected because
> by the time __strp_recv() was called again a new skb was read and the
> length isn't large enough to go all the way into the new packet, so this
> test fails:
> } else if (len <= (ssize_t)head->len -
>   skb->len - stm->strp.offset) {
> /* Length must be into new skb (and also
>  * greater than zero)
>  */
> STRP_STATS_INCR(strp->stats.bad_hdr_len);
> strp_parser_err(strp, -EPROTO, desc);
>
> So I need to figure a way to say "call this function again without
> reading more data" somehow, or make this check more lax e.g. accept any
> len > 0 after a retry maybe...
> Removing that branch altogether seems to work at least but I'm not sure
> we'd want to?

I like the check since it's conservative and covers the normal case.
Maybe just need some more logic?
> (grmbl at this slow VM and strparser not being possible to enable as a
> module, it takes ages to test)
>
>
> > > With that said, returning 0 from the parse function also raises POLLERR
> > > on the csock and hangs netparser, so things aren't that simple...
> >
> > Can you point to where this is happening. If the parse_msg callback
> > returns 0 that is suppose to indicate that more bytes are needed.
>
> I just blindly returned 0 "from time to time" in the kcm parser
> function, but looking at above it was failing on the same check.
> This somewhat makes sense for this one to fail here if a new packet was
> read, no sane parser function should give a length smaller than what
> they require to determine the length.
>
>
> Thanks,
> --
> Dominique

Re: [PATCH v2] kcm: remove any offset before parsing messages

2019-02-22 Thread Tom Herbert

On Fri, Feb 22, 2019 at 12:28 PM Dominique Martinet
 wrote:
>
> Tom Herbert wrote on Fri, Feb 22, 2019:
> > > > So basically it sounds like you're interested in supporting TCP
> > > > connections that are half closed. I believe that the error in half
> > > > closed is EPIPE, so if the TCP socket returns that it can be ignored
> > > > and the socket can continue being attached and used to send data.
> > >
> > > Did you mean 'can continue being attached and used to receive data'?
> >
> > No, I meant shutdown on receive side when FIN is receved. TX is still
> > allowed to drain an queued bytes. To support shutdown on the TX side
> > would require additional logic since we need to effectively detach the
> > transmit path but retain the receive path. I'm not sure this is a
> > compelling use case to support.
>
Dominque,

> Hm, it must be a matter of how we see thing but from what I understand
> it's exactly the other way around. The remote closed the connection, so
> trying to send anything would just yield a RST, so TX doesn't make
> sense.
> On the other hand, anything that had been sent by the remote before the
> FIN and is on the local side's memory should still be receivable.

That's right, any data sent before the FIN can be received. After the
FIN, there is not more data to received. But, the FIN doesn't say
anything about the reverse path. For instance, if the peer did
shutdown(SHUT_WR), then it sent the FIN so it no longer transmits on
the connection, but still can receive data.

>
> When you think about it as a TCP stream it's really weird: data coming,
> data coming, data coming, FIN received.
> But in the networking stack that received FIN short-circuits all the
> data that was left around and immediately raises an EPIPE error.
>
Right, it doesn't help that most people automatically assume a
received FIN means the connection is closed! (although in practice
that assumption works pretty well for most applications).

> I don't see what makes this FIN packet so great that it should be
> processed before the data; we should only see that EPIPE when we're
> done reading the data before it or trying to send something.

It's not supposed to be. If your seeing a problem, it might because
state_change is processed before the received data is drained. If
plain recvmsg were being used, zero bytes is on returned after all
outstanding received data on the socket has been read. We might need
some additional logic in KCM to handle this.

>
> I'll check tomorrow/next week but I'm pretty sure the packets before
> that have been ack'd at a tcp level as well, so losing them in the
> application level is really unexpected.
>
> > > I can confirm getsockopt with SO_ERROR gets me EPIPE, but I don't see
> > > how to efficiently ignore EPIPE until POLLIN gets unset -- polling on
> > > both the csock and kcm socket will do many needless wakeups on only the
> > > csock from what I can see, so I'd need some holdoff timer or something.
> > > I guess it's possible though.
> >
> > We might need to clear the error somehow. May a read of zero bytes?
>
> Can try.
>
> > > After a bit more debugging, this part works (__strp_recv() is called
> > > again); but the next packet that is treated properly is rejected because
> > > by the time __strp_recv() was called again a new skb was read and the
> > > length isn't large enough to go all the way into the new packet, so this
> > > test fails:
> > > } else if (len <= (ssize_t)head->len -
> > >   skb->len - stm->strp.offset) {
> > > /* Length must be into new skb (and also
> > >  * greater than zero)
> > >  */
> > > STRP_STATS_INCR(strp->stats.bad_hdr_len);
> > > strp_parser_err(strp, -EPROTO, desc);
> > >
> > > So I need to figure a way to say "call this function again without
> > > reading more data" somehow, or make this check more lax e.g. accept any
> > > len > 0 after a retry maybe...
> > > Removing that branch altogether seems to work at least but I'm not sure
> > > we'd want to?
> >
> > I like the check since it's conservative and covers the normal case.
> > Maybe just need some more logic?
>
> I can add a "retrying" state and not fail here if we ewre retrying for
> whatever reason perhaps...
> But I'm starting to wonder how this would work if my client didn't keep
> on sending data, I'll try to fail on the last client's packet and see
> how __strp_recv is called again through the timer, with the same skb
> perhaps?

Right, a timer or some sort of aynchronous callbask is needed. Like I
said, I don't think dealing with memory failure like this is unique to
KCM.

Thanks again for looking into this, I think this is good progress!

Tom

>
> --
> Dominique

Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-05 Thread Tom Herbert

On Sat, May 5, 2018 at 2:43 AM, Herbert Xu  wrote:
> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> rhashtable_walk_prev() returns the object returned by
>> the previous rhashtable_walk_next(), providing it is still in the
>> table (or was during this grace period).
>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>> have been called since the last rhashtable_walk_next().
>>
>> If there have been no calls to rhashtable_walk_next(), or if the
>> object is gone from the table, then NULL is returned.
>>
>> This can usefully be used in a seq_file ->start() function.
>> If the pos is the same as was returned by the last ->next() call,
>> then rhashtable_walk_prev() can be used to re-establish the
>> current location in the table.  If it returns NULL, then
>> rhashtable_walk_next() should be used.
>>
>> Signed-off-by: NeilBrown 
>
> I will ack this if Tom is OK with replacing peek with it.
>
I'm not following why this is any better than peek. The point of
having rhashtable_walk_peek is to to allow the caller to get then
current element not the next one. This is needed when table is read in
multiple parts and we need to pick up with what was returned from the
last call to rhashtable_walk_next (which apparently is what this patch
is also trying to do).

There is one significant difference in that peek will return the
element in the table regardless of where the iterator is at (this is
why peek can move the iterator) and only returns NULL at end of table.
As mentioned above rhashtable_walk_prev can return NULL so then caller
needs and so rhashtable_walk_next "should be used" to get the previous
element. Doing a peek is a lot cleaner and more straightforward API in
this regard.

Tom

> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [RFC PATCH 00/30] Kernel NET policy

2016-07-18 Thread Tom Herbert

On Mon, Jul 18, 2016 at 5:51 PM, Liang, Kan  wrote:
>
>
>> >
>> > It is a big challenge to get good network performance. First, the
>> > network performance is not good with default system settings. Second,
>> > it is too difficult to do automatic tuning for all possible workloads,
>> > since workloads have different requirements. Some workloads may want
>> high throughput.
>>
>> Seems you did lots of tests to find optimal settings for a given base policy.
>>
> Yes. Current test only base on Intel i40e driver. The optimal settings should
> vary for other devices. But adding settings for new device is not hard.
>
The optimal settings are very dependent on system architecture (NUMA
config, #cpus, memory, etc.) and sometimes kernel version as well. A
database that provides best configurations across different devices,
architectures, and kernel version might be interesting; but beware
that that is a whole bunch of work to maintain, Either way policy like
this really should be handled in userspace.

Tom

Re: [PATCH net-next 5/5] ipv6: Compute multipath hash for forwarded ICMP errors from offending packet

2016-10-27 Thread Tom Herbert

On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki  wrote:
> Same as for the transmit path, let's do our best to ensure that received
> ICMP errors that may be subject to forwarding will be routed the same
> path as flow that triggered the error, if it was going in the opposite
> direction.
>
Unfortunately our ability to do this is generally quite limited. This
patch will select the route for multipath, but I don't believe sets
the same link in LAG and definitely can't help switches doing ECMP to
route the ICMP packet in the same way as the flow would be. Did you
see a problem that warrants solving this case?

Tom


> Signed-off-by: Jakub Sitnicki 
> ---
>  net/ipv6/route.c | 26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 1184c2b..c0f38ea 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1150,6 +1150,30 @@ struct dst_entry *ip6_route_input_lookup(struct net 
> *net,
>  }
>  EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
>
> +static u32 ip6_multipath_icmp_hash(const struct sk_buff *skb)
> +{
> +   const struct icmp6hdr *icmph = icmp6_hdr(skb);
> +   const struct ipv6hdr *inner_iph;
> +   struct ipv6hdr _inner_iph;
> +
> +   if (icmph->icmp6_type != ICMPV6_DEST_UNREACH &&
> +   icmph->icmp6_type != ICMPV6_PKT_TOOBIG &&
> +   icmph->icmp6_type != ICMPV6_TIME_EXCEED &&
> +   icmph->icmp6_type != ICMPV6_PARAMPROB)
> +   goto standard_hash;
> +
> +   inner_iph = skb_header_pointer(
> +   skb, skb_transport_offset(skb) + sizeof(*icmph),
> +   sizeof(_inner_iph), &_inner_iph);
> +   if (!inner_iph)
> +   goto standard_hash;
> +
> +   return icmpv6_multipath_hash(inner_iph);
> +
> +standard_hash:
> +   return 0; /* compute it later, if needed */
> +}
> +
>  void ip6_route_input(struct sk_buff *skb)
>  {
> const struct ipv6hdr *iph = ipv6_hdr(skb);
> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb)
> tun_info = skb_tunnel_info(skb);
> if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
> fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
> +   if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
> +   fl6.mp_hash = ip6_multipath_icmp_hash(skb);

I will point out that this is only
> skb_dst_drop(skb);
> skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags));
>  }
> --
> 2.7.4
>

Re: [PATCH net-next 5/5] ipv6: Compute multipath hash for forwarded ICMP errors from offending packet

2016-10-28 Thread Tom Herbert

On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki  wrote:
> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki  wrote:
>>> Same as for the transmit path, let's do our best to ensure that received
>>> ICMP errors that may be subject to forwarding will be routed the same
>>> path as flow that triggered the error, if it was going in the opposite
>>> direction.
>>>
>> Unfortunately our ability to do this is generally quite limited. This
>> patch will select the route for multipath, but I don't believe sets
>> the same link in LAG and definitely can't help switches doing ECMP to
>> route the ICMP packet in the same way as the flow would be. Did you
>> see a problem that warrants solving this case?
>
> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
> enable its wider use, targeting anycast services. Forwarding ICMP errors
> back to the source host, at the L3 layer, is what we thought would be a
> step forward.
>
> Similar to change in IPv4 routing introduced in commit 79a131592dbb
> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
> L3, leaving any potential problems with LAG at lower layer (L2)
> unaddressed.
>
ICMP will almost certainly take a different path in the network than
TCP or UDP due to ECMP. If we ever get proper flow label support for
ECMP then that could solve the problem if all the devices do a hash
just on .

If this patch is being done to be compatible with IPv4 I guess that's
okay, but it would be false advertisement to say this makes ICMP
follow the same path as the flow being targeted in an error.
Fortunately, I doubt anyone can have a dependency on this for ICMP.

In the realm of OAM with UDP encapsulation this requirement does come
up (that OAM messages can follow the same path as a particular flow).
That case is solvable by always using a UDP encapsulation with same
addresses, ports, and flow label. Unfortunately for that we still have
a few devices that insist on looking into the UDP payload to do
ECMP...

Tom

>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index 1184c2b..c0f38ea 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>
> [...]
>
>>> @@ -1168,6 +1192,8 @@ void ip6_route_input(struct sk_buff *skb)
>>> tun_info = skb_tunnel_info(skb);
>>> if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX))
>>> fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id;
>>> +   if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6))
>>> +   fl6.mp_hash = ip6_multipath_icmp_hash(skb);
>>
>> I will point out that this is only
>
> Sorry, looks like part of your reply got cut short. Could you repost?
>
> -Jakub
>
> [1] https://git.kernel.org/torvalds/c/79a131592dbb81a2dba208622a2ffbfc53f28bc0

Re: [PATCH] flow_dissector: avoid uninitialized variable access

2016-10-22 Thread Tom Herbert

On Sat, Oct 22, 2016 at 8:57 AM, Eric Garver  wrote:
> On Sat, Oct 22, 2016 at 12:16:29AM +0200, Arnd Bergmann wrote:
>> On Friday, October 21, 2016 11:05:45 PM CEST Arnd Bergmann wrote:
>> >
>> > Can you explain why "dissector_uses_key(flow_dissector,
>> > FLOW_DISSECTOR_KEY_VLAN) && skb_vlan_tag_present(skb)" implies
>> > "eth_type_vlan(proto))"?
>> >
>> > If I add uninitialized_var() here, I would at least put that in
>> > a comment here.
>>
>> Found it now myself: if skb_vlan_tag_present(skb), then we don't
>> access 'vlan', otherwise we know it is initialized because
>> eth_type_vlan(proto) has to be true.
>>
>> > On a related note, I also don't see how
>> > "dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_VLAN)"
>> > implies that skb is non-NULL. I guess this is related to the
>> > first one.
>>
>> I'm still unsure about this one.
>
> Only skb_flow_dissect_flow_keys_buf() calls this function with skb ==
> NULL. It uses flow_keys_buf_dissector_keys which does not specify
> FLOW_DISSECTOR_KEY_VLAN, so the if statement is false.
>
> A similar assumption is made for FLOW_DISSECTOR_KEY_ETH_ADDRS higher up.
>
This is a serious problem. We can't rely on the callers to know which
keys they are allowed to use to avoid crashing the kernel. We should
fix those to check if skb is NULL, add a comment to the head of the
function warning people to never assume skb is non-NULL, and also
maybe add a degenerative check that both data argument and skb are not
NULL.

Tom

>> I also found something else that is suspicious: 'vlan' points
>> to the local _vlan variable, but that has gone out of scope
>> by the time we access the pointer, which doesn't seem safe.
>
> I see no harm in moving _vlan to the same scope as vlan.

Re: [PATCH net-next 5/5] ipv6: Compute multipath hash for forwarded ICMP errors from offending packet

2016-10-31 Thread Tom Herbert

On Sun, Oct 30, 2016 at 6:03 AM, Jakub Sitnicki  wrote:
> On Fri, Oct 28, 2016 at 02:25 PM GMT, Tom Herbert wrote:
>> On Fri, Oct 28, 2016 at 1:32 AM, Jakub Sitnicki  wrote:
>>> On Thu, Oct 27, 2016 at 10:35 PM GMT, Tom Herbert wrote:
>>>> On Mon, Oct 24, 2016 at 2:28 AM, Jakub Sitnicki  wrote:
>>>>> Same as for the transmit path, let's do our best to ensure that received
>>>>> ICMP errors that may be subject to forwarding will be routed the same
>>>>> path as flow that triggered the error, if it was going in the opposite
>>>>> direction.
>>>>>
>>>> Unfortunately our ability to do this is generally quite limited. This
>>>> patch will select the route for multipath, but I don't believe sets
>>>> the same link in LAG and definitely can't help switches doing ECMP to
>>>> route the ICMP packet in the same way as the flow would be. Did you
>>>> see a problem that warrants solving this case?
>>>
>>> The motivation here is to bring IPv6 ECMP routing on par with IPv4 to
>>> enable its wider use, targeting anycast services. Forwarding ICMP errors
>>> back to the source host, at the L3 layer, is what we thought would be a
>>> step forward.
>>>
>>> Similar to change in IPv4 routing introduced in commit 79a131592dbb
>>> ("ipv4: ICMP packet inspection for multipath", [1]) we do our best at
>>> L3, leaving any potential problems with LAG at lower layer (L2)
>>> unaddressed.
>>>
>> ICMP will almost certainly take a different path in the network than
>> TCP or UDP due to ECMP. If we ever get proper flow label support for
>> ECMP then that could solve the problem if all the devices do a hash
>> just on .
>
> Sorry for my late reply, I have been traveling.
>
> I think that either I am missing something here, or the proposed changes
> address just the problem that you have described.
>
> Yes, if we compute the hash that drives the route choice over the IP
> header of the ICMP error, then there is no guarantee it will travel back
> to the sender of the offending packet that triggered the error.
>
> That is why, we look at the offending packet carried by an ICMP error
> and hash over its fields, instead. We need, however, to take care of two
> things:
>
> 1) swap the source with the destination address, because we are
>forwarding the ICMP error in the opposite direction than the
>offending packet was going (see icmpv6_multipath_hash() introduced in
>patch 4/5); and
>
> 2) ensure the flow labels used in both directions are the same (either
>reflected by one side, or fixed, e.g. not used and set to 0), so that
>the 4-tuple we hash over when forwarding, label, next hdr>, is the same both ways, modulo the order of
>addresses.
>
>> If this patch is being done to be compatible with IPv4 I guess that's
>> okay, but it would be false advertisement to say this makes ICMP
>> follow the same path as the flow being targeted in an error.
>> Fortunately, I doubt anyone can have a dependency on this for ICMP.
>
> I wouldn't want to propose anything that would be useless. If you think
> that this is the case here, I would very much like to understand what
> and why cannot work in practice.
>
The normal hash for TCP or UDP using ECMP is over . For an ICMP packet ECMP would most likely be
done over . There really is no way to ensure
that an ICMP packet will follow the same path as TCP or any other
protocol. Fortunately, this is really isn't so terrible. The Internet
has worked this way ever since routers started using ports as input to
ECMP and that hasn't caused any major meltdown.

Tom

> Thanks for reviewing this series,
> Jakub

Re: [PATCH net-next 5/5] ipv6: Compute multipath hash for forwarded ICMP errors from offending packet

2016-11-01 Thread Tom Herbert

On Tue, Nov 1, 2016 at 9:25 AM, Hannes Frederic Sowa
 wrote:
> On 31.10.2016 20:25, Tom Herbert wrote:
>> The normal hash for TCP or UDP using ECMP is over > dstIP, srcPort, dstPort>. For an ICMP packet ECMP would most likely be
>> done over . There really is no way to ensure
>> that an ICMP packet will follow the same path as TCP or any other
>> protocol. Fortunately, this is really isn't so terrible. The Internet
>> has worked this way ever since routers started using ports as input to
>> ECMP and that hasn't caused any major meltdown.
>
> The normal hash for forwarding is without srcPort or dstPort, so the
> same as ICMP and especially also because of fragmentation problematic I
> don't think a lot of routers use L4 port information for ECMP either.
>
I don't think we can define a "normal hash". There is no requirement
that routers do ECMP a certain way, or that they do ECMP, or that for
that matter that they even consistently route packets for the same
flow. All of this is optimization, not something we can rely on
operationally. So in the general case, regardless of anything we do in
the stack, either ICMP packets will follow the same path as the flow
are they won't. If they don't things still need to to work. So I still
don't see what material benefit this patch gives us.

Tom

> Bye,
> Hannes
>

Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Tom Herbert

On Wed, Dec 14, 2016 at 10:46 AM, Jason A. Donenfeld  wrote:
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.
>
"super fast" is relative. My quick test shows that this faster than
Toeplitz (good, but not exactly hard to achieve), but is about 4x
slower than jhash.

> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
>
I don't think we need advertising nor a lesson on hashing. It would be
much more useful if you just point us to the paper on siphash (which I
assume I http://cr.yp.to/siphash/siphash-20120918.pdf ?).

> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.
>
> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.
>
Key rotation is important anyway, without any key rotation even if the
key is compromised in siphash by some external means we would have an
insecure hash until the system reboots.

> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
>
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
>
> Secondly, a few places are using MD5 for creating secure sequence
> numbers, port numbers, or fast random numbers. Siphash is a faster, more
> fittting, and more secure replacement for MD5 in those situations.
>
> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.
>
Maybe so, but we need to do due diligence before considering adopting
siphash as the primary hashing in the network stack. Consider that we
may very well perform a hash over L4 tuples on _every_ packet. We've
done a good job at limiting this to be at most one hash per packet,
but nevertheless the performance of the hash function must be take
into account.

A few points:

1) My quick test shows siphash is about four times more expensive than
jhash. On my test system, computing a hash over IPv4 tuple (two 32 bit
addresses and 2 16 bit source ports) is 6.9 nsecs in Jenkins hash, 33
nsecs with siphash. Given that we have eliminated most of the packet
header hashes this might be tolerable, but still should be looking at
ways to optimize.
2) I like moving to use u64 (quad words) in the hash, this is an
improvement over Jenkins which is based on 32 bit words. If we put
this in the kernel we probably want to have several variants of
siphash for specific sizes (e.g. siphash1, siphash2, siphash2,
siphashn for hash over one, two, three, or n sixty four bit words).
3) I also tested siphash distribution and Avalanche Effect (these
really should have been covered in the paper :-( ). Both of these are
good with siphash, in fact almost have identical characteristics to
Jenkins hash.

Tom

> Signed-off-by: Jason A. Donenfeld 
> Cc: Jean-Philippe Aumasson 
> Cc: Daniel J. Bernstein 
> Cc: Linus Torvalds 
> Cc: Eric Biggers 
> Cc: David Laight 
> ---
> Changes from v2->v3:
>
>   - There is now a fast aligned version of the function and a not-as-fast
> unaligned version. The requirements for each have been documented in
> a docbook-style comment. As well, the header now contains a constant
> for the expected alignment.
>
>   - The test suite has been updated to check both the unaligned and aligned
> version of the function.
>
>  include/linux/siphash.h |  30 ++
>  lib/Kconfig.debug   |   6 +-
>  lib/Makefile|   5 +-
>  lib/siphash.c   | 153 
> 
>  lib/test_siphash.c  |  85 +++
>  5 files changed, 274 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/siphash.h
>  create mode 100644 lib/sipha

Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Tom Herbert

On Wed, Dec 14, 2016 at 4:53 AM, Jason A. Donenfeld  wrote:
> Hi David,
>
> On Wed, Dec 14, 2016 at 10:51 AM, David Laight  
> wrote:
>> From: Jason A. Donenfeld
>>> Sent: 14 December 2016 00:17
>>> This gives a clear speed and security improvement. Rather than manually
>>> filling MD5 buffers, we simply create a layout by a simple anonymous
>>> struct, for which gcc generates rather efficient code.
>> ...
>>> + const struct {
>>> + struct in6_addr saddr;
>>> + struct in6_addr daddr;
>>> + __be16 sport;
>>> + __be16 dport;
>>> + } __packed combined = {
>>> + .saddr = *(struct in6_addr *)saddr,
>>> + .daddr = *(struct in6_addr *)daddr,
>>> + .sport = sport,
>>> + .dport = dport
>>> + };
>>
>> You need to look at the effect of marking this (and the other)
>> structures 'packed' on architectures like sparc64.
>
> In all current uses of __packed in the code, I think the impact is
> precisely zero, because all structures have members in descending
> order of size, with each member being a perfect multiple of the one
> below it. The __packed is therefore just there for safety, in case
> somebody comes in and screws everything up by sticking a u8 in
> between. In that case, it wouldn't be desirable to hash the structure
> padding bits. In the worst case, I don't believe the impact would be
> worse than a byte-by-byte memcpy, which is what the old code did. But
> anyway, these structures are already naturally packed anyway, so the
> present impact is nil.
>
If you pad the data structure to 64 bits then we can call the version
of siphash that only deals in 64 bit words. Writing a zero in the
padding will be cheaper than dealing with odd lengths in siphash24.

Tom

> Jason

Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Tom Herbert

On Wed, Dec 14, 2016 at 12:55 PM, Jason A. Donenfeld  wrote:
> Hey Tom,
>
> Just following up on what I mentioned in my last email...
>
> On Wed, Dec 14, 2016 at 8:35 PM, Jason A. Donenfeld  wrote:
>> I think your suggestion for (2) will contribute to further
>> optimizations for (1). In v2, I had another patch in there adding
>> siphash_1word, siphash_2words, etc, like jhash, but I implemented it
>> by taking u32 variables and then just concatenating these into a
>> buffer and passing them to the main siphash function. I removed it
>> from v3 because I thought that these kind of missed the whole point.
>> In particular:
>>
>> a) siphash24_1word, siphash24_2words, siphash24_3words, etc should
>> take u64, not u32, since that's what siphash operates on natively
>
> I implemented these here:
> https://git.zx2c4.com/linux-dev/commit/?h=siphash&id=4652b6f3643bdba217e2194d89661348bbac48a0
>
Those look good, although I would probably just do 1,2,3 words and
then have a function that takes n words like jhash. Might want to call
these dword to distinguish from 32 bit words in jhash.

Also, what is the significance of "24" in the function and constant
names? Can we just drop that and call this siphash?

Tom

> This will be part of the next version of the series I submit. It's not
> immediately clear that using it is strictly faster than the struct
> trick though. However, I'm not yet sure why this would be.
>
> Jason

Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Tom Herbert

On Wed, Dec 14, 2016 at 2:56 PM, Jason A. Donenfeld  wrote:
> Hey Tom,
>
> On Wed, Dec 14, 2016 at 10:35 PM, Tom Herbert  wrote:
>> Those look good, although I would probably just do 1,2,3 words and
>> then have a function that takes n words like jhash. Might want to call
>> these dword to distinguish from 32 bit words in jhash.
>
> So actually jhash_Nwords makes no sense, since it takes dwords
> (32-bits) not words (16-bits). The siphash analog should be called
> siphash24_Nqwords.
>
Yeah, that's a "bug" with jhash function names.

> I think what I'll do is change what I already have to:
> siphash24_1qword
> siphash24_2qword
> siphash24_3qword
> siphash24_4qword
>
> And then add some static inline helpers to assist with smaller u32s
> like ipv4 addresses called:
>
> siphash24_2dword
> siphash24_4dword
> siphash24_6dword
> siphash24_8dword
>
> While we're having something new, might as well call it the right thing.
>
I'm confused, doesn't 2dword == 1qword? Anyway, I think the qword
functions are good enough. If someone needs to hash over some odd
length they can either put them in a structure padded to 64 bits or
call the hash function that takes a byte length.

>
>> Also, what is the significance of "24" in the function and constant
>> names? Can we just drop that and call this siphash?
>
> SipHash is actually a family of PRFs, differentiated by the number of
> SIPROUNDs after each 64-bit input is processed and the number of
> SIPROUNDs at the very end of the function. The best trade-off of speed
> and security for kernel usage is 2 rounds after each 64-bit input and
> 4 rounds at the end of the function. This doesn't fall to any known
> cryptanalysis and it's very fast.

I'd still drop the "24" unless you really think we're going to have
multiple variants coming into the kernel.

Tom

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread Tom Herbert

On Fri, Dec 16, 2016 at 4:39 AM, Jason A. Donenfeld  wrote:
> Hey JP,
>
> On Fri, Dec 16, 2016 at 9:08 AM, Jean-Philippe Aumasson
>  wrote:
>> Here's a tentative HalfSipHash:
>> https://github.com/veorq/SipHash/blob/halfsiphash/halfsiphash.c
>>
>> Haven't computed the cycle count nor measured its speed.
>
Tested this. Distribution and avalanche effect are still good. Speed
wise I see about a 33% improvement over siphash (20 nsecs/op versus 32
nsecs). That's about 3x of jhash speed (7 nsecs). So that might closer
to a more palatable replacement for jhash. Do we lose any security
advantages with halfsiphash?

Tom

> This is incredible. Really. Wow!
>
> I'll integrate this into my patchset and will write up some
> documentation about when one should be used over the other.
>
> Thanks again. Quite exciting.
>
> Jason

Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread Tom Herbert

On Fri, Dec 16, 2016 at 12:41 PM, George Spelvin
 wrote:
> Tom Herbert wrote:
>> Tested this. Distribution and avalanche effect are still good. Speed
>> wise I see about a 33% improvement over siphash (20 nsecs/op versus 32
>> nsecs). That's about 3x of jhash speed (7 nsecs). So that might closer
>> to a more palatable replacement for jhash. Do we lose any security
>> advantages with halfsiphash?
>
> What are you testing on?  And what input size?  And does "33% improvement"
> mean 4/3 the rate and 3/4 the time?  Or 2/3 the time and 3/2 the rate?
>
Sorry, that is over an IPv4 tuple. Intel(R) Xeon(R) CPU E5-2660 0 @
2.20GHz. Recoded the function I was using to look like more like 64
bit version and yes it is indeed slower.

> These are very odd results.  On a 64-bit machine, SipHash should be the
> same speed per round, and faster because it hashes more data per round.
> (Unless you're hitting some unexpected cache/decode effect due to REX
> prefixes.)
>
> On a 32-bit machine (other than ARM, where your results might make sense,
> or maybe if you're hashing large amounts of data), the difference should
> be larger.
>
> And yes, there is a *significant* security loss.  SipHash is 128 bits
> ("don't worry about it").  hsiphash is 64 bits, which is known breakable
> ("worry about it"), so we have to do a careful analysis of the cost of
> a successful attack.
>
> As mentioned in the e-mails that just flew by, hsiphash is intended
> *only* for 32-bit machines which bog down on full SipHash.  On all 64-bit
> machines, it will be implemented as an alias for SipHash and the security
> concerns will Just Go Away.
>
> The place where hsiphash is expected to make a big difference is 32-bit
> x86.  If you only see 33% difference with "gcc -m32", I'm going to be
> very confused.

Re: [PATCH net-next v4 1/2] New kernel function to get IP overhead on a socket.

2017-03-22 Thread Tom Herbert

On Wed, Mar 22, 2017 at 3:59 PM, R. Parameswaran
 wrote:
>
> A new function, kernel_sock_ip_overhead(), is provided
> to calculate the cumulative overhead imposed by the IP
> Header and IP options, if any, on a socket's payload.
> The new function returns an overhead of zero for sockets
> that do not belong to the IPv4 or IPv6 address families.
>
Can you provide some context as to why this is needed?

Tom

> Signed-off-by: R. Parameswaran 
> ---
>  include/linux/net.h |  3 +++
>  net/socket.c| 44 
>  2 files changed, 47 insertions(+)
>
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 0620f5e..a42fab2 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -298,6 +298,9 @@ int kernel_sendpage(struct socket *sock, struct page 
> *page, int offset,
>  int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg);
>  int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
>
> +/* Following routine returns the IP overhead imposed by a socket.  */
> +u32 kernel_sock_ip_overhead(struct sock *sk);
> +
>  #define MODULE_ALIAS_NETPROTO(proto) \
> MODULE_ALIAS("net-pf-" __stringify(proto))
>
> diff --git a/net/socket.c b/net/socket.c
> index e034fe4..69598e1 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -3345,3 +3345,47 @@ int kernel_sock_shutdown(struct socket *sock, enum 
> sock_shutdown_cmd how)
> return sock->ops->shutdown(sock, how);
>  }
>  EXPORT_SYMBOL(kernel_sock_shutdown);
> +
> +/* This routine returns the IP overhead imposed by a socket i.e.
> + * the length of the underlying IP header, depending on whether
> + * this is an IPv4 or IPv6 socket and the length from IP options turned
> + * on at the socket.
> + */
> +u32 kernel_sock_ip_overhead(struct sock *sk)
> +{
> +   struct inet_sock *inet;
> +   struct ipv6_pinfo *np;
> +   struct ip_options_rcu *opt;
> +   struct ipv6_txoptions *optv6 = NULL;
> +   u32 overhead = 0;
> +   bool owned_by_user;
> +
> +   if (!sk)
> +   return overhead;
> +
> +   owned_by_user = sock_owned_by_user(sk);
> +   switch (sk->sk_family) {
> +   case AF_INET:
> +   inet = inet_sk(sk);
> +   overhead += sizeof(struct iphdr);
> +   opt = rcu_dereference_protected(inet->inet_opt,
> +   owned_by_user);
> +   if (opt)
> +   overhead += opt->opt.optlen;
> +   return overhead;
> +#if IS_ENABLED(CONFIG_IPV6)
> +   case AF_INET6:
> +   np = inet6_sk(sk);
> +   overhead += sizeof(struct ipv6hdr);
> +   if (np)
> +   optv6 = rcu_dereference_protected(np->opt,
> + owned_by_user);
> +   if (optv6)
> +   overhead += (optv6->opt_flen + optv6->opt_nflen);
> +   return overhead;
> +#endif /* IS_ENABLED(CONFIG_IPV6) */
> +   default: /* Returns 0 overhead if the socket is not ipv4 or ipv6 */
> +   return overhead;
> +   }
> +}
> +EXPORT_SYMBOL(kernel_sock_ip_overhead);
> --
> 2.1.4
>

Re: [PATCH 01/12] net: mediatek: fix DQL support

2016-06-07 Thread Tom Herbert

On Tue, Jun 7, 2016 at 4:01 PM, David Miller  wrote:
> From: John Crispin 
> Date: Mon, 6 Jun 2016 08:43:13 +0200
>
>> i think one solution would be to add some code to have 2 devices share
>> the same dql instance. would that be an acceptable solution ?
>
> You still need to address the issue of synchronization.
>
> dql purposefully doesn't use locking, always because a higher level
> object (in this case the netdev TX queue) it is contained within
> provides the synchronization.
>
> That breaks apart once you share the dql between two netdevs, as you
> are proposing here.  You'll have to add locking, which is expensive.
>
> That's why I'm trying to encourage you to think out of the box and
> find some way to solve the issue without having to access shared
> state shared between multiple devices.
>

I think you guys mean mean BQL not DQL :-)

If two netdevs share the same DMA ring then is using two netdevs the
right approach. Seems like this would have other consequences beyond
BQL...

Tom

> Thanks.
>

Re: [PATCH] net: Fix typos and whitespace.

2016-03-23 Thread Tom Herbert

On Wed, Mar 23, 2016 at 11:27 AM, David Miller  wrote:
> From: Bjorn Helgaas 
> Date: Wed, 23 Mar 2016 08:45:30 -0500
>
>> Fix typos.  Capitalize CPU, NAPI, RCU consistently.  Align structure
>> indentation.  No functional change intended; only comment and whitespace
>> changes.
>>
>> Signed-off-by: Bjorn Helgaas 
>
> Does not apply to the current 'net' tree, please respin.

Why is this for net and not net-next?

Re: [PATCH] flow_dissector: Pre-initialize ip_proto in __skb_flow_dissect()

2015-06-25 Thread Tom Herbert

On Thu, Jun 25, 2015 at 6:10 AM, Geert Uytterhoeven
 wrote:
> net/core/flow_dissector.c: In function ‘__skb_flow_dissect’:
> net/core/flow_dissector.c:132: warning: ‘ip_proto’ may be used uninitialized 
> in this function
>
> Signed-off-by: Geert Uytterhoeven 
> ---
> This may be a false positive, but the state machine in
> __skb_flow_dissect() is a bit hard to follow.
> As I believe it is controlled by a packet received from the network, the
> only safe thing to do is to pre-initialize ip_proto.
> ---
>  net/core/flow_dissector.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 476e5dda59e19822..2a834c6179b9973e 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -129,7 +129,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> struct flow_dissector_key_ports *key_ports;
> struct flow_dissector_key_tags *key_tags;
> struct flow_dissector_key_keyid *key_keyid;
> -   u8 ip_proto;
> +   u8 ip_proto = 0;
>
>     if (!data) {
> data = skb->data;
> --
> 1.9.1
>

Acked-by: Tom Herbert 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] netfilter: ipv4: use preferred kernel types

2016-02-01 Thread Tom Herbert

On Mon, Feb 1, 2016 at 11:41 AM, David Miller  wrote:
> From: David Laight 
> Date: Mon, 1 Feb 2016 16:37:41 +
>
>> From: Lucas Tanure
>>> Sent: 30 January 2016 13:18
>>> As suggested by checkpatch.pl:
>>> CHECK: Prefer kernel type 'uX' over 'uintX_t'
>>
>> One might ask why?
>
> We have consistently done this, and consistency is enough of an
> argument.

Also, uintX_t is verbose and does not add anything to readability.

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert

On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar  wrote:
>
> * Ingo Molnar  wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > +   /* Check length */
>> > +10:cmpl$8, %esi
>> > +   jg  30f
>> > +   jl  20f
>> > +
>> > +   /* Exactly 8 bytes length */
>> > +   addl(%rdi), %eax
>> > +   adcl4(%rdi), %eax
>> > +   RETURN
>> > +
>> > +   /* Less than 8 bytes length */
>> > +20:clc
>> > +   jmpq *branch_tbl_len(, %rsi, 8)
>> > +
>> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
>> > +* over first n % 16 quads
>> > +*/
>> > +30:movl%esi, %ecx
>> > +   shrl$3, %ecx
>> > +   andl$0xf, %ecx
>> > +   negq%rcx
>> > +   lea 40f(, %rcx, 4), %r11
>> > +   clc
>> > +   jmp *%r11
>>
>> Are you absolutely sure that a jump table is the proper solution here? It
>> defeats branch prediction on most x86 uarchs. Why not label the loop stages 
>> and
>> jump in directly with a binary tree of branches?
>
> So just to expand on this a bit, attached below is a quick & simple & stupid
> testcase that generates a 16 entries call table. (Indirect jumps and indirect
> calls are predicted similarly on most x86 uarchs.) Just built it with:
>
>   gcc -Wall -O2 -o jump-table jump-table.c
>
> Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel 
> CPU
> and also on AMD Opteron. The numbers are from the Intel box.) this gives a 
> high
> branch miss rate with a 16 entries jump table:
>
>  triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
>  ... using 16 jump table entries.
>  ... using 1 loop iterations.
>  ... result: 10001
>  [...]
>
>  Performance counter stats for './jump-table 16' (10 runs):
>
>1386.131780  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.18% )
> 33  context-switches  #0.024 K/sec
> ( +-  1.71% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.037 K/sec
> ( +-  0.71% )
>  6,247,215,683  cycles#4.507 GHz  
> ( +-  0.18% )
>  3,895,337,877  stalled-cycles-frontend   #   62.35% frontend cycles 
> idle ( +-  0.30% )
>  1,404,014,996  instructions  #0.22  insns per cycle
>   #2.77  stalled cycles 
> per insn  ( +-  0.02% )
>300,820,988  branches  #  217.022 M/sec
> ( +-  0.02% )
> 87,518,741  branch-misses #   29.09% of all branches  
> ( +-  0.01% )
>
>1.385240076 seconds time elapsed   
>( +-  0.21% )
>
> ... as you can see the branch miss rate is very significant, causing a stalled
> decoder and very low instruction throughput.
>
> I have to reduce the jump table to a single entry (!) to get good performance 
> on
> this CPU:
>
>  Performance counter stats for './jump-table 1' (10 runs):
>
> 739.173505  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.26% )
> 37  context-switches  #0.051 K/sec
> ( +- 16.79% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.070 K/sec
> ( +-  0.41% )
>  3,331,328,405  cycles#4.507 GHz  
> ( +-  0.26% )
>  2,012,973,596  stalled-cycles-frontend   #   60.43% frontend cycles 
> idle ( +-  0.47% )
>  1,403,880,792  instructions  #0.42  insns per cycle
>   #1.43  stalled cycles 
> per insn  ( +-  0.05% )
>300,817,064  branches  #  406.964 M/sec
> ( +-  0.05% )
> 12,177  branch-misses #0.00% of all branches  
> ( +- 12.39% )
>
>0.738616356 seconds time elapsed   
>( +-  0.26% )
>
> Note how the runtime got halved: that is because stalls got halved and the
> instructions per cycle throughput doubled.
>
> Even a two entries jump table performs poorly:
>
>  Performance counter stats for './jump-table 2' (10 runs):
>
>1493.790686  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.06% )
> 39  context-switches  #0.026 K/sec
> ( +-  4.73% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.035 K/sec
> ( +-  0.26% )
>  6,732,372,612  cycles#4.507 GHz  
> ( +-  0.06% )
>  4,229,130,302  stalled-cycl

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert

On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
 wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in assembly is precisely about the your point of
having to close the carry chains with adcq $0. I do have a first
implementation in C which using switch() to handle alignment, excess
length less than 8 bytes, and the odd number of quads to sum in the
main loop. gcc turns these switch statements into jump tables (not
function tables which is what Ingo's example code was using). The
problem I hit was that for each case I needed to close the carry chain
in the inline asm so fall through wouldn't have much value and each
case is expanded. The C version using switch gave a nice performance
gain, moving to all assembly was somewhat better.

There is also question of alignment. I f we really don't need to worry
about alignment at all on x86, then we should be able to eliminate the
complexity of dealing with it.

> It seems likely that the real issue is avoiding the short loops (that
> will cause branch prediction problems) and use a lookup table instead.
>
> But we can probably do better than that asm.
>
> For example, for the actual "8 bytes or shorter" case, I think
> something like this might just work fine:
>
>   unsigned long csum_partial_8orless(const unsigned char *buf,
> unsigned long len, unsigned long sum)
>   {
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];
> asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
> return sum;
>   }
>
I will look at doing that.

Thanks,
Tom

> NOTE! The above is 100% untested. But I _think_ it may handle the
> odd-byte-case correctly, and it should result in just one 8-byte load
> (the "load_unaligned_zeropad()" is just in case that ends up
> overflowing and we have page-alloc-debug triggering a page fault at
> the end). All without looping or any conditional branches that might
> mispredict.
>
> My point is that going to assembly results in pretty hard-to-read
> code, but it's also fairly inflexible. If we stay with C, we can much
> more easily play tricks. So for example, make the above an inline
> function, and now you can do things like this:
>
>   static inline unsigned long csum_partial_64bit(void *buf, unsigned
> long len, unsigned long sum)
>   {
> if (len <= 8)
> return csum_partial_8orless(buf, len, sum);
>
> /* We know it's larger than 8 bytes, so handle alignment */
> align = 7 & -(unsigned long)buf;
> sum = csum_partial_8orless(buf, align, sum);
> buf += align;
>
> /* We need to do the big-endian thing */
> sum = rotate_by8_if_odd(sum, align);
>
> /* main loop for big packets */
> .. do the unrolled inline asm thing that we already do ..
>
> sum = rotate_by8_if_odd(sum, align);
>
> /* Handle the last bytes */
> return csum_partial_8orless(buf, len, sum);
>   }
>
>   /* Fold the 64-bit sum we computed down to 32 bits __wsum */
>   __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
>   {
> unsigned long sum = csum_partial_64bit(ptr, len, partial);
> asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
> return sum;
>  }
>
> or something like that.
>
> NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
> "big-endian 16-bit add" thing right by doing the odd byte masking in
> the end cases, and by rotating the sum by 8 bits around the
> 8-byte-unrolled-loop, but I didn't test the above. It's literally
> written in my mail client. So I can almost guarantee that it is buggy,
> but it is meant as an *example* of "why not do it this way" rather
> than production code.
>
> I think that writing it in C and trying to be intelligent about it
> like the above would result in more maintainable code, and it is
> possible that it would even be faster.
>
> Yes, writing it in C *does* likely result in a few more cases of "adcq
> $0" in order to finish up the carry calculations. The *only* advantage
> of inline asm is how it allows you to keep the carry flag around. So
> there is downside to the C model, and it might cause a cycle or two of
> extra work, but the upside of C is that you can try to do clever
> things without turning the code completely unreadable.
>
> For example, doing the exception handling (that will never actually
> trigger) for

Re: [PATCH net-next] hv_netvsc: Add feature flags NETIF_F_IPV6_CSUM and NETIF_F_TSO6 for netvsc

2016-02-03 Thread Tom Herbert

On Wed, Feb 3, 2016 at 1:30 PM, Simon Xiao  wrote:
> 1. Adding NETIF_F_IPV6_CSUM and NETIF_F_TSO6 feature flags which are
> supported by Hyper-V platform.

NETIF_F_IPV6_CSUM and NETIF_F_IP_CSUM are being deprecated. Please
change to use NETIF_F_HW_CSUM (calling helper functions if need).

Thanks,
Tom

> 2. Cleanup the coding style of flag assignment by using macro.
>
> Signed-off-by: Simon Xiao 
> Reviewed-by: K. Y. Srinivasan 
> Reviewed-by: Haiyang Zhang 
> ---
>  drivers/net/hyperv/netvsc_drv.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index 1d3a665..0cde741 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -43,6 +43,12 @@
>
>  #define RING_SIZE_MIN 64
>  #define LINKCHANGE_INT (2 * HZ)
> +#define NETVSC_HW_FEATURES (NETIF_F_RXCSUM | \
> +NETIF_F_SG | \
> +NETIF_F_TSO | \
> +NETIF_F_TSO6 | \
> +NETIF_F_IP_CSUM | \
> +NETIF_F_IPV6_CSUM)
>  static int ring_size = 128;
>  module_param(ring_size, int, S_IRUGO);
>  MODULE_PARM_DESC(ring_size, "Ring buffer size (# of pages)");
> @@ -1081,10 +1087,8 @@ static int netvsc_probe(struct hv_device *dev,
>
> net->netdev_ops = &device_ops;
>
> -   net->hw_features = NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_IP_CSUM |
> -   NETIF_F_TSO;
> -   net->features = NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_SG | NETIF_F_RXCSUM 
> |
> -   NETIF_F_IP_CSUM | NETIF_F_TSO;
> +   net->hw_features = NETVSC_HW_FEATURES;
> +   net->features = NETVSC_HW_FEATURES | NETIF_F_HW_VLAN_CTAG_TX;
>
> net->ethtool_ops = ðtool_ops;
> SET_NETDEV_DEV(net, &dev->device);
> --
> 2.5.0
>

Re: BUG: free active (active state 0) object type: work_struct hint: strp_work

2018-02-14 Thread Tom Herbert

On Tue, Feb 13, 2018 at 12:15 PM, Dmitry Vyukov  wrote:
>
> On Thu, Jan 4, 2018 at 8:36 PM, Tom Herbert  wrote:
> > On Thu, Jan 4, 2018 at 4:10 AM, syzbot
> >  wrote:
> >> Hello,
> >>
> >> syzkaller hit the following crash on
> >> 6bb8824732f69de0f233ae6b1a8158e149627b38
> >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> >> compiler: gcc (GCC) 7.1.1 20170620
> >> .config is attached
> >> Raw console output is attached.
> >> Unfortunately, I don't have any reproducer for this bug yet.
> >>
> >>
> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> Reported-by: syzbot+3c6c745b0d2f341bb...@syzkaller.appspotmail.com
> >> It will help syzbot understand when the bug is fixed. See footer for
> >> details.
> >> If you forward the report, please keep this part and the footer.
> >>
> >> Use struct sctp_assoc_value instead
> >> sctp: [Deprecated]: syz-executor4 (pid 12483) Use of int in maxseg socket
> >> option.
> >> Use struct sctp_assoc_value instead
> >> [ cut here ]
> >> ODEBUG: free active (active state 0) object type: work_struct hint:
> >> strp_work+0x0/0xf0 net/strparser/strparser.c:381
> >> WARNING: CPU: 1 PID: 3502 at lib/debugobjects.c:291
> >> debug_print_object+0x166/0x220 lib/debugobjects.c:288
> >> Kernel panic - not syncing: panic_on_warn set ...
> >>
> >> CPU: 1 PID: 3502 Comm: kworker/u4:4 Not tainted 4.15.0-rc5+ #170
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >> Google 01/01/2011
> >> Workqueue: kkcmd kcm_tx_work
> >> Call Trace:
> >>  __dump_stack lib/dump_stack.c:17 [inline]
> >>  dump_stack+0x194/0x257 lib/dump_stack.c:53
> >>  panic+0x1e4/0x41c kernel/panic.c:183
> >>  __warn+0x1dc/0x200 kernel/panic.c:547
> >>  report_bug+0x211/0x2d0 lib/bug.c:184
> >>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
> >>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
> >>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> >>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >>  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1061
> >> RIP: 0010:debug_print_object+0x166/0x220 lib/debugobjects.c:288
> >> RSP: 0018:8801c0ee7068 EFLAGS: 00010086
> >> RAX: dc08 RBX: 0003 RCX: 8159bc3e
> >> RDX:  RSI: 1100381dcdc8 RDI: 8801db317dd0
> >> RBP: 8801c0ee70a8 R08:  R09: 1100381dcd9a
> >> R10: ed00381dce3c R11: 86137ad8 R12: 0001
> >> R13: 86113480 R14: 8560dc40 R15: 8146e5f0
> >>  __debug_check_no_obj_freed lib/debugobjects.c:745 [inline]
> >>  debug_check_no_obj_freed+0x662/0xf1f lib/debugobjects.c:774
> >>  kmem_cache_free+0x253/0x2a0 mm/slab.c:3745
> >
> > I believe we just need to defer kmem_cache_free to call_rcu.
>
>
> Hi Tom,
>
> Was this ever submitted? I don't any such change in net/kcm/kcmsock.c.


Hi Dmitry,

I am looking at it. Not yet convinced that call_rcu is right fix.

Tom

Re: [RFC v2 6/6] flow_dissector: Parse batman-adv unicast headers

2017-12-05 Thread Tom Herbert

On Tue, Dec 5, 2017 at 6:35 AM, Sven Eckelmann
 wrote:
> The batman-adv unicast packets contain a full layer 2 frame in encapsulated
> form. The flow dissector must therefore be able to parse the batman-adv
> unicast header to reach the layer 2+3 information.
>
>   ++
>   | ip(v6)hdr  |
>   ++
>   | inner ethhdr   |
>   ++
>   | batadv unicast hdr |
>   ++
>   | outer ethhdr   |
>   ++
>
> The obtained information from the upper layer can then be used by RPS to
> schedule the processing on separate cores. This allows better distribution
> of multiple flows from the same neighbor to different cores.
>
> Signed-off-by: Sven Eckelmann 
> ---
>  net/core/flow_dissector.c | 30 ++
>  1 file changed, 30 insertions(+)
>
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 15ce30063765..784cc07fc58e 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  static void dissector_set_key(struct flow_dissector *flow_dissector,
>   enum flow_dissector_key_id key_id)
> @@ -696,6 +697,35 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>
> break;
> }
> +   case htons(ETH_P_BATMAN): {
> +   struct {
> +   struct batadv_unicast_packet batadv_unicast;
> +   struct ethhdr eth;
> +   } *hdr, _hdr;
> +
> +   hdr = __skb_header_pointer(skb, nhoff, sizeof(_hdr), data, 
> hlen,
> +  &_hdr);
> +   if (!hdr) {
> +   fdret = FLOW_DISSECT_RET_OUT_BAD;
> +   break;
> +   }
> +
> +   if (hdr->batadv_unicast.version != BATADV_COMPAT_VERSION) {
> +   fdret = FLOW_DISSECT_RET_OUT_BAD;
> +   break;
> +   }
> +
> +   if (hdr->batadv_unicast.packet_type != BATADV_UNICAST) {
> +   fdret = FLOW_DISSECT_RET_OUT_BAD;
> +   break;
> +   }
> +
> +   proto = hdr->eth.h_proto;
> +   nhoff += sizeof(*hdr);
> +
> +   fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
> +   break;
> +   }
> case htons(ETH_P_8021AD):
> case htons(ETH_P_8021Q): {
> const struct vlan_hdr *vlan;
> --
> 2.11.0
>
Switch statements with cases having many LOC is hard to read and
__skb_flow_dissect is aleady quite convoluted to begin with.

I suggest putting this in a static function similar to how MPLS and
GRE are handled.

Tom

Re: WARNING in strp_data_ready

2017-12-27 Thread Tom Herbert

Did you try the patch I posted?


On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov  wrote:
> On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov  wrote:
>>>  wrote:
 On 10/24/2017 08:20 AM, syzbot wrote:
> Hello,
>
> syzkaller hit the following crash on 
> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_me 
> include/net/sock.h:1505 [inline]
> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 sock_owned_by_user 
> include/net/sock.h:1511 [inline]
> WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>  panic+0x1e4/0x417 kernel/panic.c:181
>  __warn+0x1c4/0x1d9 kernel/panic.c:542
>  report_bug+0x211/0x2d0 lib/bug.c:183
>  fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
>  do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
>  do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
>  do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
>  invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
> RIP: 0010:sock_owned_by_me include/net/sock.h:1505 [inline]
> RIP: 0010:sock_owned_by_user include/net/sock.h:1511 [inline]
> RIP: 0010:strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
> RSP: 0018:8801db206b18 EFLAGS: 00010206
> RAX: 8801d1e02080 RBX: 8801dad74c48 RCX: 
> RDX: 0100 RSI: 8801d29fa0a0 RDI: 85cbede0
> RBP: 8801db206b38 R08: 0005 R09: 10ce0bcd
> R10: 8801db206a00 R11: dc00 R12: 8801d29fa000
> R13: 8801dad74c50 R14: 8801d4350a92 R15: 0001
>  psock_data_ready+0x56/0x70 net/kcm/kcmsock.c:353

 Looks like KCM is calling sk_data_ready() without first taking the
 sock lock.

 /* Called with lower sock held */
 static void kcm_rcv_strparser(struct strparser *strp, struct sk_buff *skb)
 {
  [...]
 if (kcm_queue_rcv_skb(&kcm->sk, skb)) {

 In this case kcm->sk is not the same lock the comment is referring to.
 And kcm_queue_rcv_skb() will eventually call sk_data_ready().

 @Tom, how about wrapping the sk_data_ready call in {lock|release}_sock?
 I don't have anything better in mind immediately.

>>> The sock locks are taken in reverse order in the send path so so
>>> grabbing kcm sock lock with lower lock held to call sk_data_ready may
>>> lead to deadlock like I think.
>>>
>>> It might be possible to change the order in the send path to do this.
>>> Something like:
>>>
>>> trylock on lower socket lock
>>> -if trylock fails
>>>   - release kcm sock lock
>>>   - lock lower sock
>>>   - lock kcm sock
>>> - call sendpage locked function
>>>
>>> I admit that dealing with two levels of socket locks in the data path
>>> is quite a pain :-)
>>
>> up
>>
>> still happening and we've lost 50K+ test VMs on this
>
> up
>
> Still happens and number of crashes crossed 60K, can we do something
> with this please?

Re: WARNING in strp_data_ready

2017-12-27 Thread Tom Herbert

On Wed, Dec 27, 2017 at 12:20 PM, Ozgur  wrote:
>
>
> 27.12.2017, 23:14, "Dmitry Vyukov" :
>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur  wrote:
>>>  27.12.2017, 22:21, "Dmitry Vyukov" :
>>>>  On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  wrote:
>>>>>   Did you try the patch I posted?
>>>>
>>>>  Hi Tom,
>>>
>>>  Hello Dmitry,
>>>
>>>>  No. And I didn't know I need to. Why?
>>>>  If you think the patch needs additional testing, you can ask syzbot to
>>>>  test it. See 
>>>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>>>>  Otherwise proceed with committing it. Or what are we waiting for?
>>>>
>>>>  Thanks
>>>
>>>  I think we need to fixed patch for crash, in fact check to patch code and 
>>> test solve the bug.
>>>  How do test it because there is no patch in the following bug?
>>
>> Hi Ozgur,
>>
>> I am not sure I completely understand what you mean. But the
>> reproducer for this bug (which one can use for testing) is here:
>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>> Tom also mentions there is some patch for this, but I don't know where
>> it is, it doesn't seem to be referenced from this thread.
>
> Hello Dmitry,
>
> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
> I think Tom send patch to only you and are you tested?
>
> kcmsock.c will change and strp_data_ready I think locked.
>
> Tom, please send a patch for me? I can test and inform you.
>
Hi Ozgur,

I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can!

Thanks,
Tom

> Regards
>
> Ozgur
>
>>>  The fix patch should be for this net/kcm/kcmsock.c file and lock functions 
>>> must be added calling sk_data_ready ().
>>>  Regards
>>>
>>>  Ozgur
>>>
>>>>>   On Wed, Dec 27, 2017 at 10:25 AM, Dmitry Vyukov  
>>>>> wrote:
>>>>>>   On Wed, Dec 6, 2017 at 4:44 PM, Dmitry Vyukov  
>>>>>> wrote:
>>>>>>>>wrote:
>>>>>>>>>   On 10/24/2017 08:20 AM, syzbot wrote:
>>>>>>>>>>   Hello,
>>>>>>>>>>
>>>>>>>>>>   syzkaller hit the following crash on 
>>>>>>>>>> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
>>>>>>>>>>   
>>>>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>>>>>>>>   compiler: gcc (GCC) 7.1.1 20170620
>>>>>>>>>>   .config is attached
>>>>>>>>>>   Raw console output is attached.
>>>>>>>>>>   C reproducer is attached
>>>>>>>>>>   syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>>>>>>>>   for information about syzkaller reproducers
>>>>>>>>>>
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>>>>>>>>>> sock_owned_by_me include/net/sock.h:1505 [inline]
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>>>>>>>>>> sock_owned_by_user include/net/sock.h:1511 [inline]
>>>>>>>>>>   WARNING: CPU: 0 PID: 2996 at ./include/net/sock.h:1505 
>>>>>>>>>> strp_data_ready+0x2b7/0x390 net/strparser/strparser.c:404
>>>>>>>>>>   Kernel panic - not syncing: panic_on_warn set ...
>>>>>>>>>>
>>>>>>>>>>   CPU: 0 PID: 2996 Comm: syzkaller142210 Not tainted 4.14.0-rc5+ #138
>>>>>>>>>>   Hardware name: Google Google Compute Engine/Google Compute Engine, 
>>>>>>>>>> BIOS Google 01/01/2011
>>>>>>>>>>   Call Trace:
>>>>>>>>>>
>>>>>>>>>>__dump_stack lib/dump_stack.c:16 [inline]
>>>>>>>>>>dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>>>>>>>>panic+0x1e4/0x417 kernel/panic.c:181
>>>>>>>>>>__warn+0x1c4/0x1d9 kernel/panic.c:542
>>>>>>>>>>report_bug+0x211/0x2d0 lib/bug.c:183
>>>>>>>>>>fi

Re: WARNING in strp_data_ready

2017-12-28 Thread Tom Herbert

On Thu, Dec 28, 2017 at 12:59 AM, Ozgur  wrote:
>
>
> 28.12.2017, 04:19, "Tom Herbert" :
>> On Wed, Dec 27, 2017 at 12:20 PM, Ozgur  wrote:
>>>  27.12.2017, 23:14, "Dmitry Vyukov" :
>>>>  On Wed, Dec 27, 2017 at 9:08 PM, Ozgur  wrote:
>>>>>   27.12.2017, 22:21, "Dmitry Vyukov" :
>>>>>>   On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert  
>>>>>> wrote:
>>>>>>>Did you try the patch I posted?
>>>>>>
>>>>>>   Hi Tom,
>>>>>
>>>>>   Hello Dmitry,
>>>>>
>>>>>>   No. And I didn't know I need to. Why?
>>>>>>   If you think the patch needs additional testing, you can ask syzbot to
>>>>>>   test it. See 
>>>>>> https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>>>>>>   Otherwise proceed with committing it. Or what are we waiting for?
>>>>>>
>>>>>>   Thanks
>>>>>
>>>>>   I think we need to fixed patch for crash, in fact check to patch code 
>>>>> and test solve the bug.
>>>>>   How do test it because there is no patch in the following bug?
>>>>
>>>>  Hi Ozgur,
>>>>
>>>>  I am not sure I completely understand what you mean. But the
>>>>  reproducer for this bug (which one can use for testing) is here:
>>>>  https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>>>>  Tom also mentions there is some patch for this, but I don't know where
>>>>  it is, it doesn't seem to be referenced from this thread.
>>>
>>>  Hello Dmitry,
>>>
>>>  Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
>>>  I think Tom send patch to only you and are you tested?
>>>
>>>  kcmsock.c will change and strp_data_ready I think locked.
>>>
>>>  Tom, please send a patch for me? I can test and inform you.
>>
>> Hi Ozgur,
>>
>> I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you 
>> can!
>>
>> Thanks,
>> Tom
>
> Hello Tom,
>
> Which are you use the repos? I pulled but I don't seen this patches.
>
They are not in any public repo yet. I posted the patches to netdev
list so they can be reviewed and tested by third parties. Posting
patches to the list a normal path to get patches into the kernel
(http://nickdesaulniers.github.io/blog/2017/05/16/submitting-your-first-patch-to-the-linux-kernel-and-responding-to-feedback/).

These patches were applied to net-next but are simple enough that they
should apply to other branches. I will repost and target to net per
Dave's directive once they are verified to fix the issue.

Tom

Re: [PATCH] strparser: initialize all callbacks

2017-08-25 Thread Tom Herbert

On Thu, Aug 24, 2017 at 2:38 PM, Eric Biggers  wrote:
> From: Eric Biggers 
>
> commit bbb03029a899 ("strparser: Generalize strparser") added more
> function pointers to 'struct strp_callbacks'; however, kcm_attach() was
> not updated to initialize them.  This could cause the ->lock() and/or
> ->unlock() function pointers to be set to garbage values, causing a
> crash in strp_work().
>
> Fix the bug by moving the callback structs into static memory, so
> unspecified members are zeroed.  Also constify them while we're at it.
>
> This bug was found by syzkaller, which encountered the following splat:
>
> IP: 0x55
> PGD 3b1ca067
> P4D 3b1ca067
> PUD 3b12f067
> PMD 0
>
> Oops: 0010 [#1] SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 
> #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> 01/01/2011
> Workqueue: kstrp strp_work
> task: 88006bb0e480 task.stack: 88006bb1
> RIP: 0010:0x55
> RSP: 0018:88006bb17540 EFLAGS: 00010246
> RAX: dc00 RBX: 88006ce4bd60 RCX: 
> RDX: 11000d9c97bd RSI:  RDI: 88006ce4bc48
> RBP: 88006bb17558 R08: 81467ab2 R09: 
> R10: 88006bb17438 R11: 88006bb17940 R12: 88006ce4bc48
> R13: 88003c683018 R14: 88006bb17980 R15: 88003c683000
> FS:  () GS:88006de0() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0055 CR3: 3c145000 CR4: 06e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
>  worker_thread+0x223/0x1860 kernel/workqueue.c:2233
>  kthread+0x35e/0x430 kernel/kthread.c:231
>  ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
> Code:  Bad RIP value.
> RIP: 0x55 RSP: 88006bb17540
> CR2: 0055
> ---[ end trace f0e4920047069cee ]---
>
> Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
> CONFIG_AF_KCM=y):
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> static const struct bpf_insn bpf_insns[3] = {
> { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
> { .code = 0x95 }, /* BPF_EXIT_INSN() */
> };
>
> static const union bpf_attr bpf_attr = {
> .prog_type = 1,
> .insn_cnt = 2,
> .insns = (uintptr_t)&bpf_insns,
> .license = (uintptr_t)"",
> };
>
> int main(void)
> {
> int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
>  &bpf_attr, sizeof(bpf_attr));
> int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
> int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
>
> ioctl(kcm_fd, SIOCKCMATTACH,
>   &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
> }
>
> Fixes: bbb03029a899 ("strparser: Generalize strparser")
> Cc: Dmitry Vyukov 
> Cc: Tom Herbert 
> Signed-off-by: Eric Biggers 
> ---
>  Documentation/networking/strparser.txt |  2 +-
>  include/net/strparser.h|  2 +-
>  kernel/bpf/sockmap.c   | 10 +-
>  net/kcm/kcmsock.c  | 11 +--
>  net/strparser/strparser.c  |  2 +-
>  5 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/networking/strparser.txt 
> b/Documentation/networking/strparser.txt
> index fe01302471ae..13081b3decef 100644
> --- a/Documentation/networking/strparser.txt
> +++ b/Documentation/networking/strparser.txt
> @@ -35,7 +35,7 @@ Functions
>  =
>
>  strp_init(struct strparser *strp, struct sock *sk,
> - struct strp_callbacks *cb)
> + const struct strp_callbacks *cb)
>
>   Called to initialize a stream parser. strp is a struct of type
>   strparser that is allocated by the upper layer. sk is the TCP
> diff --git a/include/net/strparser.h b/include/net/strparser.h
> index 4fe966a0ad92..7dc131d62ad5 100644
> --- a/include/net/strparser.h
> +++ b/include/net/strparser.h
> @@ -138,7 +138,7 @@ void strp_done(struct strparser *strp);
>  void strp_stop(struct strparser *strp);
>  void strp_check_rcv(struct strparser *strp);
>  int strp_init(struct st

Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-27 Thread Tom Herbert

On Wed, Sep 27, 2017 at 4:25 PM, Willem de Bruijn
 wrote:
>>> In the future, both simple and sophisticated policy like RSS or other guest
>>> driven steering policies could be done on top.
>>
>> IMHO there should be a more practical example before adding all this
>> indirection. And it would be nice to understand why this queue selection
>> needs to be tun specific.
>
> I was thinking the same and this reminds me of the various strategies
> implemented in packet fanout. tun_cpu_select_queue is analogous to
> fanout_demux_cpu though it is tun-specific in that it requires tun->numqueues.
>
> Fanout accrued various strategies until it gained an eBPF variant. Just
> supporting BPF is probably sufficient here, too.

+1, in addition to packet fanout, we have SO_REUSEPORT with BPF, RPS,
RFS, etc. It would be nice if existing packet steering mechanisms
could be leveraged for tun.

Re: [PATCH] ila_xlat: add missing hash secret initialization

2017-06-08 Thread Tom Herbert

On Thu, Jun 8, 2017 at 12:54 AM, Arnd Bergmann  wrote:
> While discussing the possible merits of clang warning about unused initialized
> functions, I found one function that was clearly meant to be called but
> never actually is.
>
> __ila_hash_secret_init() initializes the hash value for the ila locator,
> apparently this is intended to prevent hash collision attacks, but this ends
> up being a read-only zero constant since there is no caller. I could find
> no indication of why it was never called, the earliest patch submission
> for the module already was like this. If my interpretation is right, we
> certainly want to backport the patch to stable kernels as well.
>
> I considered adding it to the ila_xlat_init callback, but for best effect
> the random data is read as late as possible, just before it is first used.
> The underlying net_get_random_once() is already highly optimized to avoid
> overhead when called frequently.
>
> Fixes: 7f00feaf1076 ("ila: Add generic ILA translation facility")
> Cc: sta...@vger.kernel.org
> Link: https://www.spinics.net/lists/kernel/msg2527243.html
> Signed-off-by: Arnd Bergmann 
> ---
>  net/ipv6/ila/ila_xlat.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
> index 2fd5ca151dcf..77f7f8c7d93d 100644
> --- a/net/ipv6/ila/ila_xlat.c
> +++ b/net/ipv6/ila/ila_xlat.c
> @@ -62,6 +62,7 @@ static inline u32 ila_locator_hash(struct ila_locator loc)
>  {
> u32 *v = (u32 *)loc.v32;
>
> +   __ila_hash_secret_init();
> return jhash_2words(v[0], v[1], hashrnd);
>  }
>
> --
> 2.9.0
>

Thanks Arnd!

Acked-by: Tom Herbert

Re: [RFC v2 6/6] flow_dissector: Parse batman-adv unicast headers

2017-12-06 Thread Tom Herbert

On Wed, Dec 6, 2017 at 8:54 AM, Willem de Bruijn
 wrote:
> On Wed, Dec 6, 2017 at 5:26 AM, Sven Eckelmann
>  wrote:
>> On Dienstag, 5. Dezember 2017 09:19:45 CET Tom Herbert wrote:
>> [...]
>>> Switch statements with cases having many LOC is hard to read and
>>> __skb_flow_dissect is aleady quite convoluted to begin with.
>>>
>>> I suggest putting this in a static function similar to how MPLS and
>>> GRE are handled.
>
> Perhaps it can even be behind a static key depending on whether any
> devices are active, adjusted in batadv_hardif_(en|dis)able_interface.
>
It's aready in a switch statement so static key shouldn't make a
difference. Also, we have made flow dissector operate indendently of
whether to end protocol is enable (for instance GRE is dissected
regarless of whether and GRE tunnels are configured).

Tom

>> Thanks for the feedback.
>>
>> I was not sure whether "inline" or an extra function would be preferred. I've
>> then decided to implement it like most of the other protocols. But since an
>> extra function is the preferred method for handling new protos, I will move 
>> it
>> to an extra function.
>>
>> The change can already be found in
>>  
>> https://git.open-mesh.org/linux-merge.git/shortlog/refs/heads/ecsv/flowdissector
>>
>> I also saw that you've introduced in
>> commit 823b96939578 ("flow_dissector: Add control/reporting of 
>> encapsulation")
>> a flag to stop dissecting when something encapsulated was detected. It is not
>> 100% clear to me when the  FLOW_DIS_ENCAPSULATION should be set and
>> FLOW_DISSECTOR_F_STOP_AT_ENCAP be checked. Only when there is IP/eth in IP
>> (like in the two examples from the commit message) or also when there is a
>> ethernet header, followed by batman-adv unicast header and again followed by
>> an ethernet header?
>
> Please implement FLOW_DISSECTOR_F_STOP_AT_ENCAP. It may
> be used in more flow dissector paths in the future.
>
> The features are also used by GRE, which can encap Ethernet, for an example
> that is closer to this protocol.

Re: [RFC V3 PATCH 18/26] net/netpolicy: set tx queues according to policy

2016-09-12 Thread Tom Herbert

On Mon, Sep 12, 2016 at 7:55 AM,   wrote:
> From: Kan Liang 
>
> When the device tries to transmit a packet, netdev_pick_tx is called to
> find the available tx queues. If the net policy is applied, it picks up
> the assigned tx queue from net policy subsystem, and redirect the
> traffic to the assigned queue.
>
> Signed-off-by: Kan Liang 
> ---
>  include/net/sock.h |  9 +
>  net/core/dev.c | 20 ++--
>  2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e1e9e3d..ca97f35 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2280,4 +2280,13 @@ extern int sysctl_optmem_max;
>  extern __u32 sysctl_wmem_default;
>  extern __u32 sysctl_rmem_default;
>
> +/* Return netpolicy instance information from socket. */
> +static inline struct netpolicy_instance *netpolicy_find_instance(struct sock 
> *sk)
> +{
> +#ifdef CONFIG_NETPOLICY
> +   if (is_net_policy_valid(sk->sk_netpolicy.policy))
> +   return &sk->sk_netpolicy;
> +#endif
> +   return NULL;
> +}
>  #endif /* _SOCK_H */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 34b5322..b9a8044 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3266,6 +3266,7 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
> *dev,
> struct sk_buff *skb,
> void *accel_priv)
>  {
> +   struct sock *sk = skb->sk;
> int queue_index = 0;
>
>  #ifdef CONFIG_XPS
> @@ -3280,8 +3281,23 @@ struct netdev_queue *netdev_pick_tx(struct net_device 
> *dev,
> if (ops->ndo_select_queue)
> queue_index = ops->ndo_select_queue(dev, skb, 
> accel_priv,
> __netdev_pick_tx);
> -   else
> -   queue_index = __netdev_pick_tx(dev, skb);
> +   else {
> +#ifdef CONFIG_NETPOLICY
> +   struct netpolicy_instance *instance;
> +
> +   queue_index = -1;
> +   if (dev->netpolicy && sk) {
> +   instance = netpolicy_find_instance(sk);
> +   if (instance) {
> +   if (!instance->dev)
> +   instance->dev = dev;
> +   queue_index = 
> netpolicy_pick_queue(instance, false);
> +   }
> +   }
> +   if (queue_index < 0)
> +#endif

I doubt this produces the intended effect. Several drivers use
ndo_select_queue (such as mlx4) where there might do something special
for a few packets but end up called the default handler which
__netdev_pick_tx for most packets. So in such cases the netpolicy path
would be routinely bypassed. Maybe this code should be in
__netdev_pick_tx.

Tom

> +   queue_index = __netdev_pick_tx(dev, skb);
> +   }
>
> if (!accel_priv)
> queue_index = netdev_cap_txqueue(dev, queue_index);
> --
> 2.5.5
>

83 matches

Mail list logo