RE: [ PATCH v1 2/2] ibmveth: Identify ingress large send packets.

2020-10-12 Thread dwilder

On 2020-10-11 11:31, Jakub Kicinski wrote:

On Sat, 10 Oct 2020 12:51:30 -0400 Willem de Bruijn wrote:

> > @@ -1385,7 +1386,17 @@ static int ibmveth_poll(struct napi_struct *napi, 
int budget)
> > skb_put(skb, length);
> > skb->protocol = eth_type_trans(skb, netdev);
> >
> > -   if (length > netdev->mtu + ETH_HLEN) {
> > +   /* PHYP without PLSO support places a -1 in the ip
> > +* checksum for large send frames.
> > +*/
> > +   if (be16_to_cpu(skb->protocol) == ETH_P_IP) {


You can byteswap the constant, so its done at compilation time.


Thanks for the comments.

For V2 of patch I will change above to BE16_TO_CPU()




> > +   struct iphdr *iph = (struct iphdr 
*)skb->data;
> > +
> > +   iph_check = iph->check;
>
> Check against truncated/bad packets.

.. unless I missed context. Other code in this driver seems to peek in
the network and transport layer headers without additional geometry
and integrity checks, too.


Good catch, even if we trust the hypervisor to only forward valid
frames this needs to be at least mentioned in the commit message.

Also please add Fixes tags to both patches.


For V2: ( posting soon )
-Will add Fix tags
-update commit message re: validity of frames from Hypervisor.
-switch be16_to_cpu() to BE16_TO_CPU().

Thanks


RE: [(RFC) PATCH ] NULL pointer dereference on rmmod iptable_mangle.

2020-06-15 Thread dwilder

On 2020-06-15 04:44, Florian Westphal wrote:

Florian Westphal  wrote:

dwilder  wrote:
> > Since the netns core already does an unconditional synchronize_rcu after
> > the pre_exit hooks this would avoid the problem as well.
>
> Something like this?  (un-tested)

Yes.

> diff --git a/net/ipv4/netfilter/iptable_mangle.c
> b/net/ipv4/netfilter/iptable_mangle.c
> index bb9266ea3785..0d448e4d5213 100644
> --- a/net/ipv4/netfilter/iptable_mangle.c
> +++ b/net/ipv4/netfilter/iptable_mangle.c
> @@ -100,15 +100,26 @@ static int __net_init iptable_mangle_table_init(struct
> net *net)
> return ret;
>  }
>
> +static void __net_exit iptable_mangle_net_pre_exit(struct net *net)
> +{
> +   struct xt_table *table = net->ipv4.iptable_mangle;
> +
> +   if (mangle_ops)
> +   nf_unregister_net_hooks(net, mangle_ops,
> +   hweight32(table->valid_hooks));
> +}

You probably need if (table) here, not mangle_ops.
I'm not sure if it makes sense to add a new

xt_unregister_table_hook() helper, I guess one would have to try
and see if that reduces copy&paste programming or not.

>  static void __net_exit iptable_mangle_net_exit(struct net *net)
>  {
> if (!net->ipv4.iptable_mangle)
> return;
> -   ipt_unregister_table(net, net->ipv4.iptable_mangle, mangle_ops);
> +   ipt_unregister_table(net, net->ipv4.iptable_mangle, NULL);

I guess the 3rd arg could be removed from the helper.

But yes, this looks like what I had in mind.


Will there be a followup?  Otherwise I will place this on my todo-list.

Thanks.


Hi Florian

I am working on a patch.  Will post soon.  Sorry for the delay.


RE: [(RFC) PATCH ] be2net: Allow a VF to use physical link state.

2020-06-10 Thread dwilder

On 2020-06-09 14:58, Jakub Kicinski wrote:

On Mon,  8 Jun 2020 17:00:59 -0700 David Wilder wrote:
Hyper-visors owning a PF are allowed by Emulex specification to 
provide

a VF with separate physical and/or logical link states. However, on
linux, a VF driver must chose one or the other.

My scenario is a proprietary driver controlling the PF, be2net is the 
VF.

When I do a physical cable pull test the PF sends a link event
notification to the VF with the "physical" link status but this is
ignored in be2net (see be_async_link_state_process() ).

The PF is reporting the adapter type as:
0xe228   /* Device id for VF in Lancer */

I added a module parameter "use_pf_link_state". When set the VF should
ignore logical link state and use the physical link state.

However I have an issue making this work.  When the cable is pulled I
see two link statuses reported:
[1706100.767718] be2net 8002:01:00.0 eth1: Link is Down
[1706101.189298] be2net 8002:01:00.0 eth1: Link is Up

be_link_status_update() is called twice, the first with the physical 
link
status called from be_async_link_state_process(), and the second with 
the

logical link status from be_get_link_ksettings().

I am unsure why be_async_link_state_process() is called from
be_get_link_ksettings(), it results in multiple link state changes
(even in the un-patched case). If I eliminate this call then it works.
But I am un-sure of this change.

Signed-off-by: David Wilder 


Hm. Just looking at the code in __be_cmd_set_logical_link_config():


if (link_state == IFLA_VF_LINK_STATE_ENABLE ||
link_state == IFLA_VF_LINK_STATE_AUTO)
link_config |= PLINK_ENABLE;

if (link_state == IFLA_VF_LINK_STATE_AUTO)
link_config |= PLINK_TRACK;

Maybe we shouldn't set ENABLE for AUTO?


If I am understanding this correctly, this is used by the linux PF 
driver to configure

how link status is delivered to a VF.

My problem is one of interoperability between the PF (not linux) and the 
VF is running on linux.
The PF driver is implemented to the Emulex/Broadcom spec, which allows a 
PF driver to be configured such that the VF can be notified of both 
physical and logical link status, separately.




The module parameter is definitely not a good idea, what you're asking
for seems to be well within the expectation from the
.ndo_set_vf_link_state config, so it seems the driver / firmware is 
just

not implementing that right.


I am attempting to resolve an issue that the linux implementation cant 
conform to the the Emulex specification due to the implementation on 
linux.


RE: [(RFC) PATCH ] NULL pointer dereference on rmmod iptable_mangle.

2020-06-03 Thread dwilder

On 2020-06-03 15:05, Florian Westphal wrote:

David Wilder  wrote:
This crash happened on a ppc64le system running ltp network tests when 
ltp script ran "rmmod iptable_mangle".


[213425.602369] BUG: Kernel NULL pointer dereference at 0x0010
[213425.602388] Faulting instruction address: 0xc00800550bdc

[..]

In the crash we find in iptable_mangle_hook() that 
state->net->ipv4.iptable_mangle=NULL causing a NULL pointer 
dereference. net->ipv4.iptable_mangle is set to NULL in 
iptable_mangle_net_exit() and called when ip_mangle modules is 
unloaded. A rmmod task was found in the crash dump.  A 2nd crash 
showed the same problem when running "rmmod iptable_filter" 
(net->ipv4.iptable_filter=NULL).


Once a hook is registered packets will picked up a pointer from: 
net->ipv4.iptable_$table. The patch adds a call to synchronize_net() 
in ipt_unregister_table() to insure no packets are in flight that have 
picked up the pointer before completing the un-register.


This change has has prevented the problem in our testing.  However, we 
have concerns with this change as it would mean that on netns cleanup, 
we would need one synchronize_net() call for every table in use. Also, 
on module unload, there would be one synchronize_net() for every 
existing netns.


Yes, I agree with the analysis.


Signed-off-by: David Wilder 
---
 net/ipv4/netfilter/ip_tables.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/ip_tables.c 
b/net/ipv4/netfilter/ip_tables.c

index c2670ea..97c4121 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -1800,8 +1800,10 @@ int ipt_register_table(struct net *net, const 
struct xt_table *table,

 void ipt_unregister_table(struct net *net, struct xt_table *table,
  const struct nf_hook_ops *ops)
 {
-   if (ops)
+   if (ops) {
nf_unregister_net_hooks(net, ops, 
hweight32(table->valid_hooks));
+   synchronize_net();
+   }


I'd wager ebtables, arptables and ip6tables have the same bug.

The extra synchronize_net() isn't ideal.  We could probably do it this
way and then improve in a second patch.

One way to fix this without a new synchronize_net() is to switch all
iptable_foo.c to use ".pre_exit" hook as well.

pre_exit would unregister the underlying hook and .exit would to the
table freeing.

Since the netns core already does an unconditional synchronize_rcu 
after

the pre_exit hooks this would avoid the problem as well.


Something like this?  (un-tested)

diff --git a/net/ipv4/netfilter/iptable_mangle.c 
b/net/ipv4/netfilter/iptable_mangle.c

index bb9266ea3785..0d448e4d5213 100644
--- a/net/ipv4/netfilter/iptable_mangle.c
+++ b/net/ipv4/netfilter/iptable_mangle.c
@@ -100,15 +100,26 @@ static int __net_init 
iptable_mangle_table_init(struct net *net)

return ret;
 }

+static void __net_exit iptable_mangle_net_pre_exit(struct net *net)
+{
+   struct xt_table *table = net->ipv4.iptable_mangle;
+
+   if (mangle_ops)
+   nf_unregister_net_hooks(net, mangle_ops,
+   hweight32(table->valid_hooks));
+}
+
+
 static void __net_exit iptable_mangle_net_exit(struct net *net)
 {
if (!net->ipv4.iptable_mangle)
return;
-   ipt_unregister_table(net, net->ipv4.iptable_mangle, mangle_ops);
+   ipt_unregister_table(net, net->ipv4.iptable_mangle, NULL);
net->ipv4.iptable_mangle = NULL;
 }

 static struct pernet_operations iptable_mangle_net_ops = {
+   .pre_exit = iptable_mangle_net_pre_exit,
.exit = iptable_mangle_net_exit,
 };



[patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2016-11-14 Thread dwilder
The example code in netlink(7) (for reading netlink message) suggests 
using
a 4k read buffer with recvmsg.  This can cause truncated messages on 
systems

using a page size is >4096.  Please see:
linux/include/linux/netlink.h (in the kernel source)


/*
 *  skb should fit one page. This choice is good for headerless 
malloc.

 *  But we should limit to 8K so that userspace does not have to
 *  use enormous buffer sizes on recvmsg() calls just to avoid
 *  MSG_TRUNC when PAGE_SIZE is very large.
 */
#if PAGE_SIZE < 8192UL
#define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(PAGE_SIZE)
#else
#define NLMSG_GOODSIZE  SKB_WITH_OVERHEAD(8192UL)
#endif

#define NLMSG_DEFAULT_SIZE (NLMSG_GOODSIZE - NLMSG_HDRLEN)


I was troubleshooting some up-stream code on a ppc64le system
(page:size of 64k) This code had duplicated the example from netlink(7) 
and
was using a 4k buffer.  On x86-64 with a 4k page size this is not a 
problem,
however on the 64k page system some messages were truncated.  Using an 
8k buffer

as implied in netlink.h prevents problems with any page size.

Lets change the example so others don't propagate the problem further.

Signed-off-by David Wilder 

--- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
+++ man7/netlink.7  2016-11-14 13:30:51.002086354 -0800
@@ -511,7 +511,7 @@
 .in +4n
 .nf
 int len;
-char buf[4096];
+char buf[8192];
 struct iovec iov = { buf, sizeof(buf) };
 struct sockaddr_nl sa;
 struct msghdr msg;



Re: Double free of dst_entry in ipv4_dst_destroy()

2015-12-14 Thread dwilder

Eric -
With this patch applied the test ran clean for 2 days.

Thanks for your help.

Quoting Eric Dumazet :


On Fri, 2015-12-11 at 07:48 -0800, Eric Dumazet wrote:

On Fri, 2015-12-11 at 06:23 -0800, Eric Dumazet wrote:
> On Sun, 2015-12-06 at 17:58 -0800, Eric Dumazet wrote:
> > On Sun, 2015-12-06 at 13:03 -0800, Eric Dumazet wrote:
> >
> > > But then when later we promote a skb->dst to a refctounted one
> > > (skb_dst_force(), we might make sure we abort the operation  
if __refcnt

> > > == 0 ( and DST_NOCACHE is in dst->flags)
> > >
> >
> > Minimum patch would be :
> >
>
> Here is a more complete patch, it should fix the issue I think :

Hmm, I'll send a v3, I forgot to test DST_NOCACHE properly.



David, please test the following patch, thanks !

 include/net/dst.h   |   33 +
 include/net/sock.h  |2 +-
 net/ipv4/tcp_ipv4.c |5 ++---
 net/ipv6/tcp_ipv6.c |3 +--
 4 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 1279f9b09791..c7329dcd90cc 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -322,6 +322,39 @@ static inline void skb_dst_force(struct sk_buff *skb)
}
 }

+/**
+ * dst_hold_safe - Take a reference on a dst if possible
+ * @dst: pointer to dst entry
+ *
+ * This helper returns false if it could not safely
+ * take a reference on a dst.
+ */
+static inline bool dst_hold_safe(struct dst_entry *dst)
+{
+   if (dst->flags & DST_NOCACHE)
+   return atomic_inc_not_zero(&dst->__refcnt);
+   dst_hold(dst);
+   return true;
+}
+
+/**
+ * skb_dst_force_safe - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted and not destroyed, grab a ref on it.
+ */
+static inline void skb_dst_force_safe(struct sk_buff *skb)
+{
+   if (skb_dst_is_noref(skb)) {
+   struct dst_entry *dst = skb_dst(skb);
+
+   if (!dst_hold_safe(dst))
+   dst = NULL;
+
+   skb->_skb_refdst = (unsigned long)dst;
+   }
+}
+

 /**
  * __skb_tunnel_rx - prepare skb for rx reinsert
diff --git a/include/net/sock.h b/include/net/sock.h
index eaef41433d7a..18322bded064 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -816,7 +816,7 @@ void sk_stream_write_space(struct sock *sk);
 static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
/* dont let skb dst not refcounted, we are going to leave rcu lock */
-   skb_dst_force(skb);
+   skb_dst_force_safe(skb);

if (!sk->sk_backlog.tail)
sk->sk_backlog.head = skb;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index db003438aaf5..d8841a2f1569 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1493,7 +1493,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
if (likely(sk->sk_rx_dst))
skb_dst_drop(skb);
else
-   skb_dst_force(skb);
+   skb_dst_force_safe(skb);

__skb_queue_tail(&tp->ucopy.prequeue, skb);
tp->ucopy.memory += skb->truesize;
@@ -1721,8 +1721,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const  
struct sk_buff *skb)

 {
struct dst_entry *dst = skb_dst(skb);

-   if (dst) {
-   dst_hold(dst);
+   if (dst && dst_hold_safe(dst)) {
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index e7aab561b7b4..6b8a8a9091fa 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -93,10 +93,9 @@ static void inet6_sk_rx_dst_set(struct sock *sk,  
const struct sk_buff *skb)

 {
struct dst_entry *dst = skb_dst(skb);

-   if (dst) {
+   if (dst && dst_hold_safe(dst)) {
const struct rt6_info *rt = (const struct rt6_info *)dst;

-   dst_hold(dst);
sk->sk_rx_dst = dst;
inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Double free of dst_entry in ipv4_dst_destroy()

2015-12-06 Thread dwilder

Hi-

I am seeing a crash on a distro V4.2.3 kernel caused by a double  
release of a dst_entry.  In ipv4_dst_destroy() the call to  
list_empty() finds a poisoned next pointer, indicating the dst_entry  
has already been removed from the list and freed.   The crash occurs  
18 to 24 hours into a run of a network stress exerciser.


[172135.963496] Unable to handle kernel paging request for data at  
address 0x00100108  << poison value

[172135.963913] Faulting instruction address: 0xc097f5ac
[172135.964184] Oops: Kernel access of bad area, sig: 11 [#1]
[172135.964327] SMP NR_CPUS=2048 NUMA PowerNV
[172135.964649] Modules linked in: iptable_filter ip_tables x_tables  
bnx2x cxlflash cxl mdio ses libcrc32c enclosure uio_pdrv_genirq uio  
powernv_rng sunrpc autofs4 ipr
[172135.965403] CPU: 51 PID: 65452 Comm: hxecpu Tainted: GW 
   4.2.3 #2
[172135.965482] task: c01d8a370ff0 ti: c01e2d57c000 task.ti:  
c01e2d57c000
[172135.965564] NIP: c097f5ac LR: c097f59c CTR:  

[172135.965664] REGS: c01e2d57f8e0 TRAP: 0300   Tainted: G 
W(4.2.3)
[172135.965782] MSR: 90009033   CR:  
22322442  XER: 
[172135.966033] CFAR: c0008468 DAR: 00100108 DSISR:  
4200 SOFTE: 1
GPR00: c097f59c c01e2d57fb60  
c151ad00 c01e504de300
GPR04: 0001   
c01fff8af370 c141ad00
GPR08: c16aad00 00200200  
00100100 3d7473696c5f6465
GPR12: 6531303030303063 c7b5e480  
0012 0001
GPR16:  c1431280  
c0ad38f0 7fff
GPR20:  c01e28caf000  
c01e2d57c000 c1429b80
GPR24: 000a   
c01e504ddb30 
GPR28:  c01e2d57c000  
c01e504de300 c01e28caf000  <<< pointer to dst

[172135.967772] NIP [c097f5ac] ipv4_dst_destroy+0x8c/0xc0
[172135.967921] LR [c097f59c] ipv4_dst_destroy+0x7c/0xc0
[172135.968076] Call Trace:
[172135.968133] [c01e2d57fb60] [c097f59c]  
ipv4_dst_destroy+0x7c/0xc0 (unreliable)

[172135.968306] [c01e2d57fbd0] [c093b8e0] dst_destroy+0xf0/0x1a0
[172135.968452] [c01e2d57fc10] [c093bc68]  
dst_destroy_rcu+0x28/0x50
[172135.968600] [c01e2d57fc40] [c01397a0]  
rcu_process_callbacks+0x340/0x6f0

[172135.968768] [c01e2d57fcf0] [c00ba7d8] __do_softirq+0x188/0x3a0
[172135.968913] [c01e2d57fde0] [c00bac68] irq_exit+0xc8/0x100
[172135.969056] [c01e2d57fe00] [c001f734]  
timer_interrupt+0xa4/0xe0
[172135.969223] [c01e2d57fe30] [c0002714]  
decrementer_common+0x114/0x180

[172135.969395] Instruction dump:
[172135.969492] 7fe5fb78 38842968 7fc6f378 3863e580 4811c989 6000  
7fc3f378 481154c1
[172135.969748] 6000 e93f00b8 e95f00b0 7fc3f378   
f949 3d200010 61290100

[172135.970009] ---[ end trace 34f3693ddc2d5aea ]---
--

I added a call to dump_stack() into dst_release() in an attempt to  
catch the two calls to dst_release().


-- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -307,6 +307,12 @@ void dst_release(struct dst_entry *dst)
net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
 __func__, dst, newrefcnt);
if (!newrefcnt && unlikely(dst->flags & DST_NOCACHE))
+
+   if (!list_empty(&rt->rt_uncached)) {
+   printk("%s: dst=%p\n",__func__,dst);
+   dump_stack();
+   }
+
call_rcu(&dst->rcu_head, dst_destroy_rcu);
}
 }

I got lucky and caught the following stack traces on my next run.

[26211.699357] tcp_v4_do_rcv: sk=c01d10a0 skb=c01d22c61d00  
dst=c01d22c62500 sk->sk_rx_dst=c01d22c62500

[26211.699527] dst_release: dst=c01d22c62500
[26211.699626] CPU: 51 PID: 23317 Comm: hxecom Tainted: GW  
  4.2.3 #4

[26211.699632] Call Trace:
[26211.699678] [c01cf0387440] [c0a9dcd4]  
dump_stack+0x90/0xbc (unreliable)

[26211.699829] [c01cf0387470] [c093bf80] dst_release+0x110/0x120
[26211.699875] [c01cf03874e0] [c09b4024] tcp_v4_do_rcv+0x4d4/0x4f0
[26211.699979] [c01cf0387580] [c09b7834] tcp_v4_rcv+0xb74/0xb90
[26211.700027] [c01cf0387660] [c0984bb8]  
ip_local_deliver_finish+0x178/0x350
[26211.700123] [c01cf03876b0] [c09853bc]  
ip_local_deliver+0x4c/0x120

[26211.700181] [c01cf0387720] [c0984eb4] ip_rcv_finish+0x124/0x420
[26211.700255] [c01cf03877a0] [c09857a4] ip_rcv+0x314/0x440
[26211.700312] [c01cf0387830] [