Re: net: deadlock on genl_mutex

2016-11-28 Thread subashab


Issue was reported yesterday and is under investigation.


http://marc.info/?l=linux-netdev=148014004331663=2


Thanks !


Hi Dmitry

Can you try the patch below with your reproducer? I haven't seen similar 
crashes reported after this (or even with Eric's patch).


https://patchwork.ozlabs.org/patch/699937/

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: Crash due to mutex genl_lock called from RCU context

2016-11-25 Thread subashab

Oh well, this wont work, since sk->sk_destruct will be called from RCU
callback.

Grabbing the mutex should not be done from netlink_sock_destruct() but
from netlink_release()

Maybe this patch would be better :

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 62bea4591054..cce10e3c9b68 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -324,16 +324,6 @@ static void netlink_skb_set_owner_r(struct
sk_buff *skb, struct sock *sk)

 static void netlink_sock_destruct(struct sock *sk)
 {
-   struct netlink_sock *nlk = nlk_sk(sk);
-
-   if (nlk->cb_running) {
-   if (nlk->cb.done)
-   nlk->cb.done(>cb);
-
-   module_put(nlk->cb.module);
-   kfree_skb(nlk->cb.skb);
-   }
-
skb_queue_purge(>sk_receive_queue);

if (!sock_flag(sk, SOCK_DEAD)) {
@@ -456,8 +446,9 @@ static struct sock *netlink_lookup(struct net
*net, int protocol, u32 portid)

rcu_read_lock();
sk = __netlink_lookup(table, portid, net);
-   if (sk)
-   sock_hold(sk);
+   if (sk && !atomic_inc_not_zero(>sk_refcnt))
+   sk = NULL;
+
rcu_read_unlock();

return sk;
@@ -581,6 +572,7 @@ static int __netlink_create(struct net *net,
struct socket *sock,
}
init_waitqueue_head(>wait);

+   sock_set_flag(sk, SOCK_RCU_FREE);
sk->sk_destruct = netlink_sock_destruct;
sk->sk_protocol = protocol;
return 0;
@@ -645,13 +637,6 @@ static int netlink_create(struct net *net, struct
socket *sock, int protocol,
goto out;
 }

-static void deferred_put_nlk_sk(struct rcu_head *head)
-{
-	struct netlink_sock *nlk = container_of(head, struct netlink_sock, 
rcu);

-
-   sock_put(>sk);
-}
-
 static int netlink_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
@@ -724,7 +709,19 @@ static int netlink_release(struct socket *sock)
local_bh_disable();
sock_prot_inuse_add(sock_net(sk), _proto, -1);
local_bh_enable();
-   call_rcu(>rcu, deferred_put_nlk_sk);
+   if (nlk->cb_running) {
+   mutex_lock(nlk->cb_mutex);
+   if (nlk->cb_running) {
+   if (nlk->cb.done)
+   nlk->cb.done(>cb);
+
+   module_put(nlk->cb.module);
+   kfree_skb(nlk->cb.skb);
+   nlk->cb_running = false;
+   }
+   mutex_unlock(nlk->cb_mutex);
+   }
+   sock_put(sk);
return 0;
 }

diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 3cfd6cc60504..5dc08a7b0a2b 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -32,7 +32,6 @@ struct netlink_sock {
struct module   *module;

struct rhash_head   node;
-   struct rcu_head rcu;
 };

 static inline struct netlink_sock *nlk_sk(struct sock *sk)


Thanks Eric! I'll try this and get back with results over this weekend.

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Crash due to mutex genl_lock called from RCU context

2016-11-25 Thread subashab
We are seeing a crash due to gen_lock mutex being acquired in RCU 
context.

Crash is seen on a 4.4 based kernel ARM64 device. This occurred in a
regression rack, so unfortunately I don't have steps for a reproducer.

It looks like freeing socket in RCU was brought in through commit
21e4902aea80ef35afc00ee8d2abdea4f519b7f7 ("netlink: Lockless lookup with
RCU grace period in socket release").
I am not very familiar with generic netlink sockets so I am not sure
if there is any other way to fix this apart from reverting this patch.

Any pointers to debug this would be appreciated.

Here is the call stack -

BUG: sleeping function called from invalid context 
kernel/locking/mutex.c:98

in_atomic(): 1, irqs_disabled(): 0, pid: 16400, name: busybox
[] ___might_sleep+0x134/0x144
[] __might_sleep+0x7c/0x8c
[] mutex_lock+0x2c/0x4c
[] genl_lock+0x1c/0x24
[] genl_lock_done+0x2c/0x50
[] netlink_sock_destruct+0x30/0x94
[] sk_destruct+0x2c/0x150
[] __sk_free+0x9c/0xc4
[] sk_free+0x40/0x4c
[] deferred_put_nlk_sk+0x40/0x4c
[] rcu_process_callbacks+0x4d4/0x644
[] __do_softirq+0x1b8/0x3c4
[] irq_exit+0x80/0xd4
[] handle_IPI+0x1c0/0x364
[] gic_handle_irq+0x154/0x1a4

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [PATCH net] net: Check for fullsock in sock_i_uid()

2016-11-02 Thread subashab

This would be a bug in the caller.

Can you give us the complete stack trace leading to the problem you
had ?

Thanks !


Thanks Eric for the clarification. In that case, the bug is in the 
IDLETIMER target in Android kernel.

https://android.googlesource.com/kernel/common/+/android-4.4/net/netfilter/xt_IDLETIMER.c#356

Here is the call stack.

-003|rwlock_bug(?, ?)
-004|arch_read_lock(inline)
-004|do_raw_read_lock(lock = 0xFFC0354E79C8)
-005|raw_read_lock_bh(lock = 0xFFC0354E79C8)
-006|sock_i_uid(sk = 0xFFC0354E77B0)
-007|from_kuid_munged(inline)
-007|reset_timer(info = 0xFFC04D17D218, skb = 0xFFC018AB98C0)
-008|idletimer_tg_target(skb = 0xFFC018AB98C0, ?)
-009|ipt_do_table(skb = 0xFFC018AB98C0, state = 0xFFC0017E6F30, 
?)
-010|iptable_mangle_hook(?, skb = 0xFFC018AB98C0, state = 
0xFFC0017E6F30)
-011|nf_iterate(head = 0xFFC0019D55B8, skb = 0xFFC018AB98C0, 
state = 0xFFC0017E6F30, elemp =

-012|nf_hook_slow(skb = 0xFFC018AB98C0, state = 0xFFC0017E6F30)
-013|NF_HOOK_COND(inline)
-013|ip_output(net = 0xFFC0019D4B00, sk = 0xFFC0354E77B0, skb = 
0xFFC018AB98C0)
-014|ip_local_out(net = 0xFFC0019D4B00, sk = 0xFFC0354E77B0, skb 
= 0xFFC018AB98C0)
-015|ip_build_and_send_pkt(skb = 0xFFC018AB98C0, sk = 
0xFFC023F2E880, saddr = 1688053952, daddr =
-016|tcp_v4_send_synack(sk = 0xFFC023F2E880, ?, ?, req = 
0xFFC0354E77B0, foc = 0xFFC0017E7110

-017|atomic_sub_return(inline)
-017|reqsk_put(inline)
-017|tcp_conn_request(?, af_ops = 0xFFC001080FC8, sk = 
0xFFC023F2E880, ?)

-018|tcp_v4_conn_request(?, ?)
-019|tcp_rcv_state_process(sk = 0xFFC023F2E880, skb = 
0xFFC018ABAD00)

-020|tcp_v4_do_rcv(sk = 0xFFC023F2E880, skb = 0xFFC018ABAD00)
-021|tcp_v4_rcv(skb = 0xFFC018ABAD00)
-022|ip_local_deliver_finish(net = 0xFFC0019D4B00, ?, skb = 
0xFFC018ABAD00)

-023|NF_HOOK_THRESH(inline)
-023|NF_HOOK(inline)
-023|ip_local_deliver(skb = 0xFFC018ABAD00)
-024|ip_rcv_finish(net = 0xFFC0019D4B00, ?, skb = 
0xFFC018ABAD00)

-025|NF_HOOK_THRESH(inline)
-025|NF_HOOK(inline)
-025|ip_rcv(skb = 0xFFC018ABAD00, dev = 0xFFC023474000, ?, ?)
-026|deliver_skb(inline)
-026|deliver_ptype_list_skb(inline)
-026|__netif_receive_skb_core(skb = 0x0A73, pfmemalloc = FALSE)
-027|__netif_receive_skb(skb = 0xFFC0BA455D40)
-028|netif_receive_skb_internal(skb = 0xFFC0BA455D40)
-029|netif_receive_skb(skb = 0xFFC0BA455D40)

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [RFC PATCH] xfrm: Add option to reset oif in xfrm lookup

2016-08-05 Thread subashab

I need to do some additional testing next week (taking PTO the next 2
days), but this should fix your problem. Can you confirm? This is
better than a sysctl to handle the known use cases, but it does not
handle a combination of the 2 known use cases (e.g., throw your use
case into a VRF).

diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index b644a23c3db0..41f5b504a782 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -29,7 +29,7 @@ static struct dst_entry *__xfrm4_dst_lookup(struct
net *net, struct flowi4 *fl4,
memset(fl4, 0, sizeof(*fl4));
fl4->daddr = daddr->a4;
fl4->flowi4_tos = tos;
-   fl4->flowi4_oif = oif;
+   fl4->flowi4_oif = l3mdev_master_ifindex_by_index(net, oif);
if (saddr)
fl4->saddr = saddr->a4;


Thanks David. This works for me on 4.4 (along with commit 
1a8524794fc7c70f44ac28e3a6e8fd637bc41f14 ('net: l3mdev: Add master 
device lookup by index')).
Let me know if you have some other approach in mind or if this needs to 
be sent as an official patch.


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [RFC PATCH] xfrm: Add option to reset oif in xfrm lookup

2016-08-03 Thread subashab

I can't explain the iptables output but from a FIB lookup perspective
it is using table 8 per the FIB rules, the xfrm is hit and packets
shift to 192.168.77.1 and go out what you have as eth0.

Take a look at:
  perf record -e fib:* -a -g
  perf script

And then run tcpdump on both eth0 and eth1. For me on "eth0" (which is
really eth11 for my VM setup) I see this on the ping:



You can try running these commands as is on UML.
We tried these out on 3.18 as well as on 4.4.

20:50:11.389837 ARP, Request who-has 192.168.77.2 tell 192.168.77.1, 
length 28
20:50:11.390079 ARP, Reply 192.168.77.2 is-at 02:00:12:34:02:0a, length 
28

20:50:11.390101 IP 192.168.77.1 > 192.168.77.2: ICMP 192.168.77.1 udp
port 4500 unreachable, length 168

So the packets are going out "eth0" as expected.

That said, the commands you have given do not totally transfer to
another setup. In my case I have 2 VMs with eth11 and eth12 directly
connected (VM1 eth11 <--> VM2 eth11 and ditto for eth12). You have
given one side of the commands and I have configured the other side
with the .1 addresses but not bothered to translate the xfrm commands.

That said, this seems like a contrived example -- you pin ping to
device eth1 (-I eth1), you are pinging a host on the network for eth1
but want packets to go out eth0 via the xfrm. Can you elaborate on the
real use case and problem here?


Applications may be bound to a specific interface but would try to send 
data over multiple types of networks.
Our use case here is wifi calling. In this case, we try to force packets 
to go over wifi after encryption.
The rules which we were using worked on 3.18 but we ran into issues on 
4.4.

Debugging narrowed us down to this oif preservation through xfrm.

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project


Re: [RFC PATCH] xfrm: Add option to reset oif in xfrm lookup

2016-07-29 Thread subashab

Please don't try to workaround a bug with a sysctl.
If we have a bug here, we should fix it. Choosing
between bug A and bug B with a sysctl is not what
we are doing ;)


Sure, this was just a quick hack.


Can you give an example of your use case -- e.g., commands for others
(me) to reproduce?


Here is an equivalent set of rules. We see a difference in the oif when 
reset oif vs preserve it.
eth1 is the interface from which traffic is generated while eth0 is the 
tunnel.


--
#Commands
echo 1 > /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/conf/all/accept_local
echo 1 > /proc/sys/net/ipv4/conf/eth0/accept_local
echo 1 > /proc/sys/net/ipv4/conf/eth1/accept_local

ip addr add 192.168.77.2/24 dev eth0
ip link set eth0 mtu 1400
ip link set eth0 up

ip addr add 192.168.33.2/24 dev eth1
ip link set eth1 mtu 1400
ip link set eth1 up

ip ru add to 192.168.33.1 lookup 8 prio 4000
ip ru add oif eth1 lookup 8 prio 4010
ip ru add to 192.168.77.1 lookup 9 prio 4030

ip route add default dev eth1 table 8
ip route add default dev eth0 table 9

iptables -t raw -A OUTPUT -j LOG --log-prefix "RAW-OUT >> "
iptables -t mangle -A POSTROUTING -j LOG --log-prefix "MAN-PST >> "

echo 0 > /proc/sys/net/ipv4/tcp_timestamps

# out direction
ip xfrm state add src 192.168.77.2 dst 192.168.77.1 proto esp spi 0x1234 
mode tunnel enc 'cbc(aes)' 
0xbb31df5b207dc1c7a8512eeda0b2d0691e27bc8059dbb82df616bb9955058cd5 auth 
'hmac(sha1)' 0x93b43b527d564efb9eac8cd04510b86e409f8ea7 flag af-unspec 
encap espinudp 4500 4500 0.0.0.0


ip xfrm policy add dir out src 192.168.33.2 tmpl src 192.168.77.2 dst 
192.168.77.1 proto esp spi 0x1234 mode tunnel


# in direction
ip xfrm state add src 192.168.77.1 dst 192.168.77.2 proto esp spi 0x4321 
mode tunnel enc 'cbc(aes)' 
0x5d3ca96d1af2eaa9cf8f1c1cace88f550e2a5b7b82027023287e1fe2a42f7f54 auth 
'hmac(sha1)' 0xcd09f850d7c0dd6dc0ed342619c1165571452f9d flag af-unspec 
encap espinudp 4500 4500 0.0.0.0


ip xfrm policy add dir in dst 192.168.33.2 tmpl src 192.168.77.1 dst 
192.168.77.2 proto esp spi 0x4321 mode tunnel
ip xfrm policy add dir fwd dst 192.168.33.2 tmpl src 192.168.77.1 dst 
192.168.77.2 proto esp spi 0x4321 mode tunnel

--

Output when resetting oif (3.18)

root@vm:~# ping -c 1 -I eth1 192.168.33.1
PING 192.168.33.1 (192.168.33.1) 56(84) bytes of data.
RAW-OUT >> IN= OUT=eth0 SRC=192.168.33.2 DST=192.168.33.1 LEN=84 
TOS=0x00 PREC=0x00 TTL=64 ID=801 DF PROTO=ICMP TYPE=8 CODE=0 ID=2040 
SEQ=1
MAN-PST >> IN= OUT=eth0 SRC=192.168.33.2 DST=192.168.33.1 LEN=84 
TOS=0x00 PREC=0x00 TTL=64 ID=801 DF PROTO=ICMP TYPE=8 CODE=0 ID=2040 
SEQ=1
RAW-OUT >> IN= OUT=eth0 SRC=192.168.77.2 DST=192.168.77.1 LEN=160 
TOS=0x00 PREC=0x00 TTL=64 ID=41757 DF PROTO=UDP SPT=4500 DPT=4500 
LEN=140
MAN-PST >> IN= OUT=eth0 SRC=192.168.77.2 DST=192.168.77.1 LEN=160 
TOS=0x00 PREC=0x00 TTL=64 ID=41757 DF PROTO=UDP SPT=4500 DPT=4500 
LEN=140


--

Output when preserving oif (4.4)

root@vm:~# ping -c 1 -I eth1 192.168.33.1
PING 192.168.33.1 (192.168.33.1) 56(84) bytes of data.
RAW-OUT >> IN= OUT=eth1 SRC=192.168.33.2 DST=192.168.33.1 LEN=84 
TOS=0x00 PREC=0x00 TTL=64 ID=20191 DF PROTO=ICMP TYPE=8 CODE=0 ID=2043 
SEQ=1
MAN-PST >> IN= OUT=eth1 SRC=192.168.33.2 DST=192.168.33.1 LEN=84 
TOS=0x00 PREC=0x00 TTL=64 ID=20191 DF PROTO=ICMP TYPE=8 CODE=0 ID=2043 
SEQ=1
RAW-OUT >> IN= OUT=eth1 SRC=192.168.77.2 DST=192.168.77.1 LEN=160 
TOS=0x00 PREC=0x00 TTL=64 ID=49515 DF PROTO=UDP SPT=4500 DPT=4500 
LEN=140
MAN-PST >> IN= OUT=eth1 SRC=192.168.77.2 DST=192.168.77.1 LEN=160 
TOS=0x00 PREC=0x00 TTL=64 ID=49515 DF PROTO=UDP SPT=4500 DPT=4500 
LEN=140


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project




Re: [RFC] Handle error writing UINT_MAX to u32 fields

2016-06-14 Thread subashab

On 2016-06-12 20:30, subas...@codeaurora.org wrote:

The suggested change would extend the usable range of positive numbers
by one bit only. As many systems are 64 bit this does not seem forward
looking.

I would prefer to have a routine that can handle 64 bit integers with
limits (let's call it proc_doint64vec_minmax) which uses fields extra1
and extra2 of ctl_table as min and max.

Then set xfrm_table[].extra1 = 0 and xfrm_table[].extra2 = UINT_MAX if
you need a result in the u32 range.



Thanks Heinrich. Do you think we can use proc_doulongvec_minmax for 
this?


Actually proc_doulongvec_minmax does not work here.
I would expect similar problems due to casting if we use u64 
(proc_doint64vec_minmax) here.


static int __do_proc_doulongvec_minmax(void *data, struct ctl_table 
*table, int write,

{
unsigned long *i, *min, *max;
int vleft, first = 1, err = 0;

	i = (unsigned long *) data;   //This cast is causing to read beyond the 
size of data (u32)

min = (unsigned long *) table->extra1;
max = (unsigned long *) table->extra2;
	vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because 
maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on 
x86_64.


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project.


Re: [RFC] Handle error writing UINT_MAX to u32 fields

2016-06-12 Thread subashab

The suggested change would extend the usable range of positive numbers
by one bit only. As many systems are 64 bit this does not seem forward
looking.

I would prefer to have a routine that can handle 64 bit integers with
limits (let's call it proc_doint64vec_minmax) which uses fields extra1
and extra2 of ctl_table as min and max.

Then set xfrm_table[].extra1 = 0 and xfrm_table[].extra2 = UINT_MAX if
you need a result in the u32 range.



Thanks Heinrich. Do you think we can use proc_doulongvec_minmax for 
this?


Re: [PATCH iproute2] ss: Tell user about -EOPNOTSUPP for SOCK_DESTROY

2016-05-17 Thread subashab

On 2016-05-16 20:29, Lorenzo Colitti wrote:
On Tue, May 17, 2016 at 11:24 AM, David Ahern  
wrote:
As I mentioned we can print the unsupported once or per socket matched 
and

with the socket params. e.g.,

+   } else if (errno == EOPNOTSUPP) {
+   printf("Operation not supported for:\n");
+   inet_show_sock(h, diag_arg->f, 
diag_arg->protocol);


Actively suppressing all error messages is just wrong. I get the 
flooding

issue so I'm fine with just printing it once.


I disagree, but then I'm the one who wrote it in the first place, so
you wouldn't expect me to agree. :-) Let's see what Stephen says.


Hi Lorenzo

Would it be acceptable to have a separate column which displays the 
result of the sock destroy operation per socket.

State... Killed
ESTAB Y
TIME_WAIT N

If it is not supported from kernel, maybe print U (unsupported) for 
this.


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project.


Re: [net PATCH v2 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-04 Thread subashab

On 2016-04-04 10:31, Alexander Duyck wrote:
RFC 6864 states that the IPv4 ID field MUST NOT be used for purposes 
other
than fragmentation and reassembly.  Currently we are looking at this 
field
as a way of identifying what frames can be aggregated and  which cannot 
for
GRO.  While this is valid for frames that do not have DF set, it is 
invalid

to do so if the bit is set.

In addition we were generating IPv4 ID collisions when 2 or more flows 
were
interleaved over the same tunnel.  To prevent that we store the result 
of

all IP ID checks via a "|=" instead of overwriting previous values.

With this patch we support two different approaches for the IP ID 
field.
The first is a non-incrementing IP ID with DF bit set.  In such a case 
we

simply won't write to the flush_id field in the GRO context block.  The
other option is the legacy option in which the IP ID must increment by 
1

for every packet we aggregate.

In the case of the non-incrementing IP ID we will end up losing the 
data

that the IP ID is fixed.  However as per RFC 6864 we should be able to
write any value into the IP ID when the DF bit is set so this should 
cause

minimal harm.

Signed-off-by: Alexander Duyck 
---

v2: Updated patch so that we now only support one of two options.  
Either
the IP ID is fixed with DF bit set, or the IP ID is incrementing.  
That

allows us to support the fixed ID case as occurs with IPv6 to IPv4
header translation and what is likely already out there for some
devices with tunnel headers.

 net/core/dev.c |1 +
 net/ipv4/af_inet.c |   25 ++---
 net/ipv6/ip6_offload.c |3 ---
 3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 77a71cd68535..3429632398a4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4352,6 +4352,7 @@ static void gro_list_prepare(struct napi_struct
*napi, struct sk_buff *skb)
unsigned long diffs;

NAPI_GRO_CB(p)->flush = 0;
+   NAPI_GRO_CB(p)->flush_id = 0;

if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 9e481992dbae..33f6335448a2 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1324,6 +1324,7 @@ static struct sk_buff **inet_gro_receive(struct
sk_buff **head,

for (p = *head; p; p = p->next) {
struct iphdr *iph2;
+   u16 flush_id;

if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1347,14 +1348,24 @@ static struct sk_buff
**inet_gro_receive(struct sk_buff **head,
(iph->tos ^ iph2->tos) |
((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));

-   /* Save the IP ID check to be included later when we get to
-* the transport layer so only the inner most IP ID is checked.
-* This is because some GSO/TSO implementations do not
-* correctly increment the IP ID for the outer hdrs.
-*/
-   NAPI_GRO_CB(p)->flush_id =
-   ((u16)(ntohs(iph2->id) + NAPI_GRO_CB(p)->count) ^ 
id);
NAPI_GRO_CB(p)->flush |= flush;
+
+   /* We must save the offset as it is possible to have multiple
+* flows using the same protocol and address pairs so we
+* need to wait until we can validate this is part of the
+* same flow with a 5-tuple or better to avoid unnecessary
+* collisions between flows.  We can support one of two
+* possible scenarios, either a fixed value with DF bit set
+* or an incrementing value with DF either set or unset.
+* In the case of a fixed value we will end up losing the
+* data that the IP ID was a fixed value, however per RFC
+* 6864 in such a case the actual value of the IP ID is
+* meant to be ignored anyway.
+*/
+   flush_id = (u16)(id - ntohs(iph2->id));
+   if (flush_id || !(iph2->frag_off & htons(IP_DF)))
+   NAPI_GRO_CB(p)->flush_id |= flush_id ^
+   NAPI_GRO_CB(p)->count;
}

NAPI_GRO_CB(skb)->flush |= flush;
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 82e9f3076028..9aa53f64dffd 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -238,9 +238,6 @@ static struct sk_buff **ipv6_gro_receive(struct
sk_buff **head,
/* flush if Traffic Class fields are different */
NAPI_GRO_CB(p)->flush |= !!(first_word & htonl(0x0FF0));
NAPI_GRO_CB(p)->flush |= flush;
-
-   /* Clear flush_id, there's really no concept of ID in IPv6. */
-   NAPI_GRO_CB(p)->flush_id = 

[RFC] xfrm: netdevice unregistration during decryption

2016-03-08 Thread subashab

I am observing a crash originating from XFRM framework on a 3.18 ARM64
kernel.

get_rps_cpus tries to dereference the skb->dev fields but it appears 
that

the device is freed from the poison pattern.
The following is the crash call stack -

 55428.227024:   <2> [] get_rps_cpu+0x94/0x2f0
 55428.227027:   <2> [] netif_rx_internal+0x140/0x1cc
 55428.227030:   <2> [] netif_rx+0x74/0x94
 55428.227035:   <2> [] xfrm_input+0x754/0x7d0
 55428.227038:   <2> [] xfrm_input_resume+0x10/0x1c
 55428.227044:   <2> [] esp_input_done+0x20/0x30
 55428.227056:   <2> [] process_one_work+0x244/0x3fc
 55428.227060:   <2> [] worker_thread+0x2f8/0x418
 55428.227064:   <2> [] kthread+0xe0/0xec

-013|get_rps_cpu(
|dev = 0xFFC08B688000,
|skb = 0xFFC0C76AAC00 -> (
|  dev = 0xFFC08B688000 -> (
|name = 
"..
|name_hlist = (next = 0x, pprev = 
0xAAA


Following are the sequence of events observed -

1. Encrypted packet in receive path from netdevice queued to network 
stack


2. Encrypted packet queued for decryption (asynchronous)

static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
...
 aead_request_set_callback(req, 0, esp_input_done, skb);

3. Netdevice brought down and freed

4. Packet is decrypted and returned through callback in esp_input_done.

5. Packet is queued again for process in network stack using netif_rx.

The device appears to have been freed and as result, the dereference of
skb->dev in get_rps_cpus() leads to an unhandled page fault exception.

Would it make sense here to detect the device going away here using a
netdev notifier callback and free the packets after the asynchronous
callback returns.

Additionally, since the callback is from a worker thread, is it better
to use netif_rx_ni instead of netif_rx

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 85d1d47..f791128 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -351,7 +351,7 @@ resume:

if (decaps) {
skb_dst_drop(skb);
-   netif_rx(skb);
+   netif_rx_ni(skb);


Re: 4.1.12 kernel crash in rtnetlink_put_metrics

2016-03-07 Thread subashab
> Hmm, if it was 4.1.X like in original reporter case, I might have thought
> something like commit 0a1f59620068 ("ipv6: Initialize rt6_info properly
> in ip6_blackhole_route()") ... any chance on reproducing this on a latest
> kernel?
>

Unfortunately, I haven't encountered a similar crash on newer kernels as of now.



Re: 4.1.12 kernel crash in rtnetlink_put_metrics

2016-03-07 Thread subashab

On , Daniel Borkmann wrote:

Hi Andrew,

thanks for the report!

( Making the trace a bit more readable ... )

[41358.475254]BUG:unable to handle kernel NULL pointer dereference at 
(null)

[41358.475333]IP:[]rtnetlink_put_metrics+0x50/0x180
[...]
CallTrace:
[41358.476522][]?__nla_reserve+0x23/0xe0
[41358.476557][]?__nla_put+0x9/0xb0
[41358.476595][]?fib_dump_info+0x15e/0x3e0
[41358.476636][]?irq_entries_start+0x639/0x678
[41358.476671][]?fib_table_dump+0xf3/0x180
[41358.476708][]?inet_dump_fib+0x7d/0x100
[41358.476746][]?netlink_dump+0x121/0x270
[41358.476781][]?skb_free_datagram+0x12/0x40
[41358.476818][]?netlink_recvmsg+0x244/0x360
[41358.476855][]?sock_recvmsg+0x1d/0x30
[41358.476890][]?sock_recvmsg_nosec+0x30/0x30
[41358.476924][]?___sys_recvmsg+0x9c/0x120
[41358.476958][]?sock_recvmsg_nosec+0x30/0x30
[41358.476994][]?update_cfs_rq_blocked_load+0xc4/0x130
[41358.477030][]?hrtimer_forward+0xa4/0x1c0
[41358.477065][]?sockfd_lookup_light+0x1d/0x80
[41358.477099][]?__sys_recvmsg+0x3e/0x80
[41358.477134][]?SyS_socketcall+0xb1/0x2a0
[41358.477168][]?handle_irq_event+0x3c/0x60
[41358.477203][]?handle_edge_irq+0x7d/0x100
[41358.477238][]?rps_trigger_softirq+0x26/0x30
[41358.477273][]?flush_smp_call_function_queue+0x83/0x120
[41358.477307][]?syscall_call+0x7/0x7
[...]

Strange that rtnetlink_put_metrics() itself is not part of the above
call trace (it's an exported symbol).

So, your analysis suggests that metrics itself is NULL in this case?
(Can you confirm that?)

How frequently does this trigger? Are the seen call traces all the same 
kind?


Is there an easy way to reproduce this?

I presume you don't use any per route congestion control settings, 
right?


Thanks,
Daniel


Hi Daniel

I am observing a similar crash as well. This is on a 3.10 based ARM64 
kernel.
Unfortunately, the crash is occurring in a regression test rack, so I am 
not

sure of the exact test case to reproduce this crash. This seems to have
occurred twice so far with both cases having metrics as NULL.

|  rt_=_0xFFC012DA4300 -> (
|dst = (
|  callback_head = (next = 0x0, func = 0xFF800262D040),
|  child = 0xFFC03B8BC2B0,
|  dev = 0xFFC012DA4318,
|  ops = 0xFFC012DA4318,
|  _metrics = 0,
|  expires = 0,
|  path = 0x0,
|  from = 0x0,
|  xfrm = 0x0,
|  input = 0xFFC0AD498000,
|  output = 0x00010401C411,
|  flags = 0,
|  pending_confirm = 0,
|  error = 0,
|  obsolete = 0,
|  header_len = 3,
|  trailer_len = 0,
|  __pad2 = 4096,

168539.549000:   <6> Process ip (pid: 28473, stack limit = 
0xffc04b584060)

168539.549006:   <2> Call trace:
168539.549016:   <2> [] 
rtnetlink_put_metrics+0x4c/0xec
168539.549027:   <2> [] 
rt6_fill_node.isra.34+0x2b8/0x3c8

168539.549035:   <2> [] rt6_dump_route+0x68/0x7c
168539.549043:   <2> [] fib6_dump_node+0x2c/0x74
168539.549051:   <2> [] fib6_walk_continue+0xf8/0x1b4
168539.549059:   <2> [] fib6_walk+0x5c/0xb8
168539.549067:   <2> [] inet6_dump_fib+0x104/0x234
168539.549076:   <2> [] netlink_dump+0x7c/0x1cc
168539.549084:   <2> [] 
__netlink_dump_start+0x128/0x170

168539.549093:   <2> [] rtnetlink_rcv_msg+0x12c/0x1a0
168539.549101:   <2> [] netlink_rcv_skb+0x64/0xc8
168539.549110:   <2> [] rtnetlink_rcv+0x1c/0x2c
168539.549117:   <2> [] netlink_unicast+0x108/0x1b8
168539.549125:   <2> [] netlink_sendmsg+0x27c/0x2d4
168539.549134:   <2> [] sock_sendmsg+0x8c/0xb0
168539.549143:   <2> [] SyS_sendto+0xcc/0x110

I am using the following patch as a workaround now. I do not have any
per route congestion control settings enabled.
Any pointers to debug this would be greatly appreciated.

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a67310e..c63098e 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -566,7 +566,7 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 
*metrics)

int i, valid = 0;

mx = nla_nest_start(skb, RTA_METRICS);
-   if (mx == NULL)
+   if (mx == NULL || metrics == NULL)
return -ENOBUFS;

for (i = 0; i < RTAX_MAX; i++) {





[PATCH net] ipv6: addrconf: Fix recursive spin lock call

2016-02-01 Thread subashab
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and then start ipv6 autoconf.

This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -

INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
 g=3992 c=3991 q=4471)
<6> Task dump for CPU 1:
<2> kworker/1:0 R  running task1216815  2 0x0002
<2> Workqueue: ipv6_addrconf addrconf_dad_work
<6> Call trace:
<2> [] el1_irq+0x68/0xdc
<2> [] _raw_write_lock_bh+0x20/0x30
<2> [] __ipv6_dev_ac_inc+0x64/0x1b4
<2> [] addrconf_join_anycast+0x9c/0xc4
<2> [] __ipv6_ifa_notify+0x160/0x29c
<2> [] ipv6_ifa_notify+0x50/0x70
<2> [] addrconf_dad_work+0x314/0x334
<2> [] process_one_work+0x244/0x3fc
<2> [] worker_thread+0x2f8/0x418
<2> [] kthread+0xe0/0xec

Change-Id: I270c05598622d400b178d758fdbd8296cf521ee8
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv6/addrconf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2163871..ae8ac1a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3301,7 +3301,9 @@ static void addrconf_dad_begin(struct inet6_ifaddr
*ifp)
/* Because optimistic nodes can use this address,
 * notify listeners. If DAD fails, RTM_DELADDR is sent.
 */
+   read_unlock_bh(>lock);
ipv6_ifa_notify(RTM_NEWADDR, ifp);
+   read_lock_bh(>lock);
}
}

-- 



[PATCH v2] ipv6: addrconf: Fix recursive spin lock call

2016-02-01 Thread subashab
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.

This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -

INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
 g=3992 c=3991 q=4471)
<6> Task dump for CPU 1:
<2> kworker/1:0 R  running task1216815   2 0x0002
<2> Workqueue: ipv6_addrconf addrconf_dad_work
<6> Call trace:
<2> [] el1_irq+0x68/0xdc
<2> [] _raw_write_lock_bh+0x20/0x30
<2> [] __ipv6_dev_ac_inc+0x64/0x1b4
<2> [] addrconf_join_anycast+0x9c/0xc4
<2> [] __ipv6_ifa_notify+0x160/0x29c
<2> [] ipv6_ifa_notify+0x50/0x70
<2> [] addrconf_dad_work+0x314/0x334
<2> [] process_one_work+0x244/0x3fc
<2> [] worker_thread+0x2f8/0x418
<2> [] kthread+0xe0/0xec

v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric

Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")

Cc: Eric Dumazet 
Cc: Erik Kline 
Cc: Hannes Frederic Sowa 
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv6/addrconf.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 38eedde..9efd9ff 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3538,6 +3538,7 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 {
struct inet6_dev *idev = ifp->idev;
struct net_device *dev = idev->dev;
+   bool notify = false;

addrconf_join_solict(dev, >addr);

@@ -3583,7 +3584,7 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
/* Because optimistic nodes can use this address,
 * notify listeners. If DAD fails, RTM_DELADDR is sent.
 */
-   ipv6_ifa_notify(RTM_NEWADDR, ifp);
+   notify = true;
}
}

@@ -3591,6 +3592,8 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 out:
spin_unlock(>lock);
read_unlock_bh(>lock);
+   if (notify)
+   ipv6_ifa_notify(RTM_NEWADDR, ifp);
 }

 static void addrconf_dad_start(struct inet6_ifaddr *ifp)
-- 



Re: [PATCH net] ipv6: addrconf: Fix recursive spin lock call

2016-02-01 Thread subashab
> On Mon, 2016-02-01 at 11:37 -0800, Eric Dumazet wrote:
>
>> I would rather try :
>>
>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>> index 38eeddedfc21..d6b7ab07f914 100644
>> --- a/net/ipv6/addrconf.c
>> +++ b/net/ipv6/addrconf.c
>> @@ -3538,6 +3538,7 @@ static void addrconf_dad_begin(struct inet6_ifaddr
>> *ifp)
>>  {
>>  struct inet6_dev *idev = ifp->idev;
>>  struct net_device *dev = idev->dev;
>> +bool notify = false;
>>
>>  addrconf_join_solict(dev, >addr);
>>
>> @@ -3583,7 +3584,8 @@ static void addrconf_dad_begin(struct inet6_ifaddr
>> *ifp)
>>  /* Because optimistic nodes can use this address,
>>   * notify listeners. If DAD fails, RTM_DELADDR is sent.
>>   */
>> -ipv6_ifa_notify(RTM_NEWADDR, ifp);
>> +notify = true;
>> +in6_ifa_hold(ifp);
>
> Actually the in6_ifa_hold() is not needed.
>
>>  }
>>  }
>>
>> @@ -3591,6 +3593,10 @@ static void addrconf_dad_begin(struct
>> inet6_ifaddr *ifp)
>>  out:
>>  spin_unlock(>lock);
>>  read_unlock_bh(>lock);
>> +if (notify) {
>> +ipv6_ifa_notify(RTM_NEWADDR, ifp);
>> +in6_ifa_put(ifp);
>
> And in6_ifa_put() not needed once in6_ifa_hold() is removed.
>
>> +}
>>  }
>>
>>  static void addrconf_dad_start(struct inet6_ifaddr *ifp)
>>
>>

Thanks Eric. I tested the scenario with your suggestion and I don't see a
RCU stall now. I will send out v2 of this patch.




Re: WARN due to local_bh_disable called with interrupts disabled

2015-11-19 Thread subashab
>>
>> The call gic_handle_irq() sounds like a hardware IRQ func/context.
>>
>> The flush_backlog() call is due to the device is being unregistered.
>>

Yes, this is the ARM interrupt controller. It appeared as if wifi was
getting torn down around this.

>> I'm surprised to see kfree_skb() being called from hardirq context, I
>> though that was not allowed.
>>
>> AFAIK this is the reason we have: __dev_kfree_skb_any() which defer
>> freeing the SKB if (in_irq() || irqs_disabled()).
>>
>> Code:
>>  void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason
>> reason)
>>  {
>>  if (in_irq() || irqs_disabled())
>>  __dev_kfree_skb_irq(skb, reason);
>>  else
>>  dev_kfree_skb(skb);
>>  }
>
> Right, but flush_backlog() is processing packets coming from RX, that
> should have no conntracking attached at all.
>
> Might be a bug in a tunnel ?

Thanks Jesper \ Eric. I'll explore into why a conntrack entry is
associated with this skb.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KASan warning with ping call stack

2015-09-21 Thread subashab
We are seeing a Kernel address sanitizer (KASan) out of bounds warning
during a stability test. The call stack is as follows -

<3>BUG: KASan: out of bounds on stack in
csum_partial_copy_fromiovecend+0xac/0x36c at addr ffc019537d38
<3>Read of size 8 by task Thread-208/7505
<0>page:ffbac23d4260 count:0 mapcount:0 mapping:  (null)
index:0x0
<0>flags: 0x0()
<1>page dumped because: kasan: bad access detected
<0>Call trace:
<6>[] dump_backtrace+0x0/0x1c4
<6>[] show_stack+0x10/0x1c
<6>[] dump_stack+0x74/0xc8
<6>[] kasan_report_error+0x2b0/0x408
<6>[] kasan_report+0x34/0x40
<6>[] __asan_load8+0x84/0x90
<6>[] csum_partial_copy_fromiovecend+0xa8/0x36c
<6>[] ping_getfrag+0x54/0xf0
<6>[] __ip_append_data.isra.2+0x88c/0xe5c
<6>[] ip_append_data.part.3+0xd0/0xf4
<6>[] ip_append_data+0x18/0x30
<6>[] ping_v4_sendmsg+0x5d4/0x724
<6>[] inet_sendmsg+0xe0/0x12c
<6>[] sock_aio_write+0x1ac/0x1ec
<6>[] do_sync_write+0xf4/0x14c
<6>[] vfs_write+0x128/0x210
<6>[] SyS_write+0xa8/0x114
<3>Memory state around the buggy address:
<3>ffc019537c00: f1 f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00 00
<3>ffc019537c80: 00 00 00 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00
<3>ffc019537d00: f1 f1 f1 f1 00 00 f4 f4 f2 f2 f2 f2 00 00 00 00
<3>   ^
<3>ffc019537d80: 00 00 00 00 00 00 00 f4 f3 f3 f3 f3 00 00 00 00
<3>ffc019537e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1

One possible suspect for this was the check for offset and possible
invalid access of an iov, but I am not sure about it.

int csum_partial_copy_fromiovecend(unsigned char *kdata, struct iovec *iov,
 int offset, unsigned int len, __wsum *csump)
{
  __wsum csum = *csump;
  int partial_cnt = 0, err = 0;

  /* Skip over the finished iovecs */
  while (offset >= iov->iov_len) {  //Problem might occur here
 offset -= iov->iov_len;
 iov++;
 }

The test case for this is unknown as the issue was reported in a stability
run. The hardware is an ARM64 based system with 3.18 kernel.

Any tips to debug this is highly appreciated.

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] inet: Always increment refcount in inet_twsk_schedule

2015-07-23 Thread subashab
 Actually we do increment refcnt, for every socket found in ehash.

 Carefully read again __inet_lookup_established()

 This code is generic for ESTABLISH and TIME-WAIT sockets

 If you found a code that performed the lookup without taking the refcnt,
 please point me at it, this would be a serious bug.

From my previous observations, it appears as if
1. this check is bypassed
2. the refcount is incremented here but is decremented before it reaches
the packet processing in tcp_timewait_state_process()

I will try to debug this and update.

 Is it some Android kernel ?

 Android had private modules that needed an update in 3.18

Yes, the kernel is based on Android 3.18.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] dev: Spelling fix in comments

2015-07-23 Thread subashab
Fix the following typo
- unchainged - unchanged

Signed-off-by: Subash Abhinov Kasiviswanathan subas...@codeaurora.org
---
 net/core/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 488ba6c..44d1384 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4986,7 +4986,7 @@ EXPORT_SYMBOL(netdev_all_upper_get_next_dev_rcu);
  * Gets the next netdev_adjacent-private from the dev's lower neighbour
  * list, starting from iter position. The caller must hold either hold the
  * RTNL lock or its own locking that guarantees that the neighbour lower
- * list will remain unchainged.
+ * list will remain unchanged.
  */
 void *netdev_lower_get_next_private(struct net_device *dev,
struct list_head **iter)
@@ -5041,7 +5041,7 @@ EXPORT_SYMBOL(netdev_lower_get_next_private_rcu);
  * Gets the next netdev_adjacent from the dev's lower neighbour
  * list, starting from iter position. The caller must hold RTNL lock or
  * its own locking that guarantees that the neighbour lower
- * list will remain unchainged.
+ * list will remain unchanged.
  */
 void *netdev_lower_get_next(struct net_device *dev, struct list_head **iter)
 {
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] inet: Always increment refcount in inet_twsk_schedule

2015-07-20 Thread subashab
 //Initialize time wait socket and setup timer
 inet_twsk_alloc() tw_refcnt = 0
 __inet_twsk_hashdance() tw_refcnt = 3
 inet_twsk_schedule() tw_refcnt = 4
 inet_twsk_put() tw_refcnt = 3

 //Receive packet 1 in timewait state
 tcp_timewait_state_process() - inet_twsk_schedule tw_refcnt = 3 (no
 change)

 This is obviously wrong.

 If a timewait socket is found, do we increment its refcnt before
 proceeding.
We do not increment refcount currently when we find a timewait socket.

 I've received some private mails about tw issues, that turned to be
 caused by buggy drivers or buggy arch specific code.

 Are you crashed observed on x86 ?

This is observed on ARM devices. In the current debug, all time wait
socket refcount changes were happening in TCP stack only and there was no
platform / driver code involved.

According to my understanding, we would need to increment the time wait
socket refcount first before proceeding with any subsequent operations.
However, I request your expert opinion on this.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] inet: Always increment refcount in inet_twsk_schedule

2015-07-18 Thread subashab
I am seeing an issue with the reference count of time wait sockets which
leads to freeing of active timer object. This occurs in some data stress
test setups, so I am unable to determine the exact step when it occured.
However, I logged the refcount and was able to find out the code path
which leads to this problem.

//Initialize time wait socket and setup timer
inet_twsk_alloc() tw_refcnt = 0
__inet_twsk_hashdance() tw_refcnt = 3
inet_twsk_schedule() tw_refcnt = 4
inet_twsk_put() tw_refcnt = 3

//Receive packet 1 in timewait state
tcp_timewait_state_process() - inet_twsk_schedule tw_refcnt = 3 (no change)
TCP: tcp_v4_timewait_ack() - inet_twsk_put() tw_refcnt = 2

//Receive packet 2 in timewait state
tcp_timewait_state_process() - inet_twsk_schedule tw_refcnt = 2 (no change)
TCP: tcp_v4_timewait_ack() - inet_twsk_put() tw_refcnt = 1

//Receive packet 3 in timewait state
tcp_timewait_state_process() - inet_twsk_schedule tw_refcnt = 1 (no change)
TCP: tcp_v4_timewait_ack() - inet_twsk_put() tw_refcnt = 0

After this step, the time wait socket is destroyed along with the active
timer object. This leads to a warning being printed which eventually leads
to a crash.

ODEBUG: free active (active state 0) object type: timer_list hint:
tw_timer_handler+0x0/0x68

It appears that inet_twsk_schedule needs to increment the reference count
unconditionally, otherwise the socket will be destroyed since reference
count will be decremented each time an ack is sent out as a response for
an incoming packet.

Signed-off-by: Subash Abhinov Kasiviswanathan subas...@codeaurora.org
---
 net/ipv4/inet_timewait_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index cbeb022..99c349a 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -246,9 +246,9 @@ void inet_twsk_schedule(struct inet_timewait_sock *tw,
const int timeo)

tw-tw_kill = timeo = 4*HZ;
if (!mod_timer_pinned(tw-tw_timer, jiffies + timeo)) {
-   atomic_inc(tw-tw_refcnt);
atomic_inc(tw-tw_dr-tw_count);
}
+   atomic_inc(tw-tw_refcnt);
 }
 EXPORT_SYMBOL_GPL(inet_twsk_schedule);

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] netfilter: nf_nat: Fix possible null dereference

2015-06-29 Thread subashab
Fix an issue where __nf_ct_ext_find() could return null to nat in
nf_nat_masquerade_ipv4() and could be dereferenced.

This was detected by static analysis software.

Signed-off-by: Subash Abhinov Kasiviswanathan subas...@codeaurora.org
---
 net/ipv4/netfilter/nf_nat_masquerade_ipv4.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
index c6eb421..4be5d70 100644
--- a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
@@ -38,6 +38,8 @@ nf_nat_masquerade_ipv4(struct sk_buff *skb, unsigned int
hooknum,

ct = nf_ct_get(skb, ctinfo);
nat = nfct_nat(ct);
+   if (!nat)
+   return NF_DROP;

NF_CT_ASSERT(ct  (ctinfo == IP_CT_NEW || ctinfo == IP_CT_RELATED ||
ctinfo == IP_CT_RELATED_REPLY));
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rps: Handle double list_add at __napi_schedule

2015-06-15 Thread subashab
When NAPI_STATE_SCHED state is not set, enqueue_to_backlog()
will queue an IPI and add the backlog queue to the poll list. A packet
added by RPS onto the core could also add the NAPI backlog struct to the
poll list. This double addition to the list causes a crash -

2920.540304:   2 list_add double add: new=ffc076ed2930,
prev=ffc076ed2930, next=ffc076ed2850.
   [ffc000460dd4] __list_add+0xcc/0xf0
2921.064962:   2 [ffc000b44880] rps_trigger_softirq+0x1c/0x40
2921.070779:   2 [ffc000284a14]
generic_smp_call_function_single_interrupt+0xe8/0x12c
2921.078678:   2 [ffc00020d9ac] handle_IPI+0x8c/0x1ec
2921.083796:   2 [ffc000200714] gic_handle_irq+0x94/0xb0

Fix this race for double addition to list by checking the NAPI state.

Acked-by: Sharat Masetty smase...@qti.qualcomm.com
Signed-off-by: Subash Abhinov Kasiviswanathan subas...@codeaurora.org

diff --git a/net/core/dev.c b/net/core/dev.c
index 6f561de..57d6d39 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3225,7 +3225,8 @@ static void rps_trigger_softirq(void *data)
 {
struct softnet_data *sd = data;

-   napi_schedule(sd, sd-backlog);
+   if (!test_bit(NAPI_STATE_SCHED, sd-backlog.state))
+   napi_schedule(sd, sd-backlog);
sd-received_rps++;
 }


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html