Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2018-02-14 Thread Anders K. Pedersen | Cohaesio
On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote:
> On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote:
> > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
> > <a...@cohaesio.com> wrote:
> > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio
> > > wrote:
> > > > Hello,
> > > > 
> > > > On one of our border routers, Netfilter is occasionally logging
> > > > packets
> > > > with "OUT=lo" (output interface lo) even though the packet should
> > > > be
> > > > going out via a regular interface. This behavior is present on
> > > > Linux
> > > > 4.13.0 to 4.14.9, and a bisection of the problem points to
> > > > 
> > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call
> > > > dst_dev_put() properly
> > > > 
> > > > as the first bad commit. This commit adds dst_dev_put() calls
> > > > before
> > > > some dst_release() calls, and dst_dev_put() does
> > > > 
> > > > dst->dev = dev_net(dst->dev)->loopback_dev;
> > > > 
> > > > (among other things), which fits the problem we're seeing.
> > > > 
> > > > The essential part of our nftables rule set that shows this
> > > > behavior is
> > > > 
> > > > chain forward {
> > > > type filter hook forward priority 0;
> > > > 
> > > > meta oif { $internal_interfaces } accept
> > > > 
> > > > meta oif lo ip daddr != 127.0.0.0/8 \
> > > > log group 0 snaplen 80 prefix "oif-lo"
> > > > counter
> > > > 
> > > > ip saddr { $our_ip_series } \
> > > > flow table acct_out \
> > > > { meta oif . rt nexthop . ip saddr
> > > > timeout 12m counter } \
> > > > accept
> > > > 
> > > > log group 0 snaplen 80 prefix "DROP" counter drop
> > > > }
> > > > 
> > > > The router only does stateless packet filtering and no
> > > > redirection or
> > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are
> > > > not
> > > > even compiled for this kernel).
> > > > 
> > > > As a result of this problem we see packets that should be going
> > > > to an
> > > > internal interface (and thus accepted by the first rule above)
> > > > being
> > > > logged and dropped by the last rule. Some examples:
> > > > 
> > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > > WINDOW=3295 ACK URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > 
> > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo
> > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=

Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2018-01-22 Thread Anders K. Pedersen | Cohaesio
On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote:
> On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio
> <a...@cohaesio.com> wrote:
> > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio
> > wrote:
> > > Hello,
> > > 
> > > On one of our border routers, Netfilter is occasionally logging
> > > packets
> > > with "OUT=lo" (output interface lo) even though the packet should
> > > be
> > > going out via a regular interface. This behavior is present on
> > > Linux
> > > 4.13.0 to 4.14.9, and a bisection of the problem points to
> > > 
> > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call
> > > dst_dev_put() properly
> > > 
> > > as the first bad commit. This commit adds dst_dev_put() calls
> > > before
> > > some dst_release() calls, and dst_dev_put() does
> > > 
> > > dst->dev = dev_net(dst->dev)->loopback_dev;
> > > 
> > > (among other things), which fits the problem we're seeing.
> > > 
> > > The essential part of our nftables rule set that shows this
> > > behavior is
> > > 
> > > chain forward {
> > > type filter hook forward priority 0;
> > > 
> > > meta oif { $internal_interfaces } accept
> > > 
> > > meta oif lo ip daddr != 127.0.0.0/8 \
> > > log group 0 snaplen 80 prefix "oif-lo"
> > > counter
> > > 
> > > ip saddr { $our_ip_series } \
> > > flow table acct_out \
> > > { meta oif . rt nexthop . ip saddr
> > > timeout 12m counter } \
> > > accept
> > > 
> > > log group 0 snaplen 80 prefix "DROP" counter drop
> > > }
> > > 
> > > The router only does stateless packet filtering and no
> > > redirection or
> > > rewriting of the packets (connection tracking, NAT, ipvs etc. are
> > > not
> > > even compiled for this kernel).
> > > 
> > > As a result of this problem we see packets that should be going
> > > to an
> > > internal interface (and thus accepted by the first rule above)
> > > being
> > > logged and dropped by the last rule. Some examples:
> > > 
> > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > WINDOW=3295 ACK URGP=0 MARK=0
> > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118
> > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF
> > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705
> > > WINDOW=3295 ACK URGP=0 MARK=0
> > > 
> > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18
> > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF
> > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103
> > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > 
> > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo
> > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34
> > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF
> > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612
> > > WINDOW=65535 ACK PSH URGP=0 MARK=0
> > > 
> > > It also happens for outbound traffic, where the packets are
> > > logged and
> > > counted in the acct_out flow table with "meta oif" = "lo", but a
> > > correct "rt nexthop" - an example:
> >

Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2018-01-11 Thread Anders K. Pedersen | Cohaesio
On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio wrote:
> Hello,
> 
> On one of our border routers, Netfilter is occasionally logging packets
> with "OUT=lo" (output interface lo) even though the packet should be
> going out via a regular interface. This behavior is present on Linux
> 4.13.0 to 4.14.9, and a bisection of the problem points to
> 
> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly
> 
> as the first bad commit. This commit adds dst_dev_put() calls before
> some dst_release() calls, and dst_dev_put() does
> 
> dst->dev = dev_net(dst->dev)->loopback_dev;
> 
> (among other things), which fits the problem we're seeing.
> 
> The essential part of our nftables rule set that shows this behavior is
> 
> chain forward {
> type filter hook forward priority 0;
> 
> meta oif { $internal_interfaces } accept
> 
> meta oif lo ip daddr != 127.0.0.0/8 \
> log group 0 snaplen 80 prefix "oif-lo" counter
> 
> ip saddr { $our_ip_series } \
> flow table acct_out \
> { meta oif . rt nexthop . ip saddr timeout 12m 
> counter } \
> accept
> 
> log group 0 snaplen 80 prefix "DROP" counter drop
> }
> 
> The router only does stateless packet filtering and no redirection or
> rewriting of the packets (connection tracking, NAT, ipvs etc. are not
> even compiled for this kernel).
> 
> As a result of this problem we see packets that should be going to an
> internal interface (and thus accepted by the first rule above) being
> logged and dropped by the last rule. Some examples:
> 
> Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
> DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
> SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0
> Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
> DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
> SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0
> 
> Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
> SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
> MARK=0
> Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
> SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
> MARK=0
> 
> Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
> SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
> MARK=0
> Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo 
> MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
> DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
> SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
> MARK=0
> 
> It also happens for outbound traffic, where the packets are logged and
> counted in the acct_out flow table with "meta oif" = "lo", but a
> correct "rt nexthop" - an example:
> 
> Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo 
> MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 
> DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP 
> SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 MARK=0
> 
> # nft list flow table filter acct_out|tr ',' '\n'|grep lo
> flow table acct_out {
>  "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 
> bytes 40
> 
> I don't know if these packets are actually sent out on the correct
> outbound interface thanks to the proper nexthop (the MAC= information
> in the Netfilter log is from the received packet and thus not useful
> here).
> 
> I tried running a tcpdump on the lo interface to see if these packets
> would show up there, but during the three days I had it running, it
> only logged one such packet, while Netfilter logs 20+ outbound packets
> every day, and the one packet logged by tcpdump was *not* logged by
> Netfilter.

Further testing of the individual parts of the first bad commit shows
that the five first a

[bisected] Forwarded packets occasionally has loopback output interface in Netfilter

2017-12-26 Thread Anders K. Pedersen | Cohaesio
Hello,

On one of our border routers, Netfilter is occasionally logging packets
with "OUT=lo" (output interface lo) even though the packet should be
going out via a regular interface. This behavior is present on Linux
4.13.0 to 4.14.9, and a bisection of the problem points to

[95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly

as the first bad commit. This commit adds dst_dev_put() calls before
some dst_release() calls, and dst_dev_put() does

dst->dev = dev_net(dst->dev)->loopback_dev;

(among other things), which fits the problem we're seeing.

The essential part of our nftables rule set that shows this behavior is

chain forward {
type filter hook forward priority 0;

meta oif { $internal_interfaces } accept

meta oif lo ip daddr != 127.0.0.0/8 \
log group 0 snaplen 80 prefix "oif-lo" counter

ip saddr { $our_ip_series } \
flow table acct_out \
{ meta oif . rt nexthop . ip saddr timeout 12m counter 
} \
accept

log group 0 snaplen 80 prefix "DROP" counter drop
}

The router only does stateless packet filtering and no redirection or
rewriting of the packets (connection tracking, NAT, ipvs etc. are not
even compiled for this kernel).

As a result of this problem we see packets that should be going to an
internal interface (and thus accepted by the first rule above) being
logged and dropped by the last rule. Some examples:

Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0
Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 
DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP 
SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0

Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
MARK=0
Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 
DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP 
SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 
MARK=0

Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
MARK=0
Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo 
MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 
DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP 
SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 
MARK=0

It also happens for outbound traffic, where the packets are logged and
counted in the acct_out flow table with "meta oif" = "lo", but a
correct "rt nexthop" - an example:

Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo 
MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 
DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP SPT=443 
DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 MARK=0

# nft list flow table filter acct_out|tr ',' '\n'|grep lo
flow table acct_out {
 "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 bytes 
40

I don't know if these packets are actually sent out on the correct
outbound interface thanks to the proper nexthop (the MAC= information
in the Netfilter log is from the received packet and thus not useful
here).

I tried running a tcpdump on the lo interface to see if these packets
would show up there, but during the three days I had it running, it
only logged one such packet, while Netfilter logs 20+ outbound packets
every day, and the one packet logged by tcpdump was *not* logged by
Netfilter.

If it helps, I can try adding each of the six dst_dev_put() calls from
the first bad commit individually to see which of them that has an
effect on this.

Regards,
Anders K. Pedersen

Re: general protection fault in dst_destroy() - 4.13.9

2017-11-26 Thread Anders K. Pedersen | Cohaesio
On man, 2017-11-20 at 17:13 +0200, Ido Schimmel wrote:
> On Sun, Nov 19, 2017 at 12:45:41PM +0000, Anders K. Pedersen |
> Cohaesio wrote:
> > Hello,
> > 
> > A few days ago, one of our routers (running Linux 4.13.9) crashed
> > due
> > to a general protection fault in dst_destroy(). At the time, it had
> > run
> > for several weeks without any problems, but then crashed three
> > times in
> > a row within a few minutes - all due to a general protection fault
> > at
> > dst_destroy()+0x35. Since then, it has run for several days without
> > any
> > further problems, so I suspect that this was triggered by a traffic
> > pattern in the routed packets, but I don't have a way to reproduce
> > it.
> > 
> > Disassembly shows that this is in the inlined dev_put(), which does
> > this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't
> > been any fixes in this area since 4.13, and a Google search didn't
> > find
> > anything recent, so I'm guessing this is not a known problem.
> > 
> > I have included the kernel output via serial console below as well
> > as
> > gdb and objdump information. Please let me know, if I can provide
> > any
> > additional information.
> > 
> > 
> > [2024260.461401] general protection fault:  [#1] SMP
> > [2024260.467193] Modules linked in:
> > [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted:
> > GW   4.13.9 #2
> > [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3,
> > BIOS 2.5.5 08/16/2017
> > [2024260.488279] task: 88085b625cc0 task.stack:
> > c90e4000
> > [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0
> > [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286
> > [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX:
> > 0020
> > [2024260.514868] RDX: 88082cf9fbc0 RSI:  RDI:
> > 816786c0
> > [2024260.523258] RBP:  R08: ff00 R09:
> > 
> > [2024260.531649] R10:  R11:  R12:
> > 88085f5da678
> > [2024260.540040] R13: 000a R14: 88085b625cc0 R15:
> > 88085b625cc0
> > [2024260.548431] FS:  ()
> > GS:88085f5c() knlGS:
> > [2024260.557924] CS:  0010 DS:  ES:  CR0: 80050033
> > [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4:
> > 001406e0
> > [2024260.573112] Call Trace:
> > [2024260.576113]  
> > [2024260.578618]  ? rcu_process_callbacks+0x18f/0x460
> > [2024260.584126]  ? rebalance_domains+0xe2/0x290
> > [2024260.589128]  ? __do_softirq+0x100/0x292
> > [2024260.593727]  ? irq_exit+0x92/0xa0
> > [2024260.597729]  ? smp_apic_timer_interrupt+0x39/0x50
> > [2024260.603328]  ? apic_timer_interrupt+0x7c/0x90
> > [2024260.608528]  
> > [2024260.611134]  ? cpuidle_enter_state+0x14c/0x2b0
> > [2024260.616432]  ? cpuidle_enter_state+0x128/0x2b0
> > [2024260.621731]  ? do_idle+0xf9/0x190
> > [2024260.625733]  ? cpu_startup_entry+0x5f/0x70
> > [2024260.630636]  ? start_secondary+0x12a/0x130
> > [2024260.635536]  ? secondary_startup_64+0x9f/0x9f
> > [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48
> > 8b 40 30 48 85 c0 74 05 48
> > 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08
> > f6 43 60 80 74 26 48 8d bb
> > e0 00 00 00 e8 e6 7f 01 00
> > [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08
> > [2024260.669333] ---[ end trace 3c1827251806827c ]---
> > [2024260.724173] Kernel panic - not syncing: Fatal exception in
> > interrupt
> > [2024261.102792] Kernel Offset: disabled
> > [2024261.156022] Rebooting in 60 seconds..
> > [2024321.167958] ACPI MEMORY or I/O RESET_REG.
> 
> This looks very similar to a bug Eric already fixed here:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/co
> mmit/?id=222d7dbd258dad4cd5241c43ef818141fad5a87a
> 
> I don't see it in v4.13.9 which might explain why you're still
> hitting
> it. Can you please try to reproduce with mentioned patch?

Yes, it looks like it could be related. I see that it is included in
v4.14, so we'll update to that and see if it comes back.

Thanks,
Anders

general protection fault in dst_destroy() - 4.13.9

2017-11-19 Thread Anders K. Pedersen | Cohaesio
Hello,

A few days ago, one of our routers (running Linux 4.13.9) crashed due
to a general protection fault in dst_destroy(). At the time, it had run
for several weeks without any problems, but then crashed three times in
a row within a few minutes - all due to a general protection fault at
dst_destroy()+0x35. Since then, it has run for several days without any
further problems, so I suspect that this was triggered by a traffic
pattern in the routed packets, but I don't have a way to reproduce it.

Disassembly shows that this is in the inlined dev_put(), which does
this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't
been any fixes in this area since 4.13, and a Google search didn't find
anything recent, so I'm guessing this is not a known problem.

I have included the kernel output via serial console below as well as
gdb and objdump information. Please let me know, if I can provide any
additional information.


[2024260.461401] general protection fault:  [#1] SMP
[2024260.467193] Modules linked in:
[2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: GW   
4.13.9 #2
[2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 
08/16/2017
[2024260.488279] task: 88085b625cc0 task.stack: c90e4000
[2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0
[2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286
[2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: 
0020
[2024260.514868] RDX: 88082cf9fbc0 RSI:  RDI: 
816786c0
[2024260.523258] RBP:  R08: ff00 R09: 

[2024260.531649] R10:  R11:  R12: 
88085f5da678
[2024260.540040] R13: 000a R14: 88085b625cc0 R15: 
88085b625cc0
[2024260.548431] FS:  () GS:88085f5c() 
knlGS:
[2024260.557924] CS:  0010 DS:  ES:  CR0: 80050033
[2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: 
001406e0
[2024260.573112] Call Trace:
[2024260.576113]  
[2024260.578618]  ? rcu_process_callbacks+0x18f/0x460
[2024260.584126]  ? rebalance_domains+0xe2/0x290
[2024260.589128]  ? __do_softirq+0x100/0x292
[2024260.593727]  ? irq_exit+0x92/0xa0
[2024260.597729]  ? smp_apic_timer_interrupt+0x39/0x50
[2024260.603328]  ? apic_timer_interrupt+0x7c/0x90
[2024260.608528]  
[2024260.611134]  ? cpuidle_enter_state+0x14c/0x2b0
[2024260.616432]  ? cpuidle_enter_state+0x128/0x2b0
[2024260.621731]  ? do_idle+0xf9/0x190
[2024260.625733]  ? cpu_startup_entry+0x5f/0x70
[2024260.630636]  ? start_secondary+0x12a/0x130
[2024260.635536]  ? secondary_startup_64+0x9f/0x9f
[2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 
85 c0 74 05 48
89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 f6 43 60 80 
74 26 48 8d bb
e0 00 00 00 e8 e6 7f 01 00
[2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08
[2024260.669333] ---[ end trace 3c1827251806827c ]---
[2024260.724173] Kernel panic - not syncing: Fatal exception in interrupt
[2024261.102792] Kernel Offset: disabled
[2024261.156022] Rebooting in 60 seconds..
[2024321.167958] ACPI MEMORY or I/O RESET_REG.


[   36.620034] general protection fault:  [#1] SMP
[   36.625637] Modules linked in:
[   36.629141] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.13.9 #2
[   36.635938] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 
08/16/2017
[   36.644532] task: 88085b46a7c0 task.stack: c907c000
[   36.651333] RIP: 0010:dst_destroy+0x35/0xa0
[   36.656133] RSP: 0018:88085f283f08 EFLAGS: 00010286
[   36.662133] RAX: 2e37307830203a65 RBX: 88082ac1 RCX: 0020
[   36.670326] RDX: 88082ac100c0 RSI:  RDI: 816786c0
[   36.678521] RBP:  R08: 30e3e201 R09: 00010080007a
[   36.686714] R10: 88085f283e20 R11: ea0020c38e00 R12: 88085f29a678
[   36.694906] R13: 000a R14: 88085b46a7c0 R15: 88085b46a7c0
[   36.703102] FS:  () GS:88085f28() 
knlGS:
[   36.712395] CS:  0010 DS:  ES:  CR0: 80050033
[   36.718992] CR2: 55568c725558 CR3: 01809000 CR4: 001406e0
[   36.727184] Call Trace:
[   36.729987]  
[   36.732287]  ? rcu_process_callbacks+0x18f/0x460
[   36.737588]  ? rebalance_domains+0xe2/0x290
[   36.742388]  ? __do_softirq+0x100/0x292
[   36.746790]  ? irq_exit+0x92/0xa0
[   36.750590]  ? smp_apic_timer_interrupt+0x39/0x50
[   36.755990]  ? apic_timer_interrupt+0x7c/0x90
[   36.760987]  
[   36.763392]  ? poll_idle+0x46/0x7a
[   36.767295]  ? cpuidle_enter_state+0x102/0x2b0
[   36.772396]  ? do_idle+0xf9/0x190
[   36.776197]  ? cpu_startup_entry+0x5f/0x70
[   36.780892]  ? start_secondary+0x12a/0x130
[   36.785592]  ? secondary_startup_64+0x9f/0x9f
[   36.790590] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 
85 

Re: [jkirsher/net-queue PATCH] i40e: Add programming descriptors to cleaned_count

2017-10-22 Thread Anders K. Pedersen | Cohaesio
On lør, 2017-10-21 at 18:12 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> This patch updates the i40e driver to include programming descriptors
> in
> the cleaned_count. Without this change it becomes possible for us to
> leak
> memory as we don't trigger a large enough allocation when the time
> comes to
> allocate new buffers and we end up overwriting a number of rx_buffers
> equal
> to the number of programming descriptors we encountered.
> 
> Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming
> status")
> Signed-off-by: Alexander Duyck 

This patch solves the remaining memory leak we've seen, so

Tested-by: Anders K. Pedersen 

Regards,
Anders

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-22 Thread Anders K. Pedersen | Cohaesio
On tor, 2017-10-19 at 08:40 -0700, Alexander Duyck wrote:
> On Thu, Oct 19, 2017 at 5:19 AM, Anders K. Pedersen | Cohaesio
> <a...@cohaesio.com> wrote:
> > Hi Alex,
> > 
> > On ons, 2017-10-18 at 16:37 -0700, Alexander Duyck wrote:
> > > When we last talked I had asked if you could do a git bisect to
> > > find
> > > the memory leak and you said you would look into it. The most
> > > useful
> > > way to solve this would be to do a git bisect between your
> > > current
> > > kernel and the 4.11 kernel to find the point at which this
> > > started.
> > > If
> > > we can do that then fixing this becomes much simpler as we just
> > > have
> > > to fix the patch that introduced the issue.
> > 
> > We're also seeing a smaller memory leak (about 1 GB per day) than
> > the
> > original one even with the "Fix memory leak related filter
> > programming
> > status" fix applied. So far I've determined that the leak is
> > present on
> > 4.13.7 and was introduced between 4.11 and 4.12, so I'll do another
> > round of bisection to identify the patch that introduced this.
> > 
> > Since the router must run for a couple of hours before I can be
> > sure
> > whether a kernel is good or bad, and I can't reboot it during
> > working
> > hours, it'll probably be about a week before I have a result.
> > 
> > --
> > Venlig hilsen / Best Regards
> > 
> > Anders K. Pedersen
> > Senior Technical Manager
> 
> Anders,
> 
> I'll do some digging on my side to see if I can find any other memory
> leaks that might be floating around in the driver that could have
> been
> introduced during that time-frame.
> 
> One thing you might try that would help with your testing would be to
> just disable the ATR functionality in i40e. You can do that with the
> ethtool command "ethtool --set-priv-flags  flow-director-atr
> off". That should allow you to bisect this without needing to deal
> with the "programming status" patches since you won't be programming
> ATR filters which is what caused that leak.
> 
> Thanks for looking into this.
> 
> - Alex

Hi Alex,

I began bisecting, where I applied the known fix patches to the steps,
where they were applicable (i.e. without changing the flow-director-atr 
flag), but some of the steps had a high amount of packet drops, which
caused problems for our network, so I couldn't leave them running for
several hours, which is necessary to determine if the leak is present
or not. The part of the bisection I got through had the same outcome as
the last bisection, which led to "i40e: Fix support for flow
director programming status".

After that I experimented a bit with the flow-director-atr flag, and it
turns out that if I disable this flag on all the NICs, then the memory
leak is gone, so I suspected that the smaller memory leak was also
caused by "i40e: Fix support for flow director programming status".

I tried to revert this patch from 4.13 (with manual fixup for the trace
point that had been added later), but that brought back the packet
drops, so I couldn't let it run.

This morning I saw your "i40e: Add programming descriptors to
cleaned_count" patch, so I tried 4.13.9 with that patch and the
previous "i40e: Fix memory leak related filter programming status"
without turning off the flow-director-atr flag. So far this combination
is running stable without any memory leaks.

Thanks for fixing this.

Regards,
Anders

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-19 Thread Anders K. Pedersen | Cohaesio
Hi Alex,

On ons, 2017-10-18 at 16:37 -0700, Alexander Duyck wrote:
> When we last talked I had asked if you could do a git bisect to find
> the memory leak and you said you would look into it. The most useful
> way to solve this would be to do a git bisect between your current
> kernel and the 4.11 kernel to find the point at which this started.
> If
> we can do that then fixing this becomes much simpler as we just have
> to fix the patch that introduced the issue.

We're also seeing a smaller memory leak (about 1 GB per day) than the
original one even with the "Fix memory leak related filter programming
status" fix applied. So far I've determined that the leak is present on
4.13.7 and was introduced between 4.11 and 4.12, so I'll do another
round of bisection to identify the patch that introduced this.

Since the router must run for a couple of hours before I can be sure
whether a kernel is good or bad, and I can't reboot it during working
hours, it'll probably be about a week before I have a result.

-- 
Venlig hilsen / Best Regards

Anders K. Pedersen
Senior Technical Manager

Cohaesio A/S  email:  a...@cohaesio.com
Parken | Per Henrik Lings Allé 4, 4   main:   +45 45 88 08 88
DK-2100 Copenhagen | Denmark  direct: +45 45 20 78 11

Visit us at www.cohaesio.com

Cohaesio is a leading IT solutions provider in Scandinavia, providing
certified IT outsourcing, service desk and managed hosting solutions to
professional customers.

Re: [jkirsher/net-queue PATCH] i40e: Fix memory leak related filter programming status

2017-10-10 Thread Anders K. Pedersen | Cohaesio
On ons, 2017-10-04 at 08:44 -0700, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> It looks like we weren't correctly placing the pages from buffers
> that had
> been used to return a filter programming status back on the ring. As
> a
> result they were being overwritten and tracking of the pages was
> lost.
> 
> This change works to correct that by incorporating part of
> i40e_put_rx_buffer into the programming status handler code. As a
> result we
> should now be correctly placing the pages for those buffers on the
> re-allocation list instead of letting them stay in place.
> 
> Fixes: 0e626ff7ccbf ("i40e: Fix support for flow director programming
> status")
> Reported-by: Anders K. Pedersen 
> Signed-off-by: Alexander Duyck 

We rebooted the router that had memory leaks on 4.12 and 4.13 kernels 
with 4.13.5 plus this patch this morning, and the memory leak is gone,
so

Tested-by: Anders K. Pedersen 

Thanks,
Anders

> ---
> 
> I'm submitting this for Jeff's net queue to undergo some additional
> testing
> before being submitted for net or stable to address the memory leak
> isue. The
> testing for this should be pretty straight forward since ATR filters
> seem
> to cause the issue it should be possible to trigger a pretty
> signficant
> amount of memory loss running something like a netperf TCP_CRR test
> for an
> extended period of time which will trigger multiple socket
> creation/destruction.
> 
> Anders, feel free to test this patch. If you need to grab a copy of
> the
> diff instead of trying to work with applying the patch through email
> you
> should be able to find a copy at the following URL shortly after I
> submit
> this to intel-wired-lan:
>   http://patchwork.ozlabs.org/project/intel-wired-lan/list/
> 
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |   63 +++
> 
>  1 file changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 94311e3e4f43..d4ae24674a70 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -1038,6 +1038,32 @@ static bool i40e_set_new_dynamic_itr(struct
> i40e_ring_container *rc)
>  }
>  
>  /**
> + * i40e_reuse_rx_page - page flip buffer and store it back on the
> ring
> + * @rx_ring: rx descriptor ring to store buffers on
> + * @old_buff: donor buffer to have page reused
> + *
> + * Synchronizes page for reuse by the adapter
> + **/
> +static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
> +struct i40e_rx_buffer *old_buff)
> +{
> + struct i40e_rx_buffer *new_buff;
> + u16 nta = rx_ring->next_to_alloc;
> +
> + new_buff = _ring->rx_bi[nta];
> +
> + /* update, and store next to alloc */
> + nta++;
> + rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
> +
> + /* transfer page from old buffer to new buffer */
> + new_buff->dma   = old_buff->dma;
> + new_buff->page  = old_buff->page;
> + new_buff->page_offset   = old_buff->page_offset;
> + new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
> +}
> +
> +/**
>   * i40e_rx_is_programming_status - check for programming status
> descriptor
>   * @qw: qword representing status_error_len in CPU ordering
>   *
> @@ -1071,15 +1097,24 @@ static void
> i40e_clean_programming_status(struct i40e_ring *rx_ring,
> union i40e_rx_desc
> *rx_desc,
> u64 qw)
>  {
> - u32 ntc = rx_ring->next_to_clean + 1;
> + struct i40e_rx_buffer *rx_buffer;
> + u32 ntc = rx_ring->next_to_clean;
>   u8 id;
>  
>   /* fetch, update, and store next to clean */
> + rx_buffer = _ring->rx_bi[ntc++];
>   ntc = (ntc < rx_ring->count) ? ntc : 0;
>   rx_ring->next_to_clean = ntc;
>  
>   prefetch(I40E_RX_DESC(rx_ring, ntc));
>  
> + /* place unused page back on the ring */
> + i40e_reuse_rx_page(rx_ring, rx_buffer);
> + rx_ring->rx_stats.page_reuse_count++;
> +
> + /* clear contents of buffer_info */
> + rx_buffer->page = NULL;
> +
>   id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
> I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
>  
> @@ -1648,32 +1683,6 @@ static bool i40e_cleanup_headers(struct
> i40e_ring *rx_ring, struct sk_buff *skb,
>  }
>  
>  /**
> - * i40e_reuse_rx_page - page flip buffer and store it back on the
> ring
> - * @rx_ring: rx descriptor ring to store buffers on
> - * @old_buff: donor buffer to have page reused
> - *
> - * Synchronizes page for reuse by the adapter
> - **/
> -static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
> -struct i40e_rx_buffer *old_buff)
> -{
> - struct i40e_rx_buffer *new_buff;
> - u16 nta = rx_ring->next_to_alloc;
> -
> - new_buff = 

Re: Linux 4.12+ memory leak on router with i40e NICs

2017-10-04 Thread Anders K. Pedersen | Cohaesio
On ons, 2017-10-04 at 08:32 -0700, Alexander Duyck wrote:
> On Wed, Oct 4, 2017 at 5:56 AM, Anders K. Pedersen | Cohaesio
> <a...@cohaesio.com> wrote:
> > Hello,
> > 
> > After updating one of our Linux based routers to kernel 4.13 it
> > began
> > leaking memory quite fast (about 1 GB every half hour). To narrow
> > we
> > tried various kernel versions and found that 4.11.12 is okay, while
> > 4.12 also leaks, so we did a bisection between 4.11 and 4.12.
> > 
> > The first bisection ended at
> > "[6964e53f55837b0c49ed60d36656d2e0ee4fc27b] i40e: fix handling of
> > HW
> > ATR eviction", which fixes some flag handling that was broken by
> > 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
> > separate
> > flag bits", so I did a second bisection, where I added 6964e53f5583
> > "i40e: fix handling of HW ATR eviction" to the steps that had
> > 47994c119a36 "i40e: remove hw_disabled_flags in favor of using
> > separate
> > flag bits" in them.
> > 
> > The second bisection ended at
> > "[0e626ff7ccbfc43c6cc4aeea611c40b899682382] i40e: Fix support for
> > flow
> > director programming status", where I don't see any obvious
> > problems,
> > so I'm hoping for some assistance.
> > 
> > The router is a PowerEdge R730 server (Haswell based) with three
> > Intel
> > NICs (all using the i40e driver):
> > 
> > X710 quad port 10 GbE SFP+: eth0 eth1 eth2 eth3
> > X710 quad port 10 GbE SFP+: eth4 eth5 eth6 eth7
> > XL710 dual port 40 GbE QSFP+: eth8 eth9
> > 
> > The NICs are aggregated with LACP with the team driver:
> > 
> > team0: eth9 (40 GbE selected primary), and eth3, eth7 (10 GbE non-
> > selected backups)
> > team1: eth0, eth1, eth4, eth5 (all 10 GbE selected)
> > 
> > team0 is used for internal networks and has one untagged and four
> > tagged VLAN interfaces, while team1 has an external uplink
> > connection
> > without any VLANs.
> > 
> > The router runs an eBGP session on team1 to one of our uplinks, and
> > iBGP via team0 to our other border routers. It also runs OSPF on
> > the
> > internal VLANs on team0. One thing I've noticed is that when OSPF
> > is
> > not announcing a default gateway to the internal networks, so there
> > is
> > almost no traffic coming in on team0 and out on team1, but still
> > plenty
> > of traffic coming in on team1 and out via team0, there's no memory
> > leak
> > (or at least it is so small that we haven't detected it). But as
> > soon
> > as we configure OSPF to announce a default gateway to the internal
> > VLANs, so we get traffic from team0 to team1 the leaking begins.
> > Stopping the OSPF default gateway announcement again also stops the
> > leaking, but does not release already leaked memory.
> > 
> > So this leads to me suspect that the leaking is related to RX on
> > team0
> > (where XL710 eth9 is normally the only active interface) or TX on
> > team1
> > (X710 eth0, eth1, eth4, eth5). The first bad commit is related to
> > RX
> > cleaning, which suggests RX on team0. Since we're only seeing the
> > leak
> > for our outbound traffic, I suspect either a difference between the
> > X710 vs. XL710 NICs, or that the inbound traffic is for relatively
> > few
> > destination addresses (only our own systems) while the outbound
> > traffic
> > is for many different addresses on the internet. But I'm just
> > guessing
> > here.
> > 
> > I've tried kmemleak, but it only found a few kB of suspected memory
> > leaks (several of which disappeared again after a while).
> > 
> > Below I've included more details - git bisect logs, ethtool -i,
> > dmesg,
> > Kernel .config, and various memory related /proc files. Any help or
> > suggestions would be much appreciated, and please let me know if
> > more
> > information is needed or there's something I should try.
> > 
> > Regards,
> > Anders K. Pedersen
> > 
> 
> Hi Anders,
> 
> I think I see the problem and should have a patch submitted shortly
> to
> address it. From what I can tell it looks like the issue is that we
> weren't properly recycling the pages associated with descriptors that
> contained an Rx programming status. For now the workaround would be
> to
> try disabling ATR via the "ethtool --set-priv-flags" command. I
> should
> have a patch out in the next hour or so that you can try testing to
> verify if it addresses the issue.
> 
> Thanks.
> 
> - Alex

Thanks Alex,

I will test the patch in our next service window on Tuesday morning.

Regards,
Anders