[PATCH] filter.txt: update 'tools/net/' to 'tools/bpf/'

2018-04-15 Thread Wang Sheng-Hui
The tools are located at tootls/bpf/ instead of tools/net/.
Update the filter.txt doc.

Signed-off-by: Wang Sheng-Hui 
---
 Documentation/networking/filter.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index a4508ec1816b..fd55c7de9991 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -169,7 +169,7 @@ access to BPF code as well.
 BPF engine and instruction set
 --
 
-Under tools/net/ there's a small helper tool called bpf_asm which can
+Under tools/bpf/ there's a small helper tool called bpf_asm which can
 be used to write low-level filters for example scenarios mentioned in the
 previous section. Asm-like syntax mentioned here has been implemented in
 bpf_asm and will be used for further explanations (instead of dealing with
@@ -359,7 +359,7 @@ $ ./bpf_asm -c foo
 In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
 filters that might not be obvious at first, it's good to test filters before
 attaching to a live system. For that purpose, there's a small tool called
-bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
+bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
 for testing BPF filters against given pcap files, single stepping through the
 BPF code on the pcap's packets and to do BPF machine register dumps.
 
@@ -483,7 +483,7 @@ Example output from dmesg:
 [ 3389.935851] JIT code: 0030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 
00 00
 [ 3389.935852] JIT code: 0040: eb 02 31 c0 c9 c3
 
-In the kernel source tree under tools/net/, there's bpf_jit_disasm for
+In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
 generating disassembly out of the kernel log's hexdump:
 
 # ./bpf_jit_disasm
-- 
2.11.0





Re: [Patch net] llc: properly handle dev_queue_xmit() return value

2018-04-15 Thread Noam Rathaus
Hi,

Is there any update?

On Fri, Apr 13, 2018 at 7:49 PM, Noam Rathaus  wrote:
> Hi
>
> Any update?
>
> On Thu, 29 Mar 2018 at 14:11, Noam Rathaus  wrote:
>>
>> Hi,
>>
>> Will you notify me when its been accepted? if not, how can I do this
>> checking myself to see if it was accepted?
>>
>> On Tue, Mar 27, 2018 at 8:13 PM, David Miller  wrote:
>> > From: Noam Rathaus 
>> > Date: Tue, 27 Mar 2018 16:27:49 +
>> >
>> >> Guys please fill me in on the next step?
>> >>
>> >> If it’s applied it means it’s part of the official code of the kernel
>> >> now?
>> >
>> > It means it is in my networking GIT tree and will make it's way to Linus
>> > in the not so distant future.
>>
>>
>>
>> --
>>
>> Thanks,
>> Noam Rathaus
>> Beyond Security
>>
>> PGP Key ID: 7EF920D3C045D63F (Exp 2019-03)
>
> --
> Thanks,
> Noam Rathaus



-- 

Thanks,
Noam Rathaus
Beyond Security

PGP Key ID: 7EF920D3C045D63F (Exp 2019-03)


One question about __tcp_select_window()

2018-04-15 Thread Wang Jian
Hi all,

While I read __tcp_select_window() code, I find that it maybe return a
smaller window.
Below is one scenario I thought, may be not right:
In function __tcp_select_window(), assume:
full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS.
And assume disable window scaling, then
window = tp->rcv_wnd > free_space && window > free_space
then it will round down free_space and return it.

Is this expected behavior? The comment is also saying
"Get the largest window that is a nice multiple of mss."

Should we do something like below ? Or I miss something?
I don't know how to verify it now.

--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk)
 * We also don't do any window rounding when the free space
 * is too small.
 */
-   if (window <= free_space - mss || window > free_space)
+   if (window <= free_space - mss)
window = rounddown(free_space, mss);
-   else if (mss == full_space &&
+   else if (window <= free_space && mss == full_space &&
 free_space > window + (full_space >> 1))
window = free_space;
}

Thanks.


Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-04-15 Thread Pavel Machek
On Mon 2018-03-26 10:33:55, Dan Williams wrote:
> On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > > 
> > > > Broken state.
> > > > 
> > > > pavel@amd:~$ nmcli dev
> > > > DEVICE  TYPE  STATECONNECTION
> > > > eth1ethernet  unavailable  --
> > > > lo  loopback  unmanaged--
> > > > wlan0   wifi  unmanaged--
> > > 
> > > If the state is "unmanaged" on resume, that would indicate a
> > > problem
> > > with sleep/wake and likely not a kernel network device issue.
> > > 
> > > We should probably move this discussion to the NM lists to debug
> > > further.  Before you suspend, run "nmcli gen log level trace" to
> > > turn
> > > on full debug logging, then reproduce the issue, and send a pointer
> > > to
> > > those logs (scrubbed for anything you consider sensitive) to the NM
> > > mailing list.
> > 
> > Hmm :-)
> > 
> > root@amd:/data/pavel# nmcli gen log level trace
> > Error: Unknown log level 'trace'
> 
> What NM version?  'trace' is pretty old (since 1.0 from December 2014)
> so unless you're using a really, really old version of Debian I'd
> expect you'd have it.  Anyway, debug would do.

Hmm.

pavel@duo:~$ /usr/sbin/NetworkManager --version
You must be root to run NetworkManager!
pavel@duo:~$ sudo /usr/sbin/NetworkManager --version
0.9.10.0

So I set the log level, but I still don't see much in the log:

Apr 14 18:14:29 duo dbus[3009]: [system] Successfully activated
service 'org.freedesktop.nm_dispatcher'
Apr 14 18:14:29 duo nm-dispatcher: Dispatching action 'down' for wlan1
Apr 14 18:14:29 duo systemd[1]: Started Network Manager Script
Dispatcher Service.
Apr 14 18:14:29 duo systemd-sleep[6853]: Suspending system...
Apr 14 21:27:53 duo systemd[1]: systemd-journald.service watchdog
timeout (limit 1min)!
pavel@duo:~$ date
Sun Apr 15 12:26:32 CEST 2018
pavel@duo:~$

Is it possible that time handling accross suspend changed in v4.17?

I get some weird effects. With display backlight...

> > Where do I get the logs? I don't see much in the syslog...
> 
> > And.. It seems that it is "every other suspend". One resume results
> > in
> > broken network, one in working one, one in broken one...
> 
> Does your distro use pm-utils, upower, or systemd for suspend/resume
> handling?

upower, I guess:

pavel@duo:/data/l/linux$ ps aux | grep upower
root  3820  0.0  0.1  42848  7984 ?Ssl  Apr14   0:01
/usr/lib/upower/upowerd

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-04-15 Thread Pavel Machek
On Tue 2018-03-20 21:11:54, Woody Suwalski wrote:
> Woody Suwalski wrote:
> >Pavel Machek wrote:
> >>On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> >>>Pavel Machek wrote:
> Hi!
> 
> With recent linux-next, after resume networkmanager often claims that
> "network is disabled". Sometimes suspend/resume clears that.
> 
> Any ideas? Does it work for you?
>     Pavel
> >>>Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> >>Thanks for testing... but yes, 4.16 should be ok. If not fixed,
> >>problem will appear in 4.17-rc1.
> >>
> >Works here OK. Tried ~10 suspends, all restarted OK.
> >kernel next-20180320
> >nmcli shows that Wifi always connects OK
> >
> >Woody
> >
> Contrary, it just happened to me on a 64-bit build 4.16-rc5 on T440.
> I think that Dan's suspicion is correct - it is a snafu in the PM: trying to
> hibernate results in a message:
> Failed to hibernate system via logind: There's already a shutdown or sleep
> operation in progress.
> 
> And ps shows "Ds /lib/systemd/systemd-sleep suspend"...

Problem now seems to be in the mainline.

But no, I don't see systemd-sleep in my process list :-(.

I guess you can't reproduce it easily? I tried bisecting, but while it
happens often enough to make v4.17 hard to use, it does not permit
reliable bisect.

These should be bad according to my notes

b04240a33b99b32cf6fbdf5c943c04e505a0cb07 
 ed80dc19e4dd395c951f745acd1484d61c4cfb20
 52113a0d3889d6e2738cf09bf79bc9cac7b5e1c6
 4fc97ef94bbfa185d16b3e44199b7559d0668747
 14ebdb2c814f508936fe178a2abc906a16a3ab48
 639adbeef5ae1bb8eeebbb0cde0b885397bde192

bisection claimed

c16add24522547bf52c189b3c0d1ab6f5c2b4375

is first bad commit, but I'm not sure if I trust that.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH iproute2] utils: Do not reset family for default, any, all addresses

2018-04-15 Thread Thomas Deutschmann
Hi,

I can confirm that this patch solves the issue for us and restores
previous behavior.

Thank you.


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5


Re: [PATCH net] team: avoid adding twice the same option to the event list

2018-04-15 Thread Paolo Abeni
On Fri, 2018-04-13 at 14:07 -0400, David Miller wrote:
> From: Paolo Abeni 
> Date: Fri, 13 Apr 2018 13:59:25 +0200
> 
> > When parsing the options provided by the user space,
> > team_nl_cmd_options_set() insert them in a temporary list to send
> > multiple events with a single message.
> > While each option's attribute is correctly validated, the code does
> > not check for duplicate entries before inserting into the event
> > list.
> > 
> > Exploiting the above, the syzbot was able to trigger the following
> > splat:
>  ...
> > This changeset addresses the avoiding list_add() if the current
> > option is already present in the event list.
> > 
> > Reported-and-tested-by: 
> > syzbot+4d4af685432dc0e56...@syzkaller.appspotmail.com
> > Signed-off-by: Paolo Abeni 
> > Fixes: 2fcdb2c9e659 ("team: allow to send multiple set events in one 
> > message")
> 
> Looks good to me.
> 
> It's too bad that the tmp list entries don't get marked as they are
> added, or get unlinked by the list processor.  Either scheme would
> make the "already added" test a lot simpler.

Yes, I considered both changes, but than opted for this solution,
beliving it would be less invasive and more suitable for -net.

Cheers,

Paolo


[PATCH] ibmvnic: Clear pending interrupt after device reset

2018-04-15 Thread Thomas Falcon
Due to a firmware bug, the hypervisor can send an interrupt to a
transmit or receive queue just prior to a partition migration, not
allowing the device enough time to handle it and send an EOI. When
the partition migrates, the interrupt is lost but an "EOI-pending"
flag for the interrupt line is still set in firmware. No further
interrupts will be sent until that flag is cleared, effectively
freezing that queue. To workaround this, the driver will disable the
hardware interrupt and send an H_EOI signal prior to re-enabling it.
This will flush the pending EOI and allow the driver to continue
operation.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index f84a920..ef7995fc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
if (prev_state == VNIC_CLOSED)
enable_irq(adapter->rx_scrq[i]->irq);
-   else
-   enable_scrq_irq(adapter, adapter->rx_scrq[i]);
+   enable_scrq_irq(adapter, adapter->rx_scrq[i]);
}
 
for (i = 0; i < adapter->req_tx_queues; i++) {
netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
if (prev_state == VNIC_CLOSED)
enable_irq(adapter->tx_scrq[i]->irq);
-   else
-   enable_scrq_irq(adapter, adapter->tx_scrq[i]);
+   enable_scrq_irq(adapter, adapter->tx_scrq[i]);
}
 
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
@@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
*adapter)
if (adapter->tx_scrq[i]->irq) {
netdev_dbg(netdev,
   "Disabling tx_scrq[%d] irq\n", i);
+   disable_scrq_irq(adapter, adapter->tx_scrq[i]);
disable_irq(adapter->tx_scrq[i]->irq);
}
}
@@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
*adapter)
if (adapter->rx_scrq[i]->irq) {
netdev_dbg(netdev,
   "Disabling rx_scrq[%d] irq\n", i);
+   disable_scrq_irq(adapter, adapter->rx_scrq[i]);
disable_irq(adapter->rx_scrq[i]->irq);
}
}
@@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter 
*adapter,
 {
struct device *dev = &adapter->vdev->dev;
unsigned long rc;
+   u64 val;
 
if (scrq->hw_irq > 0x1ULL) {
dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
return 1;
}
 
+   val = (0xff00) | scrq->hw_irq;
+   rc = plpar_hcall_norets(H_EOI, val);
+   if (rc)
+   dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
+   val, rc);
+
rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
if (rc)
-- 
1.8.3.1



Re: [PATCH] ibmvnic: Clear pending interrupt after device reset

2018-04-15 Thread Thomas Falcon
On 04/15/2018 06:27 PM, Thomas Falcon wrote:
> Due to a firmware bug, the hypervisor can send an interrupt to a
> transmit or receive queue just prior to a partition migration, not
> allowing the device enough time to handle it and send an EOI. When
> the partition migrates, the interrupt is lost but an "EOI-pending"
> flag for the interrupt line is still set in firmware. No further
> interrupts will be sent until that flag is cleared, effectively
> freezing that queue. To workaround this, the driver will disable the
> hardware interrupt and send an H_EOI signal prior to re-enabling it.
> This will flush the pending EOI and allow the driver to continue
> operation.

Excuse me, I misspelled the linuxppc-dev email address.

Tom

> Signed-off-by: Thomas Falcon 
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index f84a920..ef7995fc 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
>   netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
>   if (prev_state == VNIC_CLOSED)
>   enable_irq(adapter->rx_scrq[i]->irq);
> - else
> - enable_scrq_irq(adapter, adapter->rx_scrq[i]);
> + enable_scrq_irq(adapter, adapter->rx_scrq[i]);
>   }
>
>   for (i = 0; i < adapter->req_tx_queues; i++) {
>   netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
>   if (prev_state == VNIC_CLOSED)
>   enable_irq(adapter->tx_scrq[i]->irq);
> - else
> - enable_scrq_irq(adapter, adapter->tx_scrq[i]);
> + enable_scrq_irq(adapter, adapter->tx_scrq[i]);
>   }
>
>   rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
> @@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
> *adapter)
>   if (adapter->tx_scrq[i]->irq) {
>   netdev_dbg(netdev,
>  "Disabling tx_scrq[%d] irq\n", i);
> + disable_scrq_irq(adapter, adapter->tx_scrq[i]);
>   disable_irq(adapter->tx_scrq[i]->irq);
>   }
>   }
> @@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
> *adapter)
>   if (adapter->rx_scrq[i]->irq) {
>   netdev_dbg(netdev,
>  "Disabling rx_scrq[%d] irq\n", i);
> + disable_scrq_irq(adapter, adapter->rx_scrq[i]);
>   disable_irq(adapter->rx_scrq[i]->irq);
>   }
>   }
> @@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter 
> *adapter,
>  {
>   struct device *dev = &adapter->vdev->dev;
>   unsigned long rc;
> + u64 val;
>
>   if (scrq->hw_irq > 0x1ULL) {
>   dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
>   return 1;
>   }
>
> + val = (0xff00) | scrq->hw_irq;
> + rc = plpar_hcall_norets(H_EOI, val);
> + if (rc)
> + dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
> + val, rc);
> +
>   rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
>   H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
>   if (rc)




[PATCH] ibmvnic: Clear pending interrupt after device reset

2018-04-15 Thread Thomas Falcon
Due to a firmware bug, the hypervisor can send an interrupt to a
transmit or receive queue just prior to a partition migration, not
allowing the device enough time to handle it and send an EOI. When
the partition migrates, the interrupt is lost but an "EOI-pending"
flag for the interrupt line is still set in firmware. No further
interrupts will be sent until that flag is cleared, effectively
freezing that queue. To workaround this, the driver will disable the
hardware interrupt and send an H_EOI signal prior to re-enabling it.
This will flush the pending EOI and allow the driver to continue
operation.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index f84a920..ef7995fc 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1034,16 +1034,14 @@ static int __ibmvnic_open(struct net_device *netdev)
netdev_dbg(netdev, "Enabling rx_scrq[%d] irq\n", i);
if (prev_state == VNIC_CLOSED)
enable_irq(adapter->rx_scrq[i]->irq);
-   else
-   enable_scrq_irq(adapter, adapter->rx_scrq[i]);
+   enable_scrq_irq(adapter, adapter->rx_scrq[i]);
}
 
for (i = 0; i < adapter->req_tx_queues; i++) {
netdev_dbg(netdev, "Enabling tx_scrq[%d] irq\n", i);
if (prev_state == VNIC_CLOSED)
enable_irq(adapter->tx_scrq[i]->irq);
-   else
-   enable_scrq_irq(adapter, adapter->tx_scrq[i]);
+   enable_scrq_irq(adapter, adapter->tx_scrq[i]);
}
 
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_UP);
@@ -1184,6 +1182,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
*adapter)
if (adapter->tx_scrq[i]->irq) {
netdev_dbg(netdev,
   "Disabling tx_scrq[%d] irq\n", i);
+   disable_scrq_irq(adapter, adapter->tx_scrq[i]);
disable_irq(adapter->tx_scrq[i]->irq);
}
}
@@ -1193,6 +1192,7 @@ static void ibmvnic_disable_irqs(struct ibmvnic_adapter 
*adapter)
if (adapter->rx_scrq[i]->irq) {
netdev_dbg(netdev,
   "Disabling rx_scrq[%d] irq\n", i);
+   disable_scrq_irq(adapter, adapter->rx_scrq[i]);
disable_irq(adapter->rx_scrq[i]->irq);
}
}
@@ -2601,12 +2601,19 @@ static int enable_scrq_irq(struct ibmvnic_adapter 
*adapter,
 {
struct device *dev = &adapter->vdev->dev;
unsigned long rc;
+   u64 val;
 
if (scrq->hw_irq > 0x1ULL) {
dev_err(dev, "bad hw_irq = %lx\n", scrq->hw_irq);
return 1;
}
 
+   val = (0xff00) | scrq->hw_irq;
+   rc = plpar_hcall_norets(H_EOI, val);
+   if (rc)
+   dev_err(dev, "H_EOI FAILED irq 0x%llx. rc=%ld\n",
+   val, rc);
+
rc = plpar_hcall_norets(H_VIOCTL, adapter->vdev->unit_address,
H_ENABLE_VIO_INTERRUPT, scrq->hw_irq, 0, 0);
if (rc)
-- 
1.8.3.1



Re: [PATCH v2 iproute2-next 1/1] tc: jsonify skbedit action

2018-04-15 Thread David Ahern
On 4/10/18 12:04 PM, Roman Mashak wrote:
> v2:
>FIxed strings format in print_string()
> 
> Signed-off-by: Roman Mashak 
> ---
>  tc/m_skbedit.c | 53 +
>  1 file changed, 29 insertions(+), 24 deletions(-)
> 

applied to iproute2-next



Re: [PATCH iproute2-next 1/1] tc: jsonify ife action

2018-04-15 Thread David Ahern
On 4/13/18 3:40 PM, Roman Mashak wrote:
> Signed-off-by: Roman Mashak 
> ---
>  tc/m_ife.c | 54 --
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 

applied to iproute2-next



Re: [PATCH] filter.txt: update 'tools/net/' to 'tools/bpf/'

2018-04-15 Thread David Miller
From: Wang Sheng-Hui 
Date: Sun, 15 Apr 2018 16:07:12 +0800

> The tools are located at tootls/bpf/ instead of tools/net/.
> Update the filter.txt doc.
> 
> Signed-off-by: Wang Sheng-Hui 

Applied, thank you.


[PATCH net] net: af_packet: fix race in PACKET_{R|T}X_RING

2018-04-15 Thread Eric Dumazet
In order to remove the race caught by syzbot [1], we need
to lock the socket before using po->tp_version as this could
change under us otherwise.

This means lock_sock() and release_sock() must be done by
packet_set_ring() callers.

[1] :
BUG: KMSAN: uninit-value in packet_set_ring+0x1254/0x3870 
net/packet/af_packet.c:4249
CPU: 0 PID: 20195 Comm: syzkaller707632 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 packet_set_ring+0x1254/0x3870 net/packet/af_packet.c:4249
 packet_setsockopt+0x12c6/0x5a90 net/packet/af_packet.c:3662
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849
 SyS_setsockopt+0x76/0xa0 net/socket.c:1828
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x449099
RSP: 002b:7f42b5307ce8 EFLAGS: 0246 ORIG_RAX: 0036
RAX: ffda RBX: 0070003c RCX: 00449099
RDX: 0005 RSI: 0107 RDI: 0003
RBP: 00700038 R08: 001c R09: 
R10: 20c0 R11: 0246 R12: 
R13: 0080eecf R14: 7f42b53089c0 R15: 0001

Local variable description: req_u@packet_setsockopt
Variable was created at:
 packet_setsockopt+0x13f/0x5a90 net/packet/af_packet.c:3612
 SYSC_setsockopt+0x4b8/0x570 net/socket.c:1849

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/packet/af_packet.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 
616cb9c18f88edd759dfb461051670c225978afa..c31b0687396a6ef45413f06efcc7c3f923e91d01
 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3008,6 +3008,7 @@ static int packet_release(struct socket *sock)
 
packet_flush_mclist(sk);
 
+   lock_sock(sk);
if (po->rx_ring.pg_vec) {
memset(&req_u, 0, sizeof(req_u));
packet_set_ring(sk, &req_u, 1, 0);
@@ -3017,6 +3018,7 @@ static int packet_release(struct socket *sock)
memset(&req_u, 0, sizeof(req_u));
packet_set_ring(sk, &req_u, 1, 1);
}
+   release_sock(sk);
 
f = fanout_release(sk);
 
@@ -3643,6 +3645,7 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
union tpacket_req_u req_u;
int len;
 
+   lock_sock(sk);
switch (po->tp_version) {
case TPACKET_V1:
case TPACKET_V2:
@@ -3653,12 +3656,17 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
len = sizeof(req_u.req3);
break;
}
-   if (optlen < len)
-   return -EINVAL;
-   if (copy_from_user(&req_u.req, optval, len))
-   return -EFAULT;
-   return packet_set_ring(sk, &req_u, 0,
-   optname == PACKET_TX_RING);
+   if (optlen < len) {
+   ret = -EINVAL;
+   } else {
+   if (copy_from_user(&req_u.req, optval, len))
+   ret = -EFAULT;
+   else
+   ret = packet_set_ring(sk, &req_u, 0,
+   optname == PACKET_TX_RING);
+   }
+   release_sock(sk);
+   return ret;
}
case PACKET_COPY_THRESH:
{
@@ -4208,8 +4216,6 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
/* Added to avoid minimal code churn */
struct tpacket_req *req = &req_u->req;
 
-   lock_sock(sk);
-
rb = tx_ring ? &po->tx_ring : &po->rx_ring;
rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
 
@@ -4347,7 +4353,6 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
if (pg_vec)
free_pg_vec(pg_vec, order, req->tp_block_nr);
 out:
-   release_sock(sk);
return err;
 }
 
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH] ibmvnic: Clear pending interrupt after device reset

2018-04-15 Thread David Miller
From: Thomas Falcon 
Date: Sun, 15 Apr 2018 18:53:36 -0500

> Due to a firmware bug, the hypervisor can send an interrupt to a
> transmit or receive queue just prior to a partition migration, not
> allowing the device enough time to handle it and send an EOI. When
> the partition migrates, the interrupt is lost but an "EOI-pending"
> flag for the interrupt line is still set in firmware. No further
> interrupts will be sent until that flag is cleared, effectively
> freezing that queue. To workaround this, the driver will disable the
> hardware interrupt and send an H_EOI signal prior to re-enabling it.
> This will flush the pending EOI and allow the driver to continue
> operation.
> 
> Signed-off-by: Thomas Falcon 

Hey Thomas, I see two copies of this patch posted.  Any special
reason for that?

Thanks.


[PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-15 Thread Zhu Yanjun
While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
  [] dump_stack+0x63/0x81
  [] panic+0xcc/0x21b
  [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
  [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
  [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
  [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
  [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
  [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 6a9086d..f1c8c42 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 
in_param, u64 out_param,
 * Device is going through error recovery
 * and cannot accept commands.
 */
+   mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+   ret = -EINVAL;
goto out;
}
 
@@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
}
 
 out_reset:
+   if (err == -EINVAL)
+   goto out;
+
if (err)
err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
 out:
@@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
*out_param = context->out_param;
 
 out_reset:
+   if (err == -EINVAL)
+   goto out;
+
if (err)
err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
 out:
-- 
2.7.4



Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Eric Dumazet wrote on Fri, Apr 13, 2018:
> That might be caused by some TS val/ecr breakage :
> 
> Many acks were received by the server tcpdump,
> but none of them was accepted by TCP stack, for some reason.
> 
> Try to disable TCP timestamps, it will give some hint if bug does not 
> reproduce.

This was spot on, after disabling tcp timestamps I cannot reproduce the
hang anymore.

I've had another look at the original sequence (as seen by the server)
and I don't see much wrong; tell me what I missed:
 - the replayed packet has seq 32004:33378, so the first ignored ack
would be the one with ack 33378, is that right? (meaning the server did
accept the one for 32004 and none after that)

Assuming it is, excerpt from around then (first emission of that packet then
client replies):
16:49:26.700531 IP .13317 > .31872: Flags [.], seq 32004:33378, 
ack 4190, win 307, options [nop,nop,TS val 1313937607 ecr 1617129440], length 
1374
...
16:49:26.728084 IP .31872 > .13317: Flags [.], ack 32004, win 
759, options [nop,nop,TS val 1617129473 ecr 1313937602], length 0
...
16:49:26.729531 IP .31872 > .13317: Flags [.], ack 33378, win 
782, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0
16:49:26.730002 IP .31872 > .13317: Flags [.], ack 34752, win 
805, options [nop,nop,TS val 1617129475 ecr 1313937607], length 0
...
16:49:26.731634 IP .31872 > .13317: Flags [.], ack 36126, win 
827, options [nop,nop,TS val 1617129476 ecr 1313937607], length 0


 - the ecr value matches the val of the packet it acks
 - the val is >= that of previous packet (won't be considered
reorder/should pass paws check?)
 - even if the packets are processed in parallel and some kind of race
occurs, a "bigger" ack should ack all the previous packets, right?

 - Just to make sure, I checked /proc/net/netstat for PAWSEstab but that
is 0:
TcpExt: SyncookiesSent SyncookiesRecv SyncookiesFailed EmbryonicRsts 
PruneCalled RcvPruned OfoPruned OutOfWindowIcmps LockDroppedIcmps ArpFilter TW 
TWRecycled TWKilled PAWSActive PAWSEstab DelayedACKs DelayedACKLocked 
DelayedACKLost ListenOverflows ListenDrops TCPHPHits TCPPureAcks TCPHPAcks 
TCPRenoRecovery TCPSackRecovery TCPSACKReneging TCPFACKReorder TCPSACKReorder 
TCPRenoReorder TCPTSReorder TCPFullUndo TCPPartialUndo TCPDSACKUndo TCPLossUndo 
TCPLostRetransmit TCPRenoFailures TCPSackFailures TCPLossFailures 
TCPFastRetrans TCPSlowStartRetrans TCPTimeouts TCPLossProbes 
TCPLossProbeRecovery TCPRenoRecoveryFail TCPSackRecoveryFail TCPRcvCollapsed 
TCPDSACKOldSent TCPDSACKOfoSent TCPDSACKRecv TCPDSACKOfoRecv TCPAbortOnData 
TCPAbortOnClose TCPAbortOnMemory TCPAbortOnTimeout TCPAbortOnLinger 
TCPAbortFailed TCPMemoryPressures TCPMemoryPressuresChrono TCPSACKDiscard 
TCPDSACKIgnoredOld TCPDSACKIgnoredNoUndo TCPSpuriousRTOs TCPMD5NotFound 
TCPMD5Unexpected TCPMD5Failure TCPSackShifted TCPSackMerged 
TCPSackShiftFallback TCPBacklogDrop PFMemallocDrop TCPMinTTLDrop 
TCPDeferAcceptDrop IPReversePathFilter TCPTimeWaitOverflow TCPReqQFullDoCookies 
TCPReqQFullDrop TCPRetransFail TCPRcvCoalesce TCPOFOQueue TCPOFODrop 
TCPOFOMerge TCPChallengeACK TCPSYNChallenge TCPFastOpenActive 
TCPFastOpenActiveFail TCPFastOpenPassive TCPFastOpenPassiveFail 
TCPFastOpenListenOverflow TCPFastOpenCookieReqd TCPFastOpenBlackhole 
TCPSpuriousRtxHostQueues BusyPollRxPackets TCPAutoCorking TCPFromZeroWindowAdv 
TCPToZeroWindowAdv TCPWantZeroWindowAdv TCPSynRetrans TCPOrigDataSent 
TCPHystartTrainDetect TCPHystartTrainCwnd TCPHystartDelayDetect 
TCPHystartDelayCwnd TCPACKSkippedSynRecv TCPACKSkippedPAWS TCPACKSkippedSeq 
TCPACKSkippedFinWait2 TCPACKSkippedTimeWait TCPACKSkippedChallenge TCPWinProbe 
TCPKeepAlive TCPMTUPFail TCPMTUPSuccess
TcpExt: 0 0 0 0 58 0 0 26 0 0 50 0 0 0 0 75402 17 201 0 0 6876848 59804 2258387 
0 33 0 0 3 0 0 0 0 1 102 15 0 0 0 1306 60 386 292 10 0 0 108750 201 1 228 1 8 4 
0 63 0 3 0 0 0 0 107 1 0 0 0 2834 1962 622 0 0 0 0 0 0 0 0 0 1065022 54160 0 1 
3 3 0 0 0 0 0 0 0 475 0 9578 6 8 71 257 5116325 0 0 0 0 0 0 6 0 0 0 61 85 0 0
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts 
OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets 
OutBcastOctets InCsumErrors InNoECTPkts InECT1Pkts InECT0Pkts InCEPkts
IpExt: 0 0 0 0 206 0 16405602866 8921427728 0 0 57928 0 0 16121410 0 5388 0

Also, here are the per-socket stats I could find (ss -i after having
reproduced hang):
 reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374
 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
 bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54
 data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912
 lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate
 3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37
 rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986


Here are the same stats with tcp timestamp disabled (after running my
reproducer, e.g. outputing a big chunk of text

[PATCH net] net: Fix one possible memleak in ip_setup_cork

2018-04-15 Thread gfree . wind
From: Gao Feng 

It would allocate memory in this function when the cork->opt is NULL. But
the memory isn't freed if failed in the latter rt check, and return error
directly. It causes the memleak if its caller is ip_make_skb which also
doesn't free the cork->opt when meet a error.

Now move the rt check ahead to avoid the memleak.

Signed-off-by: Gao Feng 
---
 net/ipv4/ip_output.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b81..83c73ba 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1109,6 +1109,10 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
struct ip_options_rcu *opt;
struct rtable *rt;
 
+   rt = *rtp;
+   if (unlikely(!rt))
+   return -EFAULT;
+
/*
 * setup for corking.
 */
@@ -1124,9 +1128,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->flags |= IPCORK_OPT;
cork->addr = ipc->addr;
}
-   rt = *rtp;
-   if (unlikely(!rt))
-   return -EFAULT;
+
/*
 * We steal reference to this route, caller should not release it
 */
-- 
1.9.1




[PATCH net] net: Fix one possible memleak in ip_setup_cork

2018-04-15 Thread gfree . wind
From: Gao Feng 

It would allocate memory in this function when the cork->opt is NULL. But
the memory isn't freed if failed in the latter rt check, and return error
directly. It causes the memleak if its caller is ip_make_skb which also
doesn't free the cork->opt when meet a error.

Now move the rt check ahead to avoid the memleak.

Signed-off-by: Gao Feng 
---
 net/ipv4/ip_output.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b81..83c73ba 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1109,6 +1109,10 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
struct ip_options_rcu *opt;
struct rtable *rt;
 
+   rt = *rtp;
+   if (unlikely(!rt))
+   return -EFAULT;
+
/*
 * setup for corking.
 */
@@ -1124,9 +1128,7 @@ static int ip_setup_cork(struct sock *sk, struct 
inet_cork *cork,
cork->flags |= IPCORK_OPT;
cork->addr = ipc->addr;
}
-   rt = *rtp;
-   if (unlikely(!rt))
-   return -EFAULT;
+
/*
 * We steal reference to this route, caller should not release it
 */
-- 
1.9.1




Re: tcp hang when socket fills up ?

2018-04-15 Thread Eric Dumazet


On 04/15/2018 06:47 PM, Dominique Martinet wrote:

> Also, here are the per-socket stats I could find (ss -i after having
> reproduced hang):
>reno wscale:7,7 rto:7456 backoff:5 rtt:32.924/1.41 ato:40 mss:1374
>pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
>bytes_acked:32004 bytes_received:4189 segs_out:85 segs_in:54
>data_segs_out:78 data_segs_in:18 send 333.9Kbps lastsnd:3912
>lastrcv:11464 lastack:11387 pacing_rate 21.4Mbps delivery_rate
>3.5Mbps busy:12188ms unacked:33 retrans:1/5 lost:33 rcv_rtt:37
>rcv_space:29200 rcv_ssthresh:39184 notsent:28796 minrtt:24.986
> 

ss -temoi might give us more info

Really it looks like at some point, all incoming packets are shown by tcpdump 
but do not reach the TCP socket anymore.

(segs_in: might be steady, look at the d0 counter shown by ss -temoi  (dX : 
drop counters, sk->sk_drops)


Are you sure you do not have some iptables/netfilter stuff ?

While running your experiment, try on the server.

perf record -a -g -e skb:kfree_skb  sleep 30
perf report





linux-next: build failure after merge of the bpf tree

2018-04-15 Thread Stephen Rothwell
Hi all,

After merging the bpf tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

kernel/bpf/core.o: In function `sock_map_release':
core.c:(.text+0xd04): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
kernel/events/core.o: In function `sock_map_release':
core.c:(.text+0x85cc): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
block/blk-core.o: In function `sock_map_release':
blk-core.c:(.text+0x58e8): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
drivers/net/virtio_net.o: In function `sock_map_release':
virtio_net.c:(.text+0x53ec): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/core/dev.o: In function `sock_map_release':
dev.c:(.text+0x6c68): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/core/rtnetlink.o: In function `sock_map_release':
rtnetlink.c:(.text+0x63e0): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/core/filter.o: In function `sock_map_release':
filter.c:(.text+0x8c8c): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/core/sock_reuseport.o: In function `sock_map_release':
sock_reuseport.c:(.text+0x398): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/bpf/test_run.o: In function `sock_map_release':
test_run.c:(.text+0x3dc): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here
net/packet/af_packet.o: In function `sock_map_release':
af_packet.c:(.text+0x6958): multiple definition of `sock_map_release'
kernel/sysctl.o:sysctl.c:(.text+0x1a50): first defined here

Caused by commit

  9b2e8bbc4e7a ("bpf: sockmap, map_release does not hold refcnt for pinned 
maps")

I applied the following patch for today:

From: Stephen Rothwell 
Date: Mon, 16 Apr 2018 12:27:24 +1000
Subject: [PATCH] fix for "bpf: sockmap, map_release does not hold refcnt for
 pinned maps"

Signed-off-by: Stephen Rothwell 
---
 include/linux/bpf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f46561de5154..3b6c2b66f414 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -660,7 +660,7 @@ static inline int sock_map_prog(struct bpf_map *map,
return -EOPNOTSUPP;
 }
 
-void sock_map_release(struct bpf_map *map) {}
+static inline void sock_map_release(struct bpf_map *map) {}
 #endif
 
 /* verifier prototypes for helper functions called from eBPF programs */
-- 
2.16.3

-- 
Cheers,
Stephen Rothwell


pgpYiNKCnBgyc.pgp
Description: OpenPGP digital signature


[PATCH] net: mediatek: use of_device_get_match_data()

2018-04-15 Thread Ryder Lee
The usage of of_device_get_match_data() reduce the code size a bit.

Also, the only way to call mtk_probe() is to match an entry in
of_mtk_match[], so match cannot be NULL.

Signed-off-by: Ryder Lee 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index e0b72bf..d8ebf0a 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2503,7 +2503,6 @@ static int mtk_probe(struct platform_device *pdev)
 {
struct resource *res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
struct device_node *mac_np;
-   const struct of_device_id *match;
struct mtk_eth *eth;
int err;
int i;
@@ -2512,8 +2511,7 @@ static int mtk_probe(struct platform_device *pdev)
if (!eth)
return -ENOMEM;
 
-   match = of_match_device(of_mtk_match, &pdev->dev);
-   eth->soc = (struct mtk_soc_data *)match->data;
+   eth->soc = of_device_get_match_data(&pdev->dev);
 
eth->dev = &pdev->dev;
eth->base = devm_ioremap_resource(&pdev->dev, res);
-- 
1.9.1



Re: [PATCH net] net: Fix one possible memleak in ip_setup_cork

2018-04-15 Thread David Miller
From: gfree.w...@vip.163.com
Date: Mon, 16 Apr 2018 10:16:45 +0800

> From: Gao Feng 
> 
> It would allocate memory in this function when the cork->opt is NULL. But
> the memory isn't freed if failed in the latter rt check, and return error
> directly. It causes the memleak if its caller is ip_make_skb which also
> doesn't free the cork->opt when meet a error.
> 
> Now move the rt check ahead to avoid the memleak.
> 
> Signed-off-by: Gao Feng 

Why did you post this patch twice?


Re:Re: [PATCH net] net: Fix one possible memleak in ip_setup_cork

2018-04-15 Thread Gao Feng
At 2018-04-16 10:55:56, "David Miller"  wrote:
>From: gfree.w...@vip.163.com
>Date: Mon, 16 Apr 2018 10:16:45 +0800
>
>> From: Gao Feng 
>> 
>> It would allocate memory in this function when the cork->opt is NULL. But
>> the memory isn't freed if failed in the latter rt check, and return error
>> directly. It causes the memleak if its caller is ip_make_skb which also
>> doesn't free the cork->opt when meet a error.
>> 
>> Now move the rt check ahead to avoid the memleak.
>> 
>> Signed-off-by: Gao Feng 
>
>Why did you post this patch twice?

Sorry, it is my input error. I typed "yes" not "all" at the first time when 
execute git-send-email.
Then I corrected it as the second time.

Best Regards
Feng


Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Eric Dumazet wrote on Sun, Apr 15, 2018:
> Are you sure you do not have some iptables/netfilter stuff ?

I have a basic firewall setup with default rules e.g. starts with
-m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
in the INPUT chain...
That said, I just dropped it on the server to check and that seems to
workaround the issue?!
When logging everything dropped it appears to decide that the connection
is no longer established at some point, but only if there is
tcp_timestamp, just, err, how?

And certainly enough, if I restore the firewall while a connection is up
that just hangs; conntrack doesn't consider it connected anymore at some
point (but it worked for a while!)

Here's the kind of logs I get from iptables:
IN=wlp1s0 OUT= MAC=00:c2:c6:b4:7e:c7:a4:12:42:b5:5d:fc:08:00 SRC=client 
DST=server LEN=52 TOS=0x00 PREC=0x00 TTL=52 ID=17038 DF PROTO=TCP SPT=41558 
DPT=15609 WINDOW=1212 RES=0x00 ACK URGP=0 


> ss -temoi might give us more info

hang
ESTAB   081406server:15609 client:41558
users:(("socat",pid=17818,fd=5)) timer:(on,48sec,11) uid:1000 ino:137253 sk:6a 
<->
 skmem:(r0,rb369280,t0,tb147456,f2050,w104446,o0,bl0,d1) ts sack
 reno wscale:7,7 rto:15168 backoff:6 rtt:36.829/6.492 ato:40
 mss:1374 pmtu:1500 rcvmss:1248 advmss:1448 cwnd:1 ssthresh:16
 bytes_acked:32004 bytes_received:4189 segs_out:84 segs_in:55
 data_segs_out:77 data_segs_in:18 send 298.5Kbps lastsnd:12483
 lastrcv:27801 lastack:27726 pacing_rate 19.1Mbps delivery_rate
 4.1Mbps busy:28492ms unacked:31 retrans:1/6 lost:31 rcv_rtt:29
 rcv_space:29200 rcv_ssthresh:39184 notsent:38812 minrtt:25.152

working (tcp_timestamp=0)
ESTAB   036   server:15080 client:32979
users:(("socat",pid=17047,fd=5)) timer:(on,226ms,0) uid:1000 ino:90917 sk:23 <->
 skmem:(r0,rb369280,t0,tb1170432,f1792,w2304,o0,bl0,d3) sack reno
 wscale:7,7 rto:230 rtt:29.413/5.345 ato:64 mss:1386 pmtu:1500
 rcvmss:1248 advmss:1460 cwnd:4 ssthresh:3 bytes_acked:17391762
 bytes_received:62397 segs_out:13964 segs_in:8642
 data_segs_out:13895 data_segs_in:1494 send 1.5Mbps lastsnd:4
 lastrcv:5 lastack:5 pacing_rate 1.8Mbps delivery_rate 1.2Mbps
 busy:56718ms unacked:1 retrans:0/11 rcv_rtt:9112.95 rcv_space:29233
 rcv_ssthresh:41680 minrtt:25.95

working (no iptables)
ESTAB   00server:61460 client:20468
users:(("socat",pid=17880,fd=5)) uid:1000 ino:129982 sk:6f <->
 skmem:(r0,rb369280,t0,tb1852416,f0,w0,o0,bl0,d1) ts sack reno
 wscale:7,7 rto:244 rtt:43.752/7.726 ato:40 mss:1374 pmtu:1500
 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:2617302
 bytes_received:5441 segs_out:1929 segs_in:976 data_segs_out:1919
 data_segs_in:41 send 2.5Mbps lastsnd:2734 lastrcv:2734 lastack:2705
 pacing_rate 5.0Mbps delivery_rate 12.7Mbps busy:1884ms rcv_rtt:30
 rcv_space:29200 rcv_ssthresh:39184 minrtt:26.156

> Really it looks like at some point, all incoming packets are shown by
> tcpdump but do not reach the TCP socket anymore.
> 
> (segs_in: might be steady, look at the d0 counter shown by ss -temoi
> (dX : drop counters, sk->sk_drops)

segs_in does not increase with replays; the d1 seems stable.


> While running your experiment, try on the server.
> 
> perf record -a -g -e skb:kfree_skb  sleep 30
> perf report

While I understand what that should do, I am not sure why I do not get
any graph so that doesn't help tell what called kfree_skb and thus what
decided to drop the packet (although we no longer really need that
now..)
perf script just shows kfree_skb e.g.
swapper 0 [001] 237244.869321: skb:kfree_skb: skbaddr=0x8800360fda00 
protocol=2048 location=0x817a1a77
  9458e3 kfree_skb
(/usr/lib/debug/lib/modules/4.16.0-300.fc28.x86_64/vmlinux)


---

So I guess that ultimately the problem is why conntrack suddenly decides
that an established connection suddenly isn't anymore, despite being
listed as established by ss..
I'm discovering `conntrack(8)`, but what strikes me as interesting is
that even that points at the connection being established (looking at a
new connection after iptables started dropping packets)
# conntrack -L | grep 21308
tcp  6 267 ESTABLISHED src=server dst=client sport=21308 dport=37552 
src=client dst=server sport=37552 dport=21308 [ASSURED] mark=0 use=1

compared to another that isn't dropped (the old connection without
tcp_timestamp)
tcp  6 299 ESTABLISHED src=server dst=client sport=15080 dport=32979 
src=client dst=server sport=32979 dport=15080 [ASSURED] mark=0 use=1

The expect/dying/unconfirmed tables all are empty.

. . . Oh, there is something interesting there, the connection doesn't
come up with -G?
working:
conntrack -G --protonum tcp --src server --dst client --sport 15080 --dport 
32979
tcp  6 299 ESTABLISHED src=server dst=cl

Re: tcp hang when socket fills up ?

2018-04-15 Thread Dominique Martinet
Dominique Martinet wrote on Mon, Apr 16, 2018:
> . . . Oh, there is something interesting there, the connection doesn't
> come up with -G?

Hm, sorry, I take this last part back. I cannot reproduce -G not working
reliably.
I'll dig around the conntrack table a bit more.


-- 
Dominique Martinet | Asmadeus


KASAN: use-after-free Read in tipc_nametbl_stop

2018-04-15 Thread syzbot

Hello,

syzbot hit the following crash on net-next commit
5d1365940a68dd57b031b6e3c07d7d451cd69daf (Thu Apr 12 18:09:05 2018 +)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=d64b64afc55660106556


So far this crash happened 5 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6319968803094528
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=6099825221173248
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4953018151731200
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534

compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d64b64afc55660106...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

Failed to remove local publication {0,0,0}/20641
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
==
BUG: KASAN: use-after-free in tipc_service_delete net/tipc/name_table.c:751  
[inline]
BUG: KASAN: use-after-free in tipc_nametbl_stop+0x94e/0xd70  
net/tipc/name_table.c:780

Read of size 8 at addr 8801c4c25130 by task kworker/u4:2/30

CPU: 0 PID: 30 Comm: kworker/u4:2 Not tainted 4.16.0+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Workqueue: netns cleanup_net
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
 tipc_service_delete net/tipc/name_table.c:751 [inline]
 tipc_nametbl_stop+0x94e/0xd70 net/tipc/name_table.c:780
 tipc_exit_net+0x2d/0x40 net/tipc/core.c:103
 ops_exit_list.isra.7+0xb0/0x160 net/core/net_namespace.c:152
 cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
 kthread+0x345/0x410 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411

Allocated by task 4535:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
 kmalloc include/linux/slab.h:512 [inline]
 kzalloc include/linux/slab.h:701 [inline]
 tipc_service_create_range net/tipc/name_table.c:183 [inline]
 tipc_service_insert_publ net/tipc/name_table.c:207 [inline]
 tipc_nametbl_insert_publ+0x569/0x1910 net/tipc/name_table.c:371
 tipc_nametbl_publish+0x6c3/0xba0 net/tipc/name_table.c:618
 tipc_sk_publish+0x22a/0x510 net/tipc/socket.c:2604
 tipc_bind+0x206/0x330 net/tipc/socket.c:647
 __sys_bind+0x331/0x440 net/socket.c:1484
 SYSC_bind net/socket.c:1495 [inline]
 SyS_bind+0x24/0x30 net/socket.c:1493
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Freed by task 30:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 tipc_service_remove_publ.isra.8+0x909/0xc30 net/tipc/name_table.c:283
 tipc_service_delete net/tipc/name_table.c:753 [inline]
 tipc_nametbl_stop+0x746/0xd70 net/tipc/name_table.c:780
 tipc_exit_net+0x2d/0x40 net/tipc/core.c:103
 ops_exit_list.isra.7+0xb0/0x160 net/core/net_namespace.c:152
 cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
 kthread+0x345/0x410 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411

The buggy address belongs to the object at 8801c4c25100
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 48 bytes inside of
 64-byte region [8801c4c25100, 8801c4c25140)
The buggy address belongs to the page:
page:ea0007130940 count:1 mapcount:0 mapping:8801c4c25000 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 8801c4c25000  00010020
raw: ea0006ccf860 ea00070840a0 8801dac00340 
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 8801c4c25000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 8801c4c25080: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

8801c4c25100: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

 ^
 8801c4c2518

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-15 Thread Jesper Dangaard Brouer
On Sat, 14 Apr 2018 21:29:26 +0200
David Woodhouse  wrote:

> On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote:
> > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote:  
> > > I guess there is nothing we need to do!
> > >
> > > On x86, in case of no intel iommu or iommu is disabled, you end up in
> > > swiotlb for DMA API calls when system has 4G memory.
> > > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs do not
> > > use bounce buffer until and unless you have swiotlb=force specified in
> > > kernel commandline.  
> > 
> > Sure.  But that means very sync_*_to_device and sync_*_to_cpu now
> > involves an indirect call to do exactly nothing, which in the workload
> > Jesper is looking at is causing a huge performance degradation due to
> > retpolines.  

Yes, exactly.

> 
> We should look at using the
> 
>  if (dma_ops == swiotlb_dma_ops)
>     swiotlb_map_page()
>  else
>     dma_ops->map_page()
> 
> trick for this. Perhaps with alternatives so that when an Intel or AMD
> IOMMU is detected, it's *that* which is checked for as the special
> case.

Yes, this trick is basically what I'm asking for :-)

It did sound like Hellwig wanted to first avoid/fix that x86 end-up
defaulting to swiotlb.  Thus, we just have to do the same trick with
the new default fall-through dma_ops.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer