date:20170913

Re: RFC: Audit Kernel Container IDs

2017-09-13 Thread Richard Guy Briggs

On 2017-09-13 14:33, Carlos O'Donell wrote:
> On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> > Containers are a userspace concept.  The kernel knows nothing of them.
> 
> I am looking at this RFC from a userspace perspective, particularly from
> the loader's point of view and the unshare syscall and the semantics that
> arise from the use of it.
> 
> At a high level what you are doing is providing a way to group, without
> hierarchy, processes and namespaces. The processes can move between
> container's if they have CAP_CONTAINER_ADMIN and can open and write to
> a special proc file.
> 
> * With unshare a thread may dissociate part of its execution context and
>   therefore see a distinct mount namespace. When you say "process" in this
>   particular RFC do you exclude the fact that a thread might be in a
>   distinct container from the rest of the threads in the process?
> 
> > The Linux audit system needs a way to be able to track the container
> > provenance of events and actions.  Audit needs the kernel's help to do
> > this.
> 
> * Why does the Linux audit system need to tracker container provenance?

- ability to filter unwanted, irrelevant or unimportant messages before
  they fill queue so important messages don't get lost.  This is a
  certification requirement.

- ability to make security claims about containers, require tracking of
  actions within those containers to ensure compliance with established
  security policies.

- ability to route messages from events to relevant audit daemon
  instance or host audit daemon instance or both, as required or
  determined by user-initiated rules

>   - How does it help to provide better audit messages?
> 
>   - Is it be enough to list the namespace that a process occupies?

We started with that approach back more than 4 years ago and found it
helped, but didn't go far enough in terms of quick and inexpensive
record filtering and left some doubt about provenance of events in the
case of non-user context events (incoming network packets).

> * Why does it need the kernel's help?
> 
>   - Is there a race condition that is only fixable with kernel support?

This was a concern, but relatively minor compared with the other benefits.

>   - Or is it easier with kernel help but not required?

It is much easier and much less expensive.

> Providing background on these questions would help clarify the
> design requirements.

Here are some references that should help provide some background:
https://github.com/linux-audit/audit-kernel/issues/32
RFE: add namespace IDs to audit records


https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events
SPEC Virtualization Manager Guest Lifecycle Events

https://lwn.net/Articles/699819/
Audit, namespaces, and containers

https://lwn.net/Articles/723561/
Containers as kernel objects
(my reply, with references: https://lkml.org/lkml/2017/8/14/15 )

https://bugzilla.redhat.com/show_bug.cgi?id=1045666
audit: add namespace IDs to log records

> > Since the concept of a container is entirely a userspace concept, a
> > trigger signal from the userspace container orchestration system
> > initiates this.  This will define a point in time and a set of resources
> > associated with a particular container with an audit container ID.
> 
> Please don't use the word 'signal', I suggest 'register' since you are
> writing to a filesystem.

Ok, that's a very reasonable request.  'signal' has a previous meaning.

> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
> > write of a u64 representing the container ID to a file representing a
> > process that will become the first process in a new container.
> > This might place restrictions on mount namespaces required to define a
> > container, or at least careful checking of namespaces in the kernel to
> > verify permissions of the orchestrator so it can't change its own
> > container ID.
> > A bind mount of nsfs may be necessary in the container orchestrator's
> > mntNS.
> > 
> > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> > filesystem to have this action permitted.  At that time, record the
> > child container's user-supplied 64-bit container identifier along with
> 
> What is a "child container?" Containers don't have any hierarchy.

Maybe some don't, but that's not likely to last long given the
abstraction and nesting of orchestration tools.  This must be nestable.

> I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
> your continued operation as we have today?

Correct.  It won't prevent processes that otherwise have permissions
today from creating all the namespaces it wishes.

> > the child container's first process (which may become the container's
> > "init" process) process ID (referenced from the initial PID namespace),
> > all namespace IDs (in the form of a nsfs device

Re: [RFC PATCH] can: m_can: Support higher speed CAN-FD bitrates

2017-09-13 Thread Sekhar Nori


On Thursday 14 September 2017 03:28 AM, Franklin S Cooper Jr wrote:
> 
> 
> On 08/18/2017 02:39 PM, Franklin S Cooper Jr wrote:
>> During test transmitting using CAN-FD at high bitrates (4 Mbps) only
>> resulted in errors. Scoping the signals I noticed that only a single bit
>> was being transmitted and with a bit more investigation realized the actual
>> MCAN IP would go back to initialization mode automatically.
>>
>> It appears this issue is due to the MCAN needing to use the Transmitter
>> Delay Compensation Mode as defined in the MCAN User's Guide. When this
>> mode is used the User's Guide indicates that the Transmitter Delay
>> Compensation Offset register should be set. The document mentions that this
>> register should be set to (1/dbitrate)/2*(Func Clk Freq).
>>
>> Additional CAN-CIA's "Bit Time Requirements for CAN FD" document indicates
>> that this TDC mode is only needed for data bit rates above 2.5 Mbps.
>> Therefore, only enable this mode and only set TDCO when the data bit rate
>> is above 2.5 Mbps.
>>
>> Signed-off-by: Franklin S Cooper Jr 
>> ---
>> I'm pretty surprised that this hasn't been implemented already since
>> the primary purpose of CAN-FD is to go beyond 1 Mbps and the MCAN IP
>> supports up to 10 Mbps.
>>
>> So it will be nice to get comments from users of this driver to understand
>> if they have been able to use CAN-FD beyond 2.5 Mbps without this patch.
>> If they haven't what did they do to get around it if they needed higher
>> speeds.
>>
>> Meanwhile I plan on testing this using a more "realistic" CAN bus to insure
>> everything still works at 5 Mbps which is the max speed of my CAN
>> transceiver.
> 
> ping. Anyone has any thoughts on this?

I added Dong who authored the m_can driver and Wenyou who added the only
in-kernel user of the driver for any help.

Thanks,
Sekhar

>>
>>  drivers/net/can/m_can/m_can.c | 24 +++-
>>  1 file changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
>> index f4947a7..720e073 100644
>> --- a/drivers/net/can/m_can/m_can.c
>> +++ b/drivers/net/can/m_can/m_can.c
>> @@ -126,6 +126,12 @@ enum m_can_mram_cfg {
>>  #define DBTP_DSJW_SHIFT 0
>>  #define DBTP_DSJW_MASK  (0xf << DBTP_DSJW_SHIFT)
>>  
>> +/* Transmitter Delay Compensation Register (TDCR) */
>> +#define TDCR_TDCO_SHIFT 8
>> +#define TDCR_TDCO_MASK  (0x7F << TDCR_TDCO_SHIFT)
>> +#define TDCR_TDCF_SHIFT 0
>> +#define TDCR_TDCF_MASK  (0x7F << TDCR_TDCO_SHIFT)
>> +
>>  /* Test Register (TEST) */
>>  #define TEST_LBCK   BIT(4)
>>  
>> @@ -977,6 +983,8 @@ static int m_can_set_bittiming(struct net_device *dev)
>>  const struct can_bittiming *dbt = >can.data_bittiming;
>>  u16 brp, sjw, tseg1, tseg2;
>>  u32 reg_btp;
>> +u32 enable_tdc = 0;
>> +u32 tdco;
>>  
>>  brp = bt->brp - 1;
>>  sjw = bt->sjw - 1;
>> @@ -991,9 +999,23 @@ static int m_can_set_bittiming(struct net_device *dev)
>>  sjw = dbt->sjw - 1;
>>  tseg1 = dbt->prop_seg + dbt->phase_seg1 - 1;
>>  tseg2 = dbt->phase_seg2 - 1;
>> +
>> +/* TDC is only needed for bitrates beyond 2.5 MBit/s
>> + * Specified in the "Bit Time Requirements for CAN FD" document
>> + */
>> +if (dbt->bitrate > 250) {
>> +enable_tdc = DBTP_TDC;
>> +/* Equation based on Bosch's M_CAN User Manual's
>> + * Transmitter Delay Compensation Section
>> + */
>> +tdco = priv->can.clock.freq / (dbt->bitrate * 2);
>> +m_can_write(priv, M_CAN_TDCR, tdco << TDCR_TDCO_SHIFT);
>> +}
>> +
>>  reg_btp = (brp << DBTP_DBRP_SHIFT) | (sjw << DBTP_DSJW_SHIFT) |
>>  (tseg1 << DBTP_DTSEG1_SHIFT) |
>> -(tseg2 << DBTP_DTSEG2_SHIFT);
>> +(tseg2 << DBTP_DTSEG2_SHIFT) | enable_tdc;
>> +
>>  m_can_write(priv, M_CAN_DBTP, reg_btp);
>>  }
>>  
>>

[PATCH] staging: irda: Remove typedef struct

2017-09-13 Thread Haneen Mohammed

This patch remove typedef from a structure with all its ocurrences
since using typedefs for structures is discouraged.
Issue found using Coccinelle:

@r1@
type T;
@@

typedef struct { ... } T;

@script:python c1@
T2;
T << r1.T;
@@
if T[-2:] =="_t" or T[-2:] == "_T":
coccinelle.T2 = T[:-2];
else:
coccinelle.T2 = T;

print T, coccinelle.T2

@r2@
type r1.T;
identifier c1.T2;
@@
-typedef
struct
+ T2
{ ... }
-T
;

@r3@
type r1.T;
identifier c1.T2;
@@
-T
+struct T2

Signed-off-by: Haneen Mohammed 
---
 drivers/staging/irda/include/net/irda/qos.h | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/staging/irda/include/net/irda/qos.h 
b/drivers/staging/irda/include/net/irda/qos.h
index 05a5a24..a0315b5 100644
--- a/drivers/staging/irda/include/net/irda/qos.h
+++ b/drivers/staging/irda/include/net/irda/qos.h
@@ -58,23 +58,23 @@
 #define IR_1600 0x02
 
 /* Quality of Service information */
-typedef struct {
+struct qos_value {
__u32 value;
__u16 bits; /* LSB is first byte, MSB is second byte */
-} qos_value_t;
+};
 
 struct qos_info {
magic_t magic;
 
-   qos_value_t baud_rate;   /* IR_11520O | ... */
-   qos_value_t max_turn_time;
-   qos_value_t data_size;
-   qos_value_t window_size;
-   qos_value_t additional_bofs;
-   qos_value_t min_turn_time;
-   qos_value_t link_disc_time;
+   struct qos_value baud_rate;   /* IR_11520O | ... */
+   struct qos_value max_turn_time;
+   struct qos_value data_size;
+   struct qos_value window_size;
+   struct qos_value additional_bofs;
+   struct qos_value min_turn_time;
+   struct qos_value link_disc_time;

-   qos_value_t power;
+   struct qos_value power;
 };
 
 extern int sysctl_max_baud_rate;
-- 
2.7.4

Re: [PATCH 01/10] arch:powerpc: return -ENOMEM on failed allocation

2017-09-13 Thread Allen

> I think the changelog for this series of conversions
> should show that you've validated the change by
> inspecting the return call chain at each modified line.
>
> Also, it seems you've cc'd the same mailing lists for
> all of the patches modified by this series.
>
> It would be better to individually address each patch
> in the series only cc'ing the appropriate maintainers
> and mailing lists.
>
> A cover letter would be good too.

 Point noted. Thanks.

- Allen

Re: Regression in throughput between kvm guests over virtual bridge

2017-09-13 Thread Jason Wang




On 2017年09月14日 00:59, Matthew Rosato wrote:

On 09/13/2017 04:13 AM, Jason Wang wrote:


On 2017年09月13日 09:16, Jason Wang wrote:


On 2017年09月13日 01:56, Matthew Rosato wrote:

We are seeing a regression for a subset of workloads across KVM guests
over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
points to c67df11f "vhost_net: try batch dequing from skb array"

In the regressed environment, we are running 4 kvm guests, 2 running as
uperf servers and 2 running as uperf clients, all on a single host.
They are connected via a virtual bridge.  The uperf client profile looks
like:




  

  
  

  
  

  



So, 1 tcp streaming instance per client.  When upgrading the host kernel
from 4.12->4.13, we see about a 30% drop in throughput for this
scenario.  After the bisect, I further verified that reverting c67df11f
on 4.13 "fixes" the throughput for this scenario.

On the other hand, if we increase the load by upping the number of
streaming instances to 50 (nprocs="50") or even 10, we see instead a
~10% increase in throughput when upgrading host from 4.12->4.13.

So it may be the issue is specific to "light load" scenarios.  I would
expect some overhead for the batching, but 30% seems significant...  Any
thoughts on what might be happening here?


Hi, thanks for the bisecting. Will try to see if I can reproduce.
Various factors could have impact on stream performance. If possible,
could you collect the #pkts and average packet size during the test?
And if you guest version is above 4.12, could you please retry with
napi_tx=true?

Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 -
4.4.0-93-generic specifically).  Here's a throughput report (uperf) and
#pkts and average packet size (tcpstat) for one of the uperf clients:

host 4.12 / guest 4.4:
throughput: 29.98Gb/s
#pkts=33465571 avg packet size=33755.70

host 4.13 / guest 4.4:
throughput: 20.36Gb/s
#pkts=21233399 avg packet size=36130.69


I test guest 4.4 on Intel machine, still can reproduce :(



I ran the test again using net-next.git as guest kernel, with and
without napi_tx=true.  napi_tx did not seem to have any significant
impact on throughput.  However, the guest kernel shift from
4.4->net-next improved things.  I can still see a regression between
host 4.12 and 4.13, but it's more on the order of 10-15% - another sample:

host 4.12 / guest net-next (without napi_tx):
throughput: 28.88Gb/s
#pkts=31743116 avg packet size=33779.78

host 4.13 / guest net-next (without napi_tx):
throughput: 24.34Gb/s
#pkts=25532724 avg packet size=35963.20


Thanks for the numbers. I originally suspect batching will lead more 
pkts but less size, but looks not. The less packets is also a hint that 
there's delay somewhere.





Thanks

Unfortunately, I could not reproduce it locally. I'm using net-next.git
as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
for both before and after the commit. I use 1 vcpu and 1 queue, and pin
vcpu and vhost threads into separate cpu on host manually (in same numa
node).

The environment is quite a bit different -- I'm running in an LPAR on a
z13 (s390x).  We've seen the issue in various configurations, the
smallest thus far was a host partition w/ 40G and 20 CPUs defined (the
numbers above were gathered w/ this configuration).  Each guest has 4GB
and 4 vcpus.  No pinning / affinity configured.


Unfortunately, I don't have s390x on hand. Will try to get one.




Can you hit this regression constantly and what's you qemu command line

Yes, the regression seems consistent.  I can try tweaking some of the
host and guest definitions to see if it makes a difference.


Is the issue gone if you reduce VHOST_RX_BATCH to 1? And it would be 
also helpful to collect perf diff to see if anything interesting. 
(Consider 4.4 shows more obvious regression, please use 4.4).




The guests are instantiated from libvirt - Here's one of the resulting
qemu command lines:

/usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes
-machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m
4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -drive
file=/dev/disk/by-id/scsi-3600507630bffc0381803,format=raw,if=none,id=drive-virtio-disk0
-device
virtio-blk-ccw,scsi=off,devno=fe.0.,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device

Re: [PATCH RFC v1 0/3] Support for tap user-space access with veth interfaces

2017-09-13 Thread Jason Wang




On 2017年09月07日 08:34, sainath.gran...@intel.com wrote:

From: Sainath Grandhi 

This patchset adds a tap device driver for veth virtual network interface.
With this implementation, tap character interface can be added only to the
peer veth interface. Adding tap interface to veth is for usecases that forwards
packets between host and VMs. This eliminates the need for an additional
software bridge. This can be extended to create both the peer interfaces as
tap interfaces. These patches are a step in that direction.

Sainath Grandhi (3):
   net: Adding API to parse IFLA_LINKINFO attribute
   net: Abstracting out common routines from veth for use by vethtap
   vethtap: veth based tap driver

  drivers/net/Kconfig |   1 +
  drivers/net/Makefile|   2 +
  drivers/net/{veth.c => veth_main.c} |  80 ++---
  drivers/net/vethtap.c   | 216 
  include/linux/if_veth.h |  13 +++
  include/net/rtnetlink.h |   3 +
  net/core/rtnetlink.c|   8 ++
  7 files changed, 308 insertions(+), 15 deletions(-)
  rename drivers/net/{veth.c => veth_main.c} (89%)
  create mode 100644 drivers/net/vethtap.c
  create mode 100644 include/linux/if_veth.h



Interesting, plan to add vhost support for this? And we can enable 
zerocopy without any worries I think.


Thanks

[PATCH net] tcp: update skb->skb_mstamp more carefully

2017-09-13 Thread Eric Dumazet

From: Eric Dumazet 

liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
in tcp_probe_timer() :
  https://www.spinics.net/lists/netdev/msg454496.html

After investigations, the root cause of the problem is that we update
skb->skb_mstamp of skbs in write queue, even if the attempt to send a
clone or copy of it failed. One reason being a routing problem.

This patch prevents this, solving liujian issue.

It also removes a potential RTT miscalculation, since
__tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
been changed.

A future ACK would then lead to a very small RTT sample and min_rtt
would then be lowered to this too small value.

Tested:

# cat user_timeout.pkt
--local_ip=192.168.102.64

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`

   +0 < S 0:0(0) win 0 
   +0 > S. 0:0(0) ack 1 

  +.1 < . 1:1(0) ack 1 win 65530
   +0 accept(3, ..., ...) = 4

   +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
   +0 write(4, ..., 24) = 24
   +0 > P. 1:25(24) ack 1 win 29200
   +.1 < . 1:1(0) ack 25 win 65530

//change the ipaddress
   +1 `ifconfig tun0 192.168.0.10/16`

   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24
   +1 write(4, ..., 24) = 24

   +0 `ifconfig tun0 192.168.102.64/16`
   +0 < . 1:2(1) ack 25 win 65530
   +0 `ifconfig tun0 192.168.0.10/16`

   +3 write(4, ..., 24) = -1

# ./packetdrill user_timeout.pkt

Signed-off-by: Eric Dumazet 
Reported-by: liujian 
---
 net/ipv4/tcp_output.c |   19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
5b6690d05abb98884adfa1693f97d896dd202893..a85a8c2948e54b931f8cd956aa7938f7efd355bd
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -991,6 +991,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff 
*skb, int clone_it,
struct tcp_skb_cb *tcb;
struct tcp_out_options opts;
unsigned int tcp_options_size, tcp_header_size;
+   struct sk_buff *oskb = NULL;
struct tcp_md5sig_key *md5;
struct tcphdr *th;
int err;
@@ -998,12 +999,12 @@ static int tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb, int clone_it,
BUG_ON(!skb || !tcp_skb_pcount(skb));
tp = tcp_sk(sk);
 
-   skb->skb_mstamp = tp->tcp_mstamp;
if (clone_it) {
TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq
- tp->snd_una;
tcp_rate_skb_sent(sk, skb);
 
+   oskb = skb;
if (unlikely(skb_cloned(skb)))
skb = pskb_copy(skb, gfp_mask);
else
@@ -1011,6 +1012,7 @@ static int tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb, int clone_it,
if (unlikely(!skb))
return -ENOBUFS;
}
+   skb->skb_mstamp = tp->tcp_mstamp;
 
inet = inet_sk(sk);
tcb = TCP_SKB_CB(skb);
@@ -1122,12 +1124,14 @@ static int tcp_transmit_skb(struct sock *sk, struct 
sk_buff *skb, int clone_it,
 
err = icsk->icsk_af_ops->queue_xmit(sk, skb, >cork.fl);
 
-   if (likely(err <= 0))
-   return err;
-
-   tcp_enter_cwr(sk);
+   if (unlikely(err > 0)) {
+   tcp_enter_cwr(sk);
+   err = net_xmit_eval(err);
+   }
+   if (!err && oskb)
+   oskb->skb_mstamp = tp->tcp_mstamp;
 
-   return net_xmit_eval(err);
+   return err;
 }
 
 /* This routine just queues the buffer for sending.
@@ -2869,10 +2873,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct 
sk_buff *skb, int segs)
 skb_headroom(skb) >= 0x)) {
struct sk_buff *nskb;
 
-   skb->skb_mstamp = tp->tcp_mstamp;
nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
 -ENOBUFS;
+   if (!err)
+   skb->skb_mstamp = tp->tcp_mstamp;
} else {
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}

[PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-13 Thread nixiaoming

If fanout_add is preempted after running po-> fanout = match
and before running __fanout_link,
it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink

so, we need add mutex_lock(_mutex) to __unregister_prot_hook
or add spin_lock(>bind_lock) before po-> fanout = match

test on linux 4.1.12:
./trinity -c setsockopt -C 2 -X &

BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()!
Kernel panic - not syncing: BUG!
CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW  O4.1.12 #1
Hardware name: Hisilicon PhosphorHi1382 FPGA (DT)
Call trace:
[] dump_backtrace+0x0/0xf8
[] show_stack+0x20/0x28
[] dump_stack+0xac/0xe4
[] panic+0xf8/0x268
[] __unregister_prot_hook+0xa0/0x144
[] packet_set_ring+0x280/0x5b4
[] packet_setsockopt+0x320/0x950
[] SyS_setsockopt+0xa4/0xd4

Signed-off-by: nixiaoming 
Tested-by: wudesheng 
---
 net/packet/af_packet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 008a45c..0300146 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, bool 
sync)
 
po->running = 0;
 
+   mutex_lock(_mutex);
if (po->fanout)
__fanout_unlink(sk, po);
else
__dev_remove_pack(>prot_hook);
+   mutex_unlock(_mutex);
 
__sock_put(sk);
 
-- 
2.11.0.1

[PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-13 Thread nixiaoming

If fanout_add is preempted after running po-> fanout = match
and before running __fanout_link,
it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink

so, we need add mutex_lock(_mutex) to __unregister_prot_hook
or add spin_lock(>bind_lock) before po-> fanout = match

test on linux 4.1.42:
./trinity -c setsockopt -C 2 -X &

BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()!
Kernel panic - not syncing: BUG!
CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW  O4.1.12 #1
Hardware name: Hisilicon PhosphorHi1382 FPGA (DT)
Call trace:
[] dump_backtrace+0x0/0xf8
[] show_stack+0x20/0x28
[] dump_stack+0xac/0xe4
[] panic+0xf8/0x268
[] __unregister_prot_hook+0xa0/0x144
[] packet_set_ring+0x280/0x5b4
[] packet_setsockopt+0x320/0x950
[] SyS_setsockopt+0xa4/0xd4

Signed-off-by: nixiaoming 
Tested-by: wudesheng 
---
 net/packet/af_packet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 008a45c..0300146 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, bool 
sync)
 
po->running = 0;
 
+   mutex_lock(_mutex);
if (po->fanout)
__fanout_unlink(sk, po);
else
__dev_remove_pack(>prot_hook);
+   mutex_unlock(_mutex);
 
__sock_put(sk);
 
-- 
2.11.0.1

RE: [Intel-wired-lan] [PATCH] igb: check memory allocation failure

2017-09-13 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf
> Of Christophe JAILLET
> Sent: Monday, August 28, 2017 10:13 AM
> To: Waskiewicz Jr, Peter ; Kirsher, Jeffrey T
> 
> Cc: netdev@vger.kernel.org; kernel-janit...@vger.kernel.org; intel-wired-
> l...@lists.osuosl.org; linux-ker...@vger.kernel.org
> Subject: Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure
> 
> Le 28/08/2017 à 01:09, Waskiewicz Jr, Peter a écrit :
> > On 8/27/17 2:42 AM, Christophe JAILLET wrote:
> >> Check memory allocation failures and return -ENOMEM in such cases, as
> >> already done for other memory allocations in this function.
> >>
> >> This avoids NULL pointers dereference.
> >>
> >> Signed-off-by: Christophe JAILLET 
> >> ---
> >>drivers/net/ethernet/intel/igb/igb_main.c | 2 ++
> >>1 file changed, 2 insertions(+)
> >>

This seems to be fine from a "it does not break in testing" perspective, so...

Tested-by: Aaron Brown  > -PJ
> >
> Hi,
> 
> in fact, there is no leak because the only caller of 'igb_sw_init()'
> (i.e. 'igb_probe()'), already frees these resources in case of error,
> see [1]
> 
> These resources are also freed  in 'igb_remove()'.
> 
> Best reagrds,
> CJ
> 
> [1]:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-
> next.git/tree/drivers/net/ethernet/intel/igb/igb_main.c#n2775

But is PJ's comment saying that it is not really necessary?  If so I tend to 
lean towards the don't touch it if it's not broken perspective.

[PATCH net] udpv6: Fix the checksum computation when HW checksum does not apply

2017-09-13 Thread Subash Abhinov Kasiviswanathan

While trying an ESP transport mode encryption for UDPv6 packets of
datagram size 1436 with MTU 1500, checksum error was observed in
the secondary fragment.

This error occurs due to the UDP payload checksum being missed out
when computing the full checksum for these packets in
udp6_hwcsum_outgoing().

Fixes: d39d938c8228 ("ipv6: Introduce udpv6_send_skb()")
Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv6/udp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e2ecfb1..40d7234 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1015,6 +1015,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct 
sk_buff *skb,
 */
offset = skb_transport_offset(skb);
skb->csum = skb_checksum(skb, offset, skb->len - offset, 0);
+   csum = skb->csum;
 
skb->ip_summed = CHECKSUM_NONE;
 
-- 
1.9.1

Re: [RFC net-next 2/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-13 Thread Vinicius Costa Gomes

Hi Henrik,

Henrik Austad  writes:

> On Thu, Aug 31, 2017 at 06:26:22PM -0700, Vinicius Costa Gomes wrote:
>> This queueing discipline implements the shaper algorithm defined by
>> the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
>>
>> It's primary usage is to apply some bandwidth reservation to user
>> defined traffic classes, which are mapped to different queues via the
>> mqprio qdisc.
>>
>> Initially, it only supports offloading the traffic shaping work to
>> supporting controllers.
>>
>> Later, when a software implementation is added, the current dependency
>> on being installed "under" mqprio can be lifted.
>>
>> Signed-off-by: Vinicius Costa Gomes 
>> Signed-off-by: Jesus Sanchez-Palencia 
>
> So, I started testing this in my VM to make sure I didn't do anything silly
> and it exploded spectacularly as it used the underlying e1000 driver which
> does not have a ndo_setup_tc present.
>
> I commented inline below, but as a tl;dr
>
> The cbs_init() would call into cbs_change() that would correctly detect
> that ndo_setup_tc is missing and abort. However, at this stage it would try
> to desroy the qdisc and in cbs_destroy() there's no such guard leading to a
>
> BUG: unable to handle kernel NULL pointer dereference at 0010
>
> Fixing that, another NULL would be found when the code walks into
> qdisc_destroy(q->qdisc).
>
> Apart from this, it loaded fine after some wrestling with tc :)
>
> Awesome! :D
>
>> ---
>>  include/linux/netdevice.h |   1 +
>>  net/sched/Kconfig |  11 ++
>>  net/sched/Makefile|   1 +
>>  net/sched/sch_cbs.c   | 286 
>> ++
>>  4 files changed, 299 insertions(+)
>>  create mode 100644 net/sched/sch_cbs.c
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 35de8312e0b5..dd9a2ecd0c03 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -775,6 +775,7 @@ enum tc_setup_type {
>>  TC_SETUP_CLSFLOWER,
>>  TC_SETUP_CLSMATCHALL,
>>  TC_SETUP_CLSBPF,
>> +TC_SETUP_CBS,
>>  };
>>
>>  /* These structures hold the attributes of xdp state that are being passed
>> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
>> index e70ed26485a2..c03d86a7775e 100644
>> --- a/net/sched/Kconfig
>> +++ b/net/sched/Kconfig
>> @@ -172,6 +172,17 @@ config NET_SCH_TBF
>>To compile this code as a module, choose M here: the
>>module will be called sch_tbf.
>>
>> +config NET_SCH_CBS
>
> Shouldn't this depend on NET_SCH_MQPRIO as it is supposed to hook into
> that?

Right now, the CBS shaper only makes sense with mqprio, later it may use
some other qdisc (like "taprio" mentioned in the cover letter) to hook
into, so, yes, you are right, there's a dependency here, better make it
clear. Will fix.

>
> @@ -173,6 +173,7 @@ config NET_SCH_TBF
>   module will be called sch_tbf.
>
>  config NET_SCH_CBS
> +   depends on NET_SCH_MQPRIO
> tristate "Credit Based Shaper (CBS)"
> ---help---
>   Say Y here if you want to use the Credit Based Shaper (CBS) packet
>
>> +tristate "Credit Based Shaper (CBS)"
>> +---help---
>> +  Say Y here if you want to use the Credit Based Shaper (CBS) packet
>> +  scheduling algorithm.
>> +
>> +  See the top of  for more details.
>> +
>> +  To compile this code as a module, choose M here: the
>> +  module will be called sch_cbs.
>> +
>>  config NET_SCH_GRED
>>  tristate "Generic Random Early Detection (GRED)"
>>  ---help---
>> diff --git a/net/sched/Makefile b/net/sched/Makefile
>> index 7b915d226de7..80c8f92d162d 100644
>> --- a/net/sched/Makefile
>> +++ b/net/sched/Makefile
>> @@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL) += sch_fq_codel.o
>>  obj-$(CONFIG_NET_SCH_FQ)+= sch_fq.o
>>  obj-$(CONFIG_NET_SCH_HHF)   += sch_hhf.o
>>  obj-$(CONFIG_NET_SCH_PIE)   += sch_pie.o
>> +obj-$(CONFIG_NET_SCH_CBS)   += sch_cbs.o
>>
>>  obj-$(CONFIG_NET_CLS_U32)   += cls_u32.o
>>  obj-$(CONFIG_NET_CLS_ROUTE4)+= cls_route.o
>> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
>> new file mode 100644
>> index ..1c86a9e14150
>> --- /dev/null
>> +++ b/net/sched/sch_cbs.c
>> @@ -0,0 +1,286 @@
>> +/*
>> + * net/sched/sch_cbs.c  Credit Based Shaper
>> + *
>> + *  This program is free software; you can redistribute it and/or
>> + *  modify it under the terms of the GNU General Public License
>> + *  as published by the Free Software Foundation; either version
>> + *  2 of the License, or (at your option) any later version.
>> + *
>> + * Authors: Vininicius Costa Gomes 
>> + *
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +struct cbs_sched_data {
>> +struct Qdisc *qdisc; /* Inner qdisc, default - pfifo

[PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB

2017-09-13 Thread Yoshihiro Shimoda

Add a new compatible string for the R8A77995 (R-Car D3) RAVB.

Acked-by: Geert Uytterhoeven 
Signed-off-by: Yoshihiro Shimoda 
---
 Changes from v1:
  - Based on r8a77970 patch.
   https://marc.info/?l=linux-renesas-soc=150524655515476
  - Add Acked-by (Thanks Geert-san!).

 Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index 2689211..c902261 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -18,6 +18,7 @@ Required properties:
   - "renesas,etheravb-r8a7795" for the R8A7795 SoC.
   - "renesas,etheravb-r8a7796" for the R8A7796 SoC.
   - "renesas,etheravb-r8a77970" for the R8A77970 SoC.
+  - "renesas,etheravb-r8a77995" for the R8A77995 SoC.
   - "renesas,etheravb-rcar-gen3" as a fallback for the above
R-Car Gen3 devices.
 
-- 
1.9.1

[PATCH] net: ipv4: fix l3slave check for index returned in IP_PKTINFO

2017-09-13 Thread David Ahern

rt_iif is only set to the actual egress device for the output path. The
recent change to consider the l3slave flag when returning IP_PKTINFO
works for local traffic (the correct device index is returned), but it
broke the more typical use case of packets received from a remote host
always returning the VRF index rather than the original ingress device.
Update the fixup to consider l3slave and rt_iif actually getting set.

Fixes: 1dfa76390bf05 ("net: ipv4: add check for l3slave for index returned in 
IP_PKTINFO")
Signed-off-by: David Ahern 
---
 net/ipv4/ip_sockglue.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index e558e4f9597b..a599aa83fdad 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -1207,7 +1207,6 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb)
 {
struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
-   bool l3slave = ipv4_l3mdev_skb(IPCB(skb)->flags);
bool prepare = (inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) ||
   ipv6_sk_rxinfo(sk);
 
@@ -1221,8 +1220,13 @@ void ipv4_pktinfo_prepare(const struct sock *sk, struct 
sk_buff *skb)
 * (e.g., process binds socket to eth0 for Tx which is
 * redirected to loopback in the rtable/dst).
 */
-   if (pktinfo->ipi_ifindex == LOOPBACK_IFINDEX || l3slave)
+   struct rtable *rt = skb_rtable(skb);
+   bool l3slave = ipv4_l3mdev_skb(IPCB(skb)->flags);
+
+   if (pktinfo->ipi_ifindex == LOOPBACK_IFINDEX)
pktinfo->ipi_ifindex = inet_iif(skb);
+   else if (l3slave && rt && rt->rt_iif)
+   pktinfo->ipi_ifindex = rt->rt_iif;
 
pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb);
} else {
-- 
2.1.4

RE: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB

2017-09-13 Thread Yoshihiro Shimoda

Hello!

> From: Sergei Shtylyov
> Sent: Thursday, September 14, 2017 1:02 AM
> 
> Hello!
> 
> On 09/13/2017 03:17 PM, Yoshihiro Shimoda wrote:
> 
> > Add a new compatible string for the R8A77995 (R-Car D3) RAVB.
> >
> > Signed-off-by: Yoshihiro Shimoda 
> > ---
> >   Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt
> b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> > index 1672353..ae2213f 100644
> > --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
> > +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> > @@ -17,6 +17,7 @@ Required properties:
> >
> > - "renesas,etheravb-r8a7795" for the R8A7795 SoC.
> > - "renesas,etheravb-r8a7796" for the R8A7796 SoC.
> > +  - "renesas,etheravb-r8a77995" for the R8A77995 SoC.
> 
> That would conflict with the R8A77970 bindings patch posted by me
> yesterday. Please respin atop of it.

Yes, I submitted v2 patch now.

> > - "renesas,etheravb-rcar-gen3" as a fallback for the above
> > R-Car Gen3 devices.
> >
> 
> One more fragment is needed here, about the mandatory 'interrupt-names"
> prop. My patch makes this change unnecessary, however...

I understood it.

Best regards,
Yoshihiro Shimoda

> MBR, Sergei

Re: [PATCH v2 net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()

2017-09-13 Thread David Miller

From: Dan Carpenter 
Date: Thu, 14 Sep 2017 02:00:54 +0300

> This code causes a static checker warning because Smatch doesn't trust
> anything that comes from skb->data.  I've reviewed this code and I do
> think skb->data can be controlled by the user here.
> 
> The sctp_event_subscribe struct has 13 __u8 fields and we want to see
> if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
> We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
> either before the start of the struct or after the end.
> 
> This is a very old bug and it's surprising that it would go undetected
> for so long but my theory is that it just doesn't have a big impact so
> it would be hard to notice.
> 
> Signed-off-by: Dan Carpenter 
> ---
> v2:  Use reverse-christmas-tree local variable ordering.

Applied and queued up for -stable, thanks.

Job Offer In USA

2017-09-13 Thread Valero Energy Recruitment

Valero Energy Oil & Gas Company USA
Employment Opportunity in Texas, USA.
Oil & Gas Sector ( NEW PLANT RIG )
Salary Range $7,000 USD - $11,700 Per Month
Kindly send your CV/RESUME for more details regarding our Job Program, 
Employment offers for all Offices.

CC: Forward All Documents/Certificates & Resume to: 
valeroenergy.recr...@gmail.com

Re: [PATCH v2] net: smsc911x: Quieten netif during suspend

2017-09-13 Thread Florian Fainelli



On 09/13/2017 10:42 AM, Geert Uytterhoeven wrote:
> If the network interface is kept running during suspend, the net core
> may call net_device_ops.ndo_start_xmit() while the Ethernet device is
> still suspended, which may lead to a system crash.
> 
> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
> driven by a PM controlled clock.  If the Ethernet registers are accessed
> while the clock is not running, the system will crash with an imprecise
> external abort.
> 
> As this is a race condition with a small time window, it is not so easy
> to trigger at will.  Using pm_test may increase your chances:
> 
> # echo 0 > /sys/module/printk/parameters/console_suspend
> # echo platform > /sys/power/pm_test
> # echo mem > /sys/power/state
> 
> To fix this, make sure the network interface is quietened during
> suspend.
> 
> Signed-off-by: Geert Uytterhoeven 

Reviewed-by: Florian Fainelli 

You may want to take the opportunity to suspend the PHY device
(conversely resume it) if WoL is not enabled on this device.

Thanks!

> ---
> This is v2 of the series "[PATCH 0/2] net: Fix crashes due to activity
> during suspend", which degenerated into a single patch after commit
> ebc8254aeae34226 ("Revert "net: phy: Correctly process PHY_HALTED in
> phy_stop_machine()"") made "[PATCH 1/2] net: phy: Freeze PHY polling before
> suspending devices" no longer needed.
> 
> v2:
>   - Spelling s/quit/quiet/g.
> 
> No stacktrace is provided, as the imprecise external abort is usually
> reported from an innocent looking and unrelated function like
> __loop_delay(), cpu_idle_poll(), or arch_timer_read_counter_long().
> ---
>  drivers/net/ethernet/smsc/smsc911x.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
> b/drivers/net/ethernet/smsc/smsc911x.c
> index 0b6a39b003a4e188..012fb66eed8dd618 100644
> --- a/drivers/net/ethernet/smsc/smsc911x.c
> +++ b/drivers/net/ethernet/smsc/smsc911x.c
> @@ -2595,6 +2595,11 @@ static int smsc911x_suspend(struct device *dev)
>   struct net_device *ndev = dev_get_drvdata(dev);
>   struct smsc911x_data *pdata = netdev_priv(ndev);
>  
> + if (netif_running(ndev)) {
> + netif_stop_queue(ndev);
> + netif_device_detach(ndev);
> + }
> +
>   /* enable wake on LAN, energy detection and the external PME
>* signal. */
>   smsc911x_reg_write(pdata, PMT_CTRL,
> @@ -2628,7 +2633,15 @@ static int smsc911x_resume(struct device *dev)
>   while (!(smsc911x_reg_read(pdata, PMT_CTRL) & PMT_CTRL_READY_) && --to)
>   udelay(1000);
>  
> - return (to == 0) ? -EIO : 0;
> + if (to == 0)
> + return -EIO;
> +
> + if (netif_running(ndev)) {
> + netif_device_attach(ndev);
> + netif_start_queue(ndev);
> + }
> +
> + return 0;
>  }
>  
>  static const struct dev_pm_ops smsc911x_pm_ops = {
> 

-- 
Florian

[PATCH v2 net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()

2017-09-13 Thread Dan Carpenter

This code causes a static checker warning because Smatch doesn't trust
anything that comes from skb->data.  I've reviewed this code and I do
think skb->data can be controlled by the user here.

The sctp_event_subscribe struct has 13 __u8 fields and we want to see
if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
either before the start of the struct or after the end.

This is a very old bug and it's surprising that it would go undetected
for so long but my theory is that it just doesn't have a big impact so
it would be hard to notice.

Signed-off-by: Dan Carpenter 
---
v2:  Use reverse-christmas-tree local variable ordering.

diff --git a/include/net/sctp/ulpevent.h b/include/net/sctp/ulpevent.h
index 1060494ac230..b8c86ec1a8f5 100644
--- a/include/net/sctp/ulpevent.h
+++ b/include/net/sctp/ulpevent.h
@@ -153,8 +153,12 @@ __u16 sctp_ulpevent_get_notification_type(const struct 
sctp_ulpevent *event);
 static inline int sctp_ulpevent_type_enabled(__u16 sn_type,
 struct sctp_event_subscribe *mask)
 {
+   int offset = sn_type - SCTP_SN_TYPE_BASE;
char *amask = (char *) mask;
-   return amask[sn_type - SCTP_SN_TYPE_BASE];
+
+   if (offset >= sizeof(struct sctp_event_subscribe))
+   return 0;
+   return amask[offset];
 }
 
 /* Given an event subscription, is this event enabled? */

Re: [PATCH] net: Convert int functions to bool

2017-09-13 Thread Joe Perches

On Thu, 2017-09-14 at 01:31 +0300, Alexey Dobriyan wrote:
> > Global function ipv6_rcv_saddr_equal and static functions
> > ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int.
> > 
> > bool is slightly more descriptive for these functions so change
> > their return type from int to bool.
> 
> From code generation POV "int" is better for non-inlined functions
> especially on non-x86.

I don't agree with you here.  For instance:

$ size net/ipv4/*.o*
   text    data bss dec hex filename
   9832  26   0    9858    2682 
net/ipv4/inet_connection_sock.o.arm.defconfig.new
   9860  26   0    9886    269e 
net/ipv4/inet_connection_sock.o.arm.defconfig.old

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Cole Robinson

On 09/13/2017 03:44 PM, Josef Bacik wrote:
> Alright thanks, this should fix it.
> 

Still no luck with all three patches applied to fedora 4.12.8-300 RPM. Pretty
sure I didn't mess up the testing but since I rarely do kernel builds it's not
impossible...

Thanks,
Cole

Re: [PATCH] net: Convert int functions to bool

2017-09-13 Thread Alexey Dobriyan

> Global function ipv6_rcv_saddr_equal and static functions
> ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int.
>
> bool is slightly more descriptive for these functions so change
> their return type from int to bool.

>From code generation POV "int" is better for non-inlined functions
especially on non-x86.

Patch bloats even x86_64:

$ ./scripts/bloat-o-meter ../vmlinux-002-do-while ../obj/vmlinux
add/remove: 0/0 grow/shrink: 2/0 up/down: 3/0 (3)
function old new   delta
inet_rcv_saddr_equal  38  40  +2
inet_csk_get_port   13001301  +1
Total: Before=6084708, After=6084711, chg +0.00%

Re: [RFC PATCH] can: m_can: Support higher speed CAN-FD bitrates

2017-09-13 Thread Franklin S Cooper Jr



On 08/18/2017 02:39 PM, Franklin S Cooper Jr wrote:
> During test transmitting using CAN-FD at high bitrates (4 Mbps) only
> resulted in errors. Scoping the signals I noticed that only a single bit
> was being transmitted and with a bit more investigation realized the actual
> MCAN IP would go back to initialization mode automatically.
> 
> It appears this issue is due to the MCAN needing to use the Transmitter
> Delay Compensation Mode as defined in the MCAN User's Guide. When this
> mode is used the User's Guide indicates that the Transmitter Delay
> Compensation Offset register should be set. The document mentions that this
> register should be set to (1/dbitrate)/2*(Func Clk Freq).
> 
> Additional CAN-CIA's "Bit Time Requirements for CAN FD" document indicates
> that this TDC mode is only needed for data bit rates above 2.5 Mbps.
> Therefore, only enable this mode and only set TDCO when the data bit rate
> is above 2.5 Mbps.
> 
> Signed-off-by: Franklin S Cooper Jr 
> ---
> I'm pretty surprised that this hasn't been implemented already since
> the primary purpose of CAN-FD is to go beyond 1 Mbps and the MCAN IP
> supports up to 10 Mbps.
> 
> So it will be nice to get comments from users of this driver to understand
> if they have been able to use CAN-FD beyond 2.5 Mbps without this patch.
> If they haven't what did they do to get around it if they needed higher
> speeds.
> 
> Meanwhile I plan on testing this using a more "realistic" CAN bus to insure
> everything still works at 5 Mbps which is the max speed of my CAN
> transceiver.

ping. Anyone has any thoughts on this?
> 
>  drivers/net/can/m_can/m_can.c | 24 +++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c
> index f4947a7..720e073 100644
> --- a/drivers/net/can/m_can/m_can.c
> +++ b/drivers/net/can/m_can/m_can.c
> @@ -126,6 +126,12 @@ enum m_can_mram_cfg {
>  #define DBTP_DSJW_SHIFT  0
>  #define DBTP_DSJW_MASK   (0xf << DBTP_DSJW_SHIFT)
>  
> +/* Transmitter Delay Compensation Register (TDCR) */
> +#define TDCR_TDCO_SHIFT  8
> +#define TDCR_TDCO_MASK   (0x7F << TDCR_TDCO_SHIFT)
> +#define TDCR_TDCF_SHIFT  0
> +#define TDCR_TDCF_MASK   (0x7F << TDCR_TDCO_SHIFT)
> +
>  /* Test Register (TEST) */
>  #define TEST_LBCKBIT(4)
>  
> @@ -977,6 +983,8 @@ static int m_can_set_bittiming(struct net_device *dev)
>   const struct can_bittiming *dbt = >can.data_bittiming;
>   u16 brp, sjw, tseg1, tseg2;
>   u32 reg_btp;
> + u32 enable_tdc = 0;
> + u32 tdco;
>  
>   brp = bt->brp - 1;
>   sjw = bt->sjw - 1;
> @@ -991,9 +999,23 @@ static int m_can_set_bittiming(struct net_device *dev)
>   sjw = dbt->sjw - 1;
>   tseg1 = dbt->prop_seg + dbt->phase_seg1 - 1;
>   tseg2 = dbt->phase_seg2 - 1;
> +
> + /* TDC is only needed for bitrates beyond 2.5 MBit/s
> +  * Specified in the "Bit Time Requirements for CAN FD" document
> +  */
> + if (dbt->bitrate > 250) {
> + enable_tdc = DBTP_TDC;
> + /* Equation based on Bosch's M_CAN User Manual's
> +  * Transmitter Delay Compensation Section
> +  */
> + tdco = priv->can.clock.freq / (dbt->bitrate * 2);
> + m_can_write(priv, M_CAN_TDCR, tdco << TDCR_TDCO_SHIFT);
> + }
> +
>   reg_btp = (brp << DBTP_DBRP_SHIFT) | (sjw << DBTP_DSJW_SHIFT) |
>   (tseg1 << DBTP_DTSEG1_SHIFT) |
> - (tseg2 << DBTP_DTSEG2_SHIFT);
> + (tseg2 << DBTP_DTSEG2_SHIFT) | enable_tdc;
> +
>   m_can_write(priv, M_CAN_DBTP, reg_btp);
>   }
>  
>

Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker

2017-09-13 Thread David Miller

From: Jiri Pirko 
Date: Wed, 13 Sep 2017 22:50:06 +0200

> Wed, Sep 13, 2017 at 06:34:28PM CEST, da...@davemloft.net wrote:
>>From: Jiri Pirko 
>>Date: Wed, 13 Sep 2017 17:32:37 +0200
>>
>>> From: Jiri Pirko 
>>> 
>>> Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed
>>> freeing in call_rcu, which changed already existing hard-to-hit
>>> race condition into 100% hit:
>>> 
>>> [  598.599825] BUG: unable to handle kernel NULL pointer dereference at 
>>> 0030
>>> [  598.607782] IP: tcf_action_destroy+0xc0/0x140
>>> 
>>> Or:
>>> 
>>> [   40.858924] BUG: unable to handle kernel NULL pointer dereference at 
>>> 0030
>>> [   40.862840] IP: tcf_generic_walker+0x534/0x820
>>> 
>>> Fix this by storing the ops and use them directly for module_put call.
>>> 
>>> Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
>>> Signed-off-by: Jiri Pirko 
>>
>>Applied, thanks Jiri.
> 
> Oh, I forgot to mention, this would be nice to push to stable.

Ok, queued up.

[PATCH] net: Convert int functions to bool

2017-09-13 Thread Joe Perches

Global function ipv6_rcv_saddr_equal and static functions
ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int.

bool is slightly more descriptive for these functions so change
their return type from int to bool.

Signed-off-by: Joe Perches 
---
 include/net/addrconf.h  |  4 ++--
 net/ipv4/inet_connection_sock.c | 36 ++--
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index f44ff2476758..87981cd63180 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -94,8 +94,8 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr 
*addr,
  u32 banned_flags);
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr,
u32 banned_flags);
-int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
-bool match_wildcard);
+bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
+ bool match_wildcard);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
 void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr 
*addr);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4089c013cb03..054b26295638 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -39,11 +39,11 @@ EXPORT_SYMBOL(inet_csk_timer_bug_msg);
  *  IPV6_ADDR_ANY only equals to IPV6_ADDR_ANY,
  *  and 0.0.0.0 equals to 0.0.0.0 only
  */
-static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6,
-   const struct in6_addr *sk2_rcv_saddr6,
-   __be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr,
-   bool sk1_ipv6only, bool sk2_ipv6only,
-   bool match_wildcard)
+static bool ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6,
+const struct in6_addr *sk2_rcv_saddr6,
+__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr,
+bool sk1_ipv6only, bool sk2_ipv6only,
+bool match_wildcard)
 {
int addr_type = ipv6_addr_type(sk1_rcv_saddr6);
int addr_type2 = sk2_rcv_saddr6 ? ipv6_addr_type(sk2_rcv_saddr6) : 
IPV6_ADDR_MAPPED;
@@ -52,29 +52,29 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr 
*sk1_rcv_saddr6,
if (addr_type == IPV6_ADDR_MAPPED && addr_type2 == IPV6_ADDR_MAPPED) {
if (!sk2_ipv6only) {
if (sk1_rcv_saddr == sk2_rcv_saddr)
-   return 1;
+   return true;
if (!sk1_rcv_saddr || !sk2_rcv_saddr)
return match_wildcard;
}
-   return 0;
+   return false;
}
 
if (addr_type == IPV6_ADDR_ANY && addr_type2 == IPV6_ADDR_ANY)
-   return 1;
+   return true;
 
if (addr_type2 == IPV6_ADDR_ANY && match_wildcard &&
!(sk2_ipv6only && addr_type == IPV6_ADDR_MAPPED))
-   return 1;
+   return true;
 
if (addr_type == IPV6_ADDR_ANY && match_wildcard &&
!(sk1_ipv6only && addr_type2 == IPV6_ADDR_MAPPED))
-   return 1;
+   return true;
 
if (sk2_rcv_saddr6 &&
ipv6_addr_equal(sk1_rcv_saddr6, sk2_rcv_saddr6))
-   return 1;
+   return true;
 
-   return 0;
+   return false;
 }
 #endif
 
@@ -82,20 +82,20 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr 
*sk1_rcv_saddr6,
  * match_wildcard == false: addresses must be exactly the same, i.e.
  *  0.0.0.0 only equals to 0.0.0.0
  */
-static int ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr,
-   bool sk2_ipv6only, bool match_wildcard)
+static bool ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr,
+bool sk2_ipv6only, bool match_wildcard)
 {
if (!sk2_ipv6only) {
if (sk1_rcv_saddr == sk2_rcv_saddr)
-   return 1;
+   return true;
if (!sk1_rcv_saddr || !sk2_rcv_saddr)
return match_wildcard;
}
-   return 0;
+   return false;
 }
 
-int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
-bool match_wildcard)
+bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
+ bool match_wildcard)
 {
 #if IS_ENABLED(CONFIG_IPV6)
if (sk->sk_family == AF_INET6)
-- 
2.10.0.rc2.1.g053435c

Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker

2017-09-13 Thread Jiri Pirko

Wed, Sep 13, 2017 at 06:34:28PM CEST, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Wed, 13 Sep 2017 17:32:37 +0200
>
>> From: Jiri Pirko 
>> 
>> Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed
>> freeing in call_rcu, which changed already existing hard-to-hit
>> race condition into 100% hit:
>> 
>> [  598.599825] BUG: unable to handle kernel NULL pointer dereference at 
>> 0030
>> [  598.607782] IP: tcf_action_destroy+0xc0/0x140
>> 
>> Or:
>> 
>> [   40.858924] BUG: unable to handle kernel NULL pointer dereference at 
>> 0030
>> [   40.862840] IP: tcf_generic_walker+0x534/0x820
>> 
>> Fix this by storing the ops and use them directly for module_put call.
>> 
>> Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
>> Signed-off-by: Jiri Pirko 
>
>Applied, thanks Jiri.

Oh, I forgot to mention, this would be nice to push to stable.

Re: [PATCH] MAINTAINERS: review Renesas DT bindings as well

2017-09-13 Thread David Miller

From: Sergei Shtylyov 
Date: Wed, 13 Sep 2017 22:28:53 +0300

> When adding myself  as a reviewer for  the Renesas  Ethernet drivers
> I somehow forgot about the bindings -- I want to review them as well.
> 
> Fixes: 8e6569af3a1b ("MAINTAINERS: add myself as Renesas Ethernet drivers 
> reviewer")
> Signed-off-by: Sergei Shtylyov 

Applied.

Re: [PATCH net] net_sched: gen_estimator: fix scaling error in bytes/packets samples

2017-09-13 Thread David Miller

From: Eric Dumazet 
Date: Wed, 13 Sep 2017 11:16:45 -0700

> From: Eric Dumazet 
> 
> Denys reported wrong rate estimations with HTB classes.
> 
> It appears the bug was added in linux-4.10, since my tests
> where using intervals of one second only.
> 
> HTB using 4 sec default rate estimators, reported rates
> were 4x higher.
> 
> We need to properly scale the bytes/packets samples before
> integrating them in EWMA.
> 
> Tested:
 ...
> Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate 
> estimators")
> Signed-off-by: Eric Dumazet 
> Reported-by: Denys Fedoryshchenko 

Applied and queued up for -stable, thanks Eric.

Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-13 Thread Pravin Shelar

On Wed, Sep 13, 2017 at 4:15 AM, 严海双  wrote:
>
>
>> On 2017年9月13日, at 上午7:43, Pravin Shelar  wrote:
>>
>> On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan
>>  wrote:
>>> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
>>> mode, tos should also fallback to ip{4,6}_dst_hoplimit.
>>>
>>> Signed-off-by: Haishuang Yan 
>>>
>>> ---
>>> Changes since v2:
>>>  * Make the commit message more clearer.
>>> ---
>>> drivers/net/geneve.c | 6 ++
>>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
>>> index f640407..d52a65f 100644
>>> --- a/drivers/net/geneve.c
>>> +++ b/drivers/net/geneve.c
>>> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, 
>>> struct net_device *dev,
>>>sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
>>>if (geneve->collect_md) {
>>>tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
>>> -   ttl = key->ttl;
>>>} else {
>>>tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
>>> -   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
>>>}
>>> +   ttl = key->ttl ? : ip4_dst_hoplimit(>dst);
>>>df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>>>
>> This changes user API of Geneve collect-metadata mode. I do not see
>> good reason for this. Why user can not set right TTL for the flow?
>>
>
> For example, I test this case with script samples/bpf/test_tunnel_bpf.sh,
> and modify samples/bpf/tcbpf2_kern.c with following patch:
>
But user is suppose to specify the TTL in collect-md mode for Geneve
tunnel. That is the API.

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Josef Bacik


> On Sep 13, 2017, at 12:46 PM, Chuck Ebbert  wrote:
> 
> On Wed, 13 Sep 2017 17:28:25 +
> Josef Bacik  wrote:
> 
>> Sorry I thought I had made this other fix, can you apply this on top
>> of the other one and try that?  I have more things to try if this
>> doesn’t work, sorry you are playing go between, but I want to make
>> sure I know _which_ fix actually fixes the problem, and then clean up
>> in followup patches.  Thanks,
>> 
>> Josef
>> 
>> On 9/13/17, 8:45 AM, "Laura Abbott"  wrote:
>> 
>> On 09/12/2017 04:12 PM, Josef Bacik wrote:
>>> First I’m super sorry for the top post, I’m at plumbers and I
>>> forgot to upload my muttrc to my new cloud instance, so I’m screwed
>>> using outlook.
>>> 
>>> I have a completely untested, uncompiled patch that I think will
>>> fix the problem, would you mind giving it a go?  Thanks,
>>> 
>>> Josef  
>> 
>> Thanks for the quick turnaround. Unfortunately, the problem is still
>> reproducible according to the reporter.
>> 
>> Thanks,
>> Laura
> 
> I am confused by the patch that originally caused this:
> 
>if (sk->sk_family == AF_INET6)
>return ipv6_rcv_saddr_equal(>sk_v6_rcv_saddr,
> -   >sk_v6_rcv_saddr,
> +   inet6_rcv_saddr(sk2),
>sk->sk_rcv_saddr,
>sk2->sk_rcv_saddr,
> 
> Shouldn't the first argument also be changed to use inet6_rcv_saddr()?

No we know sk is IPv6 so it's alright to use directly.  Thanks,

Josef

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Chuck Ebbert

On Wed, 13 Sep 2017 17:28:25 +
Josef Bacik  wrote:

> Sorry I thought I had made this other fix, can you apply this on top
> of the other one and try that?  I have more things to try if this
> doesn’t work, sorry you are playing go between, but I want to make
> sure I know _which_ fix actually fixes the problem, and then clean up
> in followup patches.  Thanks,
> 
> Josef
> 
> On 9/13/17, 8:45 AM, "Laura Abbott"  wrote:
> 
> On 09/12/2017 04:12 PM, Josef Bacik wrote:
> > First I’m super sorry for the top post, I’m at plumbers and I
> > forgot to upload my muttrc to my new cloud instance, so I’m screwed
> > using outlook.
> > 
> > I have a completely untested, uncompiled patch that I think will
> > fix the problem, would you mind giving it a go?  Thanks,
> > 
> > Josef  
> 
> Thanks for the quick turnaround. Unfortunately, the problem is still
> reproducible according to the reporter.
> 
> Thanks,
> Laura

I am confused by the patch that originally caused this:

if (sk->sk_family == AF_INET6)
return ipv6_rcv_saddr_equal(>sk_v6_rcv_saddr,
-   >sk_v6_rcv_saddr,
+   inet6_rcv_saddr(sk2),
sk->sk_rcv_saddr,
sk2->sk_rcv_saddr,

Shouldn't the first argument also be changed to use inet6_rcv_saddr()?

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Josef Bacik

Alright thanks, this should fix it.

Josef

On 9/13/17, 12:14 PM, "Cole Robinson"  wrote:

On 09/13/2017 01:40 PM, Cole Robinson wrote:
> On 09/13/2017 01:28 PM, Josef Bacik wrote:
>> Sorry I thought I had made this other fix, can you apply this on top of the 
>> other one and try that?  I have more things to try if this doesn’t work, 
>> sorry you are playing go between, but I want to make sure I know _which_ fix 
>> actually fixes the problem, and then clean up in followup patches.  Thanks,
>>
> 
> I'm the bug reporter. I'll combine the two patches and report back
> 

Nope, issue is still present with both patches applied. Tried my own build and
a package Laura provided

Thanks,
Cole





0001-net-don-t-fast-patch-mismatched-sockets-in-STRICT-mo.patch
Description: 0001-net-don-t-fast-patch-mismatched-sockets-in-STRICT-mo.patch

Re: RFC: Audit Kernel Container IDs

2017-09-13 Thread Carlos O'Donell

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
> Containers are a userspace concept.  The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

> The Linux audit system needs a way to be able to track the container
> provenance of events and actions.  Audit needs the kernel's help to do
> this.

* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

> Since the concept of a container is entirely a userspace concept, a
> trigger signal from the userspace container orchestration system
> initiates this.  This will define a point in time and a set of resources
> associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are
writing to a filesystem.

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.
> A bind mount of nsfs may be necessary in the container orchestrator's
> mntNS.
> 
> Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
> filesystem to have this action permitted.  At that time, record the
> child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

> the child container's first process (which may become the container's
> "init" process) process ID (referenced from the initial PID namespace),
> all namespace IDs (in the form of a nsfs device number and inode number
> tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
> op=$action field.

What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
> container ID present on an auditable action or event.
> 
> Forked and cloned processes inherit their parent's container ID,
> referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

> Log the creation of every namespace, inheriting/adding its spawning
> process' containerID(s), if applicable.  Include the spawning and
> spawned namespace IDs (device and inode number tuples).
> [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
> Note: At this point it appears only network namespaces may need to track
> container IDs apart from processes since incoming packets may cause an
> auditable event before being associated with a process.

OK.

> Log the destruction of every namespace when it is no longer used by any
> process, include the namespace IDs (device and inode number tuples).
> [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
> 
> Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
> the parent and child namespace IDs for any changes to a process'
> namespaces. [setns(2)]
> Note: It may be possible to combine AUDIT_NS_* record formats and
> distinguish them with an op=$action field depending on the fields
> required for each message type.

[PATCH] MAINTAINERS: review Renesas DT bindings as well

2017-09-13 Thread Sergei Shtylyov

When adding myself  as a reviewer for  the Renesas  Ethernet drivers
I somehow forgot about the bindings -- I want to review them as well.

Fixes: 8e6569af3a1b ("MAINTAINERS: add myself as Renesas Ethernet drivers 
reviewer")
Signed-off-by: Sergei Shtylyov 

---
The patch is against DaveM's 'net.git' repo.

 MAINTAINERS |2 ++
 1 file changed, 2 insertions(+)

Index: net/MAINTAINERS
===
--- net.orig/MAINTAINERS
+++ net/MAINTAINERS
@@ -11379,6 +11379,8 @@ RENESAS ETHERNET DRIVERS
 R: Sergei Shtylyov 
 L: netdev@vger.kernel.org
 L: linux-renesas-...@vger.kernel.org
+F: Documentation/devicetree/bindings/net/renesas,*.txt
+F: Documentation/devicetree/bindings/net/sh_eth.txt
 F: drivers/net/ethernet/renesas/
 F: include/linux/sh_eth.h

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Cole Robinson

On 09/13/2017 01:40 PM, Cole Robinson wrote:
> On 09/13/2017 01:28 PM, Josef Bacik wrote:
>> Sorry I thought I had made this other fix, can you apply this on top of the 
>> other one and try that?  I have more things to try if this doesn’t work, 
>> sorry you are playing go between, but I want to make sure I know _which_ fix 
>> actually fixes the problem, and then clean up in followup patches.  Thanks,
>>
> 
> I'm the bug reporter. I'll combine the two patches and report back
> 

Nope, issue is still present with both patches applied. Tried my own build and
a package Laura provided

Thanks,
Cole

[PATCH] ath10k: make ath10k_hw_ce_regs const

2017-09-13 Thread Bhumika Goyal

Make them const as they are not modified in the file referencing
them. They are only stored in the const field 'hw_ce_reg' of an ath10k
structure. Also, make the declarations in the header const.

Signed-off-by: Bhumika Goyal 
---
 drivers/net/wireless/ath/ath10k/hw.c | 4 ++--
 drivers/net/wireless/ath/ath10k/hw.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/hw.c 
b/drivers/net/wireless/ath/ath10k/hw.c
index a860691..07df7c6 100644
--- a/drivers/net/wireless/ath/ath10k/hw.c
+++ b/drivers/net/wireless/ath/ath10k/hw.c
@@ -310,7 +310,7 @@
.wm_high= _dst_wm_high,
 };
 
-struct ath10k_hw_ce_regs wcn3990_ce_regs = {
+const struct ath10k_hw_ce_regs wcn3990_ce_regs = {
.sr_base_addr   = 0x,
.sr_size_addr   = 0x0008,
.dr_base_addr   = 0x000c,
@@ -457,7 +457,7 @@ struct ath10k_hw_ce_regs wcn3990_ce_regs = {
.wm_high= _dst_wm_high,
 };
 
-struct ath10k_hw_ce_regs qcax_ce_regs = {
+const struct ath10k_hw_ce_regs qcax_ce_regs = {
.sr_base_addr   = 0x,
.sr_size_addr   = 0x0004,
.dr_base_addr   = 0x0008,
diff --git a/drivers/net/wireless/ath/ath10k/hw.h 
b/drivers/net/wireless/ath/ath10k/hw.h
index 0c089f6..f80840f 100644
--- a/drivers/net/wireless/ath/ath10k/hw.h
+++ b/drivers/net/wireless/ath/ath10k/hw.h
@@ -369,8 +369,8 @@ struct ath10k_hw_values {
 extern const struct ath10k_hw_values qca9888_values;
 extern const struct ath10k_hw_values qca4019_values;
 extern const struct ath10k_hw_values wcn3990_values;
-extern struct ath10k_hw_ce_regs wcn3990_ce_regs;
-extern struct ath10k_hw_ce_regs qcax_ce_regs;
+extern const struct ath10k_hw_ce_regs wcn3990_ce_regs;
+extern const struct ath10k_hw_ce_regs qcax_ce_regs;
 
 void ath10k_hw_fill_survey_time(struct ath10k *ar, struct survey_info *survey,
u32 cc, u32 rcc, u32 cc_prev, u32 rcc_prev);
-- 
1.9.1

Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-13 Thread Jorgen S. Hansen


> On Sep 13, 2017, at 5:19 PM, Michal Hocko  wrote:
> 
> On Wed 13-09-17 15:07:26, Jorgen S. Hansen wrote:
>> 
>>> On Sep 12, 2017, at 11:08 AM, Michal Hocko  wrote:
>>> 
>>> Hi,
>>> we are seeing the following splat with Debian 3.16 stable kernel
>>> 
>>> BUG: scheduling while atomic: MATLAB/26771/0x0100
>>> Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver 
>>> rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
>>> sunrpc vmw_vso$
>>> CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 
>>> Debian 3.16.7-ckt20-1+deb8u3
>>> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
>>> Platform, BIOS 6.00 09/21/2015
>>> 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
>>> 815103a2 00012f00 8819423dbfd8 00012f00
>>> 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
>>> Call Trace:
>>>   [] ? dump_stack+0x41/0x51
>>> [] ? __schedule_bug+0x48/0x55
>>> [] ? __schedule+0x5d2/0x700
>>> [] ? schedule_timeout+0x229/0x2a0
>>> [] ? select_task_rq_fair+0x390/0x700
>>> [] ? check_preempt_wakeup+0x120/0x1d0
>>> [] ? wait_for_completion+0xa8/0x120
>>> [] ? wake_up_state+0x10/0x10
>>> [] ? call_rcu_bh+0x20/0x20
>>> [] ? wait_rcu_gp+0x4b/0x60
>>> [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
>>> [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
>>> [] ? vmci_transport_destruct+0x1d/0xe0 
>>> [vmw_vsock_vmci_transport]
>>> [] ? vsock_sk_destruct+0x13/0x60 [vsock]
>>> [] ? __sk_free+0x1a/0x130
>>> [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
>>> [vmw_vsock_vmci_transport]
>>> [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 
>>> [vmw_vmci]
>>> [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
>>> [] ? tasklet_action+0xf4/0x100
>>> [] ? __do_softirq+0xf1/0x290
>>> [] ? irq_exit+0x95/0xa0
>>> [] ? do_IRQ+0x52/0xe0
>>> [] ? common_interrupt+0x6d/0x6d
>>> 
>>> AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
>>> to call in interrupt context") but this patch hasn't been backported to
>>> stable trees. It applies cleanly on top of 3.16 stable tree but I am not
>>> familiar with the code to send the backport to the stable maintainer
>>> directly.
>>> 
>>> Could you double check that the patch below (just a blind cherry-pick)
>>> is correct and it doesn't need additional patches on top?
>> 
>> Hi,
>> 
>> The patch below has been used to fix the above issue by other distros
>> - among them Redhat for the 3.10 kernel, so it should work for 3.16 as
>> well.
> 
> Thanks for the confirmation. I do not see 4ef7ea9195ea ("VSOCK: sock_put
> wasn't safe to call in interrupt context") in 3.10 stable branch
> though.
> 
>> In addition to the patch above, there are two other patches that
>> need to be applied on top for the fix to be correct:
>> 
>> 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue."
>> 
>> and
>> 
>> 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should 
>> filter out non matching QPs."
> 
> Good to know. I will send all three patches cherry-picked on top of the
> current 3.16 stable branch. Could you have a look please?

The patch series look good to me.

Thanks for taking care of this,
Jorgen

Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY

2017-09-13 Thread Rob Herring

On Fri, Sep 08, 2017 at 09:43:25AM +0200, Corentin Labbe wrote:
> On Fri, Sep 08, 2017 at 09:25:38AM +0200, Maxime Ripard wrote:
> > On Fri, Sep 08, 2017 at 09:11:51AM +0200, Corentin Labbe wrote:
> > > This patch add documentation about the MDIO switch used on sun8i-h3-emac
> > > for integrated PHY.
> > > 
> > > Signed-off-by: Corentin Labbe 
> > > ---
> > >  .../devicetree/bindings/net/dwmac-sun8i.txt| 127 
> > > +++--
> > >  1 file changed, 120 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt 
> > > b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> > > index 725f3b187886..3fa0e54825ea 100644
> > > --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> > > +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> > > @@ -39,7 +39,7 @@ Optional properties for the following compatibles:
> > >  - allwinner,leds-active-low: EPHY LEDs are active low
> > >  
> > >  Required child node of emac:
> > > -- mdio bus node: should be named mdio
> > > +- mdio bus node: should be labelled mdio
> > 
> > labels do not end up in the final DT (while the names do) so why are
> > you making this change?
> > 
> 
> I misunderstood label/name.
> Anyway, this contrainst should leave due to "snps,dwmac-mdio MDIOs are 
> automatically registered"
> 
> > >  
> > >  Required properties of the mdio node:
> > >  - #address-cells: shall be 1
> > > @@ -48,14 +48,28 @@ Required properties of the mdio node:
> > >  The device node referenced by "phy" or "phy-handle" should be a child 
> > > node
> > >  of the mdio node. See phy.txt for the generic PHY bindings.
> > >  
> > > -Required properties of the phy node with the following compatibles:
> > > +The following compatibles require an mdio-mux node called "mdio-mux":
> > > +  - "allwinner,sun8i-h3-emac"
> > > +  - "allwinner,sun8i-v3s-emac":
> > > +Required properties for the mdio-mux node:
> > > +  - compatible = "mdio-mux"
> > > +  - one child mdio for the integrated mdio
> > > +  - one child mdio for the external mdio if present (V3s have none)
> > > +Required properties for the mdio-mux children node:
> > > +  - reg: 0 for internal MDIO bus, 1 for external MDIO bus
> > > +
> > > +The following compatibles require a PHY node representing the integrated
> > > +PHY, under the integrated MDIO bus node if an mdio-mux node is used:
> > >- "allwinner,sun8i-h3-emac",
> > >- "allwinner,sun8i-v3s-emac":
> > > +
> > > +Required properties of the integrated phy node:
> > >  - clocks: a phandle to the reference clock for the EPHY
> > >  - resets: a phandle to the reset control for the EPHY
> > > +- phy-is-integrated
> > > +- Should be a child of the integrated mdio
> > 
> > I'm not sure what you mean by that, you ask that it should (so not
> > required?) be a child of the integrated mdio...
> > 
> 
> I will change words to "must"
> 
> > >  
> > > -Example:
> > > -
> > > +Example with integrated PHY:
> > >  emac: ethernet@1c0b000 {
> > >   compatible = "allwinner,sun8i-h3-emac";
> > >   syscon = <>;
> > > @@ -72,13 +86,112 @@ emac: ethernet@1c0b000 {
> > >   phy-handle = <_mii_phy>;
> > >   phy-mode = "mii";
> > >   allwinner,leds-active-low;
> > > +
> > > + mdio0: mdio {
> > 
> > (You don't label it mdio here, unlike what was asked before)
> > 
> > > + #address-cells = <1>;
> > > + #size-cells = <0>;
> > > + compatible = "snps,dwmac-mdio";
> > > + };
> > 
> > I think Rob wanted that node gone?
> > 
> 
> MDIO mux does not work without a parent MDIO, either gived by "parent-bus" or 
> directly via mdio_mux_init() (like it is the case in dwmac-sun8i)

Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"? 
If the latter, then I think the node is fine, but then the mux should be 
a child node of it. IOW, the child of an MDIO controller should either 
be a mux node or slave devices.

> 
> > > + mdio-mux {
> > > + compatible = "mdio-mux";
> > > + #address-cells = <1>;
> > > + #size-cells = <0>;
> > > +
> > > + int_mdio: mdio@1 {
> > > + reg = <0>;

unit address of 1 and reg prop of 0 don't match. Build your dtb with 
W=2.

> > > + #address-cells = <1>;
> > > + #size-cells = <0>;
> > > + int_mii_phy: ethernet-phy@1 {
> > > + reg = <1>;
> > > + clocks = < CLK_BUS_EPHY>;
> > > + resets = < RST_BUS_EPHY>;
> > > + phy-is-integrated

Missing ;

> > > + };
> > > + };
> > 
> > ... And in your example it's a child of the mdio mux?
> > 
> 
> So I confirm, integrated PHY must be a child of integrated MDIO (that must be 
> a child of mdio-mux).
> The example is good.
> 
> > > + ext_mdio: mdio@0 {
> > > + reg = <1>;

Another unit address mismatch.

Rob

[PATCH net] net_sched: gen_estimator: fix scaling error in bytes/packets samples

2017-09-13 Thread Eric Dumazet

From: Eric Dumazet 

Denys reported wrong rate estimations with HTB classes.

It appears the bug was added in linux-4.10, since my tests
where using intervals of one second only.

HTB using 4 sec default rate estimators, reported rates
were 4x higher.

We need to properly scale the bytes/packets samples before
integrating them in EWMA.

Tested:
 echo 1 >/sys/module/sch_htb/parameters/htb_rate_est

 Setup HTB with one class with a rate/cail of 5Gbit

 Generate traffic on this class

 tc -s -d cl sh dev eth0 classid 7002:11 
class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil
5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b
level 0 rate_handle 1 
 Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0
requeues 0) 
 rate 5Gbit 412814pps backlog 136260b 2p requeues 0 
 TCP pkts/rtx 982969327/45 bytes 1488215557414/68130
 lended: 22732826 borrowed: 0 giants: 0
 tokens: -1684 ctokens: -1684
 
Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate 
estimators")
Signed-off-by: Eric Dumazet 
Reported-by: Denys Fedoryshchenko 
---
 net/core/gen_estimator.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 
0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78
 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -83,10 +83,10 @@ static void est_timer(unsigned long arg)
u64 rate, brate;
 
est_fetch_counters(est, );
-   brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log);
+   brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - 
est->intvl_log);
brate -= (est->avbps >> est->ewma_log);
 
-   rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log);
+   rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - 
est->intvl_log);
rate -= (est->avpps >> est->ewma_log);
 
write_seqcount_begin(>seq);

Re: [PATCH v5 02/10] dt-bindings: net: Restore sun8i dwmac binding

2017-09-13 Thread Rob Herring

On Fri, Sep 08, 2017 at 09:11:48AM +0200, Corentin Labbe wrote:
> This patch restore dt-bindings documentation about dwmac-sun8i
> This reverts commit 8aa33ec2f481 ("dt-bindings: net: Revert sun8i dwmac 
> binding")

Why?

> 
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/net/dwmac-sun8i.txt| 84 
> ++
>  1 file changed, 84 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/dwmac-sun8i.txt

Otherwise,

Acked-by: Rob Herring

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 10:58 -0700, Eric Dumazet wrote:
> On Wed, 2017-09-13 at 20:35 +0300, Denys Fedoryshchenko wrote:
> 
> > Overlimits never appear in HTB as i know, here is simulation on this 
> > class that have constant "at least" 1G traffic, i throttled it to 1Kbit 
> > to simulate forced drops:
> > 
> > shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev 
> > eth3.777 classid 1:111;tc qdisc del dev eth3.777 root
> > class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit 
> > ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu 
> > 0b level 0
> >   Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0)
> >   backlog 7902126b 4976p requeues 0
> >   lended: 86694 borrowed: 0 giants: 0
> >   tokens: -93750 ctokens: -93750
> > 
> 
> Oh right, I am using a local patch for this. Time to upstream :/

Please try :

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 
7e148376ba528efabe5a53a09653f9161c264be7..c6d7ae81b41f4e277afb93a3003fefcd3f27de35
 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -142,6 +142,7 @@ struct htb_class {
struct rb_node  node[TC_HTB_NUMPRIO];   /* node for self or 
feed tree */
 
unsigned int drops cacheline_aligned_in_smp;
+   unsigned intoverlimits;
 };
 
 struct htb_level {
@@ -533,6 +534,9 @@ htb_change_class_mode(struct htb_sched *q, struct htb_class 
*cl, s64 *diff)
if (new_mode == cl->cmode)
return;
 
+   if (new_mode == HTB_CANT_SEND)
+   cl->overlimits++;
+
if (cl->prio_activity) {/* not necessary: speed optimization */
if (cl->cmode != HTB_CANT_SEND)
htb_deactivate_prios(q, cl);
@@ -1143,6 +1147,7 @@ htb_dump_class_stats(struct Qdisc *sch, unsigned long 
arg, struct gnet_dump *d)
struct htb_class *cl = (struct htb_class *)arg;
struct gnet_stats_queue qs = {
.drops = cl->drops,
+   .overlimits = cl->overlimits,
};
__u32 qlen = 0;

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 20:35 +0300, Denys Fedoryshchenko wrote:

> Overlimits never appear in HTB as i know, here is simulation on this 
> class that have constant "at least" 1G traffic, i throttled it to 1Kbit 
> to simulate forced drops:
> 
> shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev 
> eth3.777 classid 1:111;tc qdisc del dev eth3.777 root
> class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit 
> ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu 
> 0b level 0
>   Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0)
>   backlog 7902126b 4976p requeues 0
>   lended: 86694 borrowed: 0 giants: 0
>   tokens: -93750 ctokens: -93750
> 

Oh right, I am using a local patch for this. Time to upstream :/

Re: Memory leaks in conntrack

2017-09-13 Thread Cong Wang

On Wed, Sep 13, 2017 at 9:45 AM, Cong Wang  wrote:
> On Wed, Sep 13, 2017 at 1:05 AM, Florian Westphal  wrote:
>> Cong Wang  wrote:
>>> While testing my TC filter patches (so not related to conntrack), the
>>> following memory leaks are shown up:
>>>
>>> unreferenced object 0x9b19ba551228 (size 128):
>>>   comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s)
>>>   hex dump (first 32 bytes):
>>> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
>>> 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00  ...0
>>>   backtrace:
>>> [] create_object+0x169/0x2aa
>>> [] kmemleak_alloc+0x25/0x41
>>> [] slab_post_alloc_hook+0x44/0x65
>>> [] __kmalloc_track_caller+0x113/0x146
>>> [] __krealloc+0x4a/0x69
>>> [] nf_ct_ext_add+0xe1/0x145
>>> [] init_conntrack+0x1f7/0x36e
>>> [] nf_conntrack_in+0x1d3/0x326
>>> [] ipv4_conntrack_local+0x4d/0x50
>>> [] nf_hook_slow+0x3c/0x9b
>>> [] nf_hook.constprop.40+0xbe/0xd8
>>> [] __ip_local_out+0xb3/0xbf
>>> [] ip_local_out+0x1c/0x36
>>> [] ip_send_skb+0x19/0x3d
>>> [] udp_send_skb+0x17e/0x1df
>>> [] udp_sendmsg+0x5a2/0x77c
>>> unreferenced object 0x9b19a69b3340 (size 336):
>>>   comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s)
>>>   hex dump (first 32 bytes):
>>> 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de  .N..
>>> ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff  
>>>   backtrace:
>>> [] create_object+0x169/0x2aa
>>> [] kmemleak_alloc+0x25/0x41
>>> [] slab_post_alloc_hook+0x44/0x65
>>> [] kmem_cache_alloc+0xd7/0x1f1
>>> [] __nf_conntrack_alloc+0xa2/0x146
>>> [] init_conntrack+0xb2/0x36e
>>> [] nf_conntrack_in+0x1d3/0x326
>>> [] ipv4_conntrack_local+0x4d/0x50
>>> [] nf_hook_slow+0x3c/0x9b
>>> [] nf_hook.constprop.40+0xbe/0xd8
>>> [] __ip_local_out+0xb3/0xbf
>>> [] ip_local_out+0x1c/0x36
>>> [] ip_send_skb+0x19/0x3d
>>> [] udp_send_skb+0x17e/0x1df
>>> [] udp_sendmsg+0x5a2/0x77c
>>> [] inet_sendmsg+0x37/0x5e
>>>
>>> I don't touch chronyd in my VM, so I have no idea why it sends out UDP
>>> packets, my guess is it is some periodical packet.
>>>
>>> I don't think I use conntrack either, since /proc/net/ip_conntrack
>>> does not exist.
>>
>> You probably do, can you try "cat /proc/net/nf_conntrack" instead?
>>
>> (otherwise there should be no ipv4_conntrack_local() invocation
>>  since we would not register this hook at all).
>
> Yeah it is very weird but it is true:
>
> [root@localhost ~]# echo scan > /sys/kernel/debug/kmemleak
> [  133.450823] kmemleak: 18 new suspected memory leaks (see
> /sys/kernel/debug/kmemleak)
> [root@localhost ~]# cat /proc/net/ip_conntrack
> cat: /proc/net/ip_conntrack: No such file or directory
> [root@localhost ~]# cat /sys/kernel/debug/kmemleak
> unreferenced object 0x95c1e0b24040 (size 336):
> ...

Oops, you mean nf_conntrack... Here we go:

[root@localhost ~]# cat /proc/net/nf_conntrack
ipv4 2 udp  17 116 src=192.168.124.6 dst=204.2.134.162
sport=123 dport=123 src=204.2.134.162 dst=192.168.124.6 sport=123
dport=123 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 117 src=192.168.124.6 dst=45.79.187.10
sport=123 dport=123 src=45.79.187.10 dst=192.168.124.6 sport=123
dport=123 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 110 src=192.168.124.6 dst=192.168.124.1
sport=35486 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53
dport=35486 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 110 src=192.168.124.6 dst=192.168.124.1
sport=52373 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53
dport=52373 [ASSURED] mark=0 zone=0 use=2
ipv4 2 unknown  2 518 src=192.168.124.6 dst=224.0.0.22 [UNREPLIED]
src=224.0.0.22 dst=192.168.124.6 mark=0 zone=0 use=2
ipv4 2 udp  17 110 src=192.168.124.6 dst=192.168.124.1
sport=43242 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53
dport=43242 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 116 src=192.168.124.6 dst=96.226.123.196
sport=123 dport=123 src=96.226.123.196 dst=192.168.124.6 sport=123
dport=123 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 110 src=192.168.124.6 dst=192.168.124.1
sport=42838 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53
dport=42838 [ASSURED] mark=0 zone=0 use=2
ipv4 2 udp  17 117 src=192.168.124.6 dst=97.127.104.4
sport=123 dport=123 src=97.127.104.4 dst=192.168.124.6 sport=123
dport=123 [ASSURED] mark=0 zone=0 use=2

Re: [PATCH/RFC net-next] ravb: RX checksum offload

2017-09-13 Thread Sergei Shtylyov


Hello!

On 09/12/2017 04:04 PM, Simon Horman wrote:


Add support for RX checksum offload. This is enabled by default and
may be disabled and re-enabled using ethtool:

  # ethtool -K eth0 rx off
  # ethtool -K eth0 rx on

The RAVB provides a simple checksumming scheme which appears to be
completely compatible with CHECKSUM_COMPLETE: a 1's complement sum of


   Hm, the gen2/3 manuals say calculation doesn't involve bit inversion...


all packet data after the L2 header is appended to packet data; this may
be trivially read by the driver and used to update the skb accordingly.

>

In terms of performance throughput is close to gigabit line-rate both with
and without RX checksum offload enabled. Perf output, however, appears to
indicate that significantly less time is spent in do_csum(). This is as
expected.


[...]


By inspection this also appears to be compatible with the ravb found
on R-Car Gen 2 SoCs, however, this patch is currently untested on such
hardware.


   I probably won't be able to test it on gen2 too...


Signed-off-by: Simon Horman 


   I'm generally OK with the patch but have some questions/comments below...


---
  drivers/net/ethernet/renesas/ravb_main.c | 58 +++-
  1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index fdf30bfa403b..7c6438cd7de7 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c

[...]

@@ -1842,6 +1859,41 @@ static int ravb_do_ioctl(struct net_device *ndev, struct 
ifreq *req, int cmd)
return phy_mii_ioctl(phydev, req, cmd);
  }
  
+static void ravb_set_rx_csum(struct net_device *ndev, bool enable)

+{
+   struct ravb_private *priv = netdev_priv(ndev);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+
+   /* Disable TX and RX */
+   ravb_rcv_snd_disable(ndev);
+
+   /* Modify RX Checksum setting */
+   if (enable)
+   ravb_modify(ndev, ECMR, 0, ECMR_RCSC);


   Please use ECMR_RCSC as the 3rd argument too to conform the common driver 
style.



+   else
+   ravb_modify(ndev, ECMR, ECMR_RCSC, 0);


   This *if* can easily be folded into a single ravb_modify() call...

[...]

@@ -2004,6 +2057,9 @@ static int ravb_probe(struct platform_device *pdev)
if (!ndev)
return -ENOMEM;
  
+	ndev->features |= NETIF_F_RXCSUM;

+   ndev->hw_features |= ndev->features;


   Hum, both fields are 0 before this? Then why not use '=' instead of '|='?
Even if not, why not just use the same value as both the rvalues?

[...]

MBR, Sergei

[PATCH v2] net: smsc911x: Quieten netif during suspend

2017-09-13 Thread Geert Uytterhoeven

If the network interface is kept running during suspend, the net core
may call net_device_ops.ndo_start_xmit() while the Ethernet device is
still suspended, which may lead to a system crash.

E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
driven by a PM controlled clock.  If the Ethernet registers are accessed
while the clock is not running, the system will crash with an imprecise
external abort.

As this is a race condition with a small time window, it is not so easy
to trigger at will.  Using pm_test may increase your chances:

# echo 0 > /sys/module/printk/parameters/console_suspend
# echo platform > /sys/power/pm_test
# echo mem > /sys/power/state

To fix this, make sure the network interface is quietened during
suspend.

Signed-off-by: Geert Uytterhoeven 
---
This is v2 of the series "[PATCH 0/2] net: Fix crashes due to activity
during suspend", which degenerated into a single patch after commit
ebc8254aeae34226 ("Revert "net: phy: Correctly process PHY_HALTED in
phy_stop_machine()"") made "[PATCH 1/2] net: phy: Freeze PHY polling before
suspending devices" no longer needed.

v2:
  - Spelling s/quit/quiet/g.

No stacktrace is provided, as the imprecise external abort is usually
reported from an innocent looking and unrelated function like
__loop_delay(), cpu_idle_poll(), or arch_timer_read_counter_long().
---
 drivers/net/ethernet/smsc/smsc911x.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index 0b6a39b003a4e188..012fb66eed8dd618 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -2595,6 +2595,11 @@ static int smsc911x_suspend(struct device *dev)
struct net_device *ndev = dev_get_drvdata(dev);
struct smsc911x_data *pdata = netdev_priv(ndev);
 
+   if (netif_running(ndev)) {
+   netif_stop_queue(ndev);
+   netif_device_detach(ndev);
+   }
+
/* enable wake on LAN, energy detection and the external PME
 * signal. */
smsc911x_reg_write(pdata, PMT_CTRL,
@@ -2628,7 +2633,15 @@ static int smsc911x_resume(struct device *dev)
while (!(smsc911x_reg_read(pdata, PMT_CTRL) & PMT_CTRL_READY_) && --to)
udelay(1000);
 
-   return (to == 0) ? -EIO : 0;
+   if (to == 0)
+   return -EIO;
+
+   if (netif_running(ndev)) {
+   netif_device_attach(ndev);
+   netif_start_queue(ndev);
+   }
+
+   return 0;
 }
 
 static const struct dev_pm_ops smsc911x_pm_ops = {
-- 
2.7.4

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Cole Robinson

On 09/13/2017 01:28 PM, Josef Bacik wrote:
> Sorry I thought I had made this other fix, can you apply this on top of the 
> other one and try that?  I have more things to try if this doesn’t work, 
> sorry you are playing go between, but I want to make sure I know _which_ fix 
> actually fixes the problem, and then clean up in followup patches.  Thanks,
> 

I'm the bug reporter. I'll combine the two patches and report back

Thanks,
Cole

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko


On 2017-09-13 20:20, Eric Dumazet wrote:

On Wed, 2017-09-13 at 20:12 +0300, Denys Fedoryshchenko wrote:

For my case, as load increased now, i am hitting same issue (i tried 
to

play with quantum / bursts as well, didnt helped):

tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class
show dev eth3.777 classid 1:111
class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit
ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 
10b/1

mpu 0b level 0
  Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 
requeues

0)
  backlog 73968000b 39934p requeues 0
  lended: 499867 borrowed: 0 giants: 0
  tokens: 608 ctokens: 121



You have drops (and ~40,000 packets in backlog)



class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit
ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 
10b/1

mpu 0b level 0
  Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 
requeues

0)
  backlog 82536047b 39963p requeues 0
  lended: 810475 borrowed: 0 giants: 0
  tokens: 612 ctokens: 122

(1469352160-864151559)/5*8
968320961.6000
Less than 1Gbit and it's being throttled


It is not : "overlimits 0"  means this class was not throttled.
Overlimits never appear in HTB as i know, here is simulation on this 
class that have constant "at least" 1G traffic, i throttled it to 1Kbit 
to simulate forced drops:


shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev 
eth3.777 classid 1:111;tc qdisc del dev eth3.777 root
class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit 
ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu 
0b level 0

 Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0)
 backlog 7902126b 4976p requeues 0
 lended: 86694 borrowed: 0 giants: 0
 tokens: -93750 ctokens: -93750

Re: [PATCH 0/2] net: Fix crashes due to activity during suspend

2017-09-13 Thread Geert Uytterhoeven

Hi Florian,

On Thu, Sep 7, 2017 at 3:09 PM, Florian Fainelli  wrote:
> On 08/23/2017 10:13 AM, Florian Fainelli wrote:
>> On 08/23/2017 04:45 AM, Geert Uytterhoeven wrote:
>>> On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli  
>>> wrote:
 On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote:
> If an Ethernet device is used while the device is suspended, the system 
> may
> crash.
>
> E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is
> driven by a PM controlled clock.  If the Ethernet registers are accessed
> while the clock is not running, the system will crash with an imprecise
> external abort.
>
> This patch series fixes two of such crashes:
>   1. The first patch prevents the PHY polling state machine from accessing
>  PHY registers while a device is suspended,
>   2. The second patch prevents the net core from trying to transmit 
> packets
>  when an smsc911x device is suspended.
>
> Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during
> s2ram (rarely), or by using pm_test (more likely to trigger):
>
> # echo 0 > /sys/module/printk/parameters/console_suspend
> # echo platform > /sys/power/pm_test
> # echo mem > /sys/power/state
>
> With this series applied, my test systems survive a loop of 100 test
> suspends.

 It seems to me like part, if not the entire problem is that smsc91xx's
 suspend and resume functions are way too simplistic and absolutely do
 not manage the PHY during suspend/resume, the PHY state machine is not
 even stopped, so of course, this will cause bus errors if you access
 those registers.

 You are addressing this as part of patch 2, but this seems to me like
 this is still a bit incomplete and you'd need at least phy_stop() and/or
 phy_suspend() (does a power down of the PHY) and phy_start() and/or
 phy_resume() calls to complete the PHY state machine shutdown during
 suspend.

 Have you tried that?
>>>
>>> Unfortunately that doesn't help.
>>> In state PHY_HALTED, the PHY state machine still calls the .adjust_link()
>>> callback while the device is suspended.
>>
>> Humm that is correct yes.
>>
>>> Do you have a clue? This is too far beyond my phy-foo...
>>
>> I was initially contemplating a revert of
>> 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process
>> PHY_HALTED in phy_stop_machine()") but this is not the root of the
>> problem. The problem really is that phy_stop() does not wait for the PHY
>> state machine to be stopped so you cannot rely on that and past the
>> function return be offered any guarantees that adjust_link is not called.
>>
>> We seem to be getting away with that in most drivers because when we see
>> phydev->link = 0, we either do nothing or actually turn of the HW block.
>>
>> How about we export phy_stop_machine() to drivers which would provide a
>> synchronization point that would ensure that no HW accesses are done
>> past this point?
>>
>> I am absolutely not clear on the implications of using a freezable
>> workqueue with respect to the PHY state machine and how devices are
>> going to wind-up being powered down or not...
>
> Geert, as you may have notice a revert of the change was sent so 4.13
> should be fine, but ultimately I would like to put the non-reverted code
> back in after we add a few safeguards:

With the revert, I no longer need "[PATCH 1/2] net: phy: Freeze PHY polling
before suspending devices".
I just did more than 50 successful suspend/resume cycles to verify that.

I still need "[PATCH 2/2] net: smsc911x: Quiten netif during suspend", so
I'll submit a v2 for that.

> - and you reported the bus errors on smsc911x when we call adjust_link
> during suspend, and due to a lack of hard synchronization so phy_stop()
> here does not give you enough guarantees to let you turn off power to
> the smsc911x block
>
> If that seems accurate then we can work on something that should be
> working again (famous last words).

Sounds accurate to me.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Josef Bacik

Sorry I thought I had made this other fix, can you apply this on top of the 
other one and try that?  I have more things to try if this doesn’t work, sorry 
you are playing go between, but I want to make sure I know _which_ fix actually 
fixes the problem, and then clean up in followup patches.  Thanks,

Josef

On 9/13/17, 8:45 AM, "Laura Abbott"  wrote:

On 09/12/2017 04:12 PM, Josef Bacik wrote:
> First I’m super sorry for the top post, I’m at plumbers and I forgot to 
> upload my muttrc to my new cloud instance, so I’m screwed using outlook.
> 
> I have a completely untested, uncompiled patch that I think will fix the 
> problem, would you mind giving it a go?  Thanks,
> 
> Josef

Thanks for the quick turnaround. Unfortunately, the problem is still
reproducible according to the reporter.

Thanks,
Laura

0001-net-use-inet6_rcv_saddr-to-compare-sockets.patch
Description: 0001-net-use-inet6_rcv_saddr-to-compare-sockets.patch

Re: [PATCH net] perf/bpf: fix a clang compilation issue

2017-09-13 Thread Nick Desaulniers

great, thanks!

On Mon, Sep 11, 2017 at 2:28 PM, David Miller  wrote:
> From: Yonghong Song 
> Date: Thu, 7 Sep 2017 18:36:15 -0700
>
>> clang does not support variable length array for structure member.
>> It has the following error during compilation:
>>
>> kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant 
>> size:
>> 'variable length array in structure' extension will never be supported
>> unsigned long args[sys_data->nb_args];
>>   ^
>>
>> The fix is to use a fixed array length instead.
>>
>> Reported-by: Nick Desaulniers 
>> Signed-off-by: Yonghong Song 
>
> Applied.



-- 
Thanks,
~Nick Desaulniers

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 20:12 +0300, Denys Fedoryshchenko wrote:

> For my case, as load increased now, i am hitting same issue (i tried to 
> play with quantum / bursts as well, didnt helped):
> 
> tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class 
> show dev eth3.777 classid 1:111
> class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit 
> ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 
> mpu 0b level 0
>   Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 requeues 
> 0)
>   backlog 73968000b 39934p requeues 0
>   lended: 499867 borrowed: 0 giants: 0
>   tokens: 608 ctokens: 121
> 

You have drops (and ~40,000 packets in backlog)


> class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit 
> ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 
> mpu 0b level 0
>   Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 requeues 
> 0)
>   backlog 82536047b 39963p requeues 0
>   lended: 810475 borrowed: 0 giants: 0
>   tokens: 612 ctokens: 122
> 
> (1469352160-864151559)/5*8
> 968320961.6000
> Less than 1Gbit and it's being throttled

It is not : "overlimits 0"  means this class was not throttled.

[PATCH net v2 1/3] nfp: add whitelist of supported flow dissector

2017-09-13 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously we did not check the flow dissector against a list of allowed
and supported flow key dissectors. This patch introduces such a list and
correctly rejects unsupported flow keys.

Fixes: 43f84b72c50d ("nfp: add metadata to each flow offload")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index d396183108f7..a18b4d2b1d3e 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -44,6 +44,16 @@
 #include "../nfp_net.h"
 #include "../nfp_port.h"
 
+#define NFP_FLOWER_WHITELIST_DISSECTOR \
+   (BIT(FLOW_DISSECTOR_KEY_CONTROL) | \
+BIT(FLOW_DISSECTOR_KEY_BASIC) | \
+BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_PORTS) | \
+BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_VLAN) | \
+BIT(FLOW_DISSECTOR_KEY_IP))
+
 static int
 nfp_flower_xmit_flow(struct net_device *netdev,
 struct nfp_fl_payload *nfp_flow, u8 mtype)
@@ -112,6 +122,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls 
*ret_key_ls,
u8 key_layer;
int key_size;
 
+   if (flow->dissector->used_keys & ~NFP_FLOWER_WHITELIST_DISSECTOR)
+   return -EOPNOTSUPP;
+
if (dissector_uses_key(flow->dissector,
   FLOW_DISSECTOR_KEY_ENC_CONTROL)) {
struct flow_dissector_key_control *mask_enc_ctl =
-- 
2.14.1

[PATCH net v2 0/3] nfp: wait more carefully for card init

2017-09-13 Thread Jakub Kicinski

Hi!

The first patch is a small fix for flower offload, we need a whitelist
of supported matches, otherwise the unsupported ones will be ignored.

The second and the third patch are adding wait/polling to the probe path.
We had reports of driver failing probe because it couldn't find the 
control process (NSP) on the card.  Turns out the NSP will only announce
its existence after it's fully initialized.  Until now we assumed it 
will be reachable, just not processing commands (hence we wait for
a NOOP command to execute successfully).

v2:
 - fix a bad merge which resulted in a build warning and retest.

Jakub Kicinski (2):
  nfp: wait for board state before talking to the NSP
  nfp: wait for the NSP resource to appear on boot

Pieter Jansen van Vuuren (1):
  nfp: add whitelist of supported flow dissector

 .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c  | 47 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  | 23 ---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 45 +
 5 files changed, 107 insertions(+), 23 deletions(-)

-- 
2.14.1

[PATCH net v2 3/3] nfp: wait for the NSP resource to appear on boot

2017-09-13 Thread Jakub Kicinski

The control process (NSP) may take some time to complete its
initialization.  This is not a problem on most servers, but
on very fast-booting machines it may not be ready for operation
when driver probes the device.  There is also a version of the
flash in the wild where NSP tries to train the links as part
of init.  To wait for NSP initialization we should make sure
its resource has already been added to the resource table.
NSP adds itself there as last step of init.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  4 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 45 ++
 3 files changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 424707d41fbd..f8fa63b66739 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -351,6 +351,10 @@ static int nfp_nsp_init(struct pci_dev *pdev, struct 
nfp_pf *pf)
struct nfp_nsp *nsp;
int err;
 
+   err = nfp_resource_wait(pf->cpp, NFP_RESOURCE_NSP, 30);
+   if (err)
+   return err;
+
nsp = nfp_nsp_open(pf->cpp);
if (IS_ERR(nsp)) {
err = PTR_ERR(nsp);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
index 1a8d04a1e113..3ce51f03126f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
@@ -97,6 +97,8 @@ nfp_resource_acquire(struct nfp_cpp *cpp, const char *name);
 
 void nfp_resource_release(struct nfp_resource *res);
 
+int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int 
secs);
+
 u32 nfp_resource_cpp_id(struct nfp_resource *res);
 
 const char *nfp_resource_name(struct nfp_resource *res);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
index 072612263dab..b1dd13ff282b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
@@ -249,6 +249,51 @@ void nfp_resource_release(struct nfp_resource *res)
kfree(res);
 }
 
+/**
+ * nfp_resource_wait() - Wait for resource to appear
+ * @cpp:   NFP CPP handle
+ * @name:  Name of the resource
+ * @secs:  Number of seconds to wait
+ *
+ * Wait for resource to appear in the resource table, grab and release
+ * its lock.  The wait is jiffies-based, don't expect fine granularity.
+ *
+ * Return: 0 on success, errno otherwise.
+ */
+int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs)
+{
+   unsigned long warn_at = jiffies + NFP_MUTEX_WAIT_FIRST_WARN * HZ;
+   unsigned long err_at = jiffies + secs * HZ;
+   struct nfp_resource *res;
+
+   while (true) {
+   res = nfp_resource_acquire(cpp, name);
+   if (!IS_ERR(res)) {
+   nfp_resource_release(res);
+   return 0;
+   }
+
+   if (PTR_ERR(res) != -ENOENT) {
+   nfp_err(cpp, "error waiting for resource %s: %ld\n",
+   name, PTR_ERR(res));
+   return PTR_ERR(res);
+   }
+   if (time_is_before_eq_jiffies(err_at)) {
+   nfp_err(cpp, "timeout waiting for resource %s\n", name);
+   return -ETIMEDOUT;
+   }
+   if (time_is_before_eq_jiffies(warn_at)) {
+   warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ;
+   nfp_info(cpp, "waiting for NFP resource %s\n", name);
+   }
+   if (msleep_interruptible(10)) {
+   nfp_err(cpp, "wait for resource %s interrupted\n",
+   name);
+   return -ERESTARTSYS;
+   }
+   }
+}
+
 /**
  * nfp_resource_cpp_id() - Return the cpp_id of a resource handle
  * @res:   NFP Resource handle
-- 
2.14.1

[PATCH net v2 2/3] nfp: wait for board state before talking to the NSP

2017-09-13 Thread Jakub Kicinski

Board state informs us which low-level initialization stages the card
has completed.  We should wait for the card to be fully initialized
before trying to communicate with it, not only before we configure
passing traffic.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 43 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index f055b1774d65..424707d41fbd 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -74,6 +74,45 @@ static const struct pci_device_id nfp_pci_device_ids[] = {
 };
 MODULE_DEVICE_TABLE(pci, nfp_pci_device_ids);
 
+static bool nfp_board_ready(struct nfp_pf *pf)
+{
+   const char *cp;
+   long state;
+   int err;
+
+   cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state");
+   if (!cp)
+   return false;
+
+   err = kstrtol(cp, 0, );
+   if (err < 0)
+   return false;
+
+   return state == 15;
+}
+
+static int nfp_pf_board_state_wait(struct nfp_pf *pf)
+{
+   const unsigned long wait_until = jiffies + 10 * HZ;
+
+   while (!nfp_board_ready(pf)) {
+   if (time_is_before_eq_jiffies(wait_until)) {
+   nfp_err(pf->cpp, "NFP board initialization timeout\n");
+   return -EINVAL;
+   }
+
+   nfp_info(pf->cpp, "waiting for board initialization\n");
+   if (msleep_interruptible(500))
+   return -ERESTARTSYS;
+
+   /* Refresh cached information */
+   kfree(pf->hwinfo);
+   pf->hwinfo = nfp_hwinfo_read(pf->cpp);
+   }
+
+   return 0;
+}
+
 static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf)
 {
int err;
@@ -425,6 +464,10 @@ static int nfp_pci_probe(struct pci_dev *pdev,
 nfp_hwinfo_lookup(pf->hwinfo, "assembly.revision"),
 nfp_hwinfo_lookup(pf->hwinfo, "cpld.version"));
 
+   err = nfp_pf_board_state_wait(pf);
+   if (err)
+   goto err_hwinfo_free;
+
err = devlink_register(devlink, >dev);
if (err)
goto err_hwinfo_free;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index 5abb9ba31e7d..ff373acd28f3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -64,23 +64,6 @@
 
 #define NFP_PF_CSR_SLICE_SIZE  (32 * 1024)
 
-static int nfp_is_ready(struct nfp_pf *pf)
-{
-   const char *cp;
-   long state;
-   int err;
-
-   cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state");
-   if (!cp)
-   return 0;
-
-   err = kstrtol(cp, 0, );
-   if (err < 0)
-   return 0;
-
-   return state == 15;
-}
-
 /**
  * nfp_net_get_mac_addr() - Get the MAC address.
  * @pf:   NFP PF handle
@@ -725,12 +708,6 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 
INIT_WORK(>port_refresh_work, nfp_net_refresh_vnics);
 
-   /* Verify that the board has completed initialization */
-   if (!nfp_is_ready(pf)) {
-   nfp_err(pf->cpp, "NFP is not ready for NIC operation.\n");
-   return -EINVAL;
-   }
-
if (!pf->rtbl) {
nfp_err(pf->cpp, "No %s, giving up.\n",
pf->fw_loaded ? "symbol table" : "firmware found");
-- 
2.14.1

RFC: Audit Kernel Container IDs

2017-09-13 Thread Richard Guy Briggs

Containers are a userspace concept.  The kernel knows nothing of them.

The Linux audit system needs a way to be able to track the container
provenance of events and actions.  Audit needs the kernel's help to do
this.

Since the concept of a container is entirely a userspace concept, a
trigger signal from the userspace container orchestration system
initiates this.  This will define a point in time and a set of resources
associated with a particular container with an audit container ID.

The trigger is a pseudo filesystem (proc, since PID tree already exists)
write of a u64 representing the container ID to a file representing a
process that will become the first process in a new container.
This might place restrictions on mount namespaces required to define a
container, or at least careful checking of namespaces in the kernel to
verify permissions of the orchestrator so it can't change its own
container ID.
A bind mount of nsfs may be necessary in the container orchestrator's
mntNS.

Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
filesystem to have this action permitted.  At that time, record the
child container's user-supplied 64-bit container identifier along with
the child container's first process (which may become the container's
"init" process) process ID (referenced from the initial PID namespace),
all namespace IDs (in the form of a nsfs device number and inode number
tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
op=$action field.

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.

Forked and cloned processes inherit their parent's container ID,
referenced in the process' audit_context struct.

Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable.  Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.

Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.

A process can be moved from one container to another by using the
container assignment method outlined above a second time.

When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.)  A
container object may need a list of processes and/or namespaces.

A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container.  A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.

Feedback please.

- RGB

--
Richard Guy Briggs 
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko


On 2017-09-13 19:55, Eric Dumazet wrote:

On Wed, 2017-09-13 at 09:42 -0700, Eric Dumazet wrote:

On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote:
> On 2017-09-13 19:16, Eric Dumazet wrote:
> > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote:
> >> Well, probably i am answering my own question, removing estimator from
> >> classes seems drastically improve situation.
> >> It seems estimator has some issues that cause shaper to behave
> >> incorrectly (throttling traffic while it should not).
> >> But i guess thats a bug?
> >> As i was not able to predict such bottleneck by CPU load measurements.
> >
> > Well, there was a reason we disabled HTB class estimators by default ;)
> >
> >
> > 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9
>
> As soon as disabling it solve my problem - i'm fine, hehe, but i guess
> other people who might hit this problem, should be aware how to find
> reason.
> They should not be disappointed in Linux :)

Well, if they enable rate estimators while kernel does not set them by
default, they get what they want, at a cost.

> Because i can't measure this bottleneck before it happens, i'm seeing on
> mpstat all cpu's are idle, and same time traffic is throttled.

Normally things were supposed to get much better in linux-4.10

( 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114 
)


But I apparently added a scaling bug.

I will try :

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 
0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78 
100644

--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -83,10 +83,10 @@ static void est_timer(unsigned long arg)
u64 rate, brate;

est_fetch_counters(est, );
-   brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log);
+   brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - 
est->intvl_log);

brate -= (est->avbps >> est->ewma_log);

-   rate = (u64)(b.packets - est->last_packets) << (8 - 
est->ewma_log);
+   rate = (u64)(b.packets - est->last_packets) << (10 - 
est->ewma_log - est->intvl_log);

rate -= (est->avpps >> est->ewma_log);

write_seqcount_begin(>seq);



Much better indeed

# tc -s -d class sh dev eth0 classid 7002:11 ; sleep 10 ;tc -s -d class
sh dev eth0 classid 7002:11

class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil
5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b
level 0 rate_handle 1
 Sent 389085117074 bytes 256991500 pkt (dropped 0, overlimits 5926926
requeues 0)
 rate 4999Mbit 412762pps backlog 136260b 2p requeues 0
 TCP pkts/rtx 256991584/0 bytes 389085252840/0
 lended: 5961250 borrowed: 0 giants: 0
 tokens: -1664 ctokens: -1664

class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil
5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b
level 0 rate_handle 1
 Sent 395336315580 bytes 261120429 pkt (dropped 0, overlimits 6021776
requeues 0)
 rate 4999Mbit 412788pps backlog 68Kb 2p requeues 0
 TCP pkts/rtx 261120469/0 bytes 395336384730/0
 lended: 6056793 borrowed: 0 giants: 0
 tokens: -1478 ctokens: -1478


echo "(395336315580-389085117074)/10*8" | bc
5000958800
For my case, as load increased now, i am hitting same issue (i tried to 
play with quantum / bursts as well, didnt helped):


tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class 
show dev eth3.777 classid 1:111
class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit 
ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 
mpu 0b level 0
 Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 requeues 
0)

 backlog 73968000b 39934p requeues 0
 lended: 499867 borrowed: 0 giants: 0
 tokens: 608 ctokens: 121

class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit 
ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 
mpu 0b level 0
 Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 requeues 
0)

 backlog 82536047b 39963p requeues 0
 lended: 810475 borrowed: 0 giants: 0
 tokens: 612 ctokens: 122

(1469352160-864151559)/5*8
968320961.6000
Less than 1Gbit and it's being throttled

Total bandwidth:

class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 
10b/1 mpu 0b cburst 10b/1 mpu 0b level 7

 Sent 7839730635 bytes 8537393 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 123 ctokens: 123

class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 
10b/1 mpu 0b cburst 10b/1 mpu 0b level 7
 Sent 11043190453 bytes 12008366 pkt (dropped 0, overlimits 0 requeues 
0)

 backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: 124 ctokens: 124

694kpps
5.1Gbit

Re: Regression in throughput between kvm guests over virtual bridge

2017-09-13 Thread Matthew Rosato

On 09/13/2017 04:13 AM, Jason Wang wrote:
> 
> 
> On 2017年09月13日 09:16, Jason Wang wrote:
>>
>>
>> On 2017年09月13日 01:56, Matthew Rosato wrote:
>>> We are seeing a regression for a subset of workloads across KVM guests
>>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
>>> points to c67df11f "vhost_net: try batch dequing from skb array"
>>>
>>> In the regressed environment, we are running 4 kvm guests, 2 running as
>>> uperf servers and 2 running as uperf clients, all on a single host.
>>> They are connected via a virtual bridge.  The uperf client profile looks
>>> like:
>>>
>>> 
>>> 
>>>
>>>  
>>>
>>>  
>>>  
>>>
>>>  
>>>  
>>>
>>>  
>>>
>>> 
>>>
>>> So, 1 tcp streaming instance per client.  When upgrading the host kernel
>>> from 4.12->4.13, we see about a 30% drop in throughput for this
>>> scenario.  After the bisect, I further verified that reverting c67df11f
>>> on 4.13 "fixes" the throughput for this scenario.
>>>
>>> On the other hand, if we increase the load by upping the number of
>>> streaming instances to 50 (nprocs="50") or even 10, we see instead a
>>> ~10% increase in throughput when upgrading host from 4.12->4.13.
>>>
>>> So it may be the issue is specific to "light load" scenarios.  I would
>>> expect some overhead for the batching, but 30% seems significant...  Any
>>> thoughts on what might be happening here?
>>>
>>
>> Hi, thanks for the bisecting. Will try to see if I can reproduce.
>> Various factors could have impact on stream performance. If possible,
>> could you collect the #pkts and average packet size during the test?
>> And if you guest version is above 4.12, could you please retry with
>> napi_tx=true?

Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 -
4.4.0-93-generic specifically).  Here's a throughput report (uperf) and
#pkts and average packet size (tcpstat) for one of the uperf clients:

host 4.12 / guest 4.4:
throughput: 29.98Gb/s
#pkts=33465571 avg packet size=33755.70

host 4.13 / guest 4.4:
throughput: 20.36Gb/s
#pkts=21233399 avg packet size=36130.69

I ran the test again using net-next.git as guest kernel, with and
without napi_tx=true.  napi_tx did not seem to have any significant
impact on throughput.  However, the guest kernel shift from
4.4->net-next improved things.  I can still see a regression between
host 4.12 and 4.13, but it's more on the order of 10-15% - another sample:

host 4.12 / guest net-next (without napi_tx):
throughput: 28.88Gb/s
#pkts=31743116 avg packet size=33779.78

host 4.13 / guest net-next (without napi_tx):
throughput: 24.34Gb/s
#pkts=25532724 avg packet size=35963.20

>>
>> Thanks
> 
> Unfortunately, I could not reproduce it locally. I'm using net-next.git
> as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> for both before and after the commit. I use 1 vcpu and 1 queue, and pin
> vcpu and vhost threads into separate cpu on host manually (in same numa
> node).

The environment is quite a bit different -- I'm running in an LPAR on a
z13 (s390x).  We've seen the issue in various configurations, the
smallest thus far was a host partition w/ 40G and 20 CPUs defined (the
numbers above were gathered w/ this configuration).  Each guest has 4GB
and 4 vcpus.  No pinning / affinity configured.

> 
> Can you hit this regression constantly and what's you qemu command line

Yes, the regression seems consistent.  I can try tweaking some of the
host and guest definitions to see if it makes a difference.

The guests are instantiated from libvirt - Here's one of the resulting
qemu command lines:

/usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes
-machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m
4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -drive
file=/dev/disk/by-id/scsi-3600507630bffc0381803,format=raw,if=none,id=drive-virtio-disk0
-device
virtio-blk-ccw,scsi=off,devno=fe.0.,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device
virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1
-chardev pty,id=charconsole0 -device
sclpconsole,chardev=charconsole0,id=console0 -device
virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on

In the above, net0 is used for a macvtap connection (not used in the
experiment, just for a reliable ssh connection - can remove if needed).
net1 is the

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 09:42 -0700, Eric Dumazet wrote:
> On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote:
> > On 2017-09-13 19:16, Eric Dumazet wrote:
> > > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote:
> > >> Well, probably i am answering my own question, removing estimator from
> > >> classes seems drastically improve situation.
> > >> It seems estimator has some issues that cause shaper to behave
> > >> incorrectly (throttling traffic while it should not).
> > >> But i guess thats a bug?
> > >> As i was not able to predict such bottleneck by CPU load measurements.
> > > 
> > > Well, there was a reason we disabled HTB class estimators by default ;)
> > > 
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9
> > 
> > As soon as disabling it solve my problem - i'm fine, hehe, but i guess 
> > other people who might hit this problem, should be aware how to find 
> > reason.
> > They should not be disappointed in Linux :)
> 
> Well, if they enable rate estimators while kernel does not set them by
> default, they get what they want, at a cost.
> 
> > Because i can't measure this bottleneck before it happens, i'm seeing on 
> > mpstat all cpu's are idle, and same time traffic is throttled.
> 
> Normally things were supposed to get much better in linux-4.10
> 
> ( 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114
>  )
> 
> But I apparently added a scaling bug.
> 
> I will try :
> 
> diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
> index 
> 0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78
>  100644
> --- a/net/core/gen_estimator.c
> +++ b/net/core/gen_estimator.c
> @@ -83,10 +83,10 @@ static void est_timer(unsigned long arg)
> u64 rate, brate;
>  
> est_fetch_counters(est, );
> -   brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log);
> +   brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - 
> est->intvl_log);
> brate -= (est->avbps >> est->ewma_log);
>  
> -   rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log);
> +   rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - 
> est->intvl_log);
> rate -= (est->avpps >> est->ewma_log);
>  
> write_seqcount_begin(>seq);


Much better indeed

# tc -s -d class sh dev eth0 classid 7002:11 ; sleep 10 ;tc -s -d class
sh dev eth0 classid 7002:11

class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil
5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b
level 0 rate_handle 1 
 Sent 389085117074 bytes 256991500 pkt (dropped 0, overlimits 5926926
requeues 0) 
 rate 4999Mbit 412762pps backlog 136260b 2p requeues 0 
 TCP pkts/rtx 256991584/0 bytes 389085252840/0
 lended: 5961250 borrowed: 0 giants: 0
 tokens: -1664 ctokens: -1664

class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil
5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b
level 0 rate_handle 1 
 Sent 395336315580 bytes 261120429 pkt (dropped 0, overlimits 6021776
requeues 0) 
 rate 4999Mbit 412788pps backlog 68Kb 2p requeues 0 
 TCP pkts/rtx 261120469/0 bytes 395336384730/0
 lended: 6056793 borrowed: 0 giants: 0
 tokens: -1478 ctokens: -1478


echo "(395336315580-389085117074)/10*8" | bc
5000958800

Re: [PATCH net 0/3] nfp: wait more carefully for card init

2017-09-13 Thread Jakub Kicinski

On Wed, 13 Sep 2017 09:39:02 -0700 (PDT), David Miller wrote:
> From: Jakub Kicinski 
> Date: Wed, 13 Sep 2017 08:51:28 -0700
> 
> > The first patch is a small fix for flower offload, we need a whitelist
> > of supported matches, otherwise the unsupported ones will be ignored.
> > 
> > The second and the third patch are adding wait/polling to the probe path.
> > We had reports of driver failing probe because it couldn't find the 
> > control process (NSP) on the card.  Turns out the NSP will only announce
> > its existence after it's fully initialized.  Until now we assumed it 
> > will be reachable, just not processing commands (hence we wait for
> > a NOOP command to execute successfully).  
> 
> Please build test your changes and look at what the compiler says:
> 
> drivers/net/ethernet/netronome/nfp/nfp_main.c: In function ‘nfp_fw_unload’:
> drivers/net/ethernet/netronome/nfp/nfp_main.c:395:10: warning: ‘return’ with 
> a value, in function returning void
>return err;
>   ^~~
> drivers/net/ethernet/netronome/nfp/nfp_main.c:388:13: note: declared here
>  static void nfp_fw_unload(struct nfp_pf *pf)
>  ^

Sorry about that!

Re: Memory leaks in conntrack

2017-09-13 Thread Cong Wang

On Wed, Sep 13, 2017 at 1:05 AM, Florian Westphal  wrote:
> Cong Wang  wrote:
>> While testing my TC filter patches (so not related to conntrack), the
>> following memory leaks are shown up:
>>
>> unreferenced object 0x9b19ba551228 (size 128):
>>   comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s)
>>   hex dump (first 32 bytes):
>> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
>> 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00  ...0
>>   backtrace:
>> [] create_object+0x169/0x2aa
>> [] kmemleak_alloc+0x25/0x41
>> [] slab_post_alloc_hook+0x44/0x65
>> [] __kmalloc_track_caller+0x113/0x146
>> [] __krealloc+0x4a/0x69
>> [] nf_ct_ext_add+0xe1/0x145
>> [] init_conntrack+0x1f7/0x36e
>> [] nf_conntrack_in+0x1d3/0x326
>> [] ipv4_conntrack_local+0x4d/0x50
>> [] nf_hook_slow+0x3c/0x9b
>> [] nf_hook.constprop.40+0xbe/0xd8
>> [] __ip_local_out+0xb3/0xbf
>> [] ip_local_out+0x1c/0x36
>> [] ip_send_skb+0x19/0x3d
>> [] udp_send_skb+0x17e/0x1df
>> [] udp_sendmsg+0x5a2/0x77c
>> unreferenced object 0x9b19a69b3340 (size 336):
>>   comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s)
>>   hex dump (first 32 bytes):
>> 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de  .N..
>> ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff  
>>   backtrace:
>> [] create_object+0x169/0x2aa
>> [] kmemleak_alloc+0x25/0x41
>> [] slab_post_alloc_hook+0x44/0x65
>> [] kmem_cache_alloc+0xd7/0x1f1
>> [] __nf_conntrack_alloc+0xa2/0x146
>> [] init_conntrack+0xb2/0x36e
>> [] nf_conntrack_in+0x1d3/0x326
>> [] ipv4_conntrack_local+0x4d/0x50
>> [] nf_hook_slow+0x3c/0x9b
>> [] nf_hook.constprop.40+0xbe/0xd8
>> [] __ip_local_out+0xb3/0xbf
>> [] ip_local_out+0x1c/0x36
>> [] ip_send_skb+0x19/0x3d
>> [] udp_send_skb+0x17e/0x1df
>> [] udp_sendmsg+0x5a2/0x77c
>> [] inet_sendmsg+0x37/0x5e
>>
>> I don't touch chronyd in my VM, so I have no idea why it sends out UDP
>> packets, my guess is it is some periodical packet.
>>
>> I don't think I use conntrack either, since /proc/net/ip_conntrack
>> does not exist.
>
> You probably do, can you try "cat /proc/net/nf_conntrack" instead?
>
> (otherwise there should be no ipv4_conntrack_local() invocation
>  since we would not register this hook at all).

Yeah it is very weird but it is true:

[root@localhost ~]# echo scan > /sys/kernel/debug/kmemleak
[  133.450823] kmemleak: 18 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
[root@localhost ~]# cat /proc/net/ip_conntrack
cat: /proc/net/ip_conntrack: No such file or directory
[root@localhost ~]# cat /sys/kernel/debug/kmemleak
unreferenced object 0x95c1e0b24040 (size 336):
...


>
> I tried to reproduce this but so far I had no success.
> If you can identify something that could give a hint when this
> is happening (only once after boot, periodically, only with udp, etc)
> please let us know.
>
> (A reproducer would be even better of course ;-) )

Actually, it is even simpler to reproduce, nothing is needed
but wait. I thought it is somewhat triggered by my tests, but
actually no. For me, just boot the VM and wait for several
seconds, memleak will show up.

(chronyd is started by systemd during boot, not me.)

>
> Is this with current net tree?

Yes, I just pulled DaveM's net tree and recompiled the kernel,
still 100% reproducible here.

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote:
> On 2017-09-13 19:16, Eric Dumazet wrote:
> > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote:
> >> Well, probably i am answering my own question, removing estimator from
> >> classes seems drastically improve situation.
> >> It seems estimator has some issues that cause shaper to behave
> >> incorrectly (throttling traffic while it should not).
> >> But i guess thats a bug?
> >> As i was not able to predict such bottleneck by CPU load measurements.
> > 
> > Well, there was a reason we disabled HTB class estimators by default ;)
> > 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9
> 
> As soon as disabling it solve my problem - i'm fine, hehe, but i guess 
> other people who might hit this problem, should be aware how to find 
> reason.
> They should not be disappointed in Linux :)

Well, if they enable rate estimators while kernel does not set them by
default, they get what they want, at a cost.

> Because i can't measure this bottleneck before it happens, i'm seeing on 
> mpstat all cpu's are idle, and same time traffic is throttled.

Normally things were supposed to get much better in linux-4.10

( 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114
 )

But I apparently added a scaling bug.

I will try :

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 
0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78
 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -83,10 +83,10 @@ static void est_timer(unsigned long arg)
u64 rate, brate;
 
est_fetch_counters(est, );
-   brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log);
+   brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - 
est->intvl_log);
brate -= (est->avbps >> est->ewma_log);
 
-   rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log);
+   rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - 
est->intvl_log);
rate -= (est->avpps >> est->ewma_log);
 
write_seqcount_begin(>seq);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

Re: [PATCH net 0/3] nfp: wait more carefully for card init

2017-09-13 Thread David Miller

From: Jakub Kicinski 
Date: Wed, 13 Sep 2017 08:51:28 -0700

> The first patch is a small fix for flower offload, we need a whitelist
> of supported matches, otherwise the unsupported ones will be ignored.
> 
> The second and the third patch are adding wait/polling to the probe path.
> We had reports of driver failing probe because it couldn't find the 
> control process (NSP) on the card.  Turns out the NSP will only announce
> its existence after it's fully initialized.  Until now we assumed it 
> will be reachable, just not processing commands (hence we wait for
> a NOOP command to execute successfully).

Please build test your changes and look at what the compiler says:

drivers/net/ethernet/netronome/nfp/nfp_main.c: In function ‘nfp_fw_unload’:
drivers/net/ethernet/netronome/nfp/nfp_main.c:395:10: warning: ‘return’ with a 
value, in function returning void
   return err;
  ^~~
drivers/net/ethernet/netronome/nfp/nfp_main.c:388:13: note: declared here
 static void nfp_fw_unload(struct nfp_pf *pf)
 ^

Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker

2017-09-13 Thread David Miller

From: Jiri Pirko 
Date: Wed, 13 Sep 2017 17:32:37 +0200

> From: Jiri Pirko 
> 
> Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed
> freeing in call_rcu, which changed already existing hard-to-hit
> race condition into 100% hit:
> 
> [  598.599825] BUG: unable to handle kernel NULL pointer dereference at 
> 0030
> [  598.607782] IP: tcf_action_destroy+0xc0/0x140
> 
> Or:
> 
> [   40.858924] BUG: unable to handle kernel NULL pointer dereference at 
> 0030
> [   40.862840] IP: tcf_generic_walker+0x534/0x820
> 
> Fix this by storing the ops and use them directly for module_put call.
> 
> Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
> Signed-off-by: Jiri Pirko 

Applied, thanks Jiri.

Re: [PATCH net] be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx

2017-09-13 Thread David Miller

From: Suresh Reddy 
Date: Wed, 13 Sep 2017 11:12:42 -0400

> IPv6 TSO requests with extension hdrs are a problem to the
> Lancer and BEx chips. Workaround is to disable TSO6 feature
> for such packets.
> 
> Also in Lancer chips, MSS less than 256 was resulting in TX stall.
> Fix this by disabling GSO when MSS less than 256.
> 
> Signed-off-by: Suresh Reddy 

Applied.

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko


On 2017-09-13 19:16, Eric Dumazet wrote:

On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote:

Well, probably i am answering my own question, removing estimator from
classes seems drastically improve situation.
It seems estimator has some issues that cause shaper to behave
incorrectly (throttling traffic while it should not).
But i guess thats a bug?
As i was not able to predict such bottleneck by CPU load measurements.


Well, there was a reason we disabled HTB class estimators by default ;)


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9


As soon as disabling it solve my problem - i'm fine, hehe, but i guess 
other people who might hit this problem, should be aware how to find 
reason.

They should not be disappointed in Linux :)
Because i can't measure this bottleneck before it happens, i'm seeing on 
mpstat all cpu's are idle, and same time traffic is throttled.

Re: [PATCH net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()

2017-09-13 Thread David Miller

From: Dan Carpenter 
Date: Wed, 13 Sep 2017 12:20:28 +0300

> @@ -154,7 +154,11 @@ static inline int sctp_ulpevent_type_enabled(__u16 
> sn_type,
>struct sctp_event_subscribe *mask)
>  {
>   char *amask = (char *) mask;
> - return amask[sn_type - SCTP_SN_TYPE_BASE];
> + int offset = sn_type - SCTP_SN_TYPE_BASE;

Please use reverse-christmas-tree local variable ordering.

Thank you.

Re: [PATCH 06/10] drivers:ethernet: return -ENOMEM on allocation failure.

2017-09-13 Thread David Miller

From: Allen Pais 
Date: Wed, 13 Sep 2017 13:02:15 +0530

> Signed-off-by: Allen Pais 

This is quite pointless as the caller doesn't do anything with
the value, it just tests whether a negative value is returned
or not.

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote:
> Well, probably i am answering my own question, removing estimator from 
> classes seems drastically improve situation.
> It seems estimator has some issues that cause shaper to behave 
> incorrectly (throttling traffic while it should not).
> But i guess thats a bug?
> As i was not able to predict such bottleneck by CPU load measurements.

Well, there was a reason we disabled HTB class estimators by default ;)


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9

Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB

2017-09-13 Thread Sergei Shtylyov


On 09/13/2017 07:02 PM, Sergei Shtylyov wrote:


Add a new compatible string for the R8A77995 (R-Car D3) RAVB.

Signed-off-by: Yoshihiro Shimoda 
---
  Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 +
  1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt

index 1672353..ae2213f 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -17,6 +17,7 @@ Required properties:
- "renesas,etheravb-r8a7795" for the R8A7795 SoC.
- "renesas,etheravb-r8a7796" for the R8A7796 SoC.
+  - "renesas,etheravb-r8a77995" for the R8A77995 SoC.


That would conflict with the R8A77970 bindings patch posted by me 
yesterday. Please respin atop of it.


   Here's the link for your convenience:

https://marc.info/?l=linux-renesas-soc=150524655515476

[...]

MBR, Sergei

Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB

2017-09-13 Thread Sergei Shtylyov


Hello!

On 09/13/2017 03:17 PM, Yoshihiro Shimoda wrote:


Add a new compatible string for the R8A77995 (R-Car D3) RAVB.

Signed-off-by: Yoshihiro Shimoda 
---
  Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 +
  1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index 1672353..ae2213f 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -17,6 +17,7 @@ Required properties:
  
- "renesas,etheravb-r8a7795" for the R8A7795 SoC.

- "renesas,etheravb-r8a7796" for the R8A7796 SoC.
+  - "renesas,etheravb-r8a77995" for the R8A77995 SoC.


   That would conflict with the R8A77970 bindings patch posted by me 
yesterday. Please respin atop of it.



- "renesas,etheravb-rcar-gen3" as a fallback for the above
R-Car Gen3 devices.
  


   One more fragment is needed here, about the mandatory 'interrupt-names" 
prop. My patch makes this change unnecessary, however...


MBR, Sergei

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko


On 2017-09-13 18:51, Eric Dumazet wrote:

On Wed, 2017-09-13 at 18:20 +0300, Denys Fedoryshchenko wrote:

Hi,

I noticed after increasing bandwidth over some amount HTB started to
throttle classes it should not throttle.
Also estimated rate in htb totally wrong, while byte counters is
correct.

Is there any overflow or something?


Thanks Denys for the report, I will take a look at this, since I
probably introduced some regression.
It's definitely not something recent, this system was on older kernel 
with uptime over 200 days, and this bottleneck was present, i noticed it 
long time before.
But never tried to remove estimators (increasing burst/cburst to insane 
values saved me for a while).

[PATCH net 3/3] nfp: wait for the NSP resource to appear on boot

2017-09-13 Thread Jakub Kicinski

The control process (NSP) may take some time to complete its
initialization.  This is not a problem on most servers, but
on very fast-booting machines it may not be ready for operation
when driver probes the device.  There is also a version of the
flash in the wild where NSP tries to train the links as part
of init.  To wait for NSP initialization we should make sure
its resource has already been added to the resource table.
NSP adds itself there as last step of init.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  4 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 45 ++
 3 files changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 424707d41fbd..1c17b261a21c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -390,6 +390,10 @@ static void nfp_fw_unload(struct nfp_pf *pf)
struct nfp_nsp *nsp;
int err;
 
+   err = nfp_resource_wait(pf->cpp, NFP_RESOURCE_NSP, 30);
+   if (err)
+   return err;
+
nsp = nfp_nsp_open(pf->cpp);
if (IS_ERR(nsp)) {
nfp_err(pf->cpp, "Reset failed, can't open NSP\n");
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
index 1a8d04a1e113..3ce51f03126f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
@@ -97,6 +97,8 @@ nfp_resource_acquire(struct nfp_cpp *cpp, const char *name);
 
 void nfp_resource_release(struct nfp_resource *res);
 
+int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int 
secs);
+
 u32 nfp_resource_cpp_id(struct nfp_resource *res);
 
 const char *nfp_resource_name(struct nfp_resource *res);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
index 072612263dab..b1dd13ff282b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
@@ -249,6 +249,51 @@ void nfp_resource_release(struct nfp_resource *res)
kfree(res);
 }
 
+/**
+ * nfp_resource_wait() - Wait for resource to appear
+ * @cpp:   NFP CPP handle
+ * @name:  Name of the resource
+ * @secs:  Number of seconds to wait
+ *
+ * Wait for resource to appear in the resource table, grab and release
+ * its lock.  The wait is jiffies-based, don't expect fine granularity.
+ *
+ * Return: 0 on success, errno otherwise.
+ */
+int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs)
+{
+   unsigned long warn_at = jiffies + NFP_MUTEX_WAIT_FIRST_WARN * HZ;
+   unsigned long err_at = jiffies + secs * HZ;
+   struct nfp_resource *res;
+
+   while (true) {
+   res = nfp_resource_acquire(cpp, name);
+   if (!IS_ERR(res)) {
+   nfp_resource_release(res);
+   return 0;
+   }
+
+   if (PTR_ERR(res) != -ENOENT) {
+   nfp_err(cpp, "error waiting for resource %s: %ld\n",
+   name, PTR_ERR(res));
+   return PTR_ERR(res);
+   }
+   if (time_is_before_eq_jiffies(err_at)) {
+   nfp_err(cpp, "timeout waiting for resource %s\n", name);
+   return -ETIMEDOUT;
+   }
+   if (time_is_before_eq_jiffies(warn_at)) {
+   warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ;
+   nfp_info(cpp, "waiting for NFP resource %s\n", name);
+   }
+   if (msleep_interruptible(10)) {
+   nfp_err(cpp, "wait for resource %s interrupted\n",
+   name);
+   return -ERESTARTSYS;
+   }
+   }
+}
+
 /**
  * nfp_resource_cpp_id() - Return the cpp_id of a resource handle
  * @res:   NFP Resource handle
-- 
2.14.1

[PATCH net 1/3] nfp: add whitelist of supported flow dissector

2017-09-13 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously we did not check the flow dissector against a list of allowed
and supported flow key dissectors. This patch introduces such a list and
correctly rejects unsupported flow keys.

Fixes: 43f84b72c50d ("nfp: add metadata to each flow offload")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index d396183108f7..a18b4d2b1d3e 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -44,6 +44,16 @@
 #include "../nfp_net.h"
 #include "../nfp_port.h"
 
+#define NFP_FLOWER_WHITELIST_DISSECTOR \
+   (BIT(FLOW_DISSECTOR_KEY_CONTROL) | \
+BIT(FLOW_DISSECTOR_KEY_BASIC) | \
+BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_PORTS) | \
+BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \
+BIT(FLOW_DISSECTOR_KEY_VLAN) | \
+BIT(FLOW_DISSECTOR_KEY_IP))
+
 static int
 nfp_flower_xmit_flow(struct net_device *netdev,
 struct nfp_fl_payload *nfp_flow, u8 mtype)
@@ -112,6 +122,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls 
*ret_key_ls,
u8 key_layer;
int key_size;
 
+   if (flow->dissector->used_keys & ~NFP_FLOWER_WHITELIST_DISSECTOR)
+   return -EOPNOTSUPP;
+
if (dissector_uses_key(flow->dissector,
   FLOW_DISSECTOR_KEY_ENC_CONTROL)) {
struct flow_dissector_key_control *mask_enc_ctl =
-- 
2.14.1

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 18:20 +0300, Denys Fedoryshchenko wrote:
> Hi,
> 
> I noticed after increasing bandwidth over some amount HTB started to 
> throttle classes it should not throttle.
> Also estimated rate in htb totally wrong, while byte counters is 
> correct.
> 
> Is there any overflow or something?

Thanks Denys for the report, I will take a look at this, since I
probably introduced some regression.

[PATCH net 2/3] nfp: wait for board state before talking to the NSP

2017-09-13 Thread Jakub Kicinski

Board state informs us which low-level initialization stages the card
has completed.  We should wait for the card to be fully initialized
before trying to communicate with it, not only before we configure
passing traffic.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 43 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index f055b1774d65..424707d41fbd 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -74,6 +74,45 @@ static const struct pci_device_id nfp_pci_device_ids[] = {
 };
 MODULE_DEVICE_TABLE(pci, nfp_pci_device_ids);
 
+static bool nfp_board_ready(struct nfp_pf *pf)
+{
+   const char *cp;
+   long state;
+   int err;
+
+   cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state");
+   if (!cp)
+   return false;
+
+   err = kstrtol(cp, 0, );
+   if (err < 0)
+   return false;
+
+   return state == 15;
+}
+
+static int nfp_pf_board_state_wait(struct nfp_pf *pf)
+{
+   const unsigned long wait_until = jiffies + 10 * HZ;
+
+   while (!nfp_board_ready(pf)) {
+   if (time_is_before_eq_jiffies(wait_until)) {
+   nfp_err(pf->cpp, "NFP board initialization timeout\n");
+   return -EINVAL;
+   }
+
+   nfp_info(pf->cpp, "waiting for board initialization\n");
+   if (msleep_interruptible(500))
+   return -ERESTARTSYS;
+
+   /* Refresh cached information */
+   kfree(pf->hwinfo);
+   pf->hwinfo = nfp_hwinfo_read(pf->cpp);
+   }
+
+   return 0;
+}
+
 static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf)
 {
int err;
@@ -425,6 +464,10 @@ static int nfp_pci_probe(struct pci_dev *pdev,
 nfp_hwinfo_lookup(pf->hwinfo, "assembly.revision"),
 nfp_hwinfo_lookup(pf->hwinfo, "cpld.version"));
 
+   err = nfp_pf_board_state_wait(pf);
+   if (err)
+   goto err_hwinfo_free;
+
err = devlink_register(devlink, >dev);
if (err)
goto err_hwinfo_free;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
index 5abb9ba31e7d..ff373acd28f3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c
@@ -64,23 +64,6 @@
 
 #define NFP_PF_CSR_SLICE_SIZE  (32 * 1024)
 
-static int nfp_is_ready(struct nfp_pf *pf)
-{
-   const char *cp;
-   long state;
-   int err;
-
-   cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state");
-   if (!cp)
-   return 0;
-
-   err = kstrtol(cp, 0, );
-   if (err < 0)
-   return 0;
-
-   return state == 15;
-}
-
 /**
  * nfp_net_get_mac_addr() - Get the MAC address.
  * @pf:   NFP PF handle
@@ -725,12 +708,6 @@ int nfp_net_pci_probe(struct nfp_pf *pf)
 
INIT_WORK(>port_refresh_work, nfp_net_refresh_vnics);
 
-   /* Verify that the board has completed initialization */
-   if (!nfp_is_ready(pf)) {
-   nfp_err(pf->cpp, "NFP is not ready for NIC operation.\n");
-   return -EINVAL;
-   }
-
if (!pf->rtbl) {
nfp_err(pf->cpp, "No %s, giving up.\n",
pf->fw_loaded ? "symbol table" : "firmware found");
-- 
2.14.1

[PATCH net 0/3] nfp: wait more carefully for card init

2017-09-13 Thread Jakub Kicinski

Hi!

The first patch is a small fix for flower offload, we need a whitelist
of supported matches, otherwise the unsupported ones will be ignored.

The second and the third patch are adding wait/polling to the probe path.
We had reports of driver failing probe because it couldn't find the 
control process (NSP) on the card.  Turns out the NSP will only announce
its existence after it's fully initialized.  Until now we assumed it 
will be reachable, just not processing commands (hence we wait for
a NOOP command to execute successfully).

Jakub Kicinski (2):
  nfp: wait for board state before talking to the NSP
  nfp: wait for the NSP resource to appear on boot

Pieter Jansen van Vuuren (1):
  nfp: add whitelist of supported flow dissector

 .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c  | 47 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_main.c  | 23 ---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 45 +
 5 files changed, 107 insertions(+), 23 deletions(-)

-- 
2.14.1

Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression

2017-09-13 Thread Laura Abbott


On 09/12/2017 04:12 PM, Josef Bacik wrote:

First I’m super sorry for the top post, I’m at plumbers and I forgot to upload 
my muttrc to my new cloud instance, so I’m screwed using outlook.

I have a completely untested, uncompiled patch that I think will fix the 
problem, would you mind giving it a go?  Thanks,

Josef


Thanks for the quick turnaround. Unfortunately, the problem is still
reproducible according to the reporter.

Thanks,
Laura

Re: Memory leaks in conntrack

2017-09-13 Thread Florian Westphal

Florian Westphal  wrote:
> Cong Wang  wrote:
> > While testing my TC filter patches (so not related to conntrack), the
> > following memory leaks are shown up:
> > 
> > unreferenced object 0x9b19ba551228 (size 128):
> >   comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s)
> >   hex dump (first 32 bytes):
> > 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
> > 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00  ...0
> >   backtrace:
> > [] create_object+0x169/0x2aa
> > [] kmemleak_alloc+0x25/0x41
> > [] slab_post_alloc_hook+0x44/0x65
> > [] __kmalloc_track_caller+0x113/0x146
> > [] __krealloc+0x4a/0x69
> > [] nf_ct_ext_add+0xe1/0x145
> > [] init_conntrack+0x1f7/0x36e
> > [] nf_conntrack_in+0x1d3/0x326
> > [] ipv4_conntrack_local+0x4d/0x50
> > [] nf_hook_slow+0x3c/0x9b
> > [] nf_hook.constprop.40+0xbe/0xd8
> > [] __ip_local_out+0xb3/0xbf
> > [] ip_local_out+0x1c/0x36
> > [] ip_send_skb+0x19/0x3d
> > [] udp_send_skb+0x17e/0x1df
> > [] udp_sendmsg+0x5a2/0x77c
> > unreferenced object 0x9b19a69b3340 (size 336):
> >   comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s)
> >   hex dump (first 32 bytes):
> > 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de  .N..
> > ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff  
> >   backtrace:
> > [] create_object+0x169/0x2aa
> > [] kmemleak_alloc+0x25/0x41
> > [] slab_post_alloc_hook+0x44/0x65
> > [] kmem_cache_alloc+0xd7/0x1f1
> > [] __nf_conntrack_alloc+0xa2/0x146
> > [] init_conntrack+0xb2/0x36e
> > [] nf_conntrack_in+0x1d3/0x326
> > [] ipv4_conntrack_local+0x4d/0x50
> > [] nf_hook_slow+0x3c/0x9b
> > [] nf_hook.constprop.40+0xbe/0xd8
> > [] __ip_local_out+0xb3/0xbf
> > [] ip_local_out+0x1c/0x36
> > [] ip_send_skb+0x19/0x3d
> > [] udp_send_skb+0x17e/0x1df
> > [] udp_sendmsg+0x5a2/0x77c
> > [] inet_sendmsg+0x37/0x5e
> >
> > I don't touch chronyd in my VM, so I have no idea why it sends out UDP
> > packets, my guess is it is some periodical packet.
> > 
> > I don't think I use conntrack either, since /proc/net/ip_conntrack
> > does not exist.
> 
> You probably do, can you try "cat /proc/net/nf_conntrack" instead?
> 
> (otherwise there should be no ipv4_conntrack_local() invocation
>  since we would not register this hook at all).
> 
> I tried to reproduce this but so far I had no success.
> If you can identify something that could give a hint when this
> is happening (only once after boot, periodically, only with udp, etc)
> please let us know.

FWIW i managed to obtain a similar backtrace, but in that case it was a
false positive (peeking at the address content showed it was my ssh connection
to the vm and timeout and tcp conntrackk struct fields were changing;
i.e. the nf_conn reported was still in the conntrack hash.

Why this address was reported i do not know, afaik kmemleak
does scan for addresses anywhere in the object (we use
container_of() to get back nf_conn from the hlist_node), so it
should have found the address linked via the main conntrack hash table.

Right now I don't have enough info to dig any further, sorry :-/

Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?

2017-09-13 Thread Willem de Bruijn

On Wed, Sep 13, 2017 at 11:05 AM, Willem de Bruijn
 wrote:
> On Wed, Sep 13, 2017 at 10:22 AM, Jan Engelhardt  wrote:
>>
>> On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote:
>>>
>>>One way to fix is to have iptables open the object (using the stored
>>>xt_bpf_info_v1->path), gaining a new process local fd for the object,
>>>just after getting the rules from IPT_SO_GET_ENTRIES.
>>>However we didn't see any other extensions doing something like that in
>>>iptables.
>
> The binary should call bpf_obj_get on the filepath each time. These are
> not regular files, but references to a pinned object in the bpf filesystem.
>
> Blindly passing back the fd received from the kernel is clearly wrong. I'm
> really surprised that I did not run into this problem when I wrote the
> feature.
>
>>>
>>>Another way to solve is to fix the ABI (or have a v2 one), that does NOT
>>>pass the fd from userspace, only the path of the pinned object.
>>>Then, 'bpf_mt_check_v1' will open the file from the given path in order
>>>to get the bpf_prog.
>>
>> But a path has a similar problem like a file descriptor - it is local to a
>> certain mount namespace.
>
> Because these are pinned objects in the bpf filesystem, and there is
> only one of those, it may be possible to lookup the object in the kernel
> without relying on a process-local view of mount points.

It would be preferable to fix this for the existing v1 users as well.

That said, the new bpf identifier feature allows passing a globally
unique id instead of a filepath.

>
>>
>> To load "large" blobs into the kernel, a pointer to user memory is a possible
>> option. The downside is that such extra data is not retrievable from the 
>> kernel
>> via the iptables setsockopts anymore - one could work around it with procfs, 
>> or
>> just let it be.
>>
>> https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c
>> line 64+.

Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko

Well, probably i am answering my own question, removing estimator from 
classes seems drastically improve situation.
It seems estimator has some issues that cause shaper to behave 
incorrectly (throttling traffic while it should not).

But i guess thats a bug?
As i was not able to predict such bottleneck by CPU load measurements.

On 2017-09-13 18:20, Denys Fedoryshchenko wrote:

Hi,

I noticed after increasing bandwidth over some amount HTB started to
throttle classes it should not throttle.
Also estimated rate in htb totally wrong, while byte counters is 
correct.


Is there any overflow or something?

X520 card (but XL710 same)
br1 8000.90e2ba86c38c   no  eth3.1777
eth3.777
br2 8000.90e2ba86c38d   no  eth3.360
eth3.361

Inbound traffic is coming over one vlan, leaving another vlan.
Shaper is just bunch of classes and u32 filters, with few fw filters.
qdisc is pie

I put totally high values to not reach them, tried to change
quantum/burst/cburst but... stats below.

First, "root" class is 1:1 showing  rate 18086Mbit, which is
physically impossible.

Class 1:111 showing 5355Mbit, while real traffic is ~1.5Gbit

shaper /etc # tc -s -d class show dev eth3.777 classid 1:111;sleep
5;tc -s -d class show dev eth3.777 classid 1:111
class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 6487632263 bytes 5235525 pkt (dropped 0, overlimits 0 requeues 0)
 rate 5529Mbit 557534pps backlog 0b 0p requeues 0
 lended: 2423323 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 7438601731 bytes 6003811 pkt (dropped 0, overlimits 0 requeues 0)
 rate 5631Mbit 568214pps backlog 36624b 8p requeues 0
 lended: 2772486 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

(7438601731-6487632263)/5*8 = 1.521.551.148

And most important some classes suffering, while they should not (not
reaching limits)
class htb 1:95 parent 1:1 leaf 95: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 13556762059 bytes 17474559 pkt (dropped 16017, overlimits 0 
requeues 0)

 rate 2524Mbit 414197pps backlog 31969245b 34513p requeues 0
 lended: 13995723 borrowed: 0 giants: 0
 tokens: 111 ctokens: -2





Full classes stats:

class htb 1:100 parent 1:1 leaf 100: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 116 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 rate 8bit 0pps backlog 0b 0p requeues 0
 lended: 2 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

class htb 1:120 parent 1:1 leaf 120: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 531230043 bytes 782130 pkt (dropped 0, overlimits 0 requeues 0)
 rate 132274Kbit 25240pps backlog 0b 0p requeues 0
 lended: 540693 borrowed: 0 giants: 0
 tokens: 109 ctokens: -2

class htb 1:50 parent 1:1 leaf 50: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 773472109 bytes 587335 pkt (dropped 0, overlimits 0 requeues 0)
 rate 215929Kbit 20503pps backlog 0b 0p requeues 0
 lended: 216614 borrowed: 0 giants: 0
 tokens: 91 ctokens: -4

class htb 1:70 parent 1:1 leaf 70: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 1574768 bytes 6194 pkt (dropped 0, overlimits 0 requeues 0)
 rate 406272bit 214pps backlog 0b 0p requeues 0
 lended: 5758 borrowed: 0 giants: 0
 tokens: 101 ctokens: -3

class htb 1:90 parent 1:1 leaf 90: prio 0 quantum 5 rate 1Kbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 3206 bytes 53 pkt (dropped 0, overlimits 0 requeues 0)
 rate 848bit 1pps backlog 0b 0p requeues 0
 lended: 53 borrowed: 0 giants: 0

class htb 1:110 parent 1:1 leaf 110: prio 0 quantum 5 rate 10Gbit
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu
0b level 0
 Sent 17205952113 bytes 12926008 pkt (dropped 239, overlimits 0 
requeues 0)

 rate 4433Mbit 416825pps backlog 5847785b 2394p requeues 0
 lended: 7021696 borrowed: 0 giants: 0
 tokens: 91 ctokens: -4

class htb 1:45 root leaf 45: prio 0 quantum 5 rate 80Mbit ceil
80Mbit linklayer ethernet burst 1b/1 mpu 0b cburst 1b/1 mpu 0b
level 0
 Sent 2586 bytes 45 pkt (dropped 0, overlimits 0 requeues 0)
 rate 456bit 1pps backlog 0b 0p requeues 0
 lended: 45 borrowed: 0 giants: 0
 tokens: 15540 ctokens: 15540

class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst
0b/1 mpu 0b cburst 0b/1 mpu 0b level 7
 Sent 72277215121 bytes

[patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker

2017-09-13 Thread Jiri Pirko

From: Jiri Pirko 

Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed
freeing in call_rcu, which changed already existing hard-to-hit
race condition into 100% hit:

[  598.599825] BUG: unable to handle kernel NULL pointer dereference at 
0030
[  598.607782] IP: tcf_action_destroy+0xc0/0x140

Or:

[   40.858924] BUG: unable to handle kernel NULL pointer dereference at 
0030
[   40.862840] IP: tcf_generic_walker+0x534/0x820

Fix this by storing the ops and use them directly for module_put call.

Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common")
Signed-off-by: Jiri Pirko 
---
 net/sched/act_api.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index fcd7dc7..da6fa82 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -180,7 +180,7 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, 
struct sk_buff *skb,
idr_for_each_entry_ext(idr, p, id) {
ret = __tcf_idr_release(p, false, true);
if (ret == ACT_P_DELETED) {
-   module_put(p->ops->owner);
+   module_put(ops->owner);
n_i++;
} else if (ret < 0) {
goto nla_put_failure;
@@ -514,13 +514,15 @@ EXPORT_SYMBOL(tcf_action_exec);
 
 int tcf_action_destroy(struct list_head *actions, int bind)
 {
+   const struct tc_action_ops *ops;
struct tc_action *a, *tmp;
int ret = 0;
 
list_for_each_entry_safe(a, tmp, actions, list) {
+   ops = a->ops;
ret = __tcf_idr_release(a, bind, true);
if (ret == ACT_P_DELETED)
-   module_put(a->ops->owner);
+   module_put(ops->owner);
else if (ret < 0)
return ret;
}
-- 
2.9.3

[PATCH stable-3.16 2/3] VSOCK: Fix lockdep issue.

2017-09-13 Thread Michal Hocko

From: Jorgen Hansen 

commit 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a upstream.

The recent fix for the vsock sock_put issue used the wrong
initializer for the transport spin_lock causing an issue when
running with lockdep checking.

Testing: Verified fix on kernel with lockdep enabled.

Reviewed-by: Thomas Hellstrom 
Signed-off-by: Jorgen Hansen 
Signed-off-by: David S. Miller 
Signed-off-by: Michal Hocko 
---
 net/vmw_vsock/vmci_transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index aed136d27b01..314312272e08 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -1570,7 +1570,7 @@ static int vmci_transport_socket_init(struct vsock_sock 
*vsk,
vmci_trans(vsk)->notify_ops = NULL;
INIT_LIST_HEAD(_trans(vsk)->elem);
vmci_trans(vsk)->sk = >sk;
-   vmci_trans(vsk)->lock = __SPIN_LOCK_UNLOCKED(vmci_trans(vsk)->lock);
+   spin_lock_init(_trans(vsk)->lock);
if (psk) {
vmci_trans(vsk)->queue_pair_size =
vmci_trans(psk)->queue_pair_size;
-- 
2.14.1

[PATCH stable-3.16 1/3] VSOCK: sock_put wasn't safe to call in interrupt context

2017-09-13 Thread Michal Hocko

From: Jorgen Hansen 

commit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream.

In the vsock vmci_transport driver, sock_put wasn't safe to call
in interrupt context, since that may call the vsock destructor
which in turn calls several functions that should only be called
from process context. This change defers the callling of these
functions  to a worker thread. All these functions were
deallocation of resources related to the transport itself.

Furthermore, an unused callback was removed to simplify the
cleanup.

Multiple customers have been hitting this issue when using
VMware tools on vSphere 2015.

Also added a version to the vmci transport module (starting from
1.0.2.0-k since up until now it appears that this module was
sharing version with vsock that is currently at 1.0.1.0-k).

Reviewed-by: Aditya Asarwade 
Reviewed-by: Thomas Hellstrom 
Signed-off-by: Jorgen Hansen 
Signed-off-by: David S. Miller 
Signed-off-by: Michal Hocko 
---
 net/vmw_vsock/vmci_transport.c | 173 -
 net/vmw_vsock/vmci_transport.h |   4 +-
 2 files changed, 86 insertions(+), 91 deletions(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 9bb63ffec4f2..aed136d27b01 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -40,13 +40,11 @@
 
 static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg);
 static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg);
-static void vmci_transport_peer_attach_cb(u32 sub_id,
- const struct vmci_event_data *ed,
- void *client_data);
 static void vmci_transport_peer_detach_cb(u32 sub_id,
  const struct vmci_event_data *ed,
  void *client_data);
 static void vmci_transport_recv_pkt_work(struct work_struct *work);
+static void vmci_transport_cleanup(struct work_struct *work);
 static int vmci_transport_recv_listen(struct sock *sk,
  struct vmci_transport_packet *pkt);
 static int vmci_transport_recv_connecting_server(
@@ -75,6 +73,10 @@ struct vmci_transport_recv_pkt_info {
struct vmci_transport_packet pkt;
 };
 
+static LIST_HEAD(vmci_transport_cleanup_list);
+static DEFINE_SPINLOCK(vmci_transport_cleanup_lock);
+static DECLARE_WORK(vmci_transport_cleanup_work, vmci_transport_cleanup);
+
 static struct vmci_handle vmci_transport_stream_handle = { VMCI_INVALID_ID,
   VMCI_INVALID_ID };
 static u32 vmci_transport_qp_resumed_sub_id = VMCI_INVALID_ID;
@@ -791,44 +793,6 @@ static int vmci_transport_recv_stream_cb(void *data, 
struct vmci_datagram *dg)
return err;
 }
 
-static void vmci_transport_peer_attach_cb(u32 sub_id,
- const struct vmci_event_data *e_data,
- void *client_data)
-{
-   struct sock *sk = client_data;
-   const struct vmci_event_payload_qp *e_payload;
-   struct vsock_sock *vsk;
-
-   e_payload = vmci_event_data_const_payload(e_data);
-
-   vsk = vsock_sk(sk);
-
-   /* We don't ask for delayed CBs when we subscribe to this event (we
-* pass 0 as flags to vmci_event_subscribe()).  VMCI makes no
-* guarantees in that case about what context we might be running in,
-* so it could be BH or process, blockable or non-blockable.  So we
-* need to account for all possible contexts here.
-*/
-   local_bh_disable();
-   bh_lock_sock(sk);
-
-   /* XXX This is lame, we should provide a way to lookup sockets by
-* qp_handle.
-*/
-   if (vmci_handle_is_equal(vmci_trans(vsk)->qp_handle,
-e_payload->handle)) {
-   /* XXX This doesn't do anything, but in the future we may want
-* to set a flag here to verify the attach really did occur and
-* we weren't just sent a datagram claiming it was.
-*/
-   goto out;
-   }
-
-out:
-   bh_unlock_sock(sk);
-   local_bh_enable();
-}
-
 static void vmci_transport_handle_detach(struct sock *sk)
 {
struct vsock_sock *vsk;
@@ -871,28 +835,38 @@ static void vmci_transport_peer_detach_cb(u32 sub_id,
  const struct vmci_event_data *e_data,
  void *client_data)
 {
-   struct sock *sk = client_data;
+   struct vmci_transport *trans = client_data;
const struct vmci_event_payload_qp *e_payload;
-   struct vsock_sock *vsk;
 
e_payload = vmci_event_data_const_payload(e_data);
-   vsk = vsock_sk(sk);
-   if

[PATCH stable-3.16 3/3] VSOCK: Detach QP check should filter out non matching QPs.

2017-09-13 Thread Michal Hocko

From: Jorgen Hansen 

commit 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 upstream.

The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.

Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.

Reviewed-by: George Zhang 
Signed-off-by: Jorgen Hansen 
Signed-off-by: David S. Miller 
Signed-off-by: Michal Hocko 
---
 net/vmw_vsock/vmci_transport.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 314312272e08..c69c990ec4a2 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -844,7 +844,7 @@ static void vmci_transport_peer_detach_cb(u32 sub_id,
 * qp_handle.
 */
if (vmci_handle_is_invalid(e_payload->handle) ||
-   vmci_handle_is_equal(trans->qp_handle, e_payload->handle))
+   !vmci_handle_is_equal(trans->qp_handle, e_payload->handle))
return;
 
/* We don't ask for delayed CBs when we subscribe to this event (we
@@ -2158,7 +2158,7 @@ module_exit(vmci_transport_exit);
 
 MODULE_AUTHOR("VMware, Inc.");
 MODULE_DESCRIPTION("VMCI transport for Virtual Sockets");
-MODULE_VERSION("1.0.2.0-k");
+MODULE_VERSION("1.0.3.0-k");
 MODULE_LICENSE("GPL v2");
 MODULE_ALIAS("vmware_vsock");
 MODULE_ALIAS_NETPROTO(PF_VSOCK);
-- 
2.14.1

HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)

2017-09-13 Thread Denys Fedoryshchenko


Hi,

I noticed after increasing bandwidth over some amount HTB started to 
throttle classes it should not throttle.
Also estimated rate in htb totally wrong, while byte counters is 
correct.


Is there any overflow or something?

X520 card (but XL710 same)
br1 8000.90e2ba86c38c   no  eth3.1777
eth3.777
br2 8000.90e2ba86c38d   no  eth3.360
eth3.361

Inbound traffic is coming over one vlan, leaving another vlan.
Shaper is just bunch of classes and u32 filters, with few fw filters.
qdisc is pie

I put totally high values to not reach them, tried to change 
quantum/burst/cburst but... stats below.


First, "root" class is 1:1 showing  rate 18086Mbit, which is physically 
impossible.


Class 1:111 showing 5355Mbit, while real traffic is ~1.5Gbit

shaper /etc # tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc 
-s -d class show dev eth3.777 classid 1:111
class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 6487632263 bytes 5235525 pkt (dropped 0, overlimits 0 requeues 0)
 rate 5529Mbit 557534pps backlog 0b 0p requeues 0
 lended: 2423323 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 7438601731 bytes 6003811 pkt (dropped 0, overlimits 0 requeues 0)
 rate 5631Mbit 568214pps backlog 36624b 8p requeues 0
 lended: 2772486 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

(7438601731-6487632263)/5*8 = 1.521.551.148

And most important some classes suffering, while they should not (not 
reaching limits)
class htb 1:95 parent 1:1 leaf 95: prio 0 quantum 5 rate 10Gbit ceil 
100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0
 Sent 13556762059 bytes 17474559 pkt (dropped 16017, overlimits 0 
requeues 0)

 rate 2524Mbit 414197pps backlog 31969245b 34513p requeues 0
 lended: 13995723 borrowed: 0 giants: 0
 tokens: 111 ctokens: -2





Full classes stats:

class htb 1:100 parent 1:1 leaf 100: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 116 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
 rate 8bit 0pps backlog 0b 0p requeues 0
 lended: 2 borrowed: 0 giants: 0
 tokens: 124 ctokens: -1

class htb 1:120 parent 1:1 leaf 120: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 531230043 bytes 782130 pkt (dropped 0, overlimits 0 requeues 0)
 rate 132274Kbit 25240pps backlog 0b 0p requeues 0
 lended: 540693 borrowed: 0 giants: 0
 tokens: 109 ctokens: -2

class htb 1:50 parent 1:1 leaf 50: prio 0 quantum 5 rate 10Gbit ceil 
100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 773472109 bytes 587335 pkt (dropped 0, overlimits 0 requeues 0)
 rate 215929Kbit 20503pps backlog 0b 0p requeues 0
 lended: 216614 borrowed: 0 giants: 0
 tokens: 91 ctokens: -4

class htb 1:70 parent 1:1 leaf 70: prio 0 quantum 5 rate 10Gbit ceil 
100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 1574768 bytes 6194 pkt (dropped 0, overlimits 0 requeues 0)
 rate 406272bit 214pps backlog 0b 0p requeues 0
 lended: 5758 borrowed: 0 giants: 0
 tokens: 101 ctokens: -3

class htb 1:90 parent 1:1 leaf 90: prio 0 quantum 5 rate 1Kbit ceil 
100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0

 Sent 3206 bytes 53 pkt (dropped 0, overlimits 0 requeues 0)
 rate 848bit 1pps backlog 0b 0p requeues 0
 lended: 53 borrowed: 0 giants: 0

class htb 1:110 parent 1:1 leaf 110: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0
 Sent 17205952113 bytes 12926008 pkt (dropped 239, overlimits 0 requeues 
0)

 rate 4433Mbit 416825pps backlog 5847785b 2394p requeues 0
 lended: 7021696 borrowed: 0 giants: 0
 tokens: 91 ctokens: -4

class htb 1:45 root leaf 45: prio 0 quantum 5 rate 80Mbit ceil 
80Mbit linklayer ethernet burst 1b/1 mpu 0b cburst 1b/1 mpu 0b 
level 0

 Sent 2586 bytes 45 pkt (dropped 0, overlimits 0 requeues 0)
 rate 456bit 1pps backlog 0b 0p requeues 0
 lended: 45 borrowed: 0 giants: 0
 tokens: 15540 ctokens: 15540

class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 
0b/1 mpu 0b cburst 0b/1 mpu 0b level 7
 Sent 72277215121 bytes 72693012 pkt (dropped 0, overlimits 0 requeues 
0)

 rate 18086Mbit 2304729pps backlog 0b 0p requeues 0
 lended: 0 borrowed: 0 giants: 0
 tokens: -4 ctokens: -4

class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit 
ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b 
level 0
 Sent 21977384237 bytes

Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-13 Thread Michal Hocko

On Wed 13-09-17 15:07:26, Jorgen S. Hansen wrote:
> 
> > On Sep 12, 2017, at 11:08 AM, Michal Hocko  wrote:
> > 
> > Hi,
> > we are seeing the following splat with Debian 3.16 stable kernel
> > 
> > BUG: scheduling while atomic: MATLAB/26771/0x0100
> > Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver 
> > rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
> > sunrpc vmw_vso$
> > CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 
> > Debian 3.16.7-ckt20-1+deb8u3
> > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
> > Platform, BIOS 6.00 09/21/2015
> > 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
> > 815103a2 00012f00 8819423dbfd8 00012f00
> > 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
> > Call Trace:
> >   [] ? dump_stack+0x41/0x51
> > [] ? __schedule_bug+0x48/0x55
> > [] ? __schedule+0x5d2/0x700
> > [] ? schedule_timeout+0x229/0x2a0
> > [] ? select_task_rq_fair+0x390/0x700
> > [] ? check_preempt_wakeup+0x120/0x1d0
> > [] ? wait_for_completion+0xa8/0x120
> > [] ? wake_up_state+0x10/0x10
> > [] ? call_rcu_bh+0x20/0x20
> > [] ? wait_rcu_gp+0x4b/0x60
> > [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
> > [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
> > [] ? vmci_transport_destruct+0x1d/0xe0 
> > [vmw_vsock_vmci_transport]
> > [] ? vsock_sk_destruct+0x13/0x60 [vsock]
> > [] ? __sk_free+0x1a/0x130
> > [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
> > [vmw_vsock_vmci_transport]
> > [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 
> > [vmw_vmci]
> > [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
> > [] ? tasklet_action+0xf4/0x100
> > [] ? __do_softirq+0xf1/0x290
> > [] ? irq_exit+0x95/0xa0
> > [] ? do_IRQ+0x52/0xe0
> > [] ? common_interrupt+0x6d/0x6d
> > 
> > AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
> > to call in interrupt context") but this patch hasn't been backported to
> > stable trees. It applies cleanly on top of 3.16 stable tree but I am not
> > familiar with the code to send the backport to the stable maintainer
> > directly.
> > 
> > Could you double check that the patch below (just a blind cherry-pick)
> > is correct and it doesn't need additional patches on top?
> 
> Hi,
> 
> The patch below has been used to fix the above issue by other distros
> - among them Redhat for the 3.10 kernel, so it should work for 3.16 as
> well.

Thanks for the confirmation. I do not see 4ef7ea9195ea ("VSOCK: sock_put
wasn't safe to call in interrupt context") in 3.10 stable branch
though.

> In addition to the patch above, there are two other patches that
> need to be applied on top for the fix to be correct:
> 
> 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue."
> 
> and
> 
> 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should 
> filter out non matching QPs."

Good to know. I will send all three patches cherry-picked on top of the
current 3.16 stable branch. Could you have a look please?
-- 
Michal Hocko
SUSE Labs

[PATCH net] be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx

2017-09-13 Thread Suresh Reddy

IPv6 TSO requests with extension hdrs are a problem to the
Lancer and BEx chips. Workaround is to disable TSO6 feature
for such packets.

Also in Lancer chips, MSS less than 256 was resulting in TX stall.
Fix this by disabling GSO when MSS less than 256.

Signed-off-by: Suresh Reddy 
---
 drivers/net/ethernet/emulex/benet/be.h  |  8 
 drivers/net/ethernet/emulex/benet/be_main.c | 14 ++
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/emulex/benet/be.h 
b/drivers/net/ethernet/emulex/benet/be.h
index 674cf9d..8984c49 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -930,6 +930,14 @@ static inline bool is_ipv4_pkt(struct sk_buff *skb)
return skb->protocol == htons(ETH_P_IP) && ip_hdr(skb)->version == 4;
 }
 
+static inline bool is_ipv6_ext_hdr(struct sk_buff *skb)
+{
+   if (ip_hdr(skb)->version == 6)
+   return ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr);
+   else
+   return false;
+}
+
 #define be_error_recovering(adapter)   \
(adapter->flags & BE_FLAGS_TRY_RECOVERY)
 
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 319eee3..0e3d9f39 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -5089,6 +5089,20 @@ static netdev_features_t be_features_check(struct 
sk_buff *skb,
struct be_adapter *adapter = netdev_priv(dev);
u8 l4_hdr = 0;
 
+   if (skb_is_gso(skb)) {
+   /* IPv6 TSO requests with extension hdrs are a problem
+* to Lancer and BE3 HW. Disable TSO6 feature.
+*/
+   if (!skyhawk_chip(adapter) && is_ipv6_ext_hdr(skb))
+   features &= ~NETIF_F_TSO6;
+
+   /* Lancer cannot handle the packet with MSS less than 256.
+* Disable the GSO support in such cases
+*/
+   if (lancer_chip(adapter) && skb_shinfo(skb)->gso_size < 256)
+   features &= ~NETIF_F_GSO_MASK;
+   }
+
/* The code below restricts offload features for some tunneled and
 * Q-in-Q packets.
 * Offload features for normal (non tunnel) packets are unchanged.
-- 
1.8.3.1

Re: [PATCH 10/10] fs:btrfs: return -ENOMEM on allocation failure.

2017-09-13 Thread David Sterba

On Wed, Sep 13, 2017 at 01:02:19PM +0530, Allen Pais wrote:
> Signed-off-by: Allen Pais 
> ---
>  fs/btrfs/check-integrity.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 7d5a9b5..efa4c23 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -2913,7 +2913,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
>   state = kvzalloc(sizeof(*state), GFP_KERNEL);
>   if (!state) {
>   pr_info("btrfs check-integrity: allocation failed!\n");
> - return -1;
> + return -ENOMEM;

Makes sense, also please fix the -1 a few lines below that also result
from failed memory allocation, indirectly from btrfsic_dev_state_alloc().

>   }
>  
>   if (!btrfsic_is_initialized) {
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-13 Thread Jorgen S. Hansen


> On Sep 12, 2017, at 11:08 AM, Michal Hocko  wrote:
> 
> Hi,
> we are seeing the following splat with Debian 3.16 stable kernel
> 
> BUG: scheduling while atomic: MATLAB/26771/0x0100
> Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver 
> rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
> sunrpc vmw_vso$
> CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 
> Debian 3.16.7-ckt20-1+deb8u3
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
> Platform, BIOS 6.00 09/21/2015
> 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
> 815103a2 00012f00 8819423dbfd8 00012f00
> 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
> Call Trace:
>   [] ? dump_stack+0x41/0x51
> [] ? __schedule_bug+0x48/0x55
> [] ? __schedule+0x5d2/0x700
> [] ? schedule_timeout+0x229/0x2a0
> [] ? select_task_rq_fair+0x390/0x700
> [] ? check_preempt_wakeup+0x120/0x1d0
> [] ? wait_for_completion+0xa8/0x120
> [] ? wake_up_state+0x10/0x10
> [] ? call_rcu_bh+0x20/0x20
> [] ? wait_rcu_gp+0x4b/0x60
> [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
> [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
> [] ? vmci_transport_destruct+0x1d/0xe0 
> [vmw_vsock_vmci_transport]
> [] ? vsock_sk_destruct+0x13/0x60 [vsock]
> [] ? __sk_free+0x1a/0x130
> [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
> [vmw_vsock_vmci_transport]
> [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 [vmw_vmci]
> [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
> [] ? tasklet_action+0xf4/0x100
> [] ? __do_softirq+0xf1/0x290
> [] ? irq_exit+0x95/0xa0
> [] ? do_IRQ+0x52/0xe0
> [] ? common_interrupt+0x6d/0x6d
> 
> AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
> to call in interrupt context") but this patch hasn't been backported to
> stable trees. It applies cleanly on top of 3.16 stable tree but I am not
> familiar with the code to send the backport to the stable maintainer
> directly.
> 
> Could you double check that the patch below (just a blind cherry-pick)
> is correct and it doesn't need additional patches on top?

Hi,

The patch below has been used to fix the above issue by other distros - among 
them Redhat for the 3.10 kernel, so it should work for 3.16 as well. In 
addition to the patch above, there are two other patches that need to be 
applied on top for the fix to be correct:

8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue."

and

8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should filter 
out non matching QPs."

Thanks,
Jorgen

> 
> Ben could you take this to your stable 3.16 branch if the patch is correct?
> 
> I am CCing Sasha for 4.1 stable tree as well. I haven't checked whether
> pre 3.16 kernels are affected as well.
> ---
> commit fac774c40b5c512113b6373cad498f35bee7a409
> Author: Jorgen Hansen 
> Date:   Wed Oct 21 04:53:56 2015 -0700
> 
>VSOCK: sock_put wasn't safe to call in interrupt context
> 
>commit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream.
> 
>In the vsock vmci_transport driver, sock_put wasn't safe to call
>in interrupt context, since that may call the vsock destructor
>which in turn calls several functions that should only be called
>from process context. This change defers the callling of these
>functions  to a worker thread. All these functions were
>deallocation of resources related to the transport itself.
> 
>Furthermore, an unused callback was removed to simplify the
>cleanup.
> 
>Multiple customers have been hitting this issue when using
>VMware tools on vSphere 2015.
> 
>Also added a version to the vmci transport module (starting from
>1.0.2.0-k since up until now it appears that this module was
>sharing version with vsock that is currently at 1.0.1.0-k).
> 
>Reviewed-by: Aditya Asarwade 
>Reviewed-by: Thomas Hellstrom 
>Signed-off-by: Jorgen Hansen 
>Signed-off-by: David S. Miller 
> 
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 9bb63ffec4f2..aed136d27b01 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -40,13 +40,11 @@
> 
> static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg);
> static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram 
> *dg);
> -static void vmci_transport_peer_attach_cb(u32 sub_id,
> -   const struct vmci_event_data *ed,
> -   void *client_data);
> static void vmci_transport_peer_detach_cb(u32 sub_id,
> const struct vmci_event_data *ed,
> void *client_data);
> static void vmci_transport_recv_pkt_work(struct work_struct *work);
> +static

Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?

2017-09-13 Thread Willem de Bruijn

On Wed, Sep 13, 2017 at 10:22 AM, Jan Engelhardt  wrote:
>
> On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote:
>>
>>One way to fix is to have iptables open the object (using the stored
>>xt_bpf_info_v1->path), gaining a new process local fd for the object,
>>just after getting the rules from IPT_SO_GET_ENTRIES.
>>However we didn't see any other extensions doing something like that in
>>iptables.

The binary should call bpf_obj_get on the filepath each time. These are
not regular files, but references to a pinned object in the bpf filesystem.

Blindly passing back the fd received from the kernel is clearly wrong. I'm
really surprised that I did not run into this problem when I wrote the
feature.

>>
>>Another way to solve is to fix the ABI (or have a v2 one), that does NOT
>>pass the fd from userspace, only the path of the pinned object.
>>Then, 'bpf_mt_check_v1' will open the file from the given path in order
>>to get the bpf_prog.
>
> But a path has a similar problem like a file descriptor - it is local to a
> certain mount namespace.

Because these are pinned objects in the bpf filesystem, and there is
only one of those, it may be possible to lookup the object in the kernel
without relying on a process-local view of mount points.

>
> To load "large" blobs into the kernel, a pointer to user memory is a possible
> option. The downside is that such extra data is not retrievable from the 
> kernel
> via the iptables setsockopts anymore - one could work around it with procfs, 
> or
> just let it be.
>
> https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c
> line 64+.

Re: [PATCH 01/10] arch:powerpc: return -ENOMEM on failed allocation

2017-09-13 Thread Joe Perches

On Wed, 2017-09-13 at 13:02 +0530, Allen Pais wrote:
> Signed-off-by: Allen Pais 

I think the changelog for this series of conversions
should show that you've validated the change by
inspecting the return call chain at each modified line.

Also, it seems you've cc'd the same mailing lists for
all of the patches modified by this series.

It would be better to individually address each patch
in the series only cc'ing the appropriate maintainers
and mailing lists.

A cover letter would be good too.

> ---
>  arch/powerpc/platforms/cell/spider-pci.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/cell/spider-pci.c 
> b/arch/powerpc/platforms/cell/spider-pci.c
> index d1e61e2..82aa3f7 100644
> --- a/arch/powerpc/platforms/cell/spider-pci.c
> +++ b/arch/powerpc/platforms/cell/spider-pci.c
> @@ -106,7 +106,7 @@ static int __init spiderpci_pci_setup_chip(struct 
> pci_controller *phb,
>   dummy_page_va = kmalloc(PAGE_SIZE, GFP_KERNEL);
>   if (!dummy_page_va) {
>   pr_err("SPIDERPCI-IOWA:Alloc dummy_page_va failed.\n");
> - return -1;
> + return -ENOMEM;
>   }
>  
>   dummy_page_da = dma_map_single(phb->parent, dummy_page_va,
> @@ -137,7 +137,7 @@ int __init spiderpci_iowa_init(struct iowa_bus *bus, void 
> *data)
>   if (!priv) {
>   pr_err("SPIDERPCI-IOWA:"
>  "Can't allocate struct spiderpci_iowa_private");
> - return -1;
> + return -ENOMEM;
>   }
>  
>   if (of_address_to_resource(np, 0, )) {

Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB

2017-09-13 Thread Geert Uytterhoeven

On Wed, Sep 13, 2017 at 2:17 PM, Yoshihiro Shimoda
 wrote:
> Add a new compatible string for the R8A77995 (R-Car D3) RAVB.
>
> Signed-off-by: Yoshihiro Shimoda 

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH] tcp: TCP_USER_TIMEOUT can not work in tcp_probe_timer()

2017-09-13 Thread Eric Dumazet

On Wed, 2017-09-13 at 15:15 +0800, liujian wrote:
> 
> 在 2017/9/13 14:56, liujian 写道:
> > 
> > 
> > 在 2017/9/12 23:38, Eric Dumazet 写道:
> >> On Tue, 2017-09-12 at 08:05 -0700, Eric Dumazet wrote:
> >>> On Tue, 2017-09-12 at 14:08 +0800, liujian wrote:
>  Hi,
> 
>  In the scenario, tcp server side IP changed, and at that memont,
>  userspace application still send data continuously;
>  tcp_send_head(sk)'s timestamp always be refreshed.
> 
>  Here is the packetdrill script:
> 
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0 bind(3, ..., ...) = 0
> +0 listen(3, 1) = 0
> 
> +0 < S 0:0(0) win 0 
> +0 > S. 0:0(0) ack 1 
> 
>    +.1 < . 1:1(0) ack 1 win 65530
> +0 accept(3, ..., ...) = 4
> 
> +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
> +0 write(4, ..., 24) = 24
> +0 > P. 1:25(24) ack 1 win 229
> +.1 < . 1:1(0) ack 25 win 65530
> 
>  //change the ipaddress
> +1 `ifconfig tun0 192.168.0.10/16`
> 
> +1 write(4, ..., 24) = 24
> +1 write(4, ..., 24) = 24
> +1 write(4, ..., 24) = 24
> +1 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> +3 write(4, ..., 24) = 24
> 
> +0 `ifconfig tun0 192.168.0.1/16`
> +0 < . 1:1(0) ack 1 win 1000
> +0 write(4, ..., 24) = -1
> 
> 
> >>>
> >>> This has nothing to do with the code patch you have changed.
> >>>
> >>> How have you tested your patch exactly ?
> >>>
> > I tested the patch, it can work.
> > 
> > [root@localhost ~]# time ./gtests/net/packetdrill/packetdrill test.pkt
> > test.pkt:24: runtime error in write call: Expected result 24 but got -1 
> > with errno 110 (Connection timed out)
> > 
> > real0m5.356s
> > user0m0.026s
> > sys 0m0.104s
> > 
> > [root@localhost ~]# ss -toenmi src :8080
> > State   Recv-Q Send-Q   Local Address:Port  
> > Peer Address:Port
> > ESTAB   0  48 192.168.0.1:8080  
> >192.0.2.1:39559   timer:(persist,186ms,2) 
> > ino:37178 sk:6 <->
> >  skmem:(r0,rb369280,t0,tb87040,f1792,w2304,o0,bl0) sack cubic 
> > wscale:7,7 rto:301 backoff:2 rtt:100.253/37.643 mss:1460 cwnd:10 
> > bytes_acked:24 segs_out:5 segs_in:3 send 1.2Mbps lastsnd:976 lastrcv:4082 
> > lastack:3982 pacing_rate 2.3Mbps rcv_space:29200
> > 
> > 
> > if change the TCP_USER_TIMEOUT to 30s, test result as below:
> > 
> > [root@localhost ~]#   time ./gtests/net/packetdrill/packetdrill test.pkt
> > test.pkt:37: runtime error in write call: Expected result 24 but got -1 
> > with errno 110 (Connection timed out)
> > 
> > real0m44.362s
> > user0m0.018s
> > sys 0m0.110s
> > 
> > ESTAB   0  360 192.168.0.1:8080 
> >  192.0.2.1:47577 timer:(persist,516ms,6) 
> > ino:18806 sk:5 <->
> >  skmem:(r0,rb369280,t0,tb87040,f1792,w2304,o0,bl0) sack cubic 
> > wscale:7,7 rto:301 backoff:6 rtt:100.228/37.623 mss:10 bytes_acked:24 
> > segs_out:22 segs_in:3 send 1.2Mbps lastsnd:2343 lastrcv:40450 lastack:40350 
> > pacing_rate 2.3Mbps rcv_sp
> > 
> >>
> >> lpaa23:~# ss -toenmi src :8080
> >> State  Recv-Q Send-Q Local Address:Port   Peer
> >> Address:Port  
> >> ESTAB  0  144192.168.134.161:8080
> >> 192.0.2.1:51165   timer:(persist,8.262ms,5) ino:1
> >> 82083 sk:3 <->
> >> skmem:(r0,rb359040,t0,tb46080,f1792,w2304,o0,bl0,d0) sack cubic
> >> wscale:7,8 rto:301 backoff:5 rtt:100.127/37.576 
> >> mss:1460 rcvmss:536 advmss:1460 cwnd:10 bytes_acked:24 segs_out:12
> >> segs_in:3 data_segs_out:12 send 1.2Mbps lastsnd:1370 l
> >> astrcv:13348 lastack:13248 pacing_rate 2.3Mbps delivery_rate 116.7Kbps
> >> app_limited busy:11346ms rcv_space:29200 notsent:1
> >> 44 minrtt:100.043userspace application still send data continuously
> >>
> >> This is the typical RTO timer, not zero window probe.
> >>
> > with the script, it is not zero window; but the code enter tcp_probe_timer.
> > ->tcp_sendmsg
> > -->tcp_push
> >

Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?

2017-09-13 Thread Jan Engelhardt

On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote:
>
>One way to fix is to have iptables open the object (using the stored
>xt_bpf_info_v1->path), gaining a new process local fd for the object,
>just after getting the rules from IPT_SO_GET_ENTRIES.
>However we didn't see any other extensions doing something like that in
>iptables.
>
>Another way to solve is to fix the ABI (or have a v2 one), that does NOT
>pass the fd from userspace, only the path of the pinned object.
>Then, 'bpf_mt_check_v1' will open the file from the given path in order
>to get the bpf_prog.

But a path has a similar problem like a file descriptor - it is local to a
certain mount namespace.

To load "large" blobs into the kernel, a pointer to user memory is a possible
option. The downside is that such extra data is not retrievable from the kernel
via the iptables setsockopts anymore - one could work around it with procfs, or
just let it be.

https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c
line 64+.

Re: [PATCH net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()

2017-09-13 Thread Neil Horman

On Wed, Sep 13, 2017 at 12:20:28PM +0300, Dan Carpenter wrote:
> This code causes a static checker warning because Smatch doesn't trust
> anything that comes from skb->data.  I've reviewed this code and I do
> think skb->data can be controlled by the user here.
> 
> The sctp_event_subscribe struct has 13 __u8 fields and we want to see
> if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
> We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
> either before the start of the struct or after the end.
> 
> This is a very old bug and it's surprising that it would go undetected
> for so long but my theory is that it just doesn't have a big impact so
> it would be hard to notice.
> 
> Signed-off-by: Dan Carpenter 
> ---
> I'm not terribly familiar with sctp and this is a static checker fix.
> Please review it carefully.
> 
> diff --git a/include/net/sctp/ulpevent.h b/include/net/sctp/ulpevent.h
> index 1060494ac230..e6873176bea7 100644
> --- a/include/net/sctp/ulpevent.h
> +++ b/include/net/sctp/ulpevent.h
> @@ -154,7 +154,11 @@ static inline int sctp_ulpevent_type_enabled(__u16 
> sn_type,
>struct sctp_event_subscribe *mask)
>  {
>   char *amask = (char *) mask;
> - return amask[sn_type - SCTP_SN_TYPE_BASE];
> + int offset = sn_type - SCTP_SN_TYPE_BASE;
> +
> + if (offset >= sizeof(struct sctp_event_subscribe))
> + return 0;
> + return amask[offset];
>  }
Acked-by: Neil Horman 

>  
>  /* Given an event subscription, is this event enabled? */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [oss-drivers] Re: [PATCH/RFC net-next 2/2] net/sched: allow flower to match tunnel options

2017-09-13 Thread Simon Horman

On Wed, Sep 13, 2017 at 03:38:01PM +0300, Or Gerlitz wrote:
> On Wed, Sep 13, 2017 at 2:59 PM, Simon Horman
>  wrote:
> > On Wed, Sep 13, 2017 at 01:03:43PM +0300, Or Gerlitz wrote:
> >> On Wed, Sep 13, 2017 at 12:25 PM, Simon Horman
> >>  wrote:
> >> > On Tue, Sep 12, 2017 at 11:23:55PM +0300, Or Gerlitz wrote:
> >> >> On Tue, Sep 12, 2017 at 5:20 PM, Simon Horman
> >> >>  wrote:
> >> >> > Allow matching on options in tunnel headers.
> >> >> > This makes use of existing tunnel metadata support.
> >> >>
> >> >> Simon,
> >> >>
> >> >> This patch is about matching on tunnel options, right? but
> >> >>
> >> >> > Options are a bytestring of up to 256 bytes.
> >> >> > Tunnel implementations may support less or more options,
> >> >> > or no options at all.
> >> >> >
> >> >> >  # ip link add name geneve0 type geneve dstport 0 external
> >> >> >  # tc qdisc add dev eth0 ingress
> >> >> >  # tc qdisc del dev eth0 ingress; tc qdisc add dev eth0 ingress
> >> >> >  # tc filter add dev eth0 protocol ip parent : \
> >> >> >  flower indev eth0 \
> >> >> > ip_proto udp \
> >> >> > action tunnel_key \
> >> >> > set src_ip 10.0.99.192 \
> >> >> > dst_ip 10.0.99.193 \
> >> >> > dst_port 4789 \
> >> >> > id 11 \
> >> >> > opts 0102800100800022 \
> >> >> > action mirred egress redirect dev geneve0
> >> >>
> >> >> the example here is on how to use tunnel options in the tunnel set key 
> >> >> actions..
> >> >>
> >> >> And the other way around in the other patch... the patch is about the
> >> >> tunnel key set action and the example shows how to match that in
> >> >> flower... I guess you want to swap the relevant of the change log.
> >> >
> >> > Yes, it seems so. Sorry about that.
> >>
> >> no worries, you can do the swap, but before that
> >>
> >> >> Anyway, is there any human readable/understandable representation of
> >> >> these options? e.g what does 0102800100800022 means for geneve?
> >>
> >> > Thanks, I had not thought of that. My feeling is that could
> >> > be added to the tc tool as follow-up work.
> >>
> >> could you spend few words on the nature of these options now when we review
> >> the kernel patches? I guess this is somehow related to protocols such
> >> as geneve and vxlan-gpe  -- it would be good if you elaborate on that
> >> a bit, does
> >> the kernel does any usage with these options beyond matching on them or 
> >> stiching
> >> them to packet headers?
> >
> > This feature is to be used in conjunction with tunnels in collect metadata
> > (external) mode. As I understand it there are three tunnel netdevs that use
> > options metadata in the kernel at this time.
> >
> > * Geneve
> >
> >   In the case of Geneve options are TLVs[1]. My reading is that in collect
> >   metadata mode the kernel does not appear to do anything other than pass
> >   them around as a bytestring.
> >
> >   [1] https://tools.ietf.org/html/draft-ietf-nvo3-geneve-05#section-3.5
> >
> > * VXLAN-GBP
> >
> >   In the case of VXLAN-GBP on RX in collect metadata mode options are used
> >   to carry information parsed in vxlan_parse_gbp_hdr() from the VXLAN Group
> >   Based Policy Extension[2]. On RX the options data is used to create an
> >   extension (header) by vxlan_build_gbp_hdr().
> >
> >   [2] 
> > https://tools.ietf.org/html/draft-smith-vxlan-group-policy-03#section-2.1
> >
> > * ERSPAN (GRE)
> >
> >   In the case of ERSPAN, which is a variant of GRE, on RX in collect
> >   metadata mode options are used to carry the index parsed from the ERSPAN
> >   Type II feature header[3] in erspan_rcv().  The converse is true on TX
> >   and is handled by erspan_fb_xmit().
> >
> >   [3] https://tools.ietf.org/html/draft-foschiano-erspan-03#section-4.2
> >
> > Users of options:
> >
> > * There are eBPF hooks to allow getting on and setting tunnel metadata:
> >   bpf_skb_set_tunnel_opt, bpf_skb_get_tunnel_opt.
> >
> > * Open vSwitch is able to match and set Geneve and VXLAN-GBP options.
> >
> > Neither of the above appear to assume any structure for the data.
> >
> > I hope the above is the kind of information you are after.
> 
> yeah, this helps, do we expect HW offloads to be able to work with
> these options? e.g place/match on packets
> or even do something beyond that?

Perhaps unsurprisingly given my posting I would like to offload
"place/match" of options. I do not currently have any plans beyond that
and my crystal ball broke some time ago.

Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower

2017-09-13 Thread Jiri Pirko

Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote:
>This patch enables tc-flower based hardware offloads. tc flower
>filter provided by the kernel is configured as driver specific
>cloud filter. The patch implements functions and admin queue
>commands needed to support cloud filters in the driver and
>adds cloud filters to configure these tc-flower filters.
>
>The only action supported is to redirect packets to a traffic class
>on the same device.

So basically you are not doing redirect, you are just setting tclass for
matched packets, right? Why you use mirred for this? I think that
you might consider extending g_act for that:

# tc filter add dev eth0 protocol ip ingress \
  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
  action tclass 0


>
># tc qdisc add dev eth0 ingress
># ethtool -K eth0 hw-tc-offload on
>
># tc filter add dev eth0 protocol ip parent :\
>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>  action mirred ingress redirect dev eth0 tclass 0
>
># tc filter add dev eth0 protocol ip parent :\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw\
>  action mirred ingress redirect dev eth0 tclass 1
>
># tc filter add dev eth0 protocol ipv6 parent :\
>  prio 3 flower dst_ip fe8::200:1\
>  ip_proto udp dst_port 66 skip_sw\
>  action mirred ingress redirect dev eth0 tclass 1
>
>Delete tc flower filter:
>Example:
>
># tc filter del dev eth0 parent : prio 3 handle 0x1 flower
># tc filter del dev eth0 parent :
>
>Flow Director Sideband is disabled while configuring cloud filters
>via tc-flower and until any cloud filter exists.
>
>Unsupported matches when cloud filters are added using enhanced
>big buffer cloud filter mode of underlying switch include:
>1. source port and source IP
>2. Combined MAC address and IP fields.
>3. Not specifying L4 port
>
>These filter matches can however be used to redirect traffic to
>the main VSI (tc 0) which does not require the enhanced big buffer
>cloud filter support.
>
>v3: Cleaned up some lengthy function names. Changed ipv6 address to
>__be32 array instead of u8 array. Used macro for IP version. Minor
>formatting changes.
>v2:
>1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>2. Moved dev_info for add/deleting cloud filters in else condition
>3. Fixed some format specifier in dev_err logs
>4. Refactored i40e_get_capabilities to take an additional
>   list_type parameter and use it to query device and function
>   level capabilities.
>5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>   to verify if redirect to a traffic class is supported.
>6. Added comments for Geneve fix in cloud filter big buffer AQ
>   function definitions.
>7. Cleaned up setup_tc interface to rebase and work with Jiri's
>   updates, separate function to process tc cls flower offloads.
>8. Changes to make Flow Director Sideband and Cloud filters mutually
>   exclusive.
>
>Signed-off-by: Amritha Nambiar 
>Signed-off-by: Kiran Patil 
>Signed-off-by: Anjali Singhai Jain 
>Signed-off-by: Jingjing Wu 
>---
> drivers/net/ethernet/intel/i40e/i40e.h |   49 +
> drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |3 
> drivers/net/ethernet/intel/i40e/i40e_common.c  |  189 
> drivers/net/ethernet/intel/i40e/i40e_main.c|  971 +++-
> drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16 
> drivers/net/ethernet/intel/i40e/i40e_type.h|1 
> .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|3 
> 7 files changed, 1202 insertions(+), 30 deletions(-)
>
>diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>b/drivers/net/ethernet/intel/i40e/i40e.h
>index 6018fb6..b110519 100644
>--- a/drivers/net/ethernet/intel/i40e/i40e.h
>+++ b/drivers/net/ethernet/intel/i40e/i40e.h
>@@ -55,6 +55,8 @@
> #include 
> #include 
> #include 
>+#include 
>+#include 
> #include "i40e_type.h"
> #include "i40e_prototype.h"
> #include "i40e_client.h"
>@@ -252,9 +254,52 @@ struct i40e_fdir_filter {
>   u32 fd_id;
> };
> 
>+#define IPV4_VERSION 4
>+#define IPV6_VERSION 6
>+
>+#define I40E_CLOUD_FIELD_OMAC 0x01
>+#define I40E_CLOUD_FIELD_IMAC 0x02
>+#define I40E_CLOUD_FIELD_IVLAN0x04
>+#define I40E_CLOUD_FIELD_TEN_ID   0x08
>+#define I40E_CLOUD_FIELD_IIP  0x10
>+
>+#define I40E_CLOUD_FILTER_FLAGS_OMAC  I40E_CLOUD_FIELD_OMAC
>+#define I40E_CLOUD_FILTER_FLAGS_IMAC  I40E_CLOUD_FIELD_IMAC
>+#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN(I40E_CLOUD_FIELD_IMAC | \
>+   I40E_CLOUD_FIELD_IVLAN)
>+#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID   (I40E_CLOUD_FIELD_IMAC | \
>+   I40E_CLOUD_FIELD_TEN_ID)
>+#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>+I40E_CLOUD_FIELD_IMAC | \
>+

netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?

2017-09-13 Thread Shmulik Ladkani

Hi,

Commit 2c16d60 'netfilter: xt_bpf: support ebpf' introduced
'xt_bpf_info_v1', to support attaching an eBPF object by fd.

Alas, seems this ABI is problematic, as the 'fd', which is local to the
process attaching the ebpf object (namely iptables) is stored in the
matchinfo structure.

This leads to subsequent invocation of iptables to fail:

# iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
# iptables -A INPUT -s 5.6.7.8 -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.

Tracing 2nd invocation shows:

setsockopt(4, SOL_IP, IPT_SO_SET_REPLACE, 
"filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5400) = -1 
EINVAL

In the 2nd invocation, iptables loads existing rules from the kernel
using IPT_SO_GET_ENTRIES, adds the new innocent one, and then issues
IPT_SO_SET_REPLACE.
However the existing rules contain the former ebpf rule which contains
the fd number from the initial 'iptables -m bpf' invocation.

Thus, when IPT_SO_SET_REPLACE eventually hits 'bpf_mt_check_v1', and
'bpf_prog_get_type' is called, the arbitrary fd has either no associated
file, or has no associated bpf_prog - leading 'bpf_prog_get_type' to
fail.

One way to fix is to have iptables open the object (using the stored
xt_bpf_info_v1->path), gaining a new process local fd for the object,
just after getting the rules from IPT_SO_GET_ENTRIES.
However we didn't see any other extensions doing something like that in
iptables.

Another way to solve is to fix the ABI (or have a v2 one), that does NOT
pass the fd from userspace, only the path of the pinned object.
Then, 'bpf_mt_check_v1' will open the file from the given path in order
to get the bpf_prog.

Please advise.

Best,
Shmulik

Re: [RFC PATCH v3 2/7] sched: act_mirred: Traffic class option for mirror/redirect action

2017-09-13 Thread Jiri Pirko

Wed, Sep 13, 2017 at 11:59:24AM CEST, amritha.namb...@intel.com wrote:
>Adds optional traffic class parameter to the mirror/redirect action.
>The mirror/redirect action is extended to forward to a traffic
>class on the device if the traffic class index is provided in
>addition to the device's ifindex.

Do I understand it correctly that you just abuse mirred to pas tcclass
index down to the driver, without actually doing anything with the value
inside mirred-code ? That is a bit confusing for me.

1 2 >

1 - 100 of 155 matches

Mail list logo