date:20170807

[PATCH net-next] cxgb4: Clear On FLASH config file after a FW upgrade

2017-08-07 Thread Ganesh Goudar

From: Arjun Vynipadath 

Because Firmware and the Firmware Configuration File need to be
in sync; clear out any On-FLASH Firmware Configuration File when new
Firmware is loaded.  This will avoid difficult to diagnose and fix
problems with a mis-matched Firmware Configuration File which prevents the
adapter from being initialized.

Original work by: Casey Leedom 
Signed-off-by: Arjun Vynipadath 
Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  1 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 70 ++
 2 files changed, 71 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 1978abb..daa3775 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1405,6 +1405,7 @@ int t4_fw_upgrade(struct adapter *adap, unsigned int mbox,
 int t4_fl_pkt_align(struct adapter *adap);
 unsigned int t4_flash_cfg_addr(struct adapter *adapter);
 int t4_check_fw_version(struct adapter *adap);
+int t4_load_cfg(struct adapter *adapter, const u8 *cfg_data, unsigned int 
size);
 int t4_get_fw_version(struct adapter *adapter, u32 *vers);
 int t4_get_bs_version(struct adapter *adapter, u32 *vers);
 int t4_get_tp_version(struct adapter *adapter, u32 *vers);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 24087c8..fff8fba 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -6664,6 +6664,17 @@ int t4_fw_upgrade(struct adapter *adap, unsigned int 
mbox,
goto out;
 
/*
+* If there was a Firmware Configuration File stored in FLASH,
+* there's a good chance that it won't be compatible with the new
+* Firmware.  In order to prevent difficult to diagnose adapter
+* initialization issues, we clear out the Firmware Configuration File
+* portion of the FLASH .  The user will need to re-FLASH a new
+* Firmware Configuration File which is compatible with the new
+* Firmware if that's desired.
+*/
+   (void)t4_load_cfg(adap, NULL, 0);
+
+   /*
 * Older versions of the firmware don't understand the new
 * PCIE_FW.HALT flag and so won't know to perform a RESET when they
 * restart.  So for newly loaded older firmware we'll have to do the
@@ -8896,6 +8907,65 @@ void t4_idma_monitor(struct adapter *adapter,
 }
 
 /**
+ * t4_load_cfg - download config file
+ * @adap: the adapter
+ * @cfg_data: the cfg text file to write
+ * @size: text file size
+ *
+ * Write the supplied config text file to the card's serial flash.
+ */
+int t4_load_cfg(struct adapter *adap, const u8 *cfg_data, unsigned int size)
+{
+   int ret, i, n, cfg_addr;
+   unsigned int addr;
+   unsigned int flash_cfg_start_sec;
+   unsigned int sf_sec_size = adap->params.sf_size / adap->params.sf_nsec;
+
+   cfg_addr = t4_flash_cfg_addr(adap);
+   if (cfg_addr < 0)
+   return cfg_addr;
+
+   addr = cfg_addr;
+   flash_cfg_start_sec = addr / SF_SEC_SIZE;
+
+   if (size > FLASH_CFG_MAX_SIZE) {
+   dev_err(adap->pdev_dev, "cfg file too large, max is %u bytes\n",
+   FLASH_CFG_MAX_SIZE);
+   return -EFBIG;
+   }
+
+   i = DIV_ROUND_UP(FLASH_CFG_MAX_SIZE,/* # of sectors spanned */
+sf_sec_size);
+   ret = t4_flash_erase_sectors(adap, flash_cfg_start_sec,
+flash_cfg_start_sec + i - 1);
+   /* If size == 0 then we're simply erasing the FLASH sectors associated
+* with the on-adapter Firmware Configuration File.
+*/
+   if (ret || size == 0)
+   goto out;
+
+   /* this will write to the flash up to SF_PAGE_SIZE at a time */
+   for (i = 0; i < size; i += SF_PAGE_SIZE) {
+   if ((size - i) <  SF_PAGE_SIZE)
+   n = size - i;
+   else
+   n = SF_PAGE_SIZE;
+   ret = t4_write_flash(adap, addr, n, cfg_data);
+   if (ret)
+   goto out;
+
+   addr += SF_PAGE_SIZE;
+   cfg_data += SF_PAGE_SIZE;
+   }
+
+out:
+   if (ret)
+   dev_err(adap->pdev_dev, "config file %s failed %d\n",
+   (size == 0 ? "clear" : "download"), ret);
+   return ret;
+}
+
+/**
  * t4_set_vf_mac - Set MAC address for the specified VF
  * @adapter: The adapter
  * @vf: one of the VFs instantiated by the specified PF
-- 
2.1.0

Re: [PATCHv2 net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target

2017-08-07 Thread Jiri Pirko

Tue, Aug 08, 2017 at 04:13:27AM CEST, lucien@gmail.com wrote:
>Now xt_tgchk_param par in ipt_init_target is a local varibale,
>par.net is not initialized there. Later when xt_check_target
>calls target's checkentry in which it may access par.net, it
>would cause kernel panic.
>
>Jaroslav found this panic when running:
>
>  # ip link add TestIface type dummy
>  # tc qd add dev TestIface ingress handle :
>  # tc filter add dev TestIface parent : u32 match u32 0 0 \
>action xt -j CONNMARK --set-mark 4
>
>This patch is to pass net param into ipt_init_target and set
>par.net with it properly in there.
>
>v1->v2:
>  As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
>  it by also passing net_id to __tcf_ipt_init.
>
>Reported-by: Jaroslav Aster 
>Signed-off-by: Xin Long 

Fixes what? You need to have "Fixes" tag for net patches.

Re: [Patch net-next] net_sched: get rid of some forward declarations

2017-08-07 Thread Jiri Pirko

Tue, Aug 08, 2017 at 12:26:50AM CEST, xiyou.wangc...@gmail.com wrote:
>If we move up tcf_fill_node() we can get rid of these
>forward declarations.
>
>Also, move down tfilter_notify_chain() to group them together.
>
>Reported-by: Jamal Hadi Salim 
>Cc: Jamal Hadi Salim 
>Signed-off-by: Cong Wang 

Acked-by: Jiri Pirko

Re: [PATCH] net: Reduce skb_warn_bad_offload() noise.

2017-08-07 Thread Tonghao Zhang

Hi Willem

In a case, there is also warn info. The test topo is shown as below.

VM01: veth1 and eth0 in the VM01 are inserted to ovs br0.
veth0(IP: 172.16.34.100/24) —— veth1--br0--eth0

iperf3  -c 172.168.100.13 -i 1 -P 10 -t 10 -u -b 1000M -l 10K



VM02
eth0(IP: 172.16.34.200/24)
iperf3  -s

The warn info is shown as below [1]. If we change the CHECKSUM_NONE to
CHECKSUM_UNNECESSARY in the udp4_ufo_fragment().
and we should add a check in skb_needs_check() when outputting a packet.

diff --git a/net/core/dev.c b/net/core/dev.c
index 416137c..8fe12a7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2670,6 +2670,7 @@ static inline bool skb_needs_check(struct
sk_buff *skb, bool tx_path)
 {
if (tx_path)
return skb->ip_summed != CHECKSUM_PARTIAL &&
+  skb->ip_summed != CHECKSUM_UNNECESSARY &&
   skb->ip_summed != CHECKSUM_NONE;

return skb->ip_summed == CHECKSUM_NONE;
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 7812501..0932c85 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -235,7 +235,7 @@ static struct sk_buff *udp4_ufo_fragment(struct
sk_buff *skb,
if (uh->check == 0)
uh->check = CSUM_MANGLED_0;

-   skb->ip_summed = CHECKSUM_NONE;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;

/* If there is no outer header we can fake a checksum offload
 * due to the fact that we have already done the checksum in


[1]:
[ 1291.596232] vmxnet3: caps=(0x006000214ba9, 0x)
len=10282 data_len=10240 gso_size=1480 gso_type=2 ip_summed=1
[ 1291.596239] [ cut here ]
[ 1291.596242] WARNING: CPU: 1 PID: 2203 at net/core/dev.c:2564
skb_warn_bad_offload+0xd3/0xde
[ 1291.596242] Modules linked in: veth udp_tunnel gre openvswitch
nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack cfg80211 rfkill ext4
jbd2 mbcache sb_edac coretemp crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper ppdev
vmw_balloon cryptd vmw_vmci sg i2c_piix4 pcspkr parport_pc parport
shpchp ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi vmwgfx
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm
crc32c_intel serio_raw vmxnet3 ata_piix mptspi libata
scsi_transport_spi mptscsih mptbase i2c_core floppy dm_mirror
dm_region_hash dm_log dm_mod
[ 1291.596280] CPU: 1 PID: 2203 Comm: iperf3 Tainted: GW
4.12.0+ #1
[ 1291.596280] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
[ 1291.596281] task: 8be36f5e9680 task.stack: b6c840d44000
[ 1291.596283] RIP: 0010:skb_warn_bad_offload+0xd3/0xde
[ 1291.596284] RSP: 0018:8be3796438c8 EFLAGS: 00010246
[ 1291.596285] RAX: 0074 RBX: 8be363ff9f00 RCX: 0006
[ 1291.596286] RDX:  RSI: 0086 RDI: 8be37964e0a0
[ 1291.596287] RBP: 8be3796438f0 R08:  R09: 0bf2
[ 1291.596287] R10: 0004 R11: 0bf1 R12: 8be3660f
[ 1291.596288] R13: 0001 R14:  R15: 8be3660f
[ 1291.596289] FS:  7f8d93bdb740() GS:8be37964()
knlGS:
[ 1291.596290] CS:  0010 DS:  ES:  CR0: 80050033
[ 1291.596291] CR2: 555f94a6e0c8 CR3: 00012b84a000 CR4: 001406e0
[ 1291.596296] Call Trace:
[ 1291.596297]  
[ 1291.596300]  __skb_gso_segment+0x15d/0x170
[ 1291.596301]  validate_xmit_skb+0x12d/0x2b0
[ 1291.596303]  validate_xmit_skb_list+0x42/0x70
[ 1291.596306]  sch_direct_xmit+0xd0/0x1b0
[ 1291.596308]  __dev_queue_xmit+0x42e/0x630
[ 1291.596310]  ? hrtimer_interrupt+0xd2/0x1a0
[ 1291.596312]  dev_queue_xmit+0x10/0x20
[ 1291.596315]  ovs_vport_send+0xc2/0x150 [openvswitch]
[ 1291.596318]  do_output+0x53/0xf0 [openvswitch]
[ 1291.596322]  do_execute_actions+0x9bc/0x9d0 [openvswitch]
[ 1291.596324]  ? __bpf_prog_run+0x385/0x1310
[ 1291.596327]  ovs_execute_actions+0x40/0x120 [openvswitch]
[ 1291.596330]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
[ 1291.596333]  ? ovs_ct_update_key+0x9a/0xe0 [openvswitch]
[ 1291.596336]  ovs_vport_receive+0x73/0xd0 [openvswitch]
[ 1291.596339]  ? handle_irq_event_percpu+0x54/0x80
[ 1291.596340]  ? handle_irq_event+0x46/0x60
[ 1291.596342]  ? handle_edge_irq+0x8d/0x130
[ 1291.596344]  ? handle_irq+0xab/0x120
[ 1291.596346]  ? irq_exit+0x77/0xf0
[ 1291.596348]  ? do_IRQ+0x51/0xd0
[ 1291.596352]  netdev_frame_hook+0xd3/0x160 [openvswitch]
[ 1291.596355]  __netif_receive_skb_core+0x1da/0x9e0
[ 1291.596358]  ? vport_netdev_free+0x30/0x30 [openvswitch]
[ 1291.596360]  ? kfree_skbmem+0x5a/0x60
[ 1291.596361]  ? consume_skb+0x34/0x90
[ 1291.596363]  __netif_receive_skb+0x18/0x60
[ 1291.596365]  process_backlog+0x95/0x140
[ 1291.596367]  net_rx_action+0x26c/0x3b0
[ 1291.596369]  __do_softirq+0xc9/0x269
[ 1291.596371]

[PATCH net-next] openvswitch: add NSH support

2017-08-07 Thread Yi Yang

OVS master and 2.8 branch has merged NSH userspace
patch series, this patch is to enable NSH support
in kernel data path in order that OVS can support
NSH in 2.8 release in compat mode by porting this.

Signed-off-by: Yi Yang 
---
 drivers/net/vxlan.c  |   7 ++
 include/net/nsh.h| 126 ++
 include/uapi/linux/openvswitch.h |  33 
 net/openvswitch/actions.c| 165 +++
 net/openvswitch/flow.c   |  41 ++
 net/openvswitch/flow.h   |   1 +
 net/openvswitch/flow_netlink.c   |  54 -
 7 files changed, 426 insertions(+), 1 deletion(-)
 create mode 100644 include/net/nsh.h

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index dbca067..843714c 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
@@ -1267,6 +1268,9 @@ static bool vxlan_parse_gpe_hdr(struct vxlanhdr *unparsed,
case VXLAN_GPE_NP_IPV6:
*protocol = htons(ETH_P_IPV6);
break;
+   case VXLAN_GPE_NP_NSH:
+   *protocol = htons(ETH_P_NSH);
+   break;
case VXLAN_GPE_NP_ETHERNET:
*protocol = htons(ETH_P_TEB);
break;
@@ -1806,6 +1810,9 @@ static int vxlan_build_gpe_hdr(struct vxlanhdr *vxh, u32 
vxflags,
case htons(ETH_P_IPV6):
gpe->next_protocol = VXLAN_GPE_NP_IPV6;
return 0;
+   case htons(ETH_P_NSH):
+   gpe->next_protocol = VXLAN_GPE_NP_NSH;
+   return 0;
case htons(ETH_P_TEB):
gpe->next_protocol = VXLAN_GPE_NP_ETHERNET;
return 0;
diff --git a/include/net/nsh.h b/include/net/nsh.h
new file mode 100644
index 000..96477a1
--- /dev/null
+++ b/include/net/nsh.h
@@ -0,0 +1,126 @@
+#ifndef __NET_NSH_H
+#define __NET_NSH_H 1
+
+
+/*
+ * Network Service Header:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |Ver|O|C|R|R|R|R|R|R|Length   |   MD Type   |  Next Proto   |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |Service Path ID| Service Index |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |   |
+ * ~   Mandatory/Optional Context Header   ~
+ * |   |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * Ver = The version field is used to ensure backward compatibility
+ *   going forward with future NSH updates.  It MUST be set to 0x0
+ *   by the sender, in this first revision of NSH.
+ *
+ * O = OAM. when set to 0x1 indicates that this packet is an operations
+ * and management (OAM) packet.  The receiving SFF and SFs nodes
+ * MUST examine the payload and take appropriate action.
+ *
+ * C = context. Indicates that a critical metadata TLV is present.
+ *
+ * Length : total length, in 4-byte words, of NSH including the Base
+ *  Header, the Service Path Header and the optional variable
+ *  TLVs.
+ * MD Type: indicates the format of NSH beyond the mandatory Base Header
+ *  and the Service Path Header.
+ *
+ * Next Protocol: indicates the protocol type of the original packet. A
+ *  new IANA registry will be created for protocol type.
+ *
+ * Service Path Identifier (SPI): identifies a service path.
+ *  Participating nodes MUST use this identifier for Service
+ *  Function Path selection.
+ *
+ * Service Index (SI): provides location within the SFP.
+ *
+ * [0] https://tools.ietf.org/html/draft-ietf-sfc-nsh-13
+ */
+
+/**
+ * struct nsh_md1_ctx - Keeps track of NSH context data
+ * @nshc<1-4>: NSH Contexts.
+ */
+struct nsh_md1_ctx {
+   __be32 c[4];
+};
+
+struct nsh_md2_tlv {
+   __be16 md_class;
+   u8 type;
+   u8 length;
+   u8 md_value[];
+};
+
+struct nsh_hdr {
+   __be16 ver_flags_len;
+   u8 md_type;
+   u8 next_proto;
+   __be32 path_hdr;
+   union {
+   struct nsh_md1_ctx md1;
+   struct nsh_md2_tlv md2[0];
+   };
+};
+
+/* Masking NSH header fields. */
+#define NSH_VER_MASK   0xc000
+#define NSH_VER_SHIFT  14
+#define NSH_FLAGS_MASK 0x3fc0
+#define NSH_FLAGS_SHIFT6
+#define NSH_LEN_MASK   0x003f
+#define NSH_LEN_SHIFT  0
+
+#define NSH_SPI_MASK   0xff00
+#define NSH_SPI_SHIFT  8
+#define NSH_SI_MASK0x00ff
+#define NSH_SI_SHIFT   0
+
+#define NSH_DST_PORT4790 /* UDP Port for NSH on VXLAN. */
+#define ETH_P_NSH   0x894F   /* Ethertype for NSH. */
+
+/* NSH Base Header Next Protocol. */
+#define NSH_P_IPV40x01
+#define NSH_P_IPV60x02
+#define NSH_P_ETHERNET0x03
+#define NSH_P_NSH

Re: [PATCH 0/6] In-kernel QMI handling

2017-08-07 Thread Bjorn Andersson

On Mon 07 Aug 12:19 PDT 2017, Marcel Holtmann wrote:

> Hi Bjorn,
> 
> >>> This series starts by moving the common definitions of the QMUX
> >>> protocol to the
> >>> uapi header, as they are shared with clients - both in kernel and
> >>> userspace.
> >>> 
> >>> This series then introduces in-kernel helper functions for aiding the
> >>> handling
> >>> of QMI encoded messages in the kernel. QMI encoding is a wire-format
> >>> used in
> >>> exchanging messages between the majority of QRTR clients and
> >>> services.
> >> 
> >> This raises a few red-flags for me.
> > 
> > I'm glad it does. In discussions with the responsible team within
> > Qualcomm I've highlighted a number of concerns about enabling this
> > support in the kernel. Together we're continuously looking into what
> > should be pushed out to user space, and trying to not introduce
> > unnecessary new users.
> > 
> >> So far, we've kept almost everything QMI related in userspace and
> >> handled all QMI control-channel messages from libraries like libqmi or
> >> uqmi via the cdc-wdm driver and the "rmnet" interface via the qmi_wwan
> >> driver.  The kernel drivers just serve as the transport.
> >> 
> > 
> > The path that was taken to support the MSM-style devices was to
> > implement net/qrtr, which exposes a socket interface to abstract the
> > physical transports (QMUX or IPCROUTER in Qualcomm terminology).
> > 
> > As I share you view on letting the kernel handle the transportation only
> > the task of keeping track of registered services (service id -> node and
> > port mapping) was done in a user space process and so far we've only
> > ever have to deal with QMI encoded messages in various user space tools.
> 
> I think that the transport and multiplexing can be in the kernel as
> long as it is done as proper subsystem. Similar to Phonet or CAIF.
> Meaning it should have a well defined socket interface that can be
> easily used from userspace, but also a clean in-kernel interface
> handling.
> 

In a mobile Qualcomm device there's a few different components involved
here: message routing, QMUX protocol and QMI-encoding.

The downstream Qualcomm kernel implements the two first in the
IPCROUTER, upstream this is split between the kernel net/qrtr and a user
space service-register implementing the QMUX protocol for knowing where
services are located.

The common encoding of messages passed between endpoints of the message
routing is QMI, which is made an affair totally that of each client.

> If Qualcomm is supportive of this effort and is willing to actually
> assist and/or open some of the specs or interface descriptions, then
> this is a good thing. Service registration and cleanup is really done
> best in the kernel. Same applies to multiplexing. Trying to do
> multiplexing in userspace is always cumbersome and leads to overhead
> that is of no gain. For example within oFono, we had to force
> everything to go via oFono since it was the only sane way of handling
> it. Other approaches were error prone and full of race conditions. You
> need a central entity that can clean up.
> 

The current upstream solution depends on a collaboration between
net/qrtr and the user space service register for figuring out whom to
send messages to. After that muxing et al is handled by the socket
interface and service registry does not need to be involved.

Qualcomm is very supporting of this solution and we're collaborating on
transitioning "downstream" to use this implementation.

> For the definition of an UAPI to share some code, I am actually not
> sure that is such a good idea. For example the QMI code in oFono
> follows a way simpler approach. And I am not convinced that all the
> macros are actually beneficial. For example, the whole netlink macros
> are pretty cumbersome. Adding some Documentation/qmi.txt on how the
> wire format looks like and what is expected seems to be a way better
> approach.
> 

The socket interface provided by the kernel expects some knowledge of
the QMUX protocol, for service management. The majority of this
knowledge is already public, but I agree that it would be good to gather
this in a document. The common data structure for the control message is
what I've put in the uapi, as this is used by anyone dealing with
control messages.

When it comes to the QMI-encoded messages these are application
specific, just like e.g. protobuf definitions are application specific.

As the core infrastructure is becoming available upstream and boards
like the DB410c and DB820c aim to be supported by open solutions we will
have a natural place to discuss publication of at least some of the
application level protocols.

Regards,
Bjorn

Re: [PATCH net-next 03/14] sctp: remove the typedef sctp_scope_policy_t

2017-08-07 Thread Xin Long

On Mon, Aug 7, 2017 at 9:28 PM, David Laight  wrote:
> From: Xin Long
>> Sent: 05 August 2017 13:00
>> This patch is to remove the typedef sctp_scope_policy_t and keep
>> it's members as an anonymous enum.
>>
>> It is also to define SCTP_SCOPE_POLICY_MAX to replace the num 3
>> in sysctl.c to make codes clear.
>>
>> Signed-off-by: Xin Long 
>> ---
>>  include/net/sctp/constants.h | 6 --
>>  net/sctp/sysctl.c| 2 +-
>>  2 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/net/sctp/constants.h b/include/net/sctp/constants.h
>> index 922fba5..acb03eb 100644
>> --- a/include/net/sctp/constants.h
>> +++ b/include/net/sctp/constants.h
>> @@ -341,12 +341,14 @@ typedef enum {
>>   SCTP_SCOPE_UNUSABLE,/* IPv4 unusable addresses */
>>  } sctp_scope_t;
>>
>> -typedef enum {
>> +enum {
>>   SCTP_SCOPE_POLICY_DISABLE,  /* Disable IPv4 address scoping */
>>   SCTP_SCOPE_POLICY_ENABLE,   /* Enable IPv4 address scoping */
>>   SCTP_SCOPE_POLICY_PRIVATE,  /* Follow draft but allow IPv4 private 
>> addresses */
>>   SCTP_SCOPE_POLICY_LINK, /* Follow draft but allow IPv4 link 
>> local addresses */
>> -} sctp_scope_policy_t;
>> +};
>> +
>> +#define SCTP_SCOPE_POLICY_MAXSCTP_SCOPE_POLICY_LINK
>
> Perhaps slightly better to end the enum with:
> SCTP_SCOPE_POLICY_COUNT,/* Number of policies */
> SCTP_SCOPE_POLICY_MAX = SCTP_SCOPE_POLICY_COUNT - 1 /* Last 
> policy */
> };
It might be, so that new member coming will not change too much.

I just copied the idea of SCTP_EVENT__MAX, SCTP_STATE_MAX :-)

Re: [PATCH net] sctp: use __GFP_NOWARN for sctpw.fifo allocation

2017-08-07 Thread Xin Long

On Mon, Aug 7, 2017 at 11:39 AM, Marcelo Ricardo Leitner
 wrote:
> On Sun, Aug 06, 2017 at 06:14:39PM +1200, Xin Long wrote:
>> On Sun, Aug 6, 2017 at 5:08 AM, Marcelo Ricardo Leitner
>>  wrote:
>> > On Sat, Aug 05, 2017 at 08:31:09PM +0800, Xin Long wrote:
>> >> Chen Wei found a kernel call trace when modprobe sctp_probe with
>> >> bufsize set with a huge value.
>> >>
>> >> It's because in sctpprobe_init when alloc memory for sctpw.fifo,
>> >> the size is got from userspace. If it is too large, kernel will
>> >> fail and give a warning.
>> >
>> > Yes but sctp_probe can only be loaded by an admin and it would happen
>> > only during modprobe. It's different from the commit mentioned below, on
>> > which any user could trigger it.
>> yeah, in this way it's different, I think generally it's acceptable to have
>> this kinda warning call trace by admin.
>>
>> But it could get the feedback from the return value and the warning
>> call trace seems not useful. sometimes users may be confused
>
> users or admins?
admins.
>
>> with this call trace. So it may be better not to dump the warning ?
>>
>> Or you think it can be helpful if we leave it here ?
>
> I'm afraid we may be exagerating here. There are several other ways that
> an admin can trigger scary warnings, this one is no special. I'd rather
> leave this one to the mm defaults instead.
OK, I'm all for that.
>
>>
>> >
>> >>
>> >> As there will be a fallback allocation later, this patch is just
>> >> to fail silently and return ret, just as commit 0ccc22f425e5
>> >> ("sit: use __GFP_NOWARN for user controlled allocation") did.
>> >>
>> >> Reported-by: Chen Wei 
>> >> Signed-off-by: Xin Long 
>> >> ---
>> >>  net/sctp/probe.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>
>> >> diff --git a/net/sctp/probe.c b/net/sctp/probe.c
>> >> index 6cc2152..5bf3164 100644
>> >> --- a/net/sctp/probe.c
>> >> +++ b/net/sctp/probe.c
>> >> @@ -210,7 +210,7 @@ static __init int sctpprobe_init(void)
>> >>
>> >>   init_waitqueue_head();
>> >>   spin_lock_init();
>> >> - if (kfifo_alloc(, bufsize, GFP_KERNEL))
>> >> + if (kfifo_alloc(, bufsize, GFP_KERNEL | __GFP_NOWARN))
>> >>   return ret;
>> >>
>> >>   if (!proc_create(procname, S_IRUSR, init_net.proc_net,
>> >> --
>> >> 2.1.0
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

Re: [PATCH net] net: sched: fix NULL pointer dereference when action calls some targets

2017-08-07 Thread Xin Long

On Tue, Aug 8, 2017 at 9:15 AM, Cong Wang  wrote:
> (Cc'ing netfilter and Jamal)
>
> On Sat, Aug 5, 2017 at 4:35 AM, Xin Long  wrote:
>> As we know in some target's checkentry it may dereference par.entryinfo
>> to check entry stuff inside. But when sched action calls xt_check_target,
>> par.entryinfo is set with NULL. It would cause kernel panic when calling
>> some targets.
>>
>> It can be reproduce with:
>>   # tc qd add dev eth1 ingress handle :
>>   # tc filter add dev eth1 parent : u32 match u32 0 0 action xt \
>> -j ECN --ecn-tcp-remove
>>
>> It could also crash kernel when using target CLUSTERIP or TPROXY.
>>
[1]
>> By now there's no proper value for par.entryinfo in ipt_init_target,
>> but it can not be set with NULL. This patch is to void all these
>> panics by setting it with an ipt_entry obj with all members 0.
>>
>> Signed-off-by: Xin Long 
>> ---
>>  net/sched/act_ipt.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
>> index 7c4816b..0f09f70 100644
>> --- a/net/sched/act_ipt.c
>> +++ b/net/sched/act_ipt.c
>> @@ -41,6 +41,7 @@ static int ipt_init_target(struct net *net, struct 
>> xt_entry_target *t,
>>  {
>> struct xt_tgchk_param par;
>> struct xt_target *target;
>> +   struct ipt_entry e;
>> int ret = 0;
>>
>> target = xt_request_find_target(AF_INET, t->u.user.name,
>> @@ -48,10 +49,11 @@ static int ipt_init_target(struct net *net, struct 
>> xt_entry_target *t,
>> if (IS_ERR(target))
>> return PTR_ERR(target);
>>
>> +   memset(, 0, sizeof(e));
>> t->u.kernel.target = target;
>> par.net   = net;
>> par.table = table;
>> -   par.entryinfo = NULL;
>> +   par.entryinfo = 
>
> This looks like a completely API burden?
netfilter xt targets are not really compatible with netsched action.
I've got to say, the patch is just a way to make checkentry return
false and avoid panic. like [1] said

Re: [PATCH net] net: sched: set xt_tgchk_param par.nft_compat with false in ipt_init_target

2017-08-07 Thread Xin Long

On Tue, Aug 8, 2017 at 9:03 AM, Cong Wang  wrote:
> On Sat, Aug 5, 2017 at 4:32 AM, Xin Long  wrote:
>> Commit 55917a21d0cc ("netfilter: x_tables: add context to know if
>> extension runs from nft_compat") introduced a member nft_compat to
>> xt_tgchk_param structure.
>>
>> But it didn't set it's value for ipt_init_target. With unexpected
>> value in par.nft_compat, it may return unexpected result in some
>> target's checkentry.
>>
>> This patch is to set par.nft_compat with false in ipt_init_target.
>
> It's time to set all these fields to 0 and only initialize those non-zero
> fields, in case we will add more fields in the future.
ok, the new fix is depend on the net_id one.
I will post v2 after that one gets accepted. thanks.

[PATCHv2 net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target

2017-08-07 Thread Xin Long

Now xt_tgchk_param par in ipt_init_target is a local varibale,
par.net is not initialized there. Later when xt_check_target
calls target's checkentry in which it may access par.net, it
would cause kernel panic.

Jaroslav found this panic when running:

  # ip link add TestIface type dummy
  # tc qd add dev TestIface ingress handle :
  # tc filter add dev TestIface parent : u32 match u32 0 0 \
action xt -j CONNMARK --set-mark 4

This patch is to pass net param into ipt_init_target and set
par.net with it properly in there.

v1->v2:
  As Wang Cong pointed, I missed ipt_net_id != xt_net_id, so fix
  it by also passing net_id to __tcf_ipt_init.

Reported-by: Jaroslav Aster 
Signed-off-by: Xin Long 
---
 net/sched/act_ipt.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
index 36f0ced..94ba5cf 100644
--- a/net/sched/act_ipt.c
+++ b/net/sched/act_ipt.c
@@ -36,8 +36,8 @@ static struct tc_action_ops act_ipt_ops;
 static unsigned int xt_net_id;
 static struct tc_action_ops act_xt_ops;
 
-static int ipt_init_target(struct xt_entry_target *t, char *table,
-  unsigned int hook)
+static int ipt_init_target(struct net *net, struct xt_entry_target *t,
+  char *table, unsigned int hook)
 {
struct xt_tgchk_param par;
struct xt_target *target;
@@ -49,6 +49,7 @@ static int ipt_init_target(struct xt_entry_target *t, char 
*table,
return PTR_ERR(target);
 
t->u.kernel.target = target;
+   par.net   = net;
par.table = table;
par.entryinfo = NULL;
par.target= target;
@@ -91,10 +92,11 @@ static const struct nla_policy ipt_policy[TCA_IPT_MAX + 1] 
= {
[TCA_IPT_TARG]  = { .len = sizeof(struct xt_entry_target) },
 };
 
-static int __tcf_ipt_init(struct tc_action_net *tn, struct nlattr *nla,
+static int __tcf_ipt_init(struct net *net, unsigned int id, struct nlattr *nla,
  struct nlattr *est, struct tc_action **a,
  const struct tc_action_ops *ops, int ovr, int bind)
 {
+   struct tc_action_net *tn = net_generic(net, id);
struct nlattr *tb[TCA_IPT_MAX + 1];
struct tcf_ipt *ipt;
struct xt_entry_target *td, *t;
@@ -159,7 +161,7 @@ static int __tcf_ipt_init(struct tc_action_net *tn, struct 
nlattr *nla,
if (unlikely(!t))
goto err2;
 
-   err = ipt_init_target(t, tname, hook);
+   err = ipt_init_target(net, t, tname, hook);
if (err < 0)
goto err3;
 
@@ -193,18 +195,16 @@ static int tcf_ipt_init(struct net *net, struct nlattr 
*nla,
struct nlattr *est, struct tc_action **a, int ovr,
int bind)
 {
-   struct tc_action_net *tn = net_generic(net, ipt_net_id);
-
-   return __tcf_ipt_init(tn, nla, est, a, _ipt_ops, ovr, bind);
+   return __tcf_ipt_init(net, ipt_net_id, nla, est, a, _ipt_ops, ovr,
+ bind);
 }
 
 static int tcf_xt_init(struct net *net, struct nlattr *nla,
   struct nlattr *est, struct tc_action **a, int ovr,
   int bind)
 {
-   struct tc_action_net *tn = net_generic(net, xt_net_id);
-
-   return __tcf_ipt_init(tn, nla, est, a, _xt_ops, ovr, bind);
+   return __tcf_ipt_init(net, xt_net_id, nla, est, a, _xt_ops, ovr,
+ bind);
 }
 
 static int tcf_ipt(struct sk_buff *skb, const struct tc_action *a,
-- 
2.1.0

Re: [PATCH v9 0/4] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-07 Thread Bjorn Helgaas

On Mon, Aug 07, 2017 at 02:14:48PM -0700, David Miller wrote:
> From: Ding Tianhong 
> Date: Mon, 7 Aug 2017 12:13:17 +0800
> 
> > Hi David：
> > 
> > I think networking tree merge it is a better choice, as it mainly used to 
> > tell the NIC
> > drivers how to use the Relaxed Ordering Attribute, and later we need send 
> > patch to enable
> > RO for ixgbe driver base on this patch. But I am not sure whether Bjorn has 
> > some of his own
> > view. :)
> > 
> > Hi Bjorn:
> > 
> > Could you help review this patch or give some feedback ?
> 
> I'm still waiting on this...
> 
> Bjorn?

I was on vacation Friday-today, but I'll look at this series this week.

Re: [PATCH RFC v2 3/5] samples/bpf: Fix inline asm issues building samples on arm64

2017-08-07 Thread Joel Fernandes

Hi Dave,

On Mon, Aug 7, 2017 at 11:28 AM, David Miller  wrote:
>
> Please, no.

Sorry you dislike it, I had intentionally marked it as RFC as its an
idea I was just toying with the idea and posted it early to get
feedback.

>
> The amount of hellish hacks we are adding to deal with this is getting
> way out of control.

I agree with you that hellish hacks are being added which is why it
keeps breaking. I think one of the things my series does is to add
back inclusion of asm headers that were previously removed (that is
the worst hellish hack in my opinion that existing in mainline). So in
that respect my patch is an improvement and makes it possible to build
for arm64 platforms (which is currently broken in mainline).

>
> BPF programs MUST have their own set of asm headers, this is the
> only way to get around this issue in the long term.

Wouldn't that break scripts or bpf code that instruments/trace arch
specific code?

>
> I am also strongly against adding -static to the build.

I can drop -static if you prefer, that's not an issue.

As I understand it, there are no other cleaner alternatives and this
patchset makes the samples work. I would even argue that's its more
functional than previous attempts and fixes something broken in
mainline in a more generic way. If you can provide an example of where
my patchset may not work, I would love to hear it. My whole idea was
to do it in a way that makes future breakage not happen. I don't think
that leaving things broken in this state for extended periods of time
makes sense and IMHO will slow usage of bpf samples on other
platforms.

thanks,

-Joel

Re:Re:Re: Re: [PATCH net] ppp: Fix a scheduling-while-atomic bug in del_chan

2017-08-07 Thread Gao Feng

At 2017-08-08 01:17:02, "Cong Wang"  wrote:
>On Sun, Aug 6, 2017 at 6:32 PM, Gao Feng  wrote:
>> I think the RCU should be supposed to avoid the race between del_chan and 
>> lookup_chan.
>
>More precisely, it is callid_sock which is protected by RCU.
>
>Unless I miss any other code path, pptp_exit_module() is
>problematic too, I don't think it can just vfree() the whole thing.
>
>
>> The synchronize_rcu could make sure if there was one which calls lookup_chan 
>> in this period, it would be finished and the sock refcnt is increased if 
>> necessary.
>>
>> So I think it is ok to invoke sock_put directly without SOCK_RCU_FREE, 
>> because the lookup_chan caller has already hold the sock refcnt,
>>

>

Hi Cong,
I just thought about this issue last night, then I get your this email this 
morning.

>If you mean the sock_hold() inside lookup_chan(), no,
>it doesn't help because we already dereference the sock
>before it.

>

Sorry, I don't get you clearly. Why the sock_hold() isn't helpful?
The pptp_release invokes synchronize_rcu after del_chan, it could make sure the 
others has increased the sock refcnt if necessary
and the lookup is over.
There is no one could get the sock after synchronize_rcu in pptp_release.

But I think about another problem.
It seems the pptp_sock_destruct should not invoke del_chan and 
pppox_unbind_sock.
Because when the sock refcnt is 0, the pptp_release must have be invoked 
already.

There are two cases totally.
1. when pptp_release invokes sock_put, the refcnt is 0. The del_chan and 
pppox_unbind_sock are invoked.
2. when pptp_release invokes sock_put, the refcnt is not 0. It means someone 
holds the sock during the period pptp_release invokes del_chan.
Then someone invokes sock_put and the sock refcnt reach 0, it would invoke 
sk_free and invokes pptp_sock_destruct.
So it is unnecessary to invoke del_chan and pppox_unbind_sock again.
And it would bring a race issue even if the pptp_sock_destruct invoked del_chan.

If so, I would send another patch for it.

>Also, lookup_chan_dst() does not have a refcnt, I don't
>find any code preventing it deref'ing other sock in callid_sock

>than the calling one.

Sorry, the last email is html format, not text.
So I send it with text format again.

Best Regards
Feng

Re: [PATCH net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target

2017-08-07 Thread Xin Long

On Tue, Aug 8, 2017 at 9:00 AM, Cong Wang  wrote:
> On Sat, Aug 5, 2017 at 1:48 AM, Xin Long  wrote:
>> -static int __tcf_ipt_init(struct tc_action_net *tn, struct nlattr *nla,
>> +static int __tcf_ipt_init(struct net *net, struct nlattr *nla,
>>   struct nlattr *est, struct tc_action **a,
>>   const struct tc_action_ops *ops, int ovr, int bind)
>>  {
>> +   struct tc_action_net *tn = net_generic(net, xt_net_id);
>
> ...
>
>> @@ -193,18 +195,14 @@ static int tcf_ipt_init(struct net *net, struct nlattr 
>> *nla,
>> struct nlattr *est, struct tc_action **a, int ovr,
>> int bind)
>>  {
>> -   struct tc_action_net *tn = net_generic(net, ipt_net_id);
>> -
>> -   return __tcf_ipt_init(tn, nla, est, a, _ipt_ops, ovr, bind);
>> +   return __tcf_ipt_init(net, nla, est, a, _ipt_ops, ovr, bind);
>>  }
>>
>>  static int tcf_xt_init(struct net *net, struct nlattr *nla,
>>struct nlattr *est, struct tc_action **a, int ovr,
>>int bind)
>>  {
>> -   struct tc_action_net *tn = net_generic(net, xt_net_id);
>> -
>> -   return __tcf_ipt_init(tn, nla, est, a, _xt_ops, ovr, bind);
>> +   return __tcf_ipt_init(net, nla, est, a, _xt_ops, ovr, bind);
>
> This is not correct.
>
> You miss ipt_net_id != xt_net_id.
right, that's a silly mistake. seems no better way but to pass both
net and net_id to __tcf_ipt_init. will send v2. thanks.

Re: [PATCH v5 net-next 00/12] bpf: rewrite value tracking in verifier

2017-08-07 Thread Daniel Borkmann


On 08/07/2017 04:21 PM, Edward Cree wrote:

This series simplifies alignment tracking, generalises bounds tracking and
  fixes some bounds-tracking bugs in the BPF verifier.  Pointer arithmetic on
  packet pointers, stack pointers, map value pointers and context pointers has
  been unified, and bounds on these pointers are only checked when the pointer
  is dereferenced.
Operations on pointers which destroy all relation to the original pointer
  (such as multiplies and shifts) are disallowed if !env->allow_ptr_leaks,
  otherwise they convert the pointer to an unknown scalar and feed it to the
  normal scalar arithmetic handling.
Pointer types have been unified with the corresponding adjusted-pointer types
  where those existed (e.g. PTR_TO_MAP_VALUE[_ADJ] or FRAME_PTR vs
  PTR_TO_STACK); similarly, CONST_IMM and UNKNOWN_VALUE have been unified into
  SCALAR_VALUE.
Pointer types (except CONST_PTR_TO_MAP, PTR_TO_MAP_VALUE_OR_NULL and
  PTR_TO_PACKET_END, which do not allow arithmetic) have a 'fixed offset' and
  a 'variable offset'; the former is used when e.g. adding an immediate or a
  known-constant register, as long as it does not overflow.  Otherwise the
  latter is used, and any operation creating a new variable offset creates a
  new 'id' (and, for PTR_TO_PACKET, clears the 'range').
SCALAR_VALUEs use the 'variable offset' fields to track the range of possible
  values; the 'fixed offset' should never be set on a scalar.


Been testing and reviewing the series over the last several days, looks
reasonable to me as far as I can tell. Thanks for all the hard work on
unifying this, Edward!

Acked-by: Daniel Borkmann

[PATCH] Allow passing tid or pid in SCM_CREDENTIALS without CAP_SYS_ADMIN

2017-08-07 Thread Prakash Sangappa

Currently passing tid(gettid(2)) of a thread in struct ucred in
SCM_CREDENTIALS message requires CAP_SYS_ADMIN capability otherwise
it fails with EPERM error. Some applications deal with thread id
of a thread(tid) and so it would help to allow tid in SCM_CREDENTIALS
message. Basically, either tgid(pid of the process) or the tid of
the thread should be allowed without the need for CAP_SYS_ADMIN capability.

SCM_CREDENTIALS will be used to determine the global id of a process or
a thread running inside a pid namespace.

This patch adds necessary check to accept tid in SCM_CREDENTIALS
struct ucred.

Signed-off-by: Prakash Sangappa 
---
 net/core/scm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a4..9274197 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -55,6 +55,7 @@ static __inline__ int scm_check_creds(struct ucred *creds)
return -EINVAL;
 
if ((creds->pid == task_tgid_vnr(current) ||
+creds->pid == task_pid_vnr(current) ||
 ns_capable(task_active_pid_ns(current)->user_ns, CAP_SYS_ADMIN)) &&
((uid_eq(uid, cred->uid)   || uid_eq(uid, cred->euid) ||
  uid_eq(uid, cred->suid)) || ns_capable(cred->user_ns, 
CAP_SETUID)) &&
-- 
2.7.4

Re: [PATCH] net: Reduce skb_warn_bad_offload() noise.

2017-08-07 Thread Tonghao Zhang

That is fine to me. I have tested it. Thanks.

On Mon, Aug 7, 2017 at 12:42 PM, Willem de Bruijn
 wrote:
>> The openvswitch kernel module calls the __skb_gso_segment()(and sets
>> tx_path = false) when passing packets to userspace. The UFO will set
>> the ip_summed to CHECKSUM_NONE. There are a lot of warn logs. The warn
>> log is shown as below. I guess we should revert the patch.
>
> Indeed, the software UFO code computes the checksum and
> sets ip_summed to CHECKSUM_NONE, as is correct on the
> egress path.
>
> Commit 6e7bc478c9a0 ("net: skb_needs_check() accepts
> CHECKSUM_NONE for tx") revised the tx_path case in
> skb_needs_check to avoid the warning exactly for the UFO case.
>
> We cannot just make an exception for CHECKSUM_NONE in the
> !tx_path case, as the entire statement then becomes false:
>
>return skb->ip_summed == CHECKSUM_NONE;
>
> Since on egress CHECKSUM_UNNECESSARY is equivalent to
> CHECKSUM_NONE, it should be fine to update the UFO code
> to set that, instead:
>
> @@ -235,7 +235,7 @@ static struct sk_buff *udp4_ufo_fragment(struct
> sk_buff *skb,
> if (uh->check == 0)
> uh->check = CSUM_MANGLED_0;
>
> -   skb->ip_summed = CHECKSUM_NONE;
> +   skb->ip_summed = CHECKSUM_UNNECESSARY;

Re: Qdisc->u32_node - licence to kill

2017-08-07 Thread Cong Wang

On Mon, Aug 7, 2017 at 12:54 PM, John Fastabend
 wrote:
> On 08/07/2017 12:06 PM, Jiri Pirko wrote:
>> Mon, Aug 07, 2017 at 07:47:14PM CEST, john.fastab...@gmail.com wrote:
>>> On 08/07/2017 09:41 AM, Jiri Pirko wrote:
 Hi Jamal/Cong/David/all.

 Digging in the u32 code deeper now. I need to get rid of tp->q for shared
 blocks, but I found out about this:

 struct Qdisc {
 ..
 void*u32_node;
 ..
 };

 Yeah, ugly. u32 uses it to store some shared data, tp_c. It actually
 stores a linked list of all hashtables added to one qdiscs.

 So basically what you have is, you have 1 root ht per prio/pref. Then
 you can have multiple hts, linked from any other ht, does not matter in
 which prio/pref they are.

>>>
>>> We can create arbitrary hash tables here independent of prio/pref via
>>> TCA_U32_DIVISOR. Then these can be linked to other hash tables via
>>> TCA_U32_LINK commands.
>>
>> Yeah, that's what I thought.
>>
>>
>>>
>>> prio/pref does not really play any part here from my reading, except as
>>> a further specifier in the walk callbacks. Making it a useful filter on
>>> dump operations.
>>
>> Not correct. prio/pref is one level up priority, independent on specific
>> cls implementation. You can have cls_u32 instance on prio 10 and
>> cls_flower instance on prio 20. Both work.
>
> ah right, lets make sure I got this right then (its been awhile since I've
> read this code). So the tcf_ctl_tfilter hook walks classifiers, inserting the
> classifier by prio. Then tcf_classify walks the list of classifiers looking
> for any matches, specifically any return codes it recognizes or a return code
> greater than zero. u32 though has this link notion that allows users to jump
> to other u32 classifiers that are in this list, because it has a global hash
> table list. So the per prio classifier isolation is not true in u32 case.

u32 filter supports multiple hash tables within a qdisc, struct
tc_u_common is supposed to link them together. This has to be
per qdisc because all of these hash tables belong to one qdisc
and their ID's are unique within the qdisc.

I dislike it too, and I actually tried to improve it in the past,
unfortunately didn't make any real progress. I think we can
definitely make it less ugly, but I don't think we can totally
get rid of it because of the design of u32.

Similar for tp->data.

Re: [PATCH net-next] net: dsa: lan9303: Only allocate 3 ports

2017-08-07 Thread Vivien Didelot

Egil Hjelmeland  writes:

> Save 2628 bytes on arm eabi by allocate only the required 3 ports.
>
> Now that ds->num_ports is correct: In net/dsa/tag_lan9303.c
> eliminate duplicate LAN9303_MAX_PORTS, use ds->num_ports.
> (Matching the pattern of other net/dsa/tag_xxx.c files.)
>
> Signed-off-by: Egil Hjelmeland 

Reviewed-by: Vivien Didelot

Re: [PATCH net-next] selftests: bpf: add a test for XDP redirect

2017-08-07 Thread John Fastabend

On 08/07/2017 01:14 PM, William Tu wrote:
> Add test for xdp_redirect by creating two namespaces with two
> veth peers, then forward packets in-between.
> 
> Signed-off-by: William Tu 
> Cc: Daniel Borkmann 
> Cc: John Fastabend 
> ---

Thanks for doing this.

Acked-by: John Fastabend

Re: [PATCH] ip/link_vti*.c: Fix output for ikey/okey

2017-08-07 Thread Stephen Hemminger

On Mon, 7 Aug 2017 11:59:28 +0200
Christian Langrock  wrote:

> ikey and okey are normal u32 values. There's no reason to print them as
> IPv4/IPv6 addresses.
> 
> Signed-off-by: Christian Langrock 

Changing output format breaks scripts that parse output.
But on the other hand, the VTI code breaks the assumption that ip command
output should be the same as input.

More likely the original output format was done to match Cisco output.

Why not print in hex like fwmark?

pgpymvaMpWU8J.pgp
Description: OpenPGP digital signature

Re: [PATCH net-next] net: dsa: lan9303: Only allocate 3 ports

2017-08-07 Thread Florian Fainelli

On 08/07/2017 03:22 PM, Egil Hjelmeland wrote:
> Save 2628 bytes on arm eabi by allocate only the required 3 ports.
> 
> Now that ds->num_ports is correct: In net/dsa/tag_lan9303.c
> eliminate duplicate LAN9303_MAX_PORTS, use ds->num_ports.
> (Matching the pattern of other net/dsa/tag_xxx.c files.)
> 
> Signed-off-by: Egil Hjelmeland 

Reviewed-by: Florian Fainelli 
-- 
Florian

[Patch net-next] net_sched: get rid of some forward declarations

2017-08-07 Thread Cong Wang

If we move up tcf_fill_node() we can get rid of these
forward declarations.

Also, move down tfilter_notify_chain() to group them together.

Reported-by: Jamal Hadi Salim 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 net/sched/cls_api.c | 214 +---
 1 file changed, 103 insertions(+), 111 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 668afb6e9885..8d1157aebaf7 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -100,25 +100,6 @@ int unregister_tcf_proto_ops(struct tcf_proto_ops *ops)
 }
 EXPORT_SYMBOL(unregister_tcf_proto_ops);
 
-static int tfilter_notify(struct net *net, struct sk_buff *oskb,
- struct nlmsghdr *n, struct tcf_proto *tp,
- void *fh, int event, bool unicast);
-
-static int tfilter_del_notify(struct net *net, struct sk_buff *oskb,
- struct nlmsghdr *n, struct tcf_proto *tp,
- void *fh, bool unicast, bool *last);
-
-static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
-struct nlmsghdr *n,
-struct tcf_chain *chain, int event)
-{
-   struct tcf_proto *tp;
-
-   for (tp = rtnl_dereference(chain->filter_chain);
-tp; tp = rtnl_dereference(tp->next))
-   tfilter_notify(net, oskb, n, tp, 0, event, false);
-}
-
 /* Select new prio value from the range, managed by kernel. */
 
 static inline u32 tcf_auto_prio(struct tcf_proto *tp)
@@ -411,6 +392,109 @@ static struct tcf_proto *tcf_chain_tp_find(struct 
tcf_chain *chain,
return tp;
 }
 
+static int tcf_fill_node(struct net *net, struct sk_buff *skb,
+struct tcf_proto *tp, void *fh, u32 portid,
+u32 seq, u16 flags, int event)
+{
+   struct tcmsg *tcm;
+   struct nlmsghdr  *nlh;
+   unsigned char *b = skb_tail_pointer(skb);
+
+   nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
+   if (!nlh)
+   goto out_nlmsg_trim;
+   tcm = nlmsg_data(nlh);
+   tcm->tcm_family = AF_UNSPEC;
+   tcm->tcm__pad1 = 0;
+   tcm->tcm__pad2 = 0;
+   tcm->tcm_ifindex = qdisc_dev(tp->q)->ifindex;
+   tcm->tcm_parent = tp->classid;
+   tcm->tcm_info = TC_H_MAKE(tp->prio, tp->protocol);
+   if (nla_put_string(skb, TCA_KIND, tp->ops->kind))
+   goto nla_put_failure;
+   if (nla_put_u32(skb, TCA_CHAIN, tp->chain->index))
+   goto nla_put_failure;
+   if (!fh) {
+   tcm->tcm_handle = 0;
+   } else {
+   if (tp->ops->dump && tp->ops->dump(net, tp, fh, skb, tcm) < 0)
+   goto nla_put_failure;
+   }
+   nlh->nlmsg_len = skb_tail_pointer(skb) - b;
+   return skb->len;
+
+out_nlmsg_trim:
+nla_put_failure:
+   nlmsg_trim(skb, b);
+   return -1;
+}
+
+static int tfilter_notify(struct net *net, struct sk_buff *oskb,
+ struct nlmsghdr *n, struct tcf_proto *tp,
+ void *fh, int event, bool unicast)
+{
+   struct sk_buff *skb;
+   u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
+
+   skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+   if (!skb)
+   return -ENOBUFS;
+
+   if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq,
+ n->nlmsg_flags, event) <= 0) {
+   kfree_skb(skb);
+   return -EINVAL;
+   }
+
+   if (unicast)
+   return netlink_unicast(net->rtnl, skb, portid, MSG_DONTWAIT);
+
+   return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
+ n->nlmsg_flags & NLM_F_ECHO);
+}
+
+static int tfilter_del_notify(struct net *net, struct sk_buff *oskb,
+ struct nlmsghdr *n, struct tcf_proto *tp,
+ void *fh, bool unicast, bool *last)
+{
+   struct sk_buff *skb;
+   u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
+   int err;
+
+   skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+   if (!skb)
+   return -ENOBUFS;
+
+   if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq,
+ n->nlmsg_flags, RTM_DELTFILTER) <= 0) {
+   kfree_skb(skb);
+   return -EINVAL;
+   }
+
+   err = tp->ops->delete(tp, fh, last);
+   if (err) {
+   kfree_skb(skb);
+   return err;
+   }
+
+   if (unicast)
+   return netlink_unicast(net->rtnl, skb, portid, MSG_DONTWAIT);
+
+   return rtnetlink_send(skb, net, portid, RTNLGRP_TC,
+ n->nlmsg_flags & NLM_F_ECHO);
+}
+
+static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
+struct nlmsghdr *n,
+struct

multi-queue over IFF_NO_QUEUE "virtual" devices

2017-08-07 Thread Florian Fainelli

Hi,

Most DSA supported Broadcom switches have multiple queues per ports
(usually 8) and each of these queues can be configured with different
pause, drop, hysteresis thresholds and so on in order to make use of the
switch's internal buffering scheme and have some queues achieve some
kind of lossless behavior (e.g: LAN to LAN traffic for Q7 has a higher
priority than LAN to WAN for Q0).

This is obviously very workload specific, so I'd want maximum
programmability as much as possible.

This brings me to a few questions:

1) If we have the DSA slave network devices currently flagged with
IFF_NO_QUEUE becoming multi-queue (on TX) aware such that an application
can control exactly which switch egress queue is used on a per-flow
basis, would that be a problem (this is the dynamic selection of the TX
queue)?

2) The conduit interface (CPU) port network interface has a congestion
control scheme which requires each of its TX queues (32 or 16) to be
statically mapped to each of the underlying switch port queues because
the congestion/ HW needs to inspect the queue depths of the switch to
accept/reject a packet at the CPU's TX ring level. Do we have a good way
with tc to map a virtual/stacked device's queue(s) on-top of its
physical/underlying device's queues (this is the static queue mapping
necessary for congestion to work)?

Let me know if you think this is the right approach or not.

Thanks!
-- 
Florian

[RFC PATCH 1/2] bpf: Fix bpf_trace_printk on 32-bit architectures

2017-08-07 Thread James Hogan

bpf_trace_printk() uses conditional operators to attempt to pass
different types to __trace_printk() depending on the format operators.
This doesn't work as intended on 32-bit architectures where u32 & long
are passed differently to u64, since the result of C conditional
operators follows the "usual arithmetic conversions" rules, such that
the values passed to __trace_printk() will always be u64.

For example the samples/bpf/tracex5 test printed lines like below on
MIPS, where the fd and buf have come from the u64 fd argument, and the
size from the buf argument:
  dd-1176  [000]   1180.941542: 0x0001: write(fd=1, buf=  (null), 
size=6258688)

Instead of this:
  dd-1217  [000]   1625.616026: 0x0001: write(fd=1, buf=009e4000, 
size=512)

Work around this with an ugly hack which expands each combination of
argument types for the 3 arguments. On 64-bit kernels it is assumed that
u32, long & u64 are all passed the same way so no casting takes place
(it has apparently worked implicitly until now). On 32-bit kernels it is
assumed that long and u32 pass the same way so there are 8 combinations.

On 32-bit kernels bpf_trace_printk() increases in size but should now
work correctly. On 64-bit kernels it actually reduces in size slightly,
I presume due to removal of some of the casts (which as far as I can
tell are unnecessary for printk anyway due to the controlled nature of
the interpretation):

arch   function  old new   delta
x86_64 bpf_trace_printk  532 412-120
x86bpf_trace_printk  6761120+444
MIPS64 bpf_trace_printk  760 612-148
MIPS32 bpf_trace_printk  768 996+228

Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()")
Signed-off-by: James Hogan 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Steven Rostedt 
Cc: Ingo Molnar 
Cc: netdev@vger.kernel.org
---
I'm open to nicer ways of fixing this.

This is tested with samples/bpf/tracex5 on MIPS32 and MIPS64. Only build
tested on x86.
---
 kernel/trace/bpf_trace.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 37385193a608..32dcbe1b48f2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -204,10 +204,28 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, 
u64, arg1,
fmt_cnt++;
}
 
-   return __trace_printk(1/* fake ip will not be printed */, fmt,
- mod[0] == 2 ? arg1 : mod[0] == 1 ? (long) arg1 : 
(u32) arg1,
- mod[1] == 2 ? arg2 : mod[1] == 1 ? (long) arg2 : 
(u32) arg2,
- mod[2] == 2 ? arg3 : mod[2] == 1 ? (long) arg3 : 
(u32) arg3);
+   /*
+* This is a horribly ugly hack to allow different combinations of
+* argument types to be used, particularly on 32-bit architectures where
+* u32 & long pass the same as one another, but differently to u64.
+*
+* On 64-bit architectures it is assumed u32, long & u64 pass in the
+* same way.
+*/
+
+#define __BPFTP_P(...) __trace_printk(1/* fake ip will not be printed */, \
+  fmt, ##__VA_ARGS__)
+#define __BPFTP_1(...) ((mod[0] == 2 || __BITS_PER_LONG == 64) \
+? __BPFTP_P(arg1, ##__VA_ARGS__)   \
+: __BPFTP_P((long)arg1, ##__VA_ARGS__))
+#define __BPFTP_2(...) ((mod[1] == 2 || __BITS_PER_LONG == 64) \
+? __BPFTP_1(arg2, ##__VA_ARGS__)   \
+: __BPFTP_1((long)arg2, ##__VA_ARGS__))
+#define __BPFTP_3(...) ((mod[2] == 2 || __BITS_PER_LONG == 64) \
+? __BPFTP_2(arg3, ##__VA_ARGS__)   \
+: __BPFTP_2((long)arg3, ##__VA_ARGS__))
+
+   return __BPFTP_3();
 }
 
 static const struct bpf_func_proto bpf_trace_printk_proto = {
-- 
2.13.2

[RFC PATCH 0/2] bpf_trace_printk() fixes

2017-08-07 Thread James Hogan

A couple of RFC fixes for bpf_trace_printk(). The first affects 32-bit
architectures in particular, the second is a theoretical uninitialised
variable fix.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Steven Rostedt 
Cc: Ingo Molnar 
Cc: netdev@vger.kernel.org

James Hogan (2):
  bpf: Fix bpf_trace_printk on 32-bit architectures
  bpf: Initialise mod[] in bpf_trace_printk

 kernel/trace/bpf_trace.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

-- 
2.13.2

[RFC PATCH 2/2] bpf: Initialise mod[] in bpf_trace_printk

2017-08-07 Thread James Hogan

In bpf_trace_printk(), the elements in mod[] are left uninitialised, but
they are then incremented to track the width of the formats. Zero
initialise the array just in case the memory contains non-zero values on
entry.

Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()")
Signed-off-by: James Hogan 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Steven Rostedt 
Cc: Ingo Molnar 
Cc: netdev@vger.kernel.org
---
When I checked (on MIPS32), the elements tended to have the value zero
anyway (does BPF zero the stack or something clever?), so this is a
purely theoretical fix.
---
 kernel/trace/bpf_trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 32dcbe1b48f2..86a52857d941 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -129,7 +129,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, 
u64, arg1,
   u64, arg2, u64, arg3)
 {
bool str_seen = false;
-   int mod[3] = {};
+   int mod[3] = { 0, 0, 0 };
int fmt_cnt = 0;
u64 unsafe_addr;
char buf[64];
-- 
2.13.2

[RFC PATCH] net: don't set __LINK_STATE_START until after dev->open() call

2017-08-07 Thread Jacob Keller

Fix an issue with relying on netif_running() which could be true during
when dev->open() handler is being called, even if it would exit with
a failure. This ensures the state does not get set and removed with
a narrow race for other callers to read it as open when infact it never
finished opening.

Signed-off-by: Jacob Keller 
---
I found this as a result of debugging a race condition in the i40evf
driver, in which we assumed that netif_running() would not be true until
after dev->open() had been called and succeeded. Unfortunately we can't
hold the rtnl_lock() while checking netif_running() because it would
cause a deadlock between our reset task and our ndo_open handler.

I am wondering whether the proposed change is acceptable here, or
whether some ndo_open handlers rely on __LINK_STATE_START being true
prior to their being called?

 net/core/dev.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1d75499add72..11953af90427 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1362,8 +1362,6 @@ static int __dev_open(struct net_device *dev)
if (ret)
return ret;
 
-   set_bit(__LINK_STATE_START, >state);
-
if (ops->ndo_validate_addr)
ret = ops->ndo_validate_addr(dev);
 
@@ -1372,9 +1370,8 @@ static int __dev_open(struct net_device *dev)
 
netpoll_poll_enable(dev);
 
-   if (ret)
-   clear_bit(__LINK_STATE_START, >state);
-   else {
+   if (!ret)
+   set_bit(__LINK_STATE_START, >state);
dev->flags |= IFF_UP;
dev_set_rx_mode(dev);
dev_activate(dev);
-- 
2.14.0.rc1.251.g593d8d6362ce

[PATCH net-next] net: dsa: lan9303: Only allocate 3 ports

2017-08-07 Thread Egil Hjelmeland

Save 2628 bytes on arm eabi by allocate only the required 3 ports.

Now that ds->num_ports is correct: In net/dsa/tag_lan9303.c
eliminate duplicate LAN9303_MAX_PORTS, use ds->num_ports.
(Matching the pattern of other net/dsa/tag_xxx.c files.)

Signed-off-by: Egil Hjelmeland 
---
 drivers/net/dsa/lan9303-core.c | 2 +-
 net/dsa/tag_lan9303.c  | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index 15befd155251..46fc1d5d3c9e 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -811,7 +811,7 @@ static struct dsa_switch_ops lan9303_switch_ops = {
 
 static int lan9303_register_switch(struct lan9303 *chip)
 {
-   chip->ds = dsa_switch_alloc(chip->dev, DSA_MAX_PORTS);
+   chip->ds = dsa_switch_alloc(chip->dev, LAN9303_NUM_PORTS);
if (!chip->ds)
return -ENOMEM;
 
diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c
index 247774d149f9..e23e7635fa00 100644
--- a/net/dsa/tag_lan9303.c
+++ b/net/dsa/tag_lan9303.c
@@ -39,7 +39,6 @@
  */
 
 #define LAN9303_TAG_LEN 4
-#define LAN9303_MAX_PORTS 3
 
 static struct sk_buff *lan9303_xmit(struct sk_buff *skb, struct net_device 
*dev)
 {
@@ -104,7 +103,7 @@ static struct sk_buff *lan9303_rcv(struct sk_buff *skb, 
struct net_device *dev,
 
source_port = ntohs(lan9303_tag[1]) & 0x3;
 
-   if (source_port >= LAN9303_MAX_PORTS) {
+   if (source_port >= ds->num_ports) {
dev_warn_ratelimited(>dev, "Dropping packet due to invalid 
source port\n");
return NULL;
}
-- 
2.11.0

Re: [PATCH net-next] net: vrf: Add extack messages for newlink failures

2017-08-07 Thread David Miller

From: David Ahern 
Date: Mon,  7 Aug 2017 10:08:10 -0700

> Add extack error messages for failure paths creating vrf devices. Once
> extack support is added to iproute2, we go from the unhelpful:
> $  ip li add foobar type vrf
> RTNETLINK answers: Invalid argument
> 
> to:
> $ ip li add foobar type vrf
> Error: VRF table id is missing
> 
> Signed-off-by: David Ahern 

Applied, thanks David.

[PATCH RFC 1/2] bpf: Add a BPF return code to disconnect a connection

2017-08-07 Thread Tom Herbert

When using BPF program against a flow a possible verdict is that the
packet should not only be dropped, but that the flow the packet
was received on should be terminated.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1d06be1569b1..324e886c3490 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -708,6 +708,7 @@ enum bpf_ret_code {
BPF_DROP = 2,
/* 3-6 reserved */
BPF_REDIRECT = 7,
+   BPF_DISCONNECT = 8,
/* >127 are reserved for prog type specific return codes */
 };
 
-- 
2.11.0

[PATCH RFC 2/2] stap: Socket tap

2017-08-07 Thread Tom Herbert

The socket tap uses ULP mechanism to insert funtionality between the
socket system calls and the connection itself. Basically, this is a
means to tap into a socket. To implement the tap the data operations
(sendmsg, recvmsg, sendpage, and read_sock) are intercepted using the
ULP infrastructure. In each direction a strparser instance is used to
deliniate a stream into application messages. A BPF "verdict" program
is run on the output (applicaiton messages) each strparser. The verdict
program can allow the message, drop it, or drop it and also kill the
connection.

The sendmsg path looks something like:

sendmsg (ULP) -> strparser -> verditct -> send_sock_skb

The receive path looks like:

tcp_read_sock -> strparser -> verdict -> recvmsg

Note the socket tap does not introduce any new locks or queuing
(except for messages under construction by strparser). Also,
the buffer limits of the socket are respectd in all operations.

Signed-off-by: Tom Herbert 
---
 include/net/stap.h|  43 +++
 include/uapi/linux/stap.h |  21 ++
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/stap/Kconfig  |   8 +
 net/stap/Makefile |   3 +
 net/stap/stap_main.c  | 769 ++
 7 files changed, 846 insertions(+)
 create mode 100644 include/net/stap.h
 create mode 100644 include/uapi/linux/stap.h
 create mode 100644 net/stap/Kconfig
 create mode 100644 net/stap/Makefile
 create mode 100644 net/stap/stap_main.c

diff --git a/include/net/stap.h b/include/net/stap.h
new file mode 100644
index ..dfc96a116db2
--- /dev/null
+++ b/include/net/stap.h
@@ -0,0 +1,43 @@
+/*
+ * Socket tap
+ *
+ * Copyright (c) 2017 Tom Herbert 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __NET_STAP_H_
+#define __NET_STAP_H_
+
+#include 
+#include 
+#include 
+
+struct stap_bops {
+   struct strparser strp;
+   struct bpf_prog *parse_prog;
+   struct bpf_prog *verdict_prog;
+};
+
+struct stap_sock {
+   struct sock *sk; /* Associated socket */
+
+   const struct proto_ops *orig_ops;
+
+   void (*save_data_ready)(struct sock *sk);
+   void (*save_write_space)(struct sock *sk);
+   void (*save_state_change)(struct sock *sk);
+
+   /* Send items */
+   struct stap_bops send_bops;
+   struct sk_buff_head build_list;
+   struct sk_buff_head ready_list;
+
+   /* Receive items */
+   struct stap_bops recv_bops;
+   struct sk_buff *recv_skb;
+};
+
+#endif /* __NET_STAP_H_ */
diff --git a/include/uapi/linux/stap.h b/include/uapi/linux/stap.h
new file mode 100644
index ..fa8545628fd2
--- /dev/null
+++ b/include/uapi/linux/stap.h
@@ -0,0 +1,21 @@
+/*
+ * Socket tap
+ *
+ * Copyright (c) 2017 Tom Herbert 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef _UAPI_LINUX_STAP_H
+#define _UAPI_LINUX_STAP_H
+
+struct stap_params {
+   int bpf_send_parse_fd;
+   int bpf_send_verdict_fd;
+   int bpf_recv_parse_fd;
+   int bpf_recv_verdict_fd;
+};
+
+#endif /* _UAPI_LINUX_STAP_H */
diff --git a/net/Kconfig b/net/Kconfig
index 2b8d2d88bc2b..8a1bdd269e84 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -368,6 +368,7 @@ source "net/irda/Kconfig"
 source "net/bluetooth/Kconfig"
 source "net/rxrpc/Kconfig"
 source "net/kcm/Kconfig"
+source "net/stap/Kconfig"
 source "net/strparser/Kconfig"
 
 config FIB_RULES
diff --git a/net/Makefile b/net/Makefile
index bed80fa398b7..3ef1d8ae8e58 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_BT)  += bluetooth/
 obj-$(CONFIG_SUNRPC)   += sunrpc/
 obj-$(CONFIG_AF_RXRPC) += rxrpc/
 obj-$(CONFIG_AF_KCM)   += kcm/
+obj-$(CONFIG_STAP) += stap/
 obj-$(CONFIG_STREAM_PARSER)+= strparser/
 obj-$(CONFIG_ATM)  += atm/
 obj-$(CONFIG_L2TP) += l2tp/
diff --git a/net/stap/Kconfig b/net/stap/Kconfig
new file mode 100644
index ..adb3149ea624
--- /dev/null
+++ b/net/stap/Kconfig
@@ -0,0 +1,8 @@
+config STAP
+   tristate "Socket tap"
+   depends on INET
+   select BPF_SYSCALL
+   select STREAM_PARSER
+   ---help---
+ Socket tap.
+
diff --git a/net/stap/Makefile b/net/stap/Makefile
new file mode 100644
index ..f5bce0f59da5
--- /dev/null
+++ b/net/stap/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_STAP) += stap.o
+
+stap-y := stap_main.o
diff --git a/net/stap/stap_main.c b/net/stap/stap_main.c
new file mode 100644
index ..a34dcf19e6cd
--- /dev/null
+++ b/net/stap/stap_main.c
@@ -0,0 +1,769 @@
+/*
+ * Socket tap
+ *
+ * Copyright (c) 2017 Tom

[PATCH RFC 0/2] stap: Socket tap

2017-08-07 Thread Tom Herbert

This patch set introduces generic socket tap support. This allows a
means to run BPF programs or implement other functionality in both the
send and receive path of a socket.

The socket tap uses the ULP mechanism recently introduced for kTLS.
The data operations (sendmsg, recvmsg, sendpage, and splice_recv)
are intercepted. In both directions strparser is used to break the
stream into discrete application layer messages. Each message is
then run through a BPF verdict program that indicates the message
is okay, should be dropped, or should be dropped and the connection
terminated.

The data path for socket tap for TCP is illustrated below:

  +-+
  |  Application|
  | |
  |  sendmsg  recvmsg   |
  +-+
| ^
| |
|-|-
| |
|+-++--+++
|| Socket tap  |<---| Verdict prog |<---| strparser  |
|| recvmsg ||  || TCP data ready |
|+-++--+++
|   ^
|   |
|   TCP receive queue
V
  +-+   +---+   +--+   +---+
  | Socket tap  |-->| strparser |-->| Verdict prog |-->| skb_send_sock |
  | sendmsg |   | send mode |   |  |   | locked|
  +-+   +---+   +--+   +---+
 |
 V
TCP write queue

Interface:

A socket tap is enabled on socket using SO_ULP socket option with
ulp type "stap". The socket option takes ULP specific configuration
in the stap_params structure. The parameters consist of four file
descriptors for BPF programs. These are for the parser (strparser)
in the send path, the verdict program for send path, the parser
and verdict programs in the receive path.

Example configuration to set a socket tap. In this case the same 
parse and verdict programs are used on send and receive sides of the
socket.:

struct {
struct ulp_config ulpc;
struct stap_params sp;
} my_config;

load_bpf_file("parse_kern.o");
load_bpf_file("verdict_kern.o");

my_config.ulpc.ulp_name = "stap";
my_config.sp.bpf_send_parse_fd = prog_fd[0];
my_config.sp.bpf_send_verdict_fd = prog_fd[1];
my_config.sp.bpf_recv_parse_fd = prog_fd[0];
my_config.sp.bpf_recv_verdict_fd = prog_fd[1];

setsockopt(fd, SOL_SOCKET, SO_ULP, _config, sizeof(myconfig)

Future work:

  - Fill in all the expected semantics. A goal of socket tap is
transparency with applications.
  - Performance evaluation.
  - Add a mechanism to allow an admin process to tap other users'
sockets.
  - Add userpsace tap that can diverted applciation messages through
userpsace. I'm thinking to connect tapped sockets to KCM to provide
for interface.
  - Integrate with kTLS.
  - SUpport for BFP_REDIRECT. This would be useful to redirect messages
to different sockets like in John Fasatabend's socket redirect.

Tom Herbert (2):
  bpf: Add a BPF return code to disconnect a connection
  stap: Socket tap

 include/net/stap.h|  43 +++
 include/uapi/linux/bpf.h  |   1 +
 include/uapi/linux/stap.h |  21 ++
 net/Kconfig   |   1 +
 net/Makefile  |   1 +
 net/stap/Kconfig  |   8 +
 net/stap/Makefile |   3 +
 net/stap/stap_main.c  | 769 ++
 8 files changed, 847 insertions(+)
 create mode 100644 include/net/stap.h
 create mode 100644 include/uapi/linux/stap.h
 create mode 100644 net/stap/Kconfig
 create mode 100644 net/stap/Makefile
 create mode 100644 net/stap/stap_main.c

-- 
2.11.0

Re: [PATCH] wan: dscc4: add checks for dma mapping errors

2017-08-07 Thread Francois Romieu

Alexey Khoroshilov  :
> The driver does not check if mapping dma memory succeed.
> The patch adds the checks and failure handling.
> 
> Found by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Alexey Khoroshilov 

Please amend your subject line as:

Subject: [PATCH net-next v2 1/1] dscc4: add checks for dma mapping errors.

Rationale: davem is not supposed to guess the branch the patch should be
applied to.

[...]
> diff --git a/drivers/net/wan/dscc4.c b/drivers/net/wan/dscc4.c
> index 799830ffcae2..1a94f0a95b2c 100644
> --- a/drivers/net/wan/dscc4.c
> +++ b/drivers/net/wan/dscc4.c
> @@ -522,19 +522,27 @@ static inline int try_get_rx_skb(struct dscc4_dev_priv 
> *dpriv,
>   struct RxFD *rx_fd = dpriv->rx_fd + dirty;
>   const int len = RX_MAX(HDLC_MAX_MRU);
>   struct sk_buff *skb;
> - int ret = 0;
> + dma_addr_t addr;
>  
>   skb = dev_alloc_skb(len);
>   dpriv->rx_skbuff[dirty] = skb;
> - if (skb) {
> - skb->protocol = hdlc_type_trans(skb, dev);
> - rx_fd->data = cpu_to_le32(pci_map_single(dpriv->pci_priv->pdev,
> -   skb->data, len, PCI_DMA_FROMDEVICE));
> - } else {
> - rx_fd->data = 0;
> - ret = -1;
> - }
> - return ret;
> + if (!skb)
> + goto err_out;
> +
> + skb->protocol = hdlc_type_trans(skb, dev);
> + addr = pci_map_single(dpriv->pci_priv->pdev,
> +   skb->data, len, PCI_DMA_FROMDEVICE);
> + if (pci_dma_mapping_error(dpriv->pci_priv->pdev, addr))
> + goto err_free_skb;

Nit: please use a local 'struct pci_dev *pdev = dpriv->pci_priv->pdev;'

[...]
> @@ -1147,14 +1155,22 @@ static netdev_tx_t dscc4_start_xmit(struct sk_buff 
> *skb,
>   struct dscc4_dev_priv *dpriv = dscc4_priv(dev);
>   struct dscc4_pci_priv *ppriv = dpriv->pci_priv;
>   struct TxFD *tx_fd;
> + dma_addr_t addr;
>   int next;
>  
> + addr = pci_map_single(ppriv->pdev, skb->data, skb->len,
> +   PCI_DMA_TODEVICE);
> + if (pci_dma_mapping_error(ppriv->pdev, addr)) {
> + dev_kfree_skb_any(skb);
> + dev->stats.tx_errors++;

It should read 'dev->stats.tx_dropped++'.

-- 
Ueimor

Re: [PATCH net-next v3 00/13] Update DSA's FDB API and perform switchdev cleanup

2017-08-07 Thread David Miller

From: Arkadi Sharshevsky 
Date: Sun,  6 Aug 2017 16:15:38 +0300

> The patchset adds support for configuring static FDB entries via the
> switchdev notification chain. The current method for FDB configuration
> uses the switchdev's bridge bypass implementation. In order to support 
> this legacy way and to perform the switchdev cleanup, the implementation
> is moved inside DSA.
> 
> The DSA drivers cannot sync the software bridge with hardware learned
> entries and use the switchdev's implementation of bypass FDB dumping.
> Because they are the only ones using this functionality, the fdb_dump
> implementation is moved from switchdev code into DSA.
> 
> Finally after this changes a major cleanup in switchdev can be done.
> ---
> Please see individual patches for patch specific change logs.
> v1->v2
> - Split MDB/vlan dump removal into core/driver removal.
> 
> v2->v3
> - The self implementation for FDB add/del is moved inside DSA.

Series applied, thank you.

Re: [PATCH 0/3] ARM: dts: keystone-k2g: Add DCAN instances to 66AK2G

2017-08-07 Thread Franklin S Cooper Jr


Hi Santosh,
On 08/04/2017 12:07 PM, Santosh Shilimkar wrote:
> Hi Franklin,
> 
> On 8/2/2017 1:18 PM, Franklin S Cooper Jr wrote:
>> Add D CAN nodes to 66AK2G based SoC dtsi.
>>
>> Franklin S Cooper Jr (2):
>>dt-bindings: net: c_can: Update binding for clock and power-domains
>>  property
>>ARM: configs: keystone: Enable D_CAN driver
>>
>> Lokesh Vutla (1):
>>ARM: dts: k2g: Add DCAN nodes
>>
> Any DCAN driver dependency with these patchset ? If not, I can
> queue this up so do let me know.

There aren't any dependencies.
> 
> Regards,
> Santosh

Re: [PATCH net-next] selftests: bpf: add a test for XDP redirect

2017-08-07 Thread Daniel Borkmann


On 08/07/2017 10:14 PM, William Tu wrote:

Add test for xdp_redirect by creating two namespaces with two
veth peers, then forward packets in-between.

Signed-off-by: William Tu 
Cc: Daniel Borkmann 
Cc: John Fastabend 


Acked-by: Daniel Borkmann

Re: [PATCH] hamradio: baycom: make hdlcdrv_ops const

2017-08-07 Thread David Miller

From: Bhumika Goyal 
Date: Sun,  6 Aug 2017 14:21:45 +0530

> Make hdlcdrv_ops structures const as they are only passed to
> hdlcdrv_register function. The corresponding argument is of type const,
> so make the structures const.
> 
> Signed-off-by: Bhumika Goyal 

Applied, thanks.

Re: [PATCH ipsec-next] xfrm: check that cached bundle is still valid

2017-08-07 Thread David Miller

From: Florian Westphal 
Date: Sun,  6 Aug 2017 10:19:07 +0200

> Quoting Ilan Tayari:
>   1. Set up a host-to-host IPSec tunnel (or transport, doesn't matter)
>   2. Ping over IPSec, or do something to populate the pcpu cache
>   3. Join a MC group, then leave MC group
>   4. Try to ping again using same CPU as before -> traffic
>  doesn't egress the machine at all
> 
> Ilan debugged the problem down to the fact that one of the path dsts
> devices point to lo due to earlier dst_dev_put().
> In this case, dst is marked as DEAD and we cannot reuse the bundle.
> 
> The cache only asserted that the requested policy and that of the cached
> bundle match, but its not enough - also verify the path is still valid.
> 
> Fixes: ec30d78c14a813 ("xfrm: add xdst pcpu cache")
> Reported-by: Ayham Masood 
> Tested-by: Ilan Tayari 
> Signed-off-by: Florian Westphal 

Since this regression is from the flow cache removal that went directly
into my tree, I'll apply this directly to net-next as well.

Thanks Florian.

Re: [PATCH net-next v2 0/3] net: dsa: remove useless arguments

2017-08-07 Thread David Miller

From: Vivien Didelot 
Date: Sat,  5 Aug 2017 16:20:16 -0400

> Several DSA core setup functions take many arguments, mostly because of
> the legacy code. This patch series removes the useless args of these
> functions, where either the dsa_switch or dsa_port argument is enough.
> 
> Changes in v2:
>   - ds->dev is already assigned by dsa_switch_alloc

Series applied, thanks.

Re: [PATCH][net-next] net: hns3: fix spelling mistake: "capabilty" -> "capability"

2017-08-07 Thread David Miller

From: Colin King 
Date: Sat,  5 Aug 2017 14:46:35 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_err error message and also
> split overly long line to avoid a checkpatch warning.
> 
> Signed-off-by: Colin Ian King 

Applied, thank you.

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread Stephen Hemminger

On Mon, 07 Aug 2017 13:26:03 -0700 (PDT)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Mon, 7 Aug 2017 12:12:35 -0700
> 
> > Dave, I asked for test cases, and received none.  
> 
> You don't need a test case to type make and make sure the build succeeds.

It did succeed for libmnl but was not doing anything.

Re: [PATCH v4 net-next 0/5] Refactor lan9303_xxx_packet_processing

2017-08-07 Thread David Miller

From: Egil Hjelmeland 
Date: Sat,  5 Aug 2017 13:05:45 +0200

> This series is purely non functional. 
> 
> It changes the lan9303_enable_packet_processing,
> lan9303_disable_packet_processing() to pass port number (0,1,2) as
> parameter instead of port offset. This aligns them with
> other functions in the module, and makes it possible to simplify the code.
> 
> The lan9303_enable_packet_processing, lan9303_disable_packet_processing
> functions operate on port. Therefore rename the functions to reflect that
> as well.
> 
> Reviewer pointed out lan9303_get_ethtool_stats would be better off with
> the use of a lan9303_read_switch_port(). So that was added to the series.
 ...

Series applied, thank you.

Re: [PATCH net-next v2 0/5] ipv6: sr: add support for advanced local segment processing

2017-08-07 Thread David Miller

From: David Lebrun 
Date: Sat, 5 Aug 2017 12:38:23 +0200

> v2: use EXPORT_SYMBOL_GPL
> 
> The current implementation of IPv6 SR supports SRH insertion/encapsulation
> and basic segment endpoint behavior (i.e., processing of an SRH contained in
> a packet whose active segment (IPv6 DA) is routed to the local node). This
> behavior simply consists of updating the DA to the next segment and forwarding
> the packet accordingly. This processing is realised for all such packets,
> regardless of the active segment.
> 
> The most recent specifications of IPv6 SR [1] [2] extend the SRH processing
> features as follows. Each segment endpoint defines a MyLocalSID table.
> This table maps segments to operations to perform. For each ingress IPv6
> packet whose DA is part of a given prefix, the segment endpoint looks
> up the active segment (i.e., the IPv6 DA) in the MyLocalSID table and
> applies the corresponding operation. Such specifications enable to specify
> arbitrary operations besides the basic SRH processing and allow for a more
> fine-grained classification.
> 
> This patch series implements those extended specifications by leveraging
> a new type of lightweight tunnel, seg6local.
 ...

Series applied, thanks David.

Re: [PATCH net] net: sched: fix NULL pointer dereference when action calls some targets

2017-08-07 Thread Cong Wang

(Cc'ing netfilter and Jamal)

On Sat, Aug 5, 2017 at 4:35 AM, Xin Long  wrote:
> As we know in some target's checkentry it may dereference par.entryinfo
> to check entry stuff inside. But when sched action calls xt_check_target,
> par.entryinfo is set with NULL. It would cause kernel panic when calling
> some targets.
>
> It can be reproduce with:
>   # tc qd add dev eth1 ingress handle :
>   # tc filter add dev eth1 parent : u32 match u32 0 0 action xt \
> -j ECN --ecn-tcp-remove
>
> It could also crash kernel when using target CLUSTERIP or TPROXY.
>
> By now there's no proper value for par.entryinfo in ipt_init_target,
> but it can not be set with NULL. This patch is to void all these
> panics by setting it with an ipt_entry obj with all members 0.
>
> Signed-off-by: Xin Long 
> ---
>  net/sched/act_ipt.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/act_ipt.c b/net/sched/act_ipt.c
> index 7c4816b..0f09f70 100644
> --- a/net/sched/act_ipt.c
> +++ b/net/sched/act_ipt.c
> @@ -41,6 +41,7 @@ static int ipt_init_target(struct net *net, struct 
> xt_entry_target *t,
>  {
> struct xt_tgchk_param par;
> struct xt_target *target;
> +   struct ipt_entry e;
> int ret = 0;
>
> target = xt_request_find_target(AF_INET, t->u.user.name,
> @@ -48,10 +49,11 @@ static int ipt_init_target(struct net *net, struct 
> xt_entry_target *t,
> if (IS_ERR(target))
> return PTR_ERR(target);
>
> +   memset(, 0, sizeof(e));
> t->u.kernel.target = target;
> par.net   = net;
> par.table = table;
> -   par.entryinfo = NULL;
> +   par.entryinfo = 

This looks like a completely API burden?

Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-07 Thread John Stultz

On Mon, Aug 7, 2017 at 2:05 PM, John Stultz  wrote:
> So, with recent testing with my HiKey board, I've been noticing some
> quirky behavior with my USB eth adapter.
>
> Basically, pluging the usb eth adapter in and then removing it, when
> plugging it back in I often find that its not detected, and the system
> slowly spits out the following message over and over:
>   unregister_netdevice: waiting for eth0 to become free. Usage count = 1

The other bit is that after this starts printing, the board will no
longer reboot (it hangs continuing to occasionally print the above
message), and I have to manually reset the device.

thanks
-john

Re: [PATCH v9 0/4] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-07 Thread David Miller

From: Ding Tianhong 
Date: Mon, 7 Aug 2017 12:13:17 +0800

> Hi David：
> 
> I think networking tree merge it is a better choice, as it mainly used to 
> tell the NIC
> drivers how to use the Relaxed Ordering Attribute, and later we need send 
> patch to enable
> RO for ixgbe driver base on this patch. But I am not sure whether Bjorn has 
> some of his own
> view. :)
> 
> Hi Bjorn:
> 
> Could you help review this patch or give some feedback ?

I'm still waiting on this...

Bjorn?

Re: [net-next PATCH v2] bpf: devmap fix mutex in rcu critical section

2017-08-07 Thread David Miller

From: John Fastabend 
Date: Fri, 04 Aug 2017 22:02:19 -0700

> Originally we used a mutex to protect concurrent devmap update
> and delete operations from racing with netdev unregister notifier
> callbacks.
> 
> The notifier hook is needed because we increment the netdev ref
> count when a dev is added to the devmap. This ensures the netdev
> reference is valid in the datapath. However, we don't want to block
> unregister events, hence the initial mutex and notifier handler.
> 
> The concern was in the notifier hook we search the map for dev
> entries that hold a refcnt on the net device being torn down. But,
> in order to do this we require two steps,
 ...
> Fortunately, by writing slightly better code we can avoid the
> mutex altogether. If CPU 1 in the above example uses a cmpxchg
> and _only_ replaces the dev reference in the map when it is in
> fact the expected dev the race is removed completely. The two
> cases being illustrated here, first the race condition,
 ...
> And viola the original race we tried to solve with a mutex is
> corrected and the trace noted by Sasha below is resolved due
> to removal of the mutex.
> 
> Note: When walking the devmap and removing dev references as needed
> we depend on the core to fail any calls to dev_get_by_index() using
> the ifindex of the device being removed. This way we do not race with
> the user while searching the devmap.
> 
> Additionally, the mutex was also protecting list add/del/read on
> the list of maps in-use. This patch converts this to an RCU list
> and spinlock implementation. This protects the list from concurrent
> alloc/free operations. The notifier hook walks this list so it uses
> RCU read semantics.
> 
> BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:747
> in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
> 1 lock held by syz-executor1/16315:
>  #0:  (rcu_read_lock){..}, at: [] map_delete_elem 
> kernel/bpf/syscall.c:577 [inline]
>  #0:  (rcu_read_lock){..}, at: [] SYSC_bpf 
> kernel/bpf/syscall.c:1427 [inline]
>  #0:  (rcu_read_lock){..}, at: [] SyS_bpf+0x1d32/0x4ba0 
> kernel/bpf/syscall.c:1388
> 
> Fixes: 2ddf71e23cc2 ("net: add notifier hooks for devmap bpf map")
> Reported-by: Sasha Levin 
> Signed-off-by: Daniel Borkmann 
> Signed-off-by: John Fastabend 

Applied, thanks John.

Re: [Patch net-next 0/2] net_sched: clean up filter handle

2017-08-07 Thread David Miller

From: Cong Wang 
Date: Fri,  4 Aug 2017 21:31:41 -0700

> This patchset sits in my local branch for a long time, it is time to
> send it out. It cleans up the ambiguous use of 'unsigned long fh',
> please see each of them for details.

Series applied, thanks Cong.

Re: [PATCH net-next v2] lwtunnel: replace EXPORT_SYMBOL with EXPORT_SYMBOL_GPL

2017-08-07 Thread David Miller

From: Roopa Prabhu 
Date: Fri,  4 Aug 2017 18:19:18 -0700

> From: Roopa Prabhu 
> 
> Signed-off-by: Roopa Prabhu 
> ---
> v2 - fixed a incorrect replace

Applied, thanks.

Re: [PATCH net-next v4 0/2] bpf: add support for sys_{enter|exit}_* tracepoints

2017-08-07 Thread David Miller

From: Yonghong Song 
Date: Fri, 4 Aug 2017 16:00:08 -0700

> Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
> style tracepoints. The main reason is that syscalls/sys_enter_* and 
> syscalls/sys_exit_*
> tracepoints are treated differently from other tracepoints and there
> is no bpf hook to it.
> 
> This patch set adds bpf support for these syscalls tracepoints and also
> adds a test case for it.
> 
> Changelogs:
> v3 -> v4:
>  - Check the legality of ctx offset access for syscall tracepoint as well.
>trace_event_get_offsets will return correct max offset for each
>specific syscall tracepoint.
>  - Use variable length array to avoid hardcode 6 as the maximum
>arguments beyond syscall_nr.
> v2 -> v3:
>  - Fix a build issue
> v1 -> v2:
>  - Do not use TRACE_EVENT_FL_CAP_ANY to identify syscall tracepoint.
>Instead use trace_event_call->class.

Series applied, thank you.

Re: [PATCH] of_mdio: use of_property_read_u32_array()

2017-08-07 Thread David Miller

From: Sergei Shtylyov 
Date: Sat, 05 Aug 2017 00:43:43 +0300

> The "fixed-link" prop support predated of_property_read_u32_array(), so
> basically had to open-code it. Using the modern API saves 24 bytes of the
> object code (ARM gcc 4.8.5); the only behavior change would be that the
> prop length check is now less strict (however the strict pre-check done
> in of_phy_is_fixed_link() is left intact anyway)...
> 
> Signed-off-by: Sergei Shtylyov 

Applied to net-next.

Re: [PATCH] wan: dscc4: add checks for dma mapping errors

2017-08-07 Thread David Miller

From: Alexey Khoroshilov 
Date: Fri,  4 Aug 2017 23:23:24 +0300

> The driver does not check if mapping dma memory succeed.
> The patch adds the checks and failure handling.
> 
> Found by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Alexey Khoroshilov 

This is a great example of why it can be irritating to see these
mechanical "bug fixes" for drivers very few people use and actually
test, which introduces new bugs.

> @@ -522,19 +522,27 @@ static inline int try_get_rx_skb(struct dscc4_dev_priv 
> *dpriv,
>   struct RxFD *rx_fd = dpriv->rx_fd + dirty;
>   const int len = RX_MAX(HDLC_MAX_MRU);
>   struct sk_buff *skb;
> - int ret = 0;
> + dma_addr_t addr;
>  
>   skb = dev_alloc_skb(len);
>   dpriv->rx_skbuff[dirty] = skb;

skb recorded here.

> +err_free_skb:
> + dev_kfree_skb_any(skb);

Yet freed here in the error path.

dpriv->rx_skbuff[dirty] should not be set to 'skb' until all possibile
failure tests have passed.

unregister_netdevice: waiting for eth0 to become free. Usage count = 1

2017-08-07 Thread John Stultz

So, with recent testing with my HiKey board, I've been noticing some
quirky behavior with my USB eth adapter.

Basically, pluging the usb eth adapter in and then removing it, when
plugging it back in I often find that its not detected, and the system
slowly spits out the following message over and over:
  unregister_netdevice: waiting for eth0 to become free. Usage count = 1

I've tried to go through and bisect it, but apparently the issue isn't
always reproducible, as I'm apparently getting lots of false negatives
(where I can't always reproduce boot to boot the issue on the same
kernel).

I've done three bisection passes (always restarting with the "first
bad commit" from the previous bisection as the initial bad commit for
the following pass), and it does seem to keep moving back. But it
seems much easier to trigger with newer kernels then older (and so far
I've not seen it with 4.12).

Wanted to see if anyone had any ideas what might be going wrong, and
how I should further debug this.

The last bisect log I generated was:

# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [98fdd857a3bd6a3bf0003d3f68f07c25c85dcde3] net: ethernet: ti:
cpsw: move skb timestamp to packet_submit
git bisect bad 98fdd857a3bd6a3bf0003d3f68f07c25c85dcde3
# good: [48b6bbef9a1789f0365c1a385879a1fea4460016] Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good 48b6bbef9a1789f0365c1a385879a1fea4460016
# good: [a2e8bbd2ef5457485f00b6b947bbbfa2778e5b1e] bpf: Fix
test_obj_id.c for llvm 5.0
git bisect good a2e8bbd2ef5457485f00b6b947bbbfa2778e5b1e
# good: [273889e306256e95ea55d5ebaef99310cf589def] Merge tag
'mlx5-updates-2017-06-16' of
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
git bisect good 273889e306256e95ea55d5ebaef99310cf589def
# bad: [8f46d46715a12f509e13200033a1ed4d6cf335ff] cxgb4: Use Firmware
params to get buffer-group map
git bisect bad 8f46d46715a12f509e13200033a1ed4d6cf335ff
# bad: [f5c306470ed0a8f03ba7017f397da2555b5800d4] Merge tag
'mlx5-updates-2017-06-20' of
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
git bisect bad f5c306470ed0a8f03ba7017f397da2555b5800d4
# bad: [e289ef0ded13021db292be9aef134451546e7c60] net: dsa: mv88e6xxx:
clarify SMI PHY functions
git bisect bad e289ef0ded13021db292be9aef134451546e7c60
# bad: [836d57e5c08e13bb206dcd559d96ee9355e8316e] liquidio: implement
vlan filter enable and disable
git bisect bad 836d57e5c08e13bb206dcd559d96ee9355e8316e
# bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call
dst_hold_safe() properly
git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112
# good: [0830106c53900181d336350581119af09e123bf3] ipv4: take
dst->__refcnt when caching dst in fib
git bisect good 0830106c53900181d336350581119af09e123bf3
# good: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC
and remove the operation of dst_free()
git bisect good b838d5e1c5b6e57b10ec8af2268824041e3ea911
# bad: [9514528d92d4cbe086499322370155ed69f5d06c] ipv6: call
dst_dev_put() properly
git bisect bad 9514528d92d4cbe086499322370155ed69f5d06c
# good: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take
dst->__refcnt for insertion into fib6 tree
git bisect good 1cfb71eeb12047bcdbd3e6730ffed66e810a0855
# first bad commit: [9514528d92d4cbe086499322370155ed69f5d06c] ipv6:
call dst_dev_put() properly


But again, reverting the "ipv6: call dst_dev_put() properly" commit
doesn't seem to completely resolve the issue on newer kernels (though
it may make it harder to trigger), and I suspect with further
bisection passes I might move further back.

Ideas?  I don't seem to have similar issues with USB mass storage
devices, so it seems to be networking specific.

thanks
-john

Re: [PATCH net] net: sched: set xt_tgchk_param par.nft_compat with false in ipt_init_target

2017-08-07 Thread Cong Wang

On Sat, Aug 5, 2017 at 4:32 AM, Xin Long  wrote:
> Commit 55917a21d0cc ("netfilter: x_tables: add context to know if
> extension runs from nft_compat") introduced a member nft_compat to
> xt_tgchk_param structure.
>
> But it didn't set it's value for ipt_init_target. With unexpected
> value in par.nft_compat, it may return unexpected result in some
> target's checkentry.
>
> This patch is to set par.nft_compat with false in ipt_init_target.

It's time to set all these fields to 0 and only initialize those non-zero
fields, in case we will add more fields in the future.

Re: [PATCH net v2] net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets

2017-08-07 Thread David Miller

From: Davide Caratti 
Date: Thu,  3 Aug 2017 22:54:48 +0200

> if the NIC fails to validate the checksum on TCP/UDP, and validation of IP
> checksum is successful, the driver subtracts the pseudo-header checksum
> from the value obtained by the hardware and sets CHECKSUM_COMPLETE. Don't
> do that if protocol is IPPROTO_SCTP, otherwise CRC32c validation fails.
> 
> V2: don't test MLX4_CQE_STATUS_IPV6 if MLX4_CQE_STATUS_IPV4 is set
> 
> Reported-by: Shuang Li 
> Fixes: f8c6455bb04b ("net/mlx4_en: Extend checksum offloading by CHECKSUM 
> COMPLETE")
> Signed-off-by: Davide Caratti 

Can I get reviews from some Mellanox folks please?

Re: [PATCH net-next] ibmvnic: Report rx buffer return codes as netdev_dbg

2017-08-07 Thread David Miller

From: John Allen 
Date: Mon, 7 Aug 2017 15:42:30 -0500

> Reporting any return code for a receive buffer as an "rx error" only
> produces alarming noise and the only values that have been observed to be
> used in this field are not error conditions. Change this to a netdev_dbg
> with a more descriptive message.
> 
> Signed-off-by: John Allen 

Applied, thanks John.

Re: [PATCH net] net: sched: set xt_tgchk_param par.net properly in ipt_init_target

2017-08-07 Thread Cong Wang

On Sat, Aug 5, 2017 at 1:48 AM, Xin Long  wrote:
> -static int __tcf_ipt_init(struct tc_action_net *tn, struct nlattr *nla,
> +static int __tcf_ipt_init(struct net *net, struct nlattr *nla,
>   struct nlattr *est, struct tc_action **a,
>   const struct tc_action_ops *ops, int ovr, int bind)
>  {
> +   struct tc_action_net *tn = net_generic(net, xt_net_id);

...

> @@ -193,18 +195,14 @@ static int tcf_ipt_init(struct net *net, struct nlattr 
> *nla,
> struct nlattr *est, struct tc_action **a, int ovr,
> int bind)
>  {
> -   struct tc_action_net *tn = net_generic(net, ipt_net_id);
> -
> -   return __tcf_ipt_init(tn, nla, est, a, _ipt_ops, ovr, bind);
> +   return __tcf_ipt_init(net, nla, est, a, _ipt_ops, ovr, bind);
>  }
>
>  static int tcf_xt_init(struct net *net, struct nlattr *nla,
>struct nlattr *est, struct tc_action **a, int ovr,
>int bind)
>  {
> -   struct tc_action_net *tn = net_generic(net, xt_net_id);
> -
> -   return __tcf_ipt_init(tn, nla, est, a, _xt_ops, ovr, bind);
> +   return __tcf_ipt_init(net, nla, est, a, _xt_ops, ovr, bind);

This is not correct.

You miss ipt_net_id != xt_net_id.

[PATCH net-next] ibmvnic: Report rx buffer return codes as netdev_dbg

2017-08-07 Thread John Allen

Reporting any return code for a receive buffer as an "rx error" only
produces alarming noise and the only values that have been observed to be
used in this field are not error conditions. Change this to a netdev_dbg
with a more descriptive message.

Signed-off-by: John Allen 
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5932160..99576ba 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1579,7 +1579,8 @@ static int ibmvnic_poll(struct napi_struct *napi, int 
budget)
  rx_comp.correlator);
/* do error checking */
if (next->rx_comp.rc) {
-   netdev_err(netdev, "rx error %x\n", next->rx_comp.rc);
+   netdev_dbg(netdev, "rx buffer returned with rc %x\n",
+  be16_to_cpu(next->rx_comp.rc));
/* free the entry */
next->rx_comp.first = 0;
remove_buff_from_pool(adapter, rx_buff);

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread David Miller

From: Stephen Hemminger 
Date: Mon, 7 Aug 2017 12:12:35 -0700

> Dave, I asked for test cases, and received none.

You don't need a test case to type make and make sure the build succeeds.

[PATCH net-next] selftests: bpf: add a test for XDP redirect

2017-08-07 Thread William Tu

Add test for xdp_redirect by creating two namespaces with two
veth peers, then forward packets in-between.

Signed-off-by: William Tu 
Cc: Daniel Borkmann 
Cc: John Fastabend 
---
 tools/include/uapi/linux/bpf.h   |  3 +-
 tools/testing/selftests/bpf/Makefile |  4 +-
 tools/testing/selftests/bpf/test_xdp_redirect.c  | 28 
 tools/testing/selftests/bpf/test_xdp_redirect.sh | 54 
 4 files changed, 86 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_xdp_redirect.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1579cab49717..8d9bfcca3fe4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -592,7 +592,8 @@ union bpf_attr {
FN(get_socket_uid), \
FN(set_hash),   \
FN(setsockopt), \
-   FN(skb_adjust_room),
+   FN(skb_adjust_room),\
+   FN(redirect_map),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 153c3a181a4c..3c2e67da4b41 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -15,9 +15,9 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
test_align
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
-   test_pkt_md_access.o
+   test_pkt_md_access.o test_xdp_redirect.o
 
-TEST_PROGS := test_kmod.sh
+TEST_PROGS := test_kmod.sh test_xdp_redirect.sh
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect.c 
b/tools/testing/selftests/bpf/test_xdp_redirect.c
new file mode 100644
index ..ef9e704be140
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_redirect.c
@@ -0,0 +1,28 @@
+/* Copyright (c) 2017 VMware
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include "bpf_helpers.h"
+
+int _version SEC("version") = 1;
+
+SEC("redirect_to_111")
+int xdp_redirect_to_111(struct xdp_md *xdp)
+{
+   return bpf_redirect(111, 0);
+}
+SEC("redirect_to_222")
+int xdp_redirect_to_222(struct xdp_md *xdp)
+{
+   return bpf_redirect(222, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_xdp_redirect.sh 
b/tools/testing/selftests/bpf/test_xdp_redirect.sh
new file mode 100755
index ..d8c73ed6e040
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_redirect.sh
@@ -0,0 +1,54 @@
+#!/bin/sh
+# Create 2 namespaces with two veth peers, and
+# forward packets in-between using generic XDP
+#
+# NS1(veth11) NS2(veth22)
+# |   |
+# |   |
+#   (veth1, -- (veth2,
+#   id:111) id:222)
+# | xdp forwarding |
+# --
+
+cleanup()
+{
+   if [ "$?" = "0" ]; then
+   echo "selftests: test_xdp_redirect [PASS]";
+   else
+   echo "selftests: test_xdp_redirect [FAILED]";
+   fi
+
+   set +e
+   ip netns del ns1 2> /dev/null
+   ip netns del ns2 2> /dev/null
+}
+
+set -e
+
+ip netns add ns1
+ip netns add ns2
+
+trap cleanup 0 2 3 6 9
+
+ip link add veth1 index 111 type veth peer name veth11
+ip link add veth2 index 222 type veth peer name veth22
+
+ip link set veth11 netns ns1
+ip link set veth22 netns ns2
+
+ip link set veth1 up
+ip link set veth2 up
+
+ip netns exec ns1 ip addr add 10.1.1.11/24 dev veth11
+ip netns exec ns2 ip addr add 10.1.1.22/24 dev veth22
+
+ip netns exec ns1 ip link set dev veth11 up
+ip netns exec ns2 ip link set dev veth22 up
+
+ip link set dev veth1 xdpgeneric obj test_xdp_redirect.o sec redirect_to_222
+ip link set dev veth2 xdpgeneric obj test_xdp_redirect.o sec redirect_to_111
+
+ip netns exec ns1 ping -c 1 10.1.1.22
+ip netns exec ns2 ping -c 1 10.1.1.11
+
+exit 0
-- 
2.7.4

Re: Qdisc->u32_node - licence to kill

2017-08-07 Thread John Fastabend

On 08/07/2017 12:06 PM, Jiri Pirko wrote:
> Mon, Aug 07, 2017 at 07:47:14PM CEST, john.fastab...@gmail.com wrote:
>> On 08/07/2017 09:41 AM, Jiri Pirko wrote:
>>> Hi Jamal/Cong/David/all.
>>>
>>> Digging in the u32 code deeper now. I need to get rid of tp->q for shared
>>> blocks, but I found out about this:
>>>
>>> struct Qdisc {
>>> ..
>>> void*u32_node;
>>> ..
>>> };
>>>
>>> Yeah, ugly. u32 uses it to store some shared data, tp_c. It actually
>>> stores a linked list of all hashtables added to one qdiscs.
>>>
>>> So basically what you have is, you have 1 root ht per prio/pref. Then
>>> you can have multiple hts, linked from any other ht, does not matter in
>>> which prio/pref they are.
>>>
>>
>> We can create arbitrary hash tables here independent of prio/pref via
>> TCA_U32_DIVISOR. Then these can be linked to other hash tables via
>> TCA_U32_LINK commands.
> 
> Yeah, that's what I thought.
> 
> 
>>
>> prio/pref does not really play any part here from my reading, except as
>> a further specifier in the walk callbacks. Making it a useful filter on
>> dump operations.
> 
> Not correct. prio/pref is one level up priority, independent on specific
> cls implementation. You can have cls_u32 instance on prio 10 and
> cls_flower instance on prio 20. Both work.

ah right, lets make sure I got this right then (its been awhile since I've
read this code). So the tcf_ctl_tfilter hook walks classifiers, inserting the
classifier by prio. Then tcf_classify walks the list of classifiers looking
for any matches, specifically any return codes it recognizes or a return code
greater than zero. u32 though has this link notion that allows users to jump
to other u32 classifiers that are in this list, because it has a global hash
table list. So the per prio classifier isolation is not true in u32 case.

> 
> In fact, the current u32 "linking" ignores the upper level
> prio/pref and breakes user assumptions when he inserts rules with
> specific prio.
> 
> 

hmm yep, I guess users of u32 have a "different" set of assumptions when
working with u32 hash tables than the rest of the classifiers.

>>
>>> Do I understand that correctly that prio/pref only has meaning if
>>> linking does not take place, because if there is linking, the prio/pref
>>> of inserted rule is simply ignored?
>>
>> I think even then the prio/pref meaning is dubious, from u32_change,
> 
> Please see tc_ctl_tfilter. That is where prio/pref is processed. What
> you describe is one level down.
> 

got it.

> 
>>
>>for (pins = rtnl_dereference(*ins); pins;
>> ins = >next, pins = rtnl_dereference(*ins))
>>if (TC_U32_NODE(handle) < TC_U32_NODE(pins->handle))
>>break;
>>
>> I think the list insert is done via handle not via prio/pref.
>>
>>>
>>> That is the most confusing thing I saw in net/sched/ so far. 
>>> Is this a bug? Sounds like one.
>>>
>>
>> I don't think this is a bug at very least I don't see how we can
>> change it without breaking users. I know people depend on the hash map
>> capabilities and linking logic.
> 
> Do they insert rules into multiple hashtables with different prio? Why?
> What is the usecase?
> 

Single u32 classifier with multiple hash tables linked together I would
think is the normal way. I guess because the API never disallowed it
and the user api is a bit tricky its possible users may use multiple prios,
but probably it is not needed.

Maybe Jamal has some use case where this is required?

> 
>>
>>> Did someone introduce *u32_node (formerly static struct tc_u_common
>>> *u32_list;) just to allow this weirdness?
>>>
>>> Can I just remove this shared tp_c and make the linking to other
>>> hashtables only possible within the same prio/pref? That would make
>>> sense to me.
>>>
>>
>> The idea to make linking hash tables only possible within the same
>> prio/pref will break existing programs. We can't do this its part of
>> UAPI now and people depend on it.
> 
> That's why I asked if that is a bug. I still feel it is. But I
> definitelly understand your concern. I'm just trying to figure out how
> to resolve this misdesign :(
> 

I don't have a good argument for the current design, but just want to be
sure we don't break existing users.

.John

Re: [PATCH iproute2] lib: Dump ext-ack string by default

2017-08-07 Thread David Ahern

On 8/7/17 1:28 PM, David Ahern wrote:
> @@ -99,7 +95,12 @@ static int nl_dump_ext_err(const struct nlmsghdr *nlh, 
> nl_ext_ack_fn_t errfn)
>   err_nlh = >msg;
>   }
>  
> - return errfn(errmsg, off, err_nlh);
> + if (errfn)
> + return errfn(errmsg, off, err_nlh);
> +
> + fprintf(stderr, "Error: %s\n", errmsg);
> +
> + return 1;


Dang it, missing an 'if (errmsg)' since it does not have to exist. Will
send a v2

Re: [pull request][for-next 0/8] Mellanox, mlx5 shared 2017-08-07

2017-08-07 Thread Saeed Mahameed

On Mon, Aug 7, 2017 at 9:20 PM, David Miller  wrote:
> From: Saeed Mahameed 
> Date: Mon,  7 Aug 2017 13:18:00 +0300
>
>> Hi Dave & Doug,
>>
>> This series contains some low level updates for mlx5 core driver,
>> to be shared as base code for net-next and rdma for-next mlx5
>> 4.14 submissions.
>>
>> Please find more information in the tag message below.
>>
>> Please pull and let me know if there's any porblem.
>>
>> Side note:
>> This series merges cleanly with current net-next, but it will conflict with 
>> Jiri's patch
>> "mlx5e: push cls_flower and mqprio setup_tc processing into separate 
>> functions"
>> Which is under review.
>> since this is shared code and must go to both rdma and net-next it has to be
>> based on 4.13-rc4, so there is not much I can do about this.
>
> I resolved the merge conflict as best as I could, please take a look.
>

Thanks Dave!

looks good, I will do some compilation testing later and if i find
anything i will post a patch.

Thanks a lot,
-Saeed.

[PATCH iproute2] lib: Dump ext-ack string by default

2017-08-07 Thread David Ahern

In time, errfn can be implemented for link, route, etc commands to
give a much more detailed response (e.g., point to the attribute
that failed). Doing so is much more complicated to process the
message and convert attribute ids to names.

In any case the error string returned by the kernel should be dumped
to the user, so make that happen now.

Signed-off-by: David Ahern 
---
 lib/libnetlink.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 145de2cb0ccf..ee78f768a8bd 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -61,7 +61,6 @@ static int err_attr_cb(const struct nlattr *attr, void *data)
return MNL_CB_OK;
 }
 
-
 /* dump netlink extended ack error message */
 static int nl_dump_ext_err(const struct nlmsghdr *nlh, nl_ext_ack_fn_t errfn)
 {
@@ -72,9 +71,6 @@ static int nl_dump_ext_err(const struct nlmsghdr *nlh, 
nl_ext_ack_fn_t errfn)
const char *errmsg = NULL;
uint32_t off = 0;
 
-   if (!errfn)
-   return 0;
-
/* no TLVs, nothing to do here */
if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS))
return 0;
@@ -99,7 +95,12 @@ static int nl_dump_ext_err(const struct nlmsghdr *nlh, 
nl_ext_ack_fn_t errfn)
err_nlh = >msg;
}
 
-   return errfn(errmsg, off, err_nlh);
+   if (errfn)
+   return errfn(errmsg, off, err_nlh);
+
+   fprintf(stderr, "Error: %s\n", errmsg);
+
+   return 1;
 }
 #else
 #warning "libmnl required for error support"
-- 
2.1.4

[PATCH net-next] liquidio: fix misspelled firmware image filenames

2017-08-07 Thread Felix Manlunas

From: Derek Chickles 

Fix misspelled firmware image filenames advertised via MODULE_FIRMWARE().

Signed-off-by: Derek Chickles 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 8c2cd80..3ec0dd9 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -39,10 +39,14 @@ MODULE_AUTHOR("Cavium Networks, ");
 MODULE_DESCRIPTION("Cavium LiquidIO Intelligent Server Adapter Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(LIQUIDIO_VERSION);
-MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_210SV_NAME LIO_FW_NAME_SUFFIX);
-MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_210NV_NAME LIO_FW_NAME_SUFFIX);
-MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_410NV_NAME LIO_FW_NAME_SUFFIX);
-MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_23XX_NAME LIO_FW_NAME_SUFFIX);
+MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_210SV_NAME
+   "_" LIO_FW_NAME_TYPE_NIC LIO_FW_NAME_SUFFIX);
+MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_210NV_NAME
+   "_" LIO_FW_NAME_TYPE_NIC LIO_FW_NAME_SUFFIX);
+MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_410NV_NAME
+   "_" LIO_FW_NAME_TYPE_NIC LIO_FW_NAME_SUFFIX);
+MODULE_FIRMWARE(LIO_FW_DIR LIO_FW_BASE_NAME LIO_23XX_NAME
+   "_" LIO_FW_NAME_TYPE_NIC LIO_FW_NAME_SUFFIX);
 
 static int ddr_timeout = 1;
 module_param(ddr_timeout, int, 0644);

Re: [PATCH 0/6] In-kernel QMI handling

2017-08-07 Thread Marcel Holtmann

Hi Bjorn,

>>> This series starts by moving the common definitions of the QMUX
>>> protocol to the
>>> uapi header, as they are shared with clients - both in kernel and
>>> userspace.
>>> 
>>> This series then introduces in-kernel helper functions for aiding the
>>> handling
>>> of QMI encoded messages in the kernel. QMI encoding is a wire-format
>>> used in
>>> exchanging messages between the majority of QRTR clients and
>>> services.
>> 
>> This raises a few red-flags for me.
> 
> I'm glad it does. In discussions with the responsible team within
> Qualcomm I've highlighted a number of concerns about enabling this
> support in the kernel. Together we're continuously looking into what
> should be pushed out to user space, and trying to not introduce
> unnecessary new users.
> 
>> So far, we've kept almost everything QMI related in userspace and
>> handled all QMI control-channel messages from libraries like libqmi or
>> uqmi via the cdc-wdm driver and the "rmnet" interface via the qmi_wwan
>> driver.  The kernel drivers just serve as the transport.
>> 
> 
> The path that was taken to support the MSM-style devices was to
> implement net/qrtr, which exposes a socket interface to abstract the
> physical transports (QMUX or IPCROUTER in Qualcomm terminology).
> 
> As I share you view on letting the kernel handle the transportation only
> the task of keeping track of registered services (service id -> node and
> port mapping) was done in a user space process and so far we've only
> ever have to deal with QMI encoded messages in various user space tools.

I think that the transport and multiplexing can be in the kernel as long as it 
is done as proper subsystem. Similar to Phonet or CAIF. Meaning it should have 
a well defined socket interface that can be easily used from userspace, but 
also a clean in-kernel interface handling.

If Qualcomm is supportive of this effort and is willing to actually assist 
and/or open some of the specs or interface descriptions, then this is a good 
thing. Service registration and cleanup is really done best in the kernel. Same 
applies to multiplexing. Trying to do multiplexing in userspace is always 
cumbersome and leads to overhead that is of no gain. For example within oFono, 
we had to force everything to go via oFono since it was the only sane way of 
handling it. Other approaches were error prone and full of race conditions. You 
need a central entity that can clean up.

For the definition of an UAPI to share some code, I am actually not sure that 
is such a good idea. For example the QMI code in oFono follows a way simpler 
approach. And I am not convinced that all the macros are actually beneficial. 
For example, the whole netlink macros are pretty cumbersome. Adding some 
Documentation/qmi.txt on how the wire format looks like and what is expected 
seems to be a way better approach.

Regards

Marcel

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread Stephen Hemminger

On Mon, 07 Aug 2017 11:45:17 -0700 (PDT)
David Miller  wrote:

> From: David Ahern 
> Date: Mon, 7 Aug 2017 12:09:31 -0600
> 
> > On 8/7/17 12:06 PM, Stephen Hemminger wrote:  
> >>> Does not work. Seems like you pushed the RFC commit which was known to
> >>> be incomplete.  
> >> 
> >> Patches welcome.  
> > 
> > What exists does not even compile. Patches will be sent once you fix that.  
> 
> Yeah seriously Stephen, you created this huge mess so the onus is really
> on you to fix it up.

It is fixed now.
Dave, I asked for test cases, and received none.

Re: Qdisc->u32_node - licence to kill

2017-08-07 Thread Jiri Pirko

Mon, Aug 07, 2017 at 07:47:14PM CEST, john.fastab...@gmail.com wrote:
>On 08/07/2017 09:41 AM, Jiri Pirko wrote:
>> Hi Jamal/Cong/David/all.
>> 
>> Digging in the u32 code deeper now. I need to get rid of tp->q for shared
>> blocks, but I found out about this:
>> 
>> struct Qdisc {
>>  ..
>> void*u32_node;
>>  ..
>> };
>> 
>> Yeah, ugly. u32 uses it to store some shared data, tp_c. It actually
>> stores a linked list of all hashtables added to one qdiscs.
>> 
>> So basically what you have is, you have 1 root ht per prio/pref. Then
>> you can have multiple hts, linked from any other ht, does not matter in
>> which prio/pref they are.
>> 
>
>We can create arbitrary hash tables here independent of prio/pref via
>TCA_U32_DIVISOR. Then these can be linked to other hash tables via
>TCA_U32_LINK commands.

Yeah, that's what I thought.


>
>prio/pref does not really play any part here from my reading, except as
>a further specifier in the walk callbacks. Making it a useful filter on
>dump operations.

Not correct. prio/pref is one level up priority, independent on specific
cls implementation. You can have cls_u32 instance on prio 10 and
cls_flower instance on prio 20. Both work.

In fact, the current u32 "linking" ignores the upper level
prio/pref and breakes user assumptions when he inserts rules with
specific prio.


>
>> Do I understand that correctly that prio/pref only has meaning if
>> linking does not take place, because if there is linking, the prio/pref
>> of inserted rule is simply ignored?
>
>I think even then the prio/pref meaning is dubious, from u32_change,

Please see tc_ctl_tfilter. That is where prio/pref is processed. What
you describe is one level down.


>
>for (pins = rtnl_dereference(*ins); pins;
> ins = >next, pins = rtnl_dereference(*ins))
>if (TC_U32_NODE(handle) < TC_U32_NODE(pins->handle))
>break;
>
>I think the list insert is done via handle not via prio/pref.
>
>> 
>> That is the most confusing thing I saw in net/sched/ so far. 
>> Is this a bug? Sounds like one.
>> 
>
>I don't think this is a bug at very least I don't see how we can
>change it without breaking users. I know people depend on the hash map
>capabilities and linking logic.

Do they insert rules into multiple hashtables with different prio? Why?
What is the usecase?


>
>> Did someone introduce *u32_node (formerly static struct tc_u_common
>> *u32_list;) just to allow this weirdness?
>> 
>> Can I just remove this shared tp_c and make the linking to other
>> hashtables only possible within the same prio/pref? That would make
>> sense to me.
>> 
>
>The idea to make linking hash tables only possible within the same
>prio/pref will break existing programs. We can't do this its part of
>UAPI now and people depend on it.

That's why I asked if that is a bug. I still feel it is. But I
definitelly understand your concern. I'm just trying to figure out how
to resolve this misdesign :(

Re: [PATCH net-next v4 1/2] bpf: add support for sys_enter_* and sys_exit_* tracepoints

2017-08-07 Thread Alexei Starovoitov


On 8/4/17 1:00 PM, Yonghong Song wrote:

Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The iovisor/bcc issue #748
(https://github.com/iovisor/bcc/issues/748) documents this issue.
For example, if you try to attach a bpf program to tracepoints
syscalls/sys_enter_newfstat, you will get the following error:
   # ./tools/trace.py t:syscalls:sys_enter_newfstat
   Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
   Failed to attach BPF to tracepoint

The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch adds bpf support for these syscalls tracepoints by
  . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
  . calling bpf programs in perf_syscall_enter and perf_syscall_exit

The legality of bpf program ctx access is also checked.
Function trace_event_get_offsets returns correct max offset for each
specific syscall tracepoint, which is compared against the maximum offset
access in bpf program.

Signed-off-by: Yonghong Song 


lgtm
Acked-by: Alexei Starovoitov

Re: [PATCH net-next v3 00/13] Update DSA's FDB API and perform switchdev cleanup

2017-08-07 Thread Florian Fainelli

On 08/07/2017 07:59 AM, Vivien Didelot wrote:
> Hi Arkadi,
> 
> Arkadi Sharshevsky  writes:
> 
>> The patchset adds support for configuring static FDB entries via the
>> switchdev notification chain. The current method for FDB configuration
>> uses the switchdev's bridge bypass implementation. In order to support 
>> this legacy way and to perform the switchdev cleanup, the implementation
>> is moved inside DSA.
>>
>> The DSA drivers cannot sync the software bridge with hardware learned
>> entries and use the switchdev's implementation of bypass FDB dumping.
>> Because they are the only ones using this functionality, the fdb_dump
>> implementation is moved from switchdev code into DSA.
>>
>> Finally after this changes a major cleanup in switchdev can be done.
>> ---
>> Please see individual patches for patch specific change logs.
>> v1->v2
>> - Split MDB/vlan dump removal into core/driver removal.
>>
>> v2->v3
>> - The self implementation for FDB add/del is moved inside DSA.
> 
> v3 behaves correctly:
> 
> # bridge fdb add e4:1d:2d:a5:f0:2a dev lan3
> # bridge fdb add e4:1d:2d:a5:f0:4a dev lan4 master
> # bridge fdb show
> 01:00:5e:00:00:01 dev eth0 self permanent
> 01:00:5e:00:00:01 dev eth1 self permanent
> b6:f2:c8:3a:1c:71 dev lan0 master br0 permanent
> e4:1d:2d:a5:f0:2a dev lan3 self static
> e4:1d:2d:a5:f0:4a dev lan4 offload master br0 permanent
> e4:1d:2d:a5:f0:4a dev lan4 self static
> 01:00:5e:00:00:01 dev br0 self permanent
> # bridge fdb del e4:1d:2d:a5:f0:2a dev lan3
> # bridge fdb del e4:1d:2d:a5:f0:4a dev lan4 master
> # bridge fdb show
> 01:00:5e:00:00:01 dev eth0 self permanent
> 01:00:5e:00:00:01 dev eth1 self permanent
> b6:f2:c8:3a:1c:71 dev lan0 master br0 permanent
> 01:00:5e:00:00:01 dev br0 self permanent
> 
> Tested-by: Vivien Didelot 

Same here:

Tested-by: Florian Fainelli 

thanks!
-- 
Florian

[PATCH net-next v2 2/2] bpf: Extend check_uarg_tail_zero() checks

2017-08-07 Thread Mickaël Salaün

The function check_uarg_tail_zero() was created from bpf(2) for
BPF_OBJ_GET_INFO_BY_FD without taking the access_ok() nor the PAGE_SIZE
checks. Make this checks more generally available while unlikely to be
triggered, extend the memory range check and add an explanation
including why the ToCToU should not be a security concern.

Signed-off-by: Mickaël Salaün 
Acked-by: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Martin KaFai Lau 
Link: 
https://lkml.kernel.org/r/CAGXu5j+vRGFvJZmjtAcT8Hi8B+Wz0e1b6VKYZHfQP_=dxzc...@mail.gmail.com
---
 kernel/bpf/syscall.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c653ee0bd162..fbe09a0cccf4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,6 +48,15 @@ static const struct bpf_map_ops * const bpf_map_types[] = {
 #undef BPF_MAP_TYPE
 };
 
+/*
+ * If we're handed a bigger struct than we know of, ensure all the unknown bits
+ * are 0 - i.e. new user-space does not rely on any kernel feature extensions
+ * we don't know about yet.
+ *
+ * There is a ToCToU between this function call and the following
+ * copy_from_user() call. However, this is not a concern since this function is
+ * meant to be a future-proofing of bits.
+ */
 static int check_uarg_tail_zero(void __user *uaddr,
size_t expected_size,
size_t actual_size)
@@ -57,6 +66,12 @@ static int check_uarg_tail_zero(void __user *uaddr,
unsigned char val;
int err;
 
+   if (unlikely(actual_size > PAGE_SIZE))  /* silly large */
+   return -E2BIG;
+
+   if (unlikely(!access_ok(VERIFY_READ, uaddr, actual_size)))
+   return -EFAULT;
+
if (actual_size <= expected_size)
return 0;
 
@@ -1393,17 +1408,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
if (!capable(CAP_SYS_ADMIN) && sysctl_unprivileged_bpf_disabled)
return -EPERM;
 
-   if (!access_ok(VERIFY_READ, uattr, 1))
-   return -EFAULT;
-
-   if (size > PAGE_SIZE)   /* silly large */
-   return -E2BIG;
-
-   /* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
err = check_uarg_tail_zero(uattr, sizeof(attr), size);
if (err)
return err;
-- 
2.13.3

[PATCH net-next v2 1/2] bpf: Move check_uarg_tail_zero() upward

2017-08-07 Thread Mickaël Salaün

The function check_uarg_tail_zero() may be useful for other part of the
code in the syscall.c file. Move this function at the beginning of the
file.

Signed-off-by: Mickaël Salaün 
Acked-by: Daniel Borkmann 
Cc: Alexei Starovoitov 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Martin KaFai Lau 
---

This is needed for the Landlock patch series. :)
---
 kernel/bpf/syscall.c | 52 ++--
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6c772adabad2..c653ee0bd162 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,6 +48,32 @@ static const struct bpf_map_ops * const bpf_map_types[] = {
 #undef BPF_MAP_TYPE
 };
 
+static int check_uarg_tail_zero(void __user *uaddr,
+   size_t expected_size,
+   size_t actual_size)
+{
+   unsigned char __user *addr;
+   unsigned char __user *end;
+   unsigned char val;
+   int err;
+
+   if (actual_size <= expected_size)
+   return 0;
+
+   addr = uaddr + expected_size;
+   end  = uaddr + actual_size;
+
+   for (; addr < end; addr++) {
+   err = get_user(val, addr);
+   if (err)
+   return err;
+   if (val)
+   return -E2BIG;
+   }
+
+   return 0;
+}
+
 static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 {
struct bpf_map *map;
@@ -1246,32 +1272,6 @@ static int bpf_map_get_fd_by_id(const union bpf_attr 
*attr)
return fd;
 }
 
-static int check_uarg_tail_zero(void __user *uaddr,
-   size_t expected_size,
-   size_t actual_size)
-{
-   unsigned char __user *addr;
-   unsigned char __user *end;
-   unsigned char val;
-   int err;
-
-   if (actual_size <= expected_size)
-   return 0;
-
-   addr = uaddr + expected_size;
-   end  = uaddr + actual_size;
-
-   for (; addr < end; addr++) {
-   err = get_user(val, addr);
-   if (err)
-   return err;
-   if (val)
-   return -E2BIG;
-   }
-
-   return 0;
-}
-
 static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
   const union bpf_attr *attr,
   union bpf_attr __user *uattr)
-- 
2.13.3

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread David Miller

From: David Ahern 
Date: Mon, 7 Aug 2017 12:09:31 -0600

> On 8/7/17 12:06 PM, Stephen Hemminger wrote:
>>> Does not work. Seems like you pushed the RFC commit which was known to
>>> be incomplete.
>> 
>> Patches welcome.
> 
> What exists does not even compile. Patches will be sent once you fix that.

Yeah seriously Stephen, you created this huge mess so the onus is really
on you to fix it up.

Re: [PATCH] of_mdio: use of_property_read_u32_array()

2017-08-07 Thread Rob Herring

On Mon, Aug 7, 2017 at 1:01 PM, Florian Fainelli  wrote:
> On 08/07/2017 09:18 AM, Sergei Shtylyov wrote:
>> Hello!
>>
>> On 08/07/2017 05:18 PM, Rob Herring wrote:
>>
 The "fixed-link" prop support predated of_property_read_u32_array(), so
 basically had to open-code it. Using the modern API saves 24 bytes of
 the
 object code (ARM gcc 4.8.5); the only behavior change would be that the
 prop length check is now less strict (however the strict pre-check done
 in of_phy_is_fixed_link() is left intact anyway)...

 Signed-off-by: Sergei Shtylyov 

 ---
 The patch is against the 'dt/next' branch of Rob Herring's
 'linux-git' repo
 plus the previously posted patch killing the useless local variable in
 of_phy_register_fixed_link().
>>>
>>> It shouldn't depend on anything in my tree and David normally takes
>>> of_mdio.c changes.
>>
>>MAINTAINERS still only point at the DT repo, perhaps it should be
>> updated?
>
> More or less done with this (minus the repo part):
>
> http://patchwork.ozlabs.org/patch/795887/

Really I'd like to see this fixed by moving of_mdio.c and of_net.c to
drivers/net/ as we've done for all other subsystems.

Rob

Re: [PATCH v3 net-next 0/7] net: l3mdev: Support for sockets bound to enslaved device

2017-08-07 Thread David Ahern

On 8/7/17 12:39 PM, David Miller wrote:
> Series applied, let's see if it builds this time :-)

I did an allyesconfig build before sending just to make sure, so our
mileage better not vary.

Re: [PATCH v3 net-next 0/7] net: l3mdev: Support for sockets bound to enslaved device

2017-08-07 Thread David Miller

From: David Ahern 
Date: Mon,  7 Aug 2017 08:44:15 -0700

> A missing piece to the VRF puzzle is the ability to bind sockets to
> devices enslaved to a VRF. This patch set adds the enslaved device
> index, sdif, to IPv4 and IPv6 socket lookups. The end result for users
> is the following scope options for services:
> 
> 1. "global" services - sockets not bound to any device
> 
>Allows 1 service to work across all network interfaces with
>connected sockets bound to the VRF the connection originates
>(Requires net.ipv4.tcp_l3mdev_accept=1 for TCP and
> net.ipv4.udp_l3mdev_accept=1 for UDP)
> 
> 2. "VRF" local services - sockets bound to a VRF
> 
>Sockets work across all network interfaces enslaved to a VRF but
>are limited to just the one VRF.
> 
> 3. "device" services - sockets bound to a specific network interface
> 
>Service works only through the one specific interface.
> 
> v3
> - convert __inet_lookup_established in dccp_v4_err; missed in v2
> 
> v2
> - remove sk_lookup struct and add sdif as an argument to existing
>   functions
> 
> Changes since RFC:
> - no significant logic changes; mainly whitespace cleanups

Series applied, let's see if it builds this time :-)

Re: pull-request: wireless-drivers-next 2017-08-07

2017-08-07 Thread David Miller

From: Kalle Valo 
Date: Mon, 07 Aug 2017 17:55:40 +0300

> here's the first pull request to net-next for 4.14, more info in the
> signed tag below. This time there's a simple conflict in iwlwifi but
> you can fix it just like Stephen did:
> 
> https://lkml.kernel.org/r/20170804120408.0d147...@canb.auug.org.au
> 
> Please let me know if you have any problems.

Pulled, thanks Kalle.

Re: [PATCH net-next v1 2/2] bpf: Extend check_uarg_tail_zero() checks

2017-08-07 Thread Daniel Borkmann


On 08/07/2017 06:36 PM, Mickaël Salaün wrote:

The function check_uarg_tail_zero() was created from bpf(2) for
BPF_OBJ_GET_INFO_BY_FD without taking the access_ok() nor the PAGE_SIZE
checks. Make this checks more generally available while unlikely to be
triggered, extend the memory range check and add an explanation
including why the ToCToU should not be a security concern.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Martin KaFai Lau 
Link: 
https://lkml.kernel.org/r/CAGXu5j+vRGFvJZmjtAcT8Hi8B+Wz0e1b6VKYZHfQP_=dxzc...@mail.gmail.com
---
  kernel/bpf/syscall.c | 26 +++---
  1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c653ee0bd162..b884fdc371e0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -48,6 +48,15 @@ static const struct bpf_map_ops * const bpf_map_types[] = {
  #undef BPF_MAP_TYPE
  };

+/*
+ * If we're handed a bigger struct than we know of, ensure all the unknown bits
+ * are 0 - i.e. new user-space does not rely on any kernel feature extensions
+ * we dont know about yet.


Nit: don't


+ *
+ * There is a ToCToU between this function call and the following
+ * copy_from_user() call. However, this should not be a concern since this


Lets make it a bit more clear to the reader: s/should not/is not/


+ * function is meant to be a future-proofing of bits.
+ */
  static int check_uarg_tail_zero(void __user *uaddr,
size_t expected_size,
size_t actual_size)
@@ -57,6 +66,12 @@ static int check_uarg_tail_zero(void __user *uaddr,
unsigned char val;
int err;

+   if (unlikely(!access_ok(VERIFY_READ, uaddr, actual_size)))
+   return -EFAULT;
+
+   if (unlikely(actual_size > PAGE_SIZE))   /* silly large */
+   return -E2BIG;
+


Yeah, moving the checks into check_uarg_tail_zero() is
fine by me. Can we make the 'silly large' test first, so
we don't generate unnecessary work if we bail out later
anyway?

Other than that:

Acked-by: Daniel Borkmann 

Thanks,
Daniel


if (actual_size <= expected_size)
return 0;

@@ -1393,17 +1408,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
if (!capable(CAP_SYS_ADMIN) && sysctl_unprivileged_bpf_disabled)
return -EPERM;

-   if (!access_ok(VERIFY_READ, uattr, 1))
-   return -EFAULT;
-
-   if (size > PAGE_SIZE)/* silly large */
-   return -E2BIG;
-
-   /* If we're handed a bigger struct than we know of,
-* ensure all the unknown bits are 0 - i.e. new
-* user-space does not rely on any kernel feature
-* extensions we dont know about yet.
-*/
err = check_uarg_tail_zero(uattr, sizeof(attr), size);
if (err)
return err;

[PATCH net-next 1/1] netvsc: make sure and unregister datapath

2017-08-07 Thread Stephen Hemminger

Go back to switching datapath directly in the notifier callback.
Otherwise datapath might not get switched on unregister.

No need for calling the NOTIFY_PEERS notifier since that is only for
a gratitious ARP/ND packet; but that is not required with Hyper-V
because both VF and synthetic NIC have the same MAC address.

Reported-by: Vitaly Kuznetsov 
Fixes: 0c195567a8f6 ("netvsc: transparent VF management")
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h |  3 --
 drivers/net/hyperv/netvsc.c |  2 --
 drivers/net/hyperv/netvsc_drv.c | 71 -
 3 files changed, 28 insertions(+), 48 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index c701b059c5ac..d1ea99a12cf2 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -724,14 +724,11 @@ struct net_device_context {
struct net_device __rcu *vf_netdev;
struct netvsc_vf_pcpu_stats __percpu *vf_stats;
struct work_struct vf_takeover;
-   struct work_struct vf_notify;
 
/* 1: allocated, serial number is valid. 0: not allocated */
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
-
-   bool datapath;  /* 0 - synthetic, 1 - VF nic */
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 9598220b3bcc..208f03aa83de 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -60,8 +60,6 @@ void netvsc_switch_datapath(struct net_device *ndev, bool vf)
   sizeof(struct nvsp_message),
   (unsigned long)init_pkt,
   VM_PKT_DATA_INBAND, 0);
-
-   net_device_ctx->datapath = vf;
 }
 
 static struct netvsc_device *alloc_net_device(void)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index e75c0f852a63..eb0023f55fe1 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1649,55 +1649,35 @@ static int netvsc_register_vf(struct net_device 
*vf_netdev)
return NOTIFY_OK;
 }
 
-/* Change datapath */
-static void netvsc_vf_update(struct work_struct *w)
+static int netvsc_vf_up(struct net_device *vf_netdev)
 {
-   struct net_device_context *ndev_ctx
-   = container_of(w, struct net_device_context, vf_notify);
-   struct net_device *ndev = hv_get_drvdata(ndev_ctx->device_ctx);
+   struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
-   struct net_device *vf_netdev;
-   bool vf_is_up;
-
-   if (!rtnl_trylock()) {
-   schedule_work(w);
-   return;
-   }
+   struct net_device *ndev;
 
-   vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
-   if (!vf_netdev)
-   goto unlock;
+   ndev = get_netvsc_byref(vf_netdev);
+   if (!ndev)
+   return NOTIFY_DONE;
 
-   netvsc_dev = rtnl_dereference(ndev_ctx->nvdev);
+   net_device_ctx = netdev_priv(ndev);
+   netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev)
-   goto unlock;
-
-   vf_is_up = netif_running(vf_netdev);
-   if (vf_is_up != ndev_ctx->datapath) {
-   if (vf_is_up) {
-   netdev_info(ndev, "VF up: %s\n", vf_netdev->name);
-   rndis_filter_open(netvsc_dev);
-   netvsc_switch_datapath(ndev, true);
-   netdev_info(ndev, "Data path switched to VF: %s\n",
-   vf_netdev->name);
-   } else {
-   netdev_info(ndev, "VF down: %s\n", vf_netdev->name);
-   netvsc_switch_datapath(ndev, false);
-   rndis_filter_close(netvsc_dev);
-   netdev_info(ndev, "Data path switched from VF: %s\n",
-   vf_netdev->name);
-   }
+   return NOTIFY_DONE;
 
-   /* Now notify peers through VF device. */
-   call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, ndev);
-   }
-unlock:
-   rtnl_unlock();
+   /* Bump refcount when datapath is acvive - Why? */
+   rndis_filter_open(netvsc_dev);
+
+   /* notify the host to switch the data path. */
+   netvsc_switch_datapath(ndev, true);
+   netdev_info(ndev, "Data path switched to VF: %s\n", vf_netdev->name);
+
+   return NOTIFY_OK;
 }
 
-static int netvsc_vf_notify(struct net_device *vf_netdev)
+static int netvsc_vf_down(struct net_device *vf_netdev)
 {
struct net_device_context *net_device_ctx;
+   struct netvsc_device *netvsc_dev;
struct net_device *ndev;
 
ndev = get_netvsc_byref(vf_netdev);
@@ -1705,7 +1685,13 @@ static int netvsc_vf_notify(struct net_device *vf_netdev)
return

[PATCH net-next 0/1] netvsc: another VF datapath fix

2017-08-07 Thread Stephen Hemminger

Previous fix was incomplete.

Stephen Hemminger (1):
  netvsc: make sure and unregister datapath

 drivers/net/hyperv/hyperv_net.h |  3 --
 drivers/net/hyperv/netvsc.c |  2 --
 drivers/net/hyperv/netvsc_drv.c | 71 -
 3 files changed, 28 insertions(+), 48 deletions(-)

-- 
2.11.0

Re: [PATCH RFC v2 3/5] samples/bpf: Fix inline asm issues building samples on arm64

2017-08-07 Thread David Miller


Please, no.

The amount of hellish hacks we are adding to deal with this is getting
way out of control.

BPF programs MUST have their own set of asm headers, this is the
only way to get around this issue in the long term.

I am also strongly against adding -static to the build.

Re: [PATCH net 1/1] s390/qeth: fix L3 next-hop in xmit qeth hdr

2017-08-07 Thread David Miller

From: Ursula Braun 
Date: Mon,  7 Aug 2017 13:28:39 +0200

> From: Julian Wiedmann 
> 
> On L3, the qeth_hdr struct needs to be filled with the next-hop
> IP address.
> The current code accesses rtable->rt_gateway without checking that
> rtable is a valid address. The accidental access to a lowcore area
> results in a random next-hop address in the qeth_hdr.
> rtable (or more precisely, skb_dst(skb)) can be NULL in rare cases
> (for instance together with AF_PACKET sockets).
> This patch adds the missing NULL-ptr checks.
> 
> Signed-off-by: Julian Wiedmann 
> Signed-off-by: Ursula Braun 
> Fixes: 87e7597b5a3 qeth: Move away from using neighbour entries in 
> qeth_l3_fill_header()

Applied.

Re: [PATCH v2] hysdn: fix to a race condition in put_log_buffer

2017-08-07 Thread David Miller

From: Anton Volkov 
Date: Mon,  7 Aug 2017 15:54:14 +0300

> The synchronization type that was used earlier to guard the loop that
> deletes unused log buffers may lead to a situation that prevents any
> thread from going through the loop.
> 
> The patch deletes previously used synchronization mechanism and moves
> the loop under the spin_lock so the similar cases won't be feasible in
> the future.
> 
> Found by by Linux Driver Verification project (linuxtesting.org).
> 
> Signed-off-by: Anton Volkov 
> ---
> v2: Fixed coding style issues

Applied.

Re: [PATCH net-next v1 1/2] bpf: Move check_uarg_tail_zero() upward

2017-08-07 Thread Daniel Borkmann

On 08/07/2017 06:36 PM, Mickaël Salaün wrote:

The function check_uarg_tail_zero() may be useful for other part of the
code in the syscall.c file. Move this function at the beginning of the
file.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Kees Cook 
Cc: Martin KaFai Lau 

Acked-by: Daniel Borkmann

Re: [PATCH] hns3: fix unused function warning

2017-08-07 Thread David Miller

From: Arnd Bergmann 
Date: Mon,  7 Aug 2017 12:41:53 +0200

> Without CONFIG_PCI_IOV, we get a harmless warning about an
> unused function:
> 
> drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:2273:13: error: 
> 'hclge_disable_sriov' defined but not used [-Werror=unused-function]
> 
> The #ifdefs in this driver are obviously wrong, so this just
> removes them and uses an IS_ENABLED() check that does the same
> thing correctly in a more readable way.
> 
> Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility 
> Layer Support")
> Signed-off-by: Arnd Bergmann 

Applied.

Re: [pull request][for-next 0/8] Mellanox, mlx5 shared 2017-08-07

2017-08-07 Thread David Miller

From: Saeed Mahameed 
Date: Mon,  7 Aug 2017 13:18:00 +0300

> Hi Dave & Doug,
> 
> This series contains some low level updates for mlx5 core driver,
> to be shared as base code for net-next and rdma for-next mlx5
> 4.14 submissions.
> 
> Please find more information in the tag message below.
> 
> Please pull and let me know if there's any porblem.
> 
> Side note:
> This series merges cleanly with current net-next, but it will conflict with 
> Jiri's patch
> "mlx5e: push cls_flower and mqprio setup_tc processing into separate 
> functions"
> Which is under review.
> since this is shared code and must go to both rdma and net-next it has to be
> based on 4.13-rc4, so there is not much I can do about this.

I resolved the merge conflict as best as I could, please take a look.

Pulled, thanks.

[PATCH v1 net] TCP_USER_TIMEOUT and tcp_keepalive should conform to RFC5482

2017-08-07 Thread Rao Shoaib

Change from version 0: Rationale behind the change:

The man page for tcp(7) states

when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will
override keepalive to  determine  when to close a connection due to keepalive
failure.

This is ambigious at best. user expectation is most likely that the connection
will be reset after TCP_USER_TIMEOUT milliseconds of inactivity.

The code however waits for the keepalive to kick-in (default 2hrs) and than
after one failure resets the conenction. 

What is the rationale for that ? The same effect can be obtained by simply
changing the value of tcp_keep_alive_probes.

Since the TCP_USER_TIMEOUT option was added based on RFC 5482 we need to follow 
the RFC. Which states

4.2 TCP keep-Alives:
   Some TCP implementations, such as those in BSD systems, use a
   different abort policy for TCP keep-alives than for user data.  Thus,
   the TCP keep-alive mechanism might abort a connection that would
   otherwise have survived the transient period without connectivity.
   Therefore, if a connection that enables keep-alives is also using the
   TCP User Timeout Option, then the keep-alive timer MUST be set to a
   value larger than that of the adopted USER TIMEOUT.

This patch enforces the MUST and also dis-associates user timeout from keep
alive.  A man page patch will be submitted separately.

Signed-off-by: Rao Shoaib 
---
 net/ipv4/tcp.c   | 10 --
 net/ipv4/tcp_timer.c |  9 +
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 71ce33d..f2af44d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2628,7 +2628,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;
 
case TCP_KEEPIDLE:
-   if (val < 1 || val > MAX_TCP_KEEPIDLE)
+   /* Per RFC5482 keepalive_time must be > user_timeout */
+   if (val < 1 || val > MAX_TCP_KEEPIDLE ||
+   ((val * HZ) <= icsk->icsk_user_timeout))
err = -EINVAL;
else {
tp->keepalive_time = val * HZ;
@@ -2724,8 +2726,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_USER_TIMEOUT:
/* Cap the max time in ms TCP will retry or probe the window
 * before giving up and aborting (ETIMEDOUT) a connection.
+* Per RFC5482 TCP user timeout must be < keepalive_time.
+* If the default value changes later -- all bets are off.
 */
-   if (val < 0)
+   if (val < 0 || (tp->keepalive_time &&
+   tp->keepalive_time <= msecs_to_jiffies(val)) ||
+  net->ipv4.sysctl_tcp_keepalive_time <= msecs_to_jiffies(val))
err = -EINVAL;
else
icsk->icsk_user_timeout = msecs_to_jiffies(val);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c0f..d39fe60 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -664,14 +664,7 @@ static void tcp_keepalive_timer (unsigned long data)
elapsed = keepalive_time_elapsed(tp);
 
if (elapsed >= keepalive_time_when(tp)) {
-   /* If the TCP_USER_TIMEOUT option is enabled, use that
-* to determine when to timeout instead.
-*/
-   if ((icsk->icsk_user_timeout != 0 &&
-   elapsed >= icsk->icsk_user_timeout &&
-   icsk->icsk_probes_out > 0) ||
-   (icsk->icsk_user_timeout == 0 &&
-   icsk->icsk_probes_out >= keepalive_probes(tp))) {
+   if (icsk->icsk_probes_out >= keepalive_probes(tp)) {
tcp_send_active_reset(sk, GFP_ATOMIC);
tcp_write_err(sk);
goto out;
-- 
2.7.4

Re: [PATCH net-next v3 13/13] net: switchdev: Remove bridge bypass support from switchdev

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> Currently the bridge port flags, vlans, FDBs and MDBs can be offloaded
> through the bridge code, making the switchdev's SELF bridge bypass
> implementation to be redundant. This implies several changes:
> - No need for dump infra in switchdev, DSA's special case is handled
>   privately.
> - Remove obj_dump from switchdev_ops.
> - FDBs are removed from obj_add/del routines, due to the fact that they
>   are offloaded through the bridge notification chain.
> - The switchdev_port_bridge_xx() and switchdev_port_fdb_xx() functions
>   can be removed.
> 
> Signed-off-by: Arkadi Sharshevsky 
> Reviewed-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next v4 2/2] bpf: add a test case for syscalls/sys_{enter|exit}_* tracepoints

2017-08-07 Thread Daniel Borkmann


On 08/05/2017 01:00 AM, Yonghong Song wrote:

Signed-off-by: Yonghong Song 


Acked-by: Daniel Borkmann

Re: [PATCH net-next v3 12/13] net: bridge: Remove FDB deletion through switchdev object

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> At this point no driver supports FDB add/del through switchdev object
> but rather via notification chain, thus, it is removed.
> 
> Signed-off-by: Arkadi Sharshevsky 
> Reviewed-by: Vivien Didelot 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next v3 09/13] net: dsa: Remove support for MDB dump from DSA's drivers

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> This is done as a preparation before removing support for MDB dump from
> DSA core. The MDBs are synced with the bridge and thus there is no
> need for special dump operation support.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread David Ahern

On 8/7/17 12:06 PM, Stephen Hemminger wrote:
>> Does not work. Seems like you pushed the RFC commit which was known to
>> be incomplete.
> 
> Patches welcome.

What exists does not even compile. Patches will be sent once you fix that.

Re: [PATCH net-next v3 07/13] net: dsa: Remove support for vlan dump from DSA's drivers

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> This is done as a preparation before removing support for vlan dump from
> DSA core. The vlans are synced with the bridge and thus there is no
> need for special dump operation support.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next v3 08/13] net: dsa: Remove support for bypass bridge port attributes/vlan set

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> The bridge port attributes/vlan for DSA devices should be set only
> from bridge code. Furthermore, The vlans are synced totally with the
> bridge so there is no need for special dump support.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next v3 05/13] net: dsa: Move FDB add/del implementation inside DSA

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> Currently DSA uses switchdev's implementation of FDB add/del ndos. This
> patch moves the implementation inside DSA in order to support the legacy
> way for static FDB configuration.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next v3 04/13] net: dsa: Add support for learning FDB through notification

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> Add support for learning FDB through notification. The driver defers
> the hardware update via ordered work queue. In case of a successful
> FDB add a notification is sent back to bridge.
> 
> In case of hw FDB del failure the static FDB will be deleted from
> the bridge, thus, the interface is moved to down state in order to
> indicate inconsistent situation.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-08-07 Thread Stephen Hemminger

On Mon, 7 Aug 2017 10:48:23 -0600
David Ahern  wrote:

> On 8/4/17 10:47 AM, Stephen Hemminger wrote:
> > I will put in the libmnl version. If it doesn't work because no one sent
> > me test cases, then fine. send a patch for that.  
> 
> This commit:
> 
> commit b6432e68ac2f1f6b4ea50aa0d6d47e72c445c71c
> Author: Stephen Hemminger 
> Date:   Fri Aug 4 09:52:15 2017 -0700
> 
> iproute: Add support for extended ack to rtnl_talk
> 
> 
> Does not work. Seems like you pushed the RFC commit which was known to
> be incomplete.

Patches welcome.

> First, the Config is HAVE_MNL not HAVE_LIBMNL which is in lib/libnetlink.c.
> 
> Second, changing that to HAVE_MNL does not work -- something is not
> getting passed in correctly. Just remove the semicolon on the else path:
> 
> +#else
> +/* No extended error ack without libmnl */
> +static int nl_dump_ext_err(const struct nlmsghdr *nlh, nl_ext_ack_fn_t
> errfn)
> +{
> +   return 0;
> +}
> +#endif
> 
> and you will see that HAVE_MNL is never defined.

Ok, that I will fix.

Re: [PATCH net-next v3 10/13] net: dsa: Remove redundant MDB dump support

2017-08-07 Thread Florian Fainelli

On 08/06/2017 06:15 AM, Arkadi Sharshevsky wrote:
> Currently the MDB HW database is synced with the bridge's one, thus,
> There is no need to support special dump functionality.
> 
> Signed-off-by: Arkadi Sharshevsky 

Reviewed-by: Florian Fainelli 
-- 
Florian

1 2 3 >

1 - 100 of 236 matches

Mail list logo