Re: [net-next PATCH 2/4] samples/bpf: make bpf_load.c code compatible with ELF maps section changes

2017-05-02 Thread Jesper Dangaard Brouer
On Tue, 2 May 2017 17:54:51 -0700
Alexei Starovoitov  wrote:

> On Tue, May 02, 2017 at 02:31:56PM +0200, Jesper Dangaard Brouer wrote:
> > This patch does proper parsing of the ELF "maps" section, in-order to
> > be both backwards and forwards compatible with changes to the map
> > definition struct bpf_map_def, which gets compiled into the ELF file.
> > 
> > The assumption is that new features with value zero, means that they
> > are not in-use.  For backward compatibility where loading an ELF file
> > with a smaller struct bpf_map_def, only copy objects ELF size, leaving
> > rest of loaders struct zero.  For forward compatibility where ELF file
> > have a larger struct bpf_map_def, only copy loaders own struct size
> > and verify that rest of the larger struct is zero, assuming this means
> > the newer feature was not activated, thus it should be safe for this
> > older loader to load this newer ELF file.
> > 
> > Fixes: fb30d4b71214 ("bpf: Add tests for map-in-map")
> > Fixes: 409526bea3c3 ("samples/bpf: bpf_load.c detect and abort if ELF maps 
> > section size is wrong")
> > Signed-off-by: Jesper Dangaard Brouer   
> 
> I would just merge patches 2 and 3 to reduce churn,
> but it looks like great improvement already.

I could have combined them, but I prefer keeping them separate to keep
the ELF changes separated from changing a sample program e.g.
map_perf_test_user.c.  IHMO is is cleaner this way.

> Acked-by: Alexei Starovoitov 
 
Thanks

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net v3] driver: veth: Fix one possbile memleak when fail to register_netdevice

2017-05-02 Thread Xin Long
On Wed, May 3, 2017 at 10:07 AM, Gao Feng  wrote:
>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>> On Behalf Of Xin Long
>> Sent: Wednesday, May 3, 2017 12:59 AM
>> On Tue, May 2, 2017 at 7:03 PM, Gao Feng  wrote:
>> >> From: Xin Long [mailto:lucien@gmail.com]
>> >> Sent: Tuesday, May 2, 2017 3:56 PM
>> >> On Sat, Apr 29, 2017 at 11:51 AM,   wrote:
>> >> > From: Gao Feng 
>> > [...]
>> >> > -static void veth_dev_free(struct net_device *dev)
>> >> > +static void veth_destructor_free(struct net_device *dev)
>> >> >  {
>> >> > free_percpu(dev->vstats);
>> >> > +}
>> >> not sure why you needed to add this function.
>> >> to use free_percpu() directly may be clearer.
>> >
>> > Because both of ndo_uninit and destructor need to perform same free
>> statements.
>> > It is good at maintain the codes with the common function.
>> >>
>> >> > +
>> >> > +static void veth_dev_uninit(struct net_device *dev) {
>> >> call free_percpu() here, no need to check dev->reg_state.
>> >> free_percpu will just return if dev->vstats is NULL.
>> >
>> > It would break the original design if don't check the reg_state.
>> > The original logic is that free the resources in the destructor, not in 
>> > ndo_init.
>> I got what you're doing now, can you pls try to fix this with:
>>
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -219,10 +219,9 @@ static int veth_dev_init(struct net_device *dev)
>> return 0;
>>  }
>>
>> -static void veth_dev_free(struct net_device *dev)
>> +static void veth_dev_uninit(struct net_device *dev)
>>  {
>> free_percpu(dev->vstats);
>> -   free_netdev(dev);
>>  }
>>
>>  #ifdef CONFIG_NET_POLL_CONTROLLER
>> @@ -279,6 +278,7 @@ static void veth_set_rx_headroom(struct net_device
>> *dev, int new_hr)
>>
>>  static const struct net_device_ops veth_netdev_ops = {
>> .ndo_init= veth_dev_init,
>> +   .ndo_uninit  = veth_dev_uninit,
>> .ndo_open= veth_open,
>> .ndo_stop= veth_close,
>> .ndo_start_xmit  = veth_xmit,
>> @@ -317,7 +317,7 @@ static void veth_setup(struct net_device *dev)
>>NETIF_F_HW_VLAN_STAG_TX |
>>NETIF_F_HW_VLAN_CTAG_RX |
>>NETIF_F_HW_VLAN_STAG_RX);
>> -   dev->destructor = veth_dev_free;
>> +   dev->destructor = free_netdev;
>> dev->max_mtu = ETH_MAX_MTU;
>>
>> dev->hw_features = VETH_FEATURES;
>>
>>
>> just as what other virtual nic drivers do (vxlan, geneve, macsec, bridge 
>> )
>>
>
> The fix you mentioned change the original logic.
> The dev->vstats is freed in advance in the ndo_uninit, not destructor.
> It may break the backward.
Sorry, I didn't get your "backward"
I can't see there will be any problem caused by it.

can you say this patch also break the 'backward' ?
https://patchwork.ozlabs.org/patch/748964/

It's really weird to do dev->reg_state check in ndo_unint
ndo_unint is supposed to free the memory alloced in ndo_init.

>
> Regards
> Feng
>
>


Re: [PATCH 1/1] IB/mlx5: Add port_xmit_wait to counter registers read

2017-05-02 Thread Leon Romanovsky
On Mon, May 01, 2017 at 05:30:08PM +0100, Tim Wright wrote:
> Add port_xmit_wait to the error counters read by mlx5_ib_process_mad to
> ensure sysfs port counter provides correct value for PortXmitWait.
> Otherwise the sysfs port_xmit_wait file always contains zero.
>
> The previous MAD_IFC implementation populated this counter, but it was
> removed during the migration to PPCNT for error counters (32-bit only).
>
> Signed-off-by: Tim Wright 
> ---
>  drivers/infiniband/hw/mlx5/mad.c | 2 ++
>  include/linux/mlx5/mlx5_ifc.h| 4 +++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>

Thanks,
Acked-by: Leon Romanovsky 


signature.asc
Description: PGP signature


Re: [PATCH] xdp: use common helper for netlink extended ack reporting

2017-05-02 Thread Johannes Berg
On Wed, 2017-05-03 at 00:39 +0200, Daniel Borkmann wrote:
> Small follow-up to d74a32acd59a ("xdp: use netlink extended ACK
> reporting")
> in order to let drivers all use the same NL_SET_ERR_MSG_MOD() helper
> macro
> for reporting. This also ensures that we consistently add the
> driver's
> prefix for dumping the report in user space to indicate that the
> error
> message is driver specific and not coming from core code.
> Furthermore,
> NL_SET_ERR_MSG_MOD() now reuses NL_SET_ERR_MSG() and thus makes all
> macros
> check the pointer as suggested.
> 
> References: https://www.spinics.net/lists/netdev/msg433267.html
> Signed-off-by: Daniel Borkmann 

I did wonder about the whole _TRY_ thing. LGTM

Reviewed-by: Johannes Berg 

johannes


Re: [PATCH v4 net-next 00/10] net/ncsi: Add debugging functionality

2017-05-02 Thread David Miller

Sorry, the net-next tree is closed right now as we are in the merge
window.

Please resubmit this when the net-next tree opens back up.

Thank you.


[PATCH v4 net-next 05/10] net/ncsi: Ethtool operation to get NCSI channel info

2017-05-02 Thread Gavin Shan
This adds ethtool command (ETHTOOL_GNCSICINFO) to retrieve the
NCSI channel information for the specified one. The simplified
output of this command is shown as follows from the modified
(private) ethtool:

 # ethtool --ncsi eth0 info
 NCSI channel 0:0 version:
   version:
   alpha2: 
   f/w name:
   f/w version:
   PCI IDs:   
   Manufacture ID: 

   Generic capability:   000f (000f)
   Multicast capability: 0007 ()
   Buffer capability:4000
   AEN capability:   0007 ()
   VLAN capability:  0007 ()
   Filters:  VLAN (2), Mixed (0), MC (2), UC (4)

   Link status: 000ff66b  
   MAC filter entries:
   00:00:00:00:00:00
   48:c4:b6:f0:a6:d4
   b6:fa:8f:70:00:00
   00:00:00:01:07:34
   VLAN filter entries:
   0004
   bc84

Signed-off-by: Gavin Shan 
---
 include/linux/ethtool.h  |   2 +
 include/uapi/linux/ethtool.h | 151 +++
 net/core/ethtool.c   |  22 +++
 net/ncsi/ncsi-ethtool.c  | 120 ++
 4 files changed, 295 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 720bb4d..5704b8b 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -376,5 +376,7 @@ struct ethtool_ops {
  const struct ethtool_link_ksettings *);
int (*get_ncsi_channels)(struct net_device *,
 struct ethtool_ncsi_channels *);
+   int (*get_ncsi_channel_info)(struct net_device *,
+struct ethtool_ncsi_channel_info *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index e43aacf..81fbd51 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1332,6 +1332,7 @@ struct ethtool_per_queue_op {
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 
 #define ETHTOOL_GNCSICHANNELS  0x0050 /* Get NCSI channels */
+#define ETHTOOL_GNCSICINFO 0x0051 /* Get NCSI channel information */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
@@ -1780,4 +1781,154 @@ struct ethtool_ncsi_channels {
 #define ETHTOOL_NCSI_CHANNEL_ACTIVE(1 << 8)
 #define ETHTOOL_NCSI_CHANNEL_FLAGS 0x100
 };
+
+/**
+ * struct ethtool_ncsi_channel_info - NCSI channel information
+ *
+ * @cmd: Command number = %ETHTOOL_GNCSICINFO
+ * @id:  NCSI channel identifier
+ * @version: BCD encoded NCSI version
+ * @alpha2: BCD encoded NCSI version
+ * @fw_name: Firmware name string
+ * @fw_version: Firmware version
+ * @pci_ids: PCI identifier
+ * @mf_id: Manufacture identifier
+ * @cap_generic: Generic capability list
+ * @cap_bc: Broadcast capability list
+ * @setting_bc: Broadcast filtering setting
+ * @cap_mc: Multicast capability list
+ * @setting_mc: Multicast filtering setting
+ * @cap_buf: Length of receive buffer
+ * @cap_aen: AEN capability list
+ * @setting_aen: AEN setting
+ * @cap_vlan: VLAN filtering capability list
+ * @setting_vlan: VLAN filltering setting
+ * @cap_vlan_filter: Number of VLAN filtering entries
+ * @cap_mixed_filter: Number of mixed filtering entries
+ * @cap_mc_filter: Number of multicast filtering entries
+ * @cap_uc_filter: Number of unicast filtering entries
+ * @link_status: Link status
+ * @link_other_ind: Link other indication
+ * @link_oem: Link OEM information
+ * @mac_valid_bits: Bitmap for valid MAC filtering entries
+ * @mac: MAC filtering entries
+ * @vlan_valid_bits: Bitmap for valid VLAN filtering entries
+ * @vlan: VLAN filtering entries
+ */
+struct ethtool_ncsi_channel_info {
+   __u32   cmd;
+   __u32   id;
+   __u32   version;
+   __u32   alpha2;
+   __u8fw_name[12];
+   __u32   fw_version;
+   __u16   pci_ids[4];
+   __u32   mf_id;
+   __u32   cap_generic;
+#define ETHTOOL_NCSI_G_HWA (1 << 0) /* HW arbitration   */
+#define ETHTOOL_NCSI_G_HDS (1 << 1) /* HNC driver status change */
+#define ETHTOOL_NCSI_G_FC  (1 << 2) /* HNC to MC flow control   */
+#define ETHTOOL_NCSI_G_FC1 (1 << 3) /* MC to HNC flow control   */
+#define ETHTOOL_NCSI_G_MC  (1 << 4) /* Global MC filtering  */
+#define ETHTOOL_NCSI_G_HWA_MASK0x60
+#define ETHTOOL_NCSI_G_HWA_UNKNOWN 0x00 /* Unknown HW arbitration   */
+#define ETHTOOL_NCSI_G_HWA_SUPPORT 0x20 /* Supported HW arbitration */
+#define ETHTOOL_NCSI_G_HWA_NOT_SUPPORT 0x40 /* No HW arbitration*/
+#define ETHTOOL_NCSI_G_HWA_RESERVED0x60 /* Reserved HW arbitration  */
+#define ETHTOOL_NCSI_G_MASK0x7f
+   __u32   cap_bc;
+   __u32   setting_bc;
+#define ETHTOOL_NCSI_BC_ARP(1 << 0) /* ARP packet filtering */

[PATCH v4 net-next 06/10] net/ncsi: Ethtool operation to get NCSI hw statistics

2017-05-02 Thread Gavin Shan
This adds ethtool command (ETHTOOL_GNCSISTATS) to retrieve the
NCSI hardware statistics. The simplified output of this command is
shown as follows from the modified (private) ethtool. It's obvious
the HW statistics isn't fetched from hardware yet, which is to be
sorted out later.

 # ethtool --ncsi eth0 stats
 NCSI statistics as below

   hnc_cnt_hi:   0
   hnc_cnt_lo:   0
   hnc_rx_bytes: 0
   hnc_tx_bytes: 0
   hnc_rx_uc_pkts:   0
   hnc_rx_mc_pkts:   0
   hnc_rx_bc_pkts:   0
   hnc_tx_uc_pkts:   0
   hnc_tx_mc_pkts:   0
   hnc_tx_bc_pkts:   0
   hnc_fcs_err:  0
   hnc_align_err:0
   hnc_false_carrier:0
   hnc_runt_pkts:0
   hnc_jabber_pkts:  0
   hnc_rx_pause_xon: 0
   hnc_rx_pause_xoff:0
   hnc_tx_pause_xon: 0
   hnc_tx_pause_xoff:0
   hnc_tx_s_collision:   0
   hnc_tx_m_collision:   0
   hnc_l_collision:  0
   hnc_e_collision:  0
   hnc_rx_ctl_frames:0
   hnc_rx_64_frames: 0
   hnc_rx_127_frames:0
   hnc_rx_255_frames:0
   hnc_rx_511_frames:0
   hnc_rx_1023_frames:   0
   hnc_rx_1522_frames:   0
   hnc_rx_9022_frames:   0
   hnc_tx_64_frames: 0
   hnc_tx_127_frames:0
   hnc_tx_255_frames:0
   hnc_tx_511_frames:0
   hnc_tx_1023_frames:   0
   hnc_tx_1522_frames:   0
   hnc_tx_9022_frames:   0
   hnc_rx_valid_bytes:   0
   hnc_rx_runt_pkts: 0
   hnc_rx_jabber_pkts:   0
   ncsi_rx_cmds: 0
   ncsi_dropped_cmds:0
   ncsi_cmd_type_errs:   0
   ncsi_cmd_csum_errs:   0
   ncsi_rx_pkts: 0
   ncsi_tx_pkts: 0
   ncsi_tx_aen_pkts: 0
   pt_tx_pkts:   0
   pt_tx_dropped:0
   pt_tx_channel_err:0
   pt_tx_us_err: 0
   pt_rx_pkts:   0
   pt_rx_dropped:0
   pt_rx_channel_err:0
   pt_rx_us_err: 0
   pt_rx_os_err: 0

Signed-off-by: Gavin Shan 
---
 include/linux/ethtool.h  |   2 +
 include/uapi/linux/ethtool.h | 124 +++
 net/core/ethtool.c   |  29 ++
 net/ncsi/ncsi-ethtool.c  |  87 ++
 4 files changed, 242 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 5704b8b..6d712ca 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -378,5 +378,7 @@ struct ethtool_ops {
 struct ethtool_ncsi_channels *);
int (*get_ncsi_channel_info)(struct net_device *,
 struct ethtool_ncsi_channel_info *);
+   int (*get_ncsi_stats)(struct net_device *,
+ struct ethtool_ncsi_stats *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 81fbd51..472773c 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1333,6 +1333,7 @@ struct ethtool_per_queue_op {
 
 #define ETHTOOL_GNCSICHANNELS  0x0050 /* Get NCSI channels */
 #define ETHTOOL_GNCSICINFO 0x0051 /* Get NCSI channel information */
+#define ETHTOOL_GNCSISTATS 0x0052 /* Get NCSI HW statistics */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
@@ -1931,4 +1932,127 @@ struct ethtool_ncsi_channel_info {
__u32   vlan_valid_bits;
__u16   vlan[16];
 };
+
+/**
+ * struct ethtool_ncsi_stats - NCSI hardware statistics
+ *
+ * @cmd: Command number = %ETHTOOL_GNCSISTATS
+ * @hnc_cnt_hi: Counter cleared
+ * @hnc_cnt_lo: Counter cleared
+ * @hnc_rx_bytes: Rx bytes
+ * @hnc_tx_bytes: Tx bytes
+ * @hnc_rx_uc_pkts: Rx UC packets
+ * @hnc_rx_mc_pkts: Rx MC packets
+ * @hnc_rx_bc_pkts: Rx BC packets
+ * @hnc_tx_uc_pkts: Tx UC packets
+ * @hnc_tx_mc_pkts: Tx MC packets
+ * @hnc_tx_bc_pkts: Tx BC packets
+ * @hnc_fcs_err: FCS errors
+ * @hnc_align_err: Alignment errors
+ * @hnc_false_carrier: False carrier detection
+ * @hnc_runt_pkts: Rx runt packets
+ * @hnc_jabber_pkts: Rx jabber packets
+ * @hnc_rx_pause_xon: Rx pause XON frames
+ * @hnc_rx_pause_xoff: Rx XOFF frames
+ * @hnc_tx_pause_xon: Tx XON frames
+ * @hnc_tx_pause_xoff: Tx XOFF frames
+ * @hnc_tx_s_collision: Single collision frames
+ * @hnc_tx_m_collision: Multiple collision frames
+ * @hnc_l_collision: Late collision frames
+ * @hnc_e_collision: Excessive collision frames
+ * @hnc_rx_ctl_frames: Rx control frames
+ * @hnc_rx_64_frames: Rx 64-bytes frames
+ * @hnc_rx_127_frames: Rx 65-127 bytes frames
+ * @hnc_rx_255_frames: Rx 128-255 bytes frames
+ * @hnc_rx_511_frames: Rx 256-511 bytes frames
+ * @hnc_rx_1023_frames: Rx 512-1023 bytes frames
+ * @hnc_rx_1522_frames: Rx 1024-1522 bytes frames
+ * @hnc_rx_9022_frames: Rx 1523-9022 bytes frames
+ * @hnc_tx_64_frames: Tx 64-bytes frames
+ * @hnc_tx_127_frames: Tx 65-127 bytes frames
+ * @hnc_tx_255_frames: Tx 128-255 bytes frames
+ * @hnc_tx_511_frames: Tx 256-511 bytes frames
+ * @hnc_tx_1023_frames: Tx 

[PATCH v4 net-next 00/10] net/ncsi: Add debugging functionality

2017-05-02 Thread Gavin Shan
This series supports NCSI debugging infrastructure by adding several
ethtool commands and one debugfs file. It was inspired by the reported
issues: No available package and channel are probed successfully.
Obviously, we don't have a debugging infrastructure for NCSI stack yet.

The first 3 patches, fixing some issues, aren't relevant to the
subject. I included them because I expect they can be merged beofre
the code for debugging infrastructure. PATCH[4,5,6,7] adds ethtool
commands to dump NCSI topology, channel information, HW statistics
and SW statistics. PATCH[8] adds debugfs entry /sys/kernel/debug/
ncsi/eth0/pkt, which sends NCSI command packet on write. The NCSI
response packet are dumped on read. PATCH[9,10] fix two issues
found by the debugging functionality.

Changelog
=
v4:
   * Replace debugfs entries with ethtool commands   (Dave)
   * Add ethtool commands to dump NCSI topology, channel info and
 HW statistics   (Gavin)
v3:
   * Use pr_debug() instead of pr_warn() upon failure to create
 debugfs directory or file   (Joe 
Perches)
   * Use relative debugfs path/file names in debug messages in
 ncsi-debug.c(Joe 
Perches)
   * Use const specifier for @ncsi_pkt_handlers and @ranges  (Joe 
Perches)
   * Eliminate CONFIG_NET_NCSI_DEBUG ifdef's in *.c  (Jakub 
Kicinski)
v2:
   * Use debugfs instead of procfs   (Joe 
Perches)

Gavin Shan (10):
  net/ncsi: Disable HWA mode when no channels are found
  net/ncsi: Properly track channel monitor timer state
  net/ncsi: Enforce failover on link monitor timeout
  net/ncsi: Ethtool operation to get NCSI topology
  net/ncsi: Ethtool operation to get NCSI channel info
  net/ncsi: Ethtool operation to get NCSI hw statistics
  net/ncsi: Ethtool operation to get NCSI sw statistics
  net/ncsi: Support NCSI packet generation
  net/ncsi: No error report on DP response to non-existing package
  net/ncsi: Fix length of GVI response packet

 include/linux/ethtool.h  |   8 +
 include/uapi/linux/ethtool.h | 312 ++
 net/core/ethtool.c   | 120 +++
 net/ncsi/Kconfig |   9 +
 net/ncsi/Makefile|   4 +-
 net/ncsi/internal.h  |  71 
 net/ncsi/ncsi-aen.c  |  14 +-
 net/ncsi/ncsi-cmd.c  |  13 +-
 net/ncsi/ncsi-debug.c| 759 +++
 net/ncsi/ncsi-ethtool.c  | 321 ++
 net/ncsi/ncsi-manage.c   |  56 +++-
 net/ncsi/ncsi-rsp.c  |  34 +-
 12 files changed, 1708 insertions(+), 13 deletions(-)
 create mode 100644 net/ncsi/ncsi-debug.c
 create mode 100644 net/ncsi/ncsi-ethtool.c

-- 
2.7.4



[PATCH v4 net-next 07/10] net/ncsi: Ethtool operation to get NCSI sw statistics

2017-05-02 Thread Gavin Shan
This adds ethtool command (ETHTOOL_GNCSISWSTATS) to retrieve the
NCSI software statistics. The simplified output of this command is
shown as follows from the modified (private) ethtool.

 COMMAND  OK   TIMEOUT  ERROR
 
 CIS  32   29   0
 SP   10   70
 DP   17   14   0
 EC   100
 ECNT 100
 AE   100
 GLS  10   00
 SMA  100
 DBF  100
 GC   200
 GP   200

 RESPONSE OK   TIMEOUT  ERROR
 
 CIS  300
 SP   300
 DP   201
 EC   100
 ECNT 100
 AE   100
 GLS  10   00
 SMA  100
 DBF  100
 GC   002
 GP   200

 AEN  OK   TIMEOUT  ERROR
 

Signed-off-by: Gavin Shan 
---
 include/linux/ethtool.h  |  2 ++
 include/uapi/linux/ethtool.h | 20 
 net/core/ethtool.c   | 29 
 net/ncsi/Kconfig |  9 +
 net/ncsi/Makefile|  1 +
 net/ncsi/internal.h  | 13 +
 net/ncsi/ncsi-aen.c  | 14 +-
 net/ncsi/ncsi-cmd.c  | 12 +++-
 net/ncsi/ncsi-debug.c| 45 
 net/ncsi/ncsi-ethtool.c  | 34 +
 net/ncsi/ncsi-manage.c   |  4 
 net/ncsi/ncsi-rsp.c  | 19 ++-
 12 files changed, 199 insertions(+), 3 deletions(-)
 create mode 100644 net/ncsi/ncsi-debug.c

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 6d712ca..eb57142 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -380,5 +380,7 @@ struct ethtool_ops {
 struct ethtool_ncsi_channel_info *);
int (*get_ncsi_stats)(struct net_device *,
  struct ethtool_ncsi_stats *);
+   int (*get_ncsi_sw_stats)(struct net_device *,
+struct ethtool_ncsi_sw_stats *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 472773c..bf6fa2b 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1334,6 +1334,7 @@ struct ethtool_per_queue_op {
 #define ETHTOOL_GNCSICHANNELS  0x0050 /* Get NCSI channels */
 #define ETHTOOL_GNCSICINFO 0x0051 /* Get NCSI channel information */
 #define ETHTOOL_GNCSISTATS 0x0052 /* Get NCSI HW statistics */
+#define ETHTOOL_GNCSISWSTATS   0x0053 /* Get NCSI software statistics */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
@@ -2055,4 +2056,23 @@ struct ethtool_ncsi_stats {
__u64   pt_rx_us_err;
__u64   pt_rx_os_err;
 };
+
+/**
+ * struct ethtool_ncsi_sw_stats - NCSI software statistics
+ *
+ * @cmd: Command number = %ETHTOOL_GNCSISWSTATS
+ * @command: Statistics for sent command packets
+ * @response: Statistics for received response packets
+ * @aen: Statistics for received AEN packets
+ */
+struct ethtool_ncsi_sw_stats {
+   __u32   cmd;
+#define ETHTOOL_NCSI_SW_STAT_OK0
+#define ETHTOOL_NCSI_SW_STAT_TIMEOUT   1
+#define ETHTOOL_NCSI_SW_STAT_ERROR 2
+#define ETHTOOL_NCSI_SW_STAT_MAX   3
+   __u64   command[128][ETHTOOL_NCSI_SW_STAT_MAX];
+   __u64   response[128][ETHTOOL_NCSI_SW_STAT_MAX];
+   __u64   aen[256][ETHTOOL_NCSI_SW_STAT_MAX];
+};
 #endif /* _UAPI_LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f26aa36..998d29b 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -838,6 +838,32 @@ static int ethtool_get_ncsi_stats(struct net_device *dev,
return ret;
 }
 
+static int ethtool_get_ncsi_sw_stats(struct net_device *dev,
+void __user *useraddr)
+{
+   struct ethtool_ncsi_sw_stats *enss;
+   int ret;
+
+   if (!dev->ethtool_ops->get_ncsi_sw_stats)
+   return -EOPNOTSUPP;
+
+   enss = kzalloc(sizeof(*enss), GFP_KERNEL);
+   if (!enss)
+   return -ENOMEM;
+
+   if (copy_from_user(>cmd, useraddr, sizeof(enss->cmd))) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   ret = dev->ethtool_ops->get_ncsi_sw_stats(dev, enss);
+   if (!ret && copy_to_user(useraddr, enss, sizeof(*enss)))
+   ret = -EFAULT;
+out:
+   kfree(enss);
+   return ret;
+}
+
 static void
 warn_incomplete_ethtool_legacy_settings_conversion(const char *details)
 {
@@ -2884,6 

[PATCH v4 net-next 04/10] net/ncsi: Ethtool operation to get NCSI topology

2017-05-02 Thread Gavin Shan
This adds ethtool command (ETHTOOL_GNCSICHANNELS) to retrieve the
NCSI channels that are associated with the specified netdev. The
ethtool operation (get_ncsi_channels()) is initialized or destroyed
when the NCSI device is registerred or unregistered. The userspace
and kernel has to negotiate on the total number of NCSI channels
so that userspace can allocate enough memory to convey data. Here
is the example output from modified (private) ethtool:

 # ethtool --ncsi eth0 channels
 2 channels:
 0:0 Active
 0:1

Signed-off-by: Gavin Shan 
---
 include/linux/ethtool.h  |  2 ++
 include/uapi/linux/ethtool.h | 17 ++
 net/core/ethtool.c   | 40 ++
 net/ncsi/Makefile|  3 +-
 net/ncsi/internal.h  |  4 +++
 net/ncsi/ncsi-ethtool.c  | 80 
 net/ncsi/ncsi-manage.c   |  6 
 7 files changed, 151 insertions(+), 1 deletion(-)
 create mode 100644 net/ncsi/ncsi-ethtool.c

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 83cc986..720bb4d 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -374,5 +374,7 @@ struct ethtool_ops {
  struct ethtool_link_ksettings *);
int (*set_link_ksettings)(struct net_device *,
  const struct ethtool_link_ksettings *);
+   int (*get_ncsi_channels)(struct net_device *,
+struct ethtool_ncsi_channels *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 5f4ea28..e43aacf 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -1331,6 +1331,8 @@ struct ethtool_per_queue_op {
 #define ETHTOOL_PHY_GTUNABLE   0x004e /* Get PHY tunable configuration */
 #define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 
+#define ETHTOOL_GNCSICHANNELS  0x0050 /* Get NCSI channels */
+
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
 #define SPARC_ETH_SSET ETHTOOL_SSET
@@ -1763,4 +1765,19 @@ struct ethtool_link_settings {
 * __u32 map_lp_advertising[link_mode_masks_nwords];
 */
 };
+
+/**
+ * struct ethtool_ncsi_channels - NCSI channels
+ *
+ * @cmd: Command number = %ETHTOOL_GNCSICHANNELS
+ * @nr_channels: Number of available channels
+ * @id: Array of NCSI channel IDs
+ */
+struct ethtool_ncsi_channels {
+   __u32   cmd;
+   __s16   nr_channels;
+   __u32   id[0];
+#define ETHTOOL_NCSI_CHANNEL_ACTIVE(1 << 8)
+#define ETHTOOL_NCSI_CHANNEL_FLAGS 0x100
+};
 #endif /* _UAPI_LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03111a2..7644765 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -756,6 +756,43 @@ static int ethtool_set_link_ksettings(struct net_device 
*dev,
return dev->ethtool_ops->set_link_ksettings(dev, _ksettings);
 }
 
+static int ethtool_get_ncsi_channels(struct net_device *dev,
+void __user *useraddr)
+{
+   struct ethtool_ncsi_channels *enc;
+   short nr_channels;
+   ssize_t size = 0;
+   int ret;
+
+   if (!dev->ethtool_ops->get_ncsi_channels)
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(_channels, useraddr + sizeof(enc->cmd),
+  sizeof(nr_channels)))
+   return -EFAULT;
+
+   size = sizeof(*enc);
+   if (nr_channels > 0)
+   size += nr_channels * sizeof(enc->id[0]);
+
+   enc = kzalloc(size, GFP_KERNEL);
+   if (!enc)
+   return -ENOMEM;
+
+   if (copy_from_user(enc, useraddr, size)) {
+   ret = -EFAULT;
+   goto out;
+   }
+
+   ret = dev->ethtool_ops->get_ncsi_channels(dev, enc);
+   if (copy_to_user(useraddr, enc, size))
+   ret = -EFAULT;
+
+out:
+   kfree(enc);
+   return ret;
+}
+
 static void
 warn_incomplete_ethtool_legacy_settings_conversion(const char *details)
 {
@@ -2793,6 +2830,9 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_PHY_STUNABLE:
rc = set_phy_tunable(dev, useraddr);
break;
+   case ETHTOOL_GNCSICHANNELS:
+   rc = ethtool_get_ncsi_channels(dev, useraddr);
+   break;
default:
rc = -EOPNOTSUPP;
}
diff --git a/net/ncsi/Makefile b/net/ncsi/Makefile
index dd12b56..71a258a 100644
--- a/net/ncsi/Makefile
+++ b/net/ncsi/Makefile
@@ -1,4 +1,5 @@
 #
 # Makefile for NCSI API
 #
-obj-$(CONFIG_NET_NCSI) += ncsi-cmd.o ncsi-rsp.o ncsi-aen.o ncsi-manage.o
+obj-$(CONFIG_NET_NCSI) += ncsi-cmd.o ncsi-rsp.o ncsi-aen.o ncsi-manage.o \
+ ncsi-ethtool.o
diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 1308a56..09a7ba7 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -337,4 +337,8 

[PATCH v4 net-next 03/10] net/ncsi: Enforce failover on link monitor timeout

2017-05-02 Thread Gavin Shan
The NCSI channel has been configured to provide service if its link
monitor timer is enabled, regardless of its state (inactive or active).
So the timeout event on the link monitor indicates the out-of-service
on that channel, for which a failover is needed.

This sets NCSI_DEV_RESHUFFLE flag to enforce failover on link monitor
timeout, regardless the channel's original state (inactive or active).
Also, the link is put into "down" state to give the failing channel
lowest priority when selecting for the active channel. The state of
failing channel should be set to active in order for deinitialization
and failover to be done.

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index c71a3a5..13ad1f26 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -170,6 +170,7 @@ static void ncsi_channel_monitor(unsigned long data)
struct ncsi_channel *nc = (struct ncsi_channel *)data;
struct ncsi_package *np = nc->package;
struct ncsi_dev_priv *ndp = np->ndp;
+   struct ncsi_channel_mode *ncm;
struct ncsi_cmd_arg nca;
bool enabled, chained;
unsigned int monitor_state;
@@ -214,20 +215,21 @@ static void ncsi_channel_monitor(unsigned long data)
case NCSI_CHANNEL_MONITOR_WAIT ... NCSI_CHANNEL_MONITOR_WAIT_MAX:
break;
default:
-   if (!(ndp->flags & NCSI_DEV_HWA) &&
-   state == NCSI_CHANNEL_ACTIVE) {
+   if (!(ndp->flags & NCSI_DEV_HWA)) {
ncsi_report_link(ndp, true);
ndp->flags |= NCSI_DEV_RESHUFFLE;
}
 
+   ncm = >modes[NCSI_MODE_LINK];
spin_lock_irqsave(>lock, flags);
nc->state = NCSI_CHANNEL_INVISIBLE;
+   ncm->data[2] &= ~0x1;
spin_unlock_irqrestore(>lock, flags);
 
ncsi_stop_channel_monitor(nc);
 
spin_lock_irqsave(>lock, flags);
-   nc->state = NCSI_CHANNEL_INACTIVE;
+   nc->state = NCSI_CHANNEL_ACTIVE;
list_add_tail_rcu(>link, >channel_queue);
spin_unlock_irqrestore(>lock, flags);
ncsi_process_next_channel(ndp);
-- 
2.7.4



[PATCH v4 net-next 01/10] net/ncsi: Disable HWA mode when no channels are found

2017-05-02 Thread Gavin Shan
When there are no NCSI channels probed, HWA (Hardware Arbitration)
mode is enabled. It's not correct because HWA depends on the fact:
NCSI channels exist and all of them support HWA mode. This disables
HWA when no channels are probed.

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index a3bd5fa..5073e15 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -839,12 +839,15 @@ static bool ncsi_check_hwa(struct ncsi_dev_priv *ndp)
struct ncsi_package *np;
struct ncsi_channel *nc;
unsigned int cap;
+   bool has_channel = false;
 
/* The hardware arbitration is disabled if any one channel
 * doesn't support explicitly.
 */
NCSI_FOR_EACH_PACKAGE(ndp, np) {
NCSI_FOR_EACH_CHANNEL(np, nc) {
+   has_channel = true;
+
cap = nc->caps[NCSI_CAP_GENERIC].cap;
if (!(cap & NCSI_CAP_GENERIC_HWA) ||
(cap & NCSI_CAP_GENERIC_HWA_MASK) !=
@@ -855,8 +858,13 @@ static bool ncsi_check_hwa(struct ncsi_dev_priv *ndp)
}
}
 
-   ndp->flags |= NCSI_DEV_HWA;
-   return true;
+   if (has_channel) {
+   ndp->flags |= NCSI_DEV_HWA;
+   return true;
+   }
+
+   ndp->flags &= ~NCSI_DEV_HWA;
+   return false;
 }
 
 static int ncsi_enable_hwa(struct ncsi_dev_priv *ndp)
-- 
2.7.4



[PATCH v4 net-next 02/10] net/ncsi: Properly track channel monitor timer state

2017-05-02 Thread Gavin Shan
The field @monitor.enabled in the NCSI channel descriptor is used
to track the state of channel monitor timer. It indicates the timer's
state (pending or not). We could not start the timer again in its
handler. In that case, We missed to update @monitor.enabled to false.
It leads to below warning printed by WARN_ON_ONCE() when the monitor
is restarted afterwards.

   [ cut here ]
   WARNING: CPU: 0 PID: 411 at /var/lib/jenkins/workspace/openbmc-build \
   /distro/ubuntu/target/palmetto/openbmc/build/tmp/work-shared/palmetto \
   net/ncsi/ncsi-manage.c:240 ncsi_start_channel_monitor+0x44/0x7c
   CPU: 0 PID: 411 Comm: kworker/0:3 Not tainted \
   4.7.10-f26558191540830589fe03932d05577957670b8d #1
   Hardware name: ASpeed SoC
   Workqueue: events ncsi_dev_work
   [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
   [] (show_stack) from [] (__warn+0xc4/0xf0)
   [] (__warn) from [] (warn_slowpath_null+0x1c/0x24)
   [] (warn_slowpath_null) from [] 
(ncsi_start_channel_monitor+0x44/0x7c)
   [] (ncsi_start_channel_monitor) from [] 
(ncsi_configure_channel+0x27c/0x2dc)
   [] (ncsi_configure_channel) from [] 
(ncsi_dev_work+0x39c/0x3e8)
   [] (ncsi_dev_work) from [] (process_one_work+0x1b8/0x2fc)
   [] (process_one_work) from [] (worker_thread+0x2c0/0x3f8)
   [] (worker_thread) from [] (kthread+0xd0/0xe8)
   [] (kthread) from [] (ret_from_fork+0x14/0x24)
   ---[ end trace 110cccf2b038c44d ]---

This fixes the issue by updating @monitor.enabled to false if needed.

Reported-by: Sridevi Ramesh 
Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-manage.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 5073e15..c71a3a5 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -183,11 +183,16 @@ static void ncsi_channel_monitor(unsigned long data)
monitor_state = nc->monitor.state;
spin_unlock_irqrestore(>lock, flags);
 
-   if (!enabled || chained)
+   if (!enabled || chained) {
+   ncsi_stop_channel_monitor(nc);
return;
+   }
+
if (state != NCSI_CHANNEL_INACTIVE &&
-   state != NCSI_CHANNEL_ACTIVE)
+   state != NCSI_CHANNEL_ACTIVE) {
+   ncsi_stop_channel_monitor(nc);
return;
+   }
 
switch (monitor_state) {
case NCSI_CHANNEL_MONITOR_START:
@@ -199,6 +204,7 @@ static void ncsi_channel_monitor(unsigned long data)
nca.req_flags = 0;
ret = ncsi_xmit_cmd();
if (ret) {
+   ncsi_stop_channel_monitor(nc);
netdev_err(ndp->ndev.dev, "Error %d sending GLS\n",
   ret);
return;
@@ -218,6 +224,8 @@ static void ncsi_channel_monitor(unsigned long data)
nc->state = NCSI_CHANNEL_INVISIBLE;
spin_unlock_irqrestore(>lock, flags);
 
+   ncsi_stop_channel_monitor(nc);
+
spin_lock_irqsave(>lock, flags);
nc->state = NCSI_CHANNEL_INACTIVE;
list_add_tail_rcu(>link, >channel_queue);
@@ -257,6 +265,10 @@ void ncsi_stop_channel_monitor(struct ncsi_channel *nc)
nc->monitor.enabled = false;
spin_unlock_irqrestore(>lock, flags);
 
+   /* The timer isn't in pending state if we're deleting the timer
+* in its handler. del_timer_sync() can detect it and just does
+* nothing.
+*/
del_timer_sync(>monitor.timer);
 }
 
-- 
2.7.4



[PATCH v4 net-next 09/10] net/ncsi: No error report on DP response to non-existing package

2017-05-02 Thread Gavin Shan
The issue was found from /sys/kernel/debug/ncsi/eth0/stats. The
first step in NCSI package/channel enumeration is deselect all
packages by sending DP (Deselect Package) commands. The remote
NIC replies with response while the corresponding package isn't
populated yet and it is treated as an error wrongly.

 # ethtool --ncsi eth0 swstats
 :
 RESPONSE OK   TIMEOUT  ERROR
 ===
 DP   201

This fixes the issue by ignoring the error in DP response handler,
when the corresponding package isn't existing. With this applied,
no error reported from DP response packets.

 # ethtool --ncsi eth0 swstats
 :
 RESPONSE OK   TIMEOUT  ERROR
 ===
 DP   300

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-rsp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index 41479a4..5097d86 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -118,7 +118,7 @@ static int ncsi_rsp_handler_dp(struct ncsi_request *nr)
ncsi_find_package_and_channel(ndp, rsp->rsp.common.channel,
  , NULL);
if (!np)
-   return -ENODEV;
+   return 0;
 
/* Change state of all channels attached to the package */
NCSI_FOR_EACH_CHANNEL(np, nc) {
-- 
2.7.4



[PATCH v4 net-next 10/10] net/ncsi: Fix length of GVI response packet

2017-05-02 Thread Gavin Shan
The length of GVI (GetVersionInfo) response packet should be 40 instead
of 36. This issue was found from /sys/kernel/debug/ncsi/eth0/stats.

 # ethtool --ncsi eth0 swstats
 :
 RESPONSE OK   TIMEOUT  ERROR
 ===
 GVI  002

With this applied, no error reported on GVI response packets:

 # ethtool --ncsi eth0 swstats
 :
 RESPONSE OK   TIMEOUT  ERROR
 ===
 GVI  200

Signed-off-by: Gavin Shan 
---
 net/ncsi/ncsi-rsp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
index 5097d86..a3e0f4f 100644
--- a/net/ncsi/ncsi-rsp.c
+++ b/net/ncsi/ncsi-rsp.c
@@ -951,7 +951,7 @@ static struct ncsi_rsp_handler {
{ NCSI_PKT_RSP_EGMF,4, ncsi_rsp_handler_egmf},
{ NCSI_PKT_RSP_DGMF,4, ncsi_rsp_handler_dgmf},
{ NCSI_PKT_RSP_SNFC,4, ncsi_rsp_handler_snfc},
-   { NCSI_PKT_RSP_GVI,36, ncsi_rsp_handler_gvi },
+   { NCSI_PKT_RSP_GVI,40, ncsi_rsp_handler_gvi },
{ NCSI_PKT_RSP_GC, 32, ncsi_rsp_handler_gc  },
{ NCSI_PKT_RSP_GP, -1, ncsi_rsp_handler_gp  },
{ NCSI_PKT_RSP_GCPS,  172, ncsi_rsp_handler_gcps},
-- 
2.7.4



[PATCH v4 net-next 08/10] net/ncsi: Support NCSI packet generation

2017-05-02 Thread Gavin Shan
This introduces /sys/kernel/debug/ncsi/eth0/pkt. The debugfs entry
can accept parameters to produce NCSI command packet. The received
NCSI response packet is dumped on read. Below is an example to send
CIS command and dump its response.

   # echo CIS,0,0 > /sys/kernel/debug/ncsi/eth0/pkt
   # cat /sys/kernel/debug/ncsi/eth0/pkt
   NCSI response [CIS] packet received

   00 01 dd 80 00 0004  

Signed-off-by: Gavin Shan 
---
 net/ncsi/internal.h|  54 
 net/ncsi/ncsi-cmd.c|   1 +
 net/ncsi/ncsi-debug.c  | 714 +
 net/ncsi/ncsi-manage.c |  10 +
 net/ncsi/ncsi-rsp.c|  11 +
 5 files changed, 790 insertions(+)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 5a6cd74..0a9210d 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -215,6 +215,9 @@ struct ncsi_request {
bool used;/* Request that has been assigned  */
unsigned int flags;   /* NCSI request property   */
 #define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#ifdef CONFIG_NET_NCSI_DEBUG
+#define NCSI_REQ_FLAG_DEBUG2
+#endif
struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
struct sk_buff   *cmd;/* Associated NCSI command packet  */
struct sk_buff   *rsp;/* Associated NCSI response packet */
@@ -277,6 +280,15 @@ struct ncsi_dev_priv {
struct packet_type  ptype;   /* NCSI packet Rx handler */
 #ifdef CONFIG_NET_NCSI_DEBUG
struct ethtool_ncsi_sw_stats stats;  /* NCSI software statistics   */
+   struct dentry   *dentry; /* Debugfs directory   */
+   struct {
+   struct dentry  *dentry;
+   unsigned int   req;
+#define NCSI_PKT_REQ_FREE  0
+#define NCSI_PKT_REQ_BUSY  0x
+   interrno;
+   struct sk_buff *rsp;
+   } pkt;
 #endif /* CONFIG_NET_NCSI_DEBUG */
struct list_headnode;/* Form NCSI device list  */
 };
@@ -348,10 +360,52 @@ void ncsi_ethtool_unregister_dev(struct net_device *dev);
 #ifdef CONFIG_NET_NCSI_DEBUG
 void ncsi_dev_update_stats(struct ncsi_dev_priv *ndp,
   int type, int subtype, int errno);
+int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp);
+void ncsi_dev_release_debug(struct ncsi_dev_priv *ndp);
+void ncsi_dev_reset_debug_pkt(struct ncsi_dev_priv *ndp,
+ struct sk_buff *skb, int errno);
+
+static inline bool ncsi_dev_is_debug_pkt(struct ncsi_dev_priv *ndp,
+struct ncsi_request *nr)
+{
+   return ((nr->flags & NCSI_REQ_FLAG_DEBUG) && ndp->pkt.req == nr->id);
+}
+
+static inline void ncsi_dev_set_debug_pkt(struct ncsi_dev_priv *ndp,
+ struct ncsi_request *nr)
+{
+   if (nr->flags & NCSI_REQ_FLAG_DEBUG)
+   ndp->pkt.req = nr->id;
+}
 #else
 static inline void ncsi_dev_update_stats(struct ncsi_dev_priv *ndp,
 int type, int subtype, int errno)
 {
 }
+
+static inline int ncsi_dev_init_debug(struct ncsi_dev_priv *ndp)
+{
+   return -ENOTTY;
+}
+
+static inline void ncsi_dev_release_debug(struct ncsi_dev_priv *ndp)
+{
+}
+
+static inline bool ncsi_dev_is_debug_pkt(struct ncsi_dev_priv *ndp,
+struct ncsi_request *nr)
+{
+   return false;
+}
+
+static inline void ncsi_dev_set_debug_pkt(struct ncsi_dev_priv *ndp,
+ struct ncsi_request *nr)
+{
+}
+
+static inline void ncsi_dev_reset_debug_pkt(struct ncsi_dev_priv *ndp,
+   struct sk_buff *skb, int errno)
+{
+}
 #endif /* CONFIG_NET_NCSI_DEBUG */
 #endif /* __NCSI_INTERNAL_H__ */
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 875ff07..cfcda87 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -361,6 +361,7 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
 */
nr->enabled = true;
mod_timer(>timer, jiffies + 1 * HZ);
+   ncsi_dev_set_debug_pkt(nca->ndp, nr);
 
/* Send NCSI packet */
skb_get(nr->cmd);
diff --git a/net/ncsi/ncsi-debug.c b/net/ncsi/ncsi-debug.c
index 0e6c038..5a5f058 100644
--- a/net/ncsi/ncsi-debug.c
+++ b/net/ncsi/ncsi-debug.c
@@ -22,6 +22,9 @@
 #include "internal.h"
 #include "ncsi-pkt.h"
 
+static struct dentry *ncsi_dentry;
+static const char *ncsi_pkt_type_name(unsigned int type);
+
 void ncsi_dev_update_stats(struct ncsi_dev_priv *ndp,
   int type, int subtype, int errno)
 {
@@ -43,3 +46,714 @@ void ncsi_dev_update_stats(struct ncsi_dev_priv *ndp,
 
spin_unlock_irqrestore(>lock, flags);
 }
+
+void ncsi_dev_reset_debug_pkt(struct ncsi_dev_priv *ndp,
+ struct sk_buff *skb, int errno)
+{
+   unsigned long flags;
+   struct sk_buff *old;
+
+   

[PATCH 1/1] forcedeth: remove unnecessary carrier status check

2017-05-02 Thread Zhu Yanjun
Since netif_carrier_on() will do nothing if device's
carrier is already on, so it's unnecessary to do
carrier status check.

It's the same for netif_carrier_off().

Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 978d329..aa912f4 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -4248,11 +4248,9 @@ static int nv_get_link_ksettings(struct net_device *dev,
/* We do not track link speed / duplex setting if the
 * interface is disabled. Force a link check */
if (nv_update_linkspeed(dev)) {
-   if (!netif_carrier_ok(dev))
-   netif_carrier_on(dev);
+   netif_carrier_on(dev);
} else {
-   if (netif_carrier_ok(dev))
-   netif_carrier_off(dev);
+   netif_carrier_off(dev);
}
}
 
-- 
2.7.4



Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Casey Leedom
| From: Alexander Duyck 
| Date: Tuesday, May 2, 2017 11:10 AM
| ...
| So for example, in the case of x86 it seems like there are multiple
| root complexes that have issues, and the gains for enabling it with
| standard DMA to host memory are small. As such we may want to default
| it to off via the architecture specific PCIe code and then look at
| having "white-list" cases where we enable it for things like
| peer-to-peer accesses. In the case of SPARC we could look at
| defaulting it to on, and only "black-list" any cases where there might
| be issues since SPARC relies on this in a significant way for
| performance. In the case of ARM and other architectures it is a bit of
| a toss-up. I would say we could just default it to on for now and
| "black-list" anything that doesn't work, or we could go the other way
| like I suggested for x86. It all depends on what the ARM community
| would want to agree on for this. I would say unless it makes a
| significant difference like it does in the case of SPARC we are
| probably better off just defaulting it to off.

  Sorry, I forgot to respond to this earlier when someone was rushing me out
to a customer dinner.

  I'm unaware of any other x86 Root Complex Port that has a problem with
Relaxed Ordering other than the performance issue with the current Intel
implementation.  Ashok tells me that Intel is in the final stages of putting
together a technical notice on this issue but I don't know when that will
come out.  Hopefully that will shed much more light on the cause and actual
use of Relaxed Ordering when directed to Coherent Memory on current and past
Intel platforms.  (Note that the performance bug seems to limit us to
~75-85Gb/s DMA Write speed to Coherent Host Memory.)

  The only other Device that I currently know of which has issues with
Relaxed Ordering TLPs directed at it, is also a Root Complex Port on the new
AMD A1100 ARM SoC ("SEATTLE").  There we have an actual bug which could lead
to Data Corruption.

  But I don't know anything about other x86 Root Complex Ports having issues
with this -- we've been using it ever since commit ef306b50b from December
2010.

  Also, I'm not aware of any performance data which has been gathered on the
use of Relaxed Ordering when directed at Host Memory.  From your note, it
sounds like it's important on SPARC architectures.  But it could conceivably
be important on any architecture or Root Complex/Memory Controller
implementation.  We use it to direct Ingress Packet Data to Free List
Buffers, followed by a TLP without Relaxed Ordering directed at a Host
Memory Message Queue.  The idea here is to give the Root Complex options on
which DMA Memory Write TLPs to process in order to optimize data placement
in memory.  And with the next Ethernet speed bump to 200Gb/s this may be
even more important.

  Basically, I'd hate to come up with a solution where we write off the use
of Relaxed Ordering for DMA Writes to Host Memory.  I don't think you're
suggesting that, but there are a number of x86 Root Complex implementations
out there -- and some like the new AMD Ryzen have yet to be tested -- as
well as other architectures.

  And, as Ashok and I have both nothed, any solution we come up with needs
to cope with a heterogeneous situation where, on the same PCIe Fabric, it
may be necessesary/desireable to support Relaxed Ordering TLPs directed at
some End Nodes but not others.

Casey


[PATCH net-next] selftests/bpf: get rid of -D__x86_64__

2017-05-02 Thread Alexei Starovoitov
-D__x86_64__ workaround was used to make /usr/include/features.h
to follow expected path through the system include headers.
This is not portable.
Instead define dummy stubs.h which is used by 'clang -target bpf'

Fixes: 6882804c916b ("selftests/bpf: add a test for overlapping packet range 
checks")
Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/Makefile| 4 ++--
 tools/testing/selftests/bpf/gnu/stubs.h | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/gnu/stubs.h

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index f4341e160de6..91edd0566237 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -34,6 +34,6 @@ $(BPFOBJ): force
 CLANG ?= clang
 
 %.o: %.c
-   $(CLANG) -I../../../include/uapi -I../../../../samples/bpf/ \
-   -D__x86_64__ -Wno-compare-distinct-pointer-types \
+   $(CLANG) -I. -I../../../include/uapi -I../../../../samples/bpf/ \
+   -Wno-compare-distinct-pointer-types \
-O2 -target bpf -c $< -o $@
diff --git a/tools/testing/selftests/bpf/gnu/stubs.h 
b/tools/testing/selftests/bpf/gnu/stubs.h
new file mode 100644
index ..719225b16626
--- /dev/null
+++ b/tools/testing/selftests/bpf/gnu/stubs.h
@@ -0,0 +1 @@
+/* dummy .h to trick /usr/include/features.h to work with 'clang -target bpf' 
*/
-- 
2.9.3



Re: linux-next: manual merge of the net-next tree with Linus' tree

2017-05-02 Thread David Miller
From: Stephen Rothwell 
Date: Wed, 3 May 2017 11:07:03 +1000

> Hi all,
> 
> Today's linux-next merge of the net-next tree got a conflict in:
> 
>   arch/avr32/include/uapi/asm/socket.h
> 
> between commit:
> 
>   26202873bb51 ("avr32: remove support for AVR32 architecture")
> 
> from Linus' tree and commits:
> 
>   a2d133b1d465 ("sock: introduce SO_MEMINFO getsockopt")
>   6d4339028b35 ("net: Introduce SO_INCOMING_NAPI_ID"
>   5daab9db7b65 ("New getsockopt option to get socket cookie")
> 
> from the net-next tree.
> 
> I fixed it up (I removed the file) and can carry the fix as
> necessary. This is now fixed as far as linux-next is concerned, but any
> non trivial conflicts should be mentioned to your upstream maintainer
> when your tree is submitted for merging.  You may also want to consider
> cooperating with the maintainer of the conflicting tree to minimise any
> particularly complex conflicts.

Duly noted in the pull request I sent to Linus, and now that he took
net-next in it is resolved in his tree.

Thanks!


Re: TPACKET_V3 timeout bug?

2017-05-02 Thread Guy Harris
On May 2, 2017, at 10:16 AM, chetan loke  wrote:

> Commit that caused it:
> 
> https://github.com/torvalds/linux/commit/41a50d621a321b4c15273cc1b5ed41437f4acdfb
> 
> Reverting that change is what we need.

As long as you do *not* revert


https://github.com/torvalds/linux/commit/da413eec729dae5dcb150e2eb34c5e7e5e4e1b49

"packet: Fixed TPACKET V3 to signal poll when block is closed rather than every 
packet

Make TPACKET_V3 signal poll when block is closed rather than for every
packet. Side effect is that poll will be signaled when block retire
timer expires which didn't previously happen. Issue was visible when
sending packets at a very low frequency such that all blocks are retired
before packets are received by TPACKET_V3. This caused avoidable packet
loss. The fix ensures that the signal is sent when blocks are closed
which covers the normal path where the block is filled as well as the
path where the timer expires. The case where a block is filled without
moving to the next block (ie. all blocks are full) will still cause poll
to be signaled."

The behavior in the commit message exactly what libpcap (and probably at least 
some other users of TPACKET_V3) need.  libpcap doesn't care whether empty 
blocks are passed up (as indicated, "you get woken up when the packet buffer 
timeout expires" is *not* part of libpcap's contract, and code shouldn't depend 
on that), so, as long as you preserve the "signal poll when a block is closed, 
even if it's an empty block" behavior, libpcap will work fine.


[PATCH net-next] selftests/bpf: add a test case to check verifier pointer arithmetic

2017-05-02 Thread Alexei Starovoitov
From: Yonghong Song 

With clang/llvm 4.0+, the test case is able to generate
the following pattern:

440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1


commit 332270fdc8b6 ("bpf: enhance verifier to understand stack
pointer arithmetic") improved verifier to handle such a pattern.
This patch adds a C test case to actually generate such a pattern.
A dummy tracepoint interface is used to load the program
into the kernel.

Signed-off-by: Yonghong Song 
Acked-by: Martin KaFai Lau 
Acked-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
We were planning to submit this test case along with
the commit 332270fdc8b6 for completeness, but it took few
extra days to cleanup.
Hope it's not too late.
---
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  |  16 ++
 tools/testing/selftests/bpf/test_tcp_estats.c | 258 ++
 3 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_tcp_estats.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d8d94b9bd76c..f4341e160de6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -13,7 +13,7 @@ LDLIBS += -lcap -lelf
 
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs
 
-TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o
+TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o
 
 TEST_PROGS := test_kmod.sh
 
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 4ed049a0b14b..b59f5ed4ae40 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -268,6 +268,21 @@ static void test_l4lb(void)
bpf_object__close(obj);
 }
 
+static void test_tcp_estats(void)
+{
+   const char *file = "./test_tcp_estats.o";
+   int err, prog_fd;
+   struct bpf_object *obj;
+   __u32 duration = 0;
+
+   err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, , _fd);
+   CHECK(err, "", "err %d errno %d\n", err, errno);
+   if (err)
+   return;
+
+   bpf_object__close(obj);
+}
+
 int main(void)
 {
struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY };
@@ -277,6 +292,7 @@ int main(void)
test_pkt_access();
test_xdp();
test_l4lb();
+   test_tcp_estats();
 
printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
return 0;
diff --git a/tools/testing/selftests/bpf/test_tcp_estats.c 
b/tools/testing/selftests/bpf/test_tcp_estats.c
new file mode 100644
index ..bee3bbecc0c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcp_estats.c
@@ -0,0 +1,258 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+
+/* This program shows clang/llvm is able to generate code pattern
+ * like:
+ *   _tcp_send_active_reset:
+ *  0:   bf 16 00 00 00 00 00 00 r6 = r1
+ *..
+ *335:   b7 01 00 00 0f 00 00 00 r1 = 15
+ *336:   05 00 48 00 00 00 00 00 goto 72
+ *
+ *   LBB0_3:
+ *337:   b7 01 00 00 01 00 00 00 r1 = 1
+ *338:   63 1a d0 ff 00 00 00 00 *(u32 *)(r10 - 48) = r1
+ *408:   b7 01 00 00 03 00 00 00 r1 = 3
+ *
+ *   LBB0_4:
+ *409:   71 a2 fe ff 00 00 00 00 r2 = *(u8 *)(r10 - 2)
+ *410:   bf a7 00 00 00 00 00 00 r7 = r10
+ *411:   07 07 00 00 b8 ff ff ff r7 += -72
+ *412:   bf 73 00 00 00 00 00 00 r3 = r7
+ *413:   0f 13 00 00 00 00 00 00 r3 += r1
+ *414:   73 23 2d 00 00 00 00 00 *(u8 *)(r3 + 45) = r2
+ *
+ * From the above code snippet, the code generated by the compiler
+ * is reasonable. The "r1" is assigned to different values in basic
+ * blocks "_tcp_send_active_reset" and "LBB0_3", and used in "LBB0_4".
+ * The verifier should be able to handle such code patterns.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define _(P) ({typeof(P) val = 0; bpf_probe_read(, sizeof(val), ); val;})
+#define TCP_ESTATS_MAGIC 0xBAADBEEF
+
+/* This test case needs "sock" and "pt_regs" data structure.
+ * Recursively, "sock" needs "sock_common" and "inet_sock".
+ * However, this is a unit test case only for
+ * verifier purpose without bpf program execution.
+ * We can safely mock much simpler data structures, basically
+ * only taking the necessary fields from kernel headers.
+ */

RE: [PATCH net v3] driver: veth: Fix one possbile memleak when fail to register_netdevice

2017-05-02 Thread Gao Feng
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Xin Long
> Sent: Wednesday, May 3, 2017 12:59 AM
> On Tue, May 2, 2017 at 7:03 PM, Gao Feng  wrote:
> >> From: Xin Long [mailto:lucien@gmail.com]
> >> Sent: Tuesday, May 2, 2017 3:56 PM
> >> On Sat, Apr 29, 2017 at 11:51 AM,   wrote:
> >> > From: Gao Feng 
> > [...]
> >> > -static void veth_dev_free(struct net_device *dev)
> >> > +static void veth_destructor_free(struct net_device *dev)
> >> >  {
> >> > free_percpu(dev->vstats);
> >> > +}
> >> not sure why you needed to add this function.
> >> to use free_percpu() directly may be clearer.
> >
> > Because both of ndo_uninit and destructor need to perform same free
> statements.
> > It is good at maintain the codes with the common function.
> >>
> >> > +
> >> > +static void veth_dev_uninit(struct net_device *dev) {
> >> call free_percpu() here, no need to check dev->reg_state.
> >> free_percpu will just return if dev->vstats is NULL.
> >
> > It would break the original design if don't check the reg_state.
> > The original logic is that free the resources in the destructor, not in 
> > ndo_init.
> I got what you're doing now, can you pls try to fix this with:
> 
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -219,10 +219,9 @@ static int veth_dev_init(struct net_device *dev)
> return 0;
>  }
> 
> -static void veth_dev_free(struct net_device *dev)
> +static void veth_dev_uninit(struct net_device *dev)
>  {
> free_percpu(dev->vstats);
> -   free_netdev(dev);
>  }
> 
>  #ifdef CONFIG_NET_POLL_CONTROLLER
> @@ -279,6 +278,7 @@ static void veth_set_rx_headroom(struct net_device
> *dev, int new_hr)
> 
>  static const struct net_device_ops veth_netdev_ops = {
> .ndo_init= veth_dev_init,
> +   .ndo_uninit  = veth_dev_uninit,
> .ndo_open= veth_open,
> .ndo_stop= veth_close,
> .ndo_start_xmit  = veth_xmit,
> @@ -317,7 +317,7 @@ static void veth_setup(struct net_device *dev)
>NETIF_F_HW_VLAN_STAG_TX |
>NETIF_F_HW_VLAN_CTAG_RX |
>NETIF_F_HW_VLAN_STAG_RX);
> -   dev->destructor = veth_dev_free;
> +   dev->destructor = free_netdev;
> dev->max_mtu = ETH_MAX_MTU;
> 
> dev->hw_features = VETH_FEATURES;
> 
> 
> just as what other virtual nic drivers do (vxlan, geneve, macsec, bridge )
> 

The fix you mentioned change the original logic.
The dev->vstats is freed in advance in the ndo_uninit, not destructor.
It may break the backward.

Regards
Feng




Re: [PATCH net-next RFC 1/1] net netlink: Add new type NLA_FLAG_BITS

2017-05-02 Thread Jamal Hadi Salim

On 17-05-02 03:03 PM, David Miller wrote:

From: Jamal Hadi Salim 
Date: Sun, 30 Apr 2017 10:28:39 -0400


Generic bitflags attribute content sent to the kernel by user.
With this type the user can either set or unset a flag in the
kernel.


You asked for feedback, here it is :-)

I think this is overengineered.

Just define a u32 for the value, and mask which defines which bits are
legitimate and defined.  Any bit outside of the legitimate mask must
be zero.



That is what it does but as a nla type. It has a validator ops() in
case someone wants to override the default validator.
Is the ops the over-engineering you refer to or merely making it an
nla type is a problem?

cheers,
jamal



RE: [Intel-wired-lan] [PATCH] igb: make a few local functions static

2017-05-02 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Colin King
> Sent: Thursday, April 27, 2017 10:59 AM
> To: Kirsher, Jeffrey T ; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH] igb: make a few local functions static
> 
> From: Colin Ian King 
> 
> clean up a few sparse warnings, these following
> functions can be made static:
> 
> drivers/net/ethernet/intel/igb/igb_main.c: warning: symbol
>   'igb_add_mac_filter' was not declared. Should it be static?
> drivers/net/ethernet/intel/igb/igb_main.c: warning: symbol
>   'igb_del_mac_filter' was not declared. Should it be static?
> drivers/net/ethernet/intel/igb/igb_main.c: warning: symbol
>   'igb_set_vf_mac_filter' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)

Tested-by: Aaron Brown 


RE: [Intel-wired-lan] [PATCH net-next] igb: mark PM functions as __maybe_unused

2017-05-02 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On
> Behalf Of Arnd Bergmann
> Sent: Thursday, April 27, 2017 12:10 PM
> To: Kirsher, Jeffrey T 
> Cc: Arnd Bergmann ; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; intel-wired-...@lists.osuosl.org; David S. Miller
> 
> Subject: [Intel-wired-lan] [PATCH net-next] igb: mark PM functions as
> __maybe_unused
> 
> The new wake function is only used by the suspend/resume handlers that
> are defined in inside of an #ifdef, which can cause this harmless
> warning:
> 
> drivers/net/ethernet/intel/igb/igb_main.c:7988:13: warning:
> 'igb_deliver_wake_packet' defined but not used [-Wunused-function]
> 
> Removing the #ifdef, instead using a __maybe_unused annotation
> simplifies the code and avoids the warning.
> 
> Fixes: b90fa8763560 ("igb: Enable reading of wake up packet")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 18 +-
>  1 file changed, 5 insertions(+), 13 deletions(-)

Tested-by: Aaron Brown 


linux-next: manual merge of the net-next tree with Linus' tree

2017-05-02 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  arch/avr32/include/uapi/asm/socket.h

between commit:

  26202873bb51 ("avr32: remove support for AVR32 architecture")

from Linus' tree and commits:

  a2d133b1d465 ("sock: introduce SO_MEMINFO getsockopt")
  6d4339028b35 ("net: Introduce SO_INCOMING_NAPI_ID"
  5daab9db7b65 ("New getsockopt option to get socket cookie")

from the net-next tree.

I fixed it up (I removed the file) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


Re: [net-next PATCH 2/4] samples/bpf: make bpf_load.c code compatible with ELF maps section changes

2017-05-02 Thread Alexei Starovoitov
On Tue, May 02, 2017 at 02:31:56PM +0200, Jesper Dangaard Brouer wrote:
> This patch does proper parsing of the ELF "maps" section, in-order to
> be both backwards and forwards compatible with changes to the map
> definition struct bpf_map_def, which gets compiled into the ELF file.
> 
> The assumption is that new features with value zero, means that they
> are not in-use.  For backward compatibility where loading an ELF file
> with a smaller struct bpf_map_def, only copy objects ELF size, leaving
> rest of loaders struct zero.  For forward compatibility where ELF file
> have a larger struct bpf_map_def, only copy loaders own struct size
> and verify that rest of the larger struct is zero, assuming this means
> the newer feature was not activated, thus it should be safe for this
> older loader to load this newer ELF file.
> 
> Fixes: fb30d4b71214 ("bpf: Add tests for map-in-map")
> Fixes: 409526bea3c3 ("samples/bpf: bpf_load.c detect and abort if ELF maps 
> section size is wrong")
> Signed-off-by: Jesper Dangaard Brouer 

I would just merge patches 2 and 3 to reduce churn,
but it looks like great improvement already.
Acked-by: Alexei Starovoitov 



Re: [net-next PATCH 1/4] samples/bpf: adjust rlimit RLIMIT_MEMLOCK for traceex2, tracex3 and tracex4

2017-05-02 Thread Alexei Starovoitov
On Tue, May 02, 2017 at 02:31:50PM +0200, Jesper Dangaard Brouer wrote:
> Needed to adjust max locked memory RLIMIT_MEMLOCK for testing these bpf 
> samples
> as these are using more and larger maps than can fit in distro default 
> 64Kbytes limit.
> 
> Signed-off-by: Jesper Dangaard Brouer 
...
> + struct rlimit r = {1024*1024, RLIM_INFINITY};
...
> + struct rlimit r = {1024*1024, RLIM_INFINITY};

why magic numbers?
All other samples do
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
 
> + if (setrlimit(RLIMIT_MEMLOCK, )) {
> + perror("setrlimit(RLIMIT_MEMLOCK)");

ip_tunnel.c test does:
perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
Few others do:
assert(!setrlimit(RLIMIT_MEMLOCK, ));
and the rest just:
setrlimit(RLIMIT_MEMLOCK, );

We probalby need to move this to a helper.

> + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};

here it's consistent :)

> + if (setrlimit(RLIMIT_MEMLOCK, )) {
> + perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");

but with different perror ?
Let's do a common helper for all?



RE: [PATCH net v4 00/12] Fix possbile memleaks when fail to register_netdevice

2017-05-02 Thread Gao Feng
Hi David

> From: David Miller [mailto:da...@davemloft.net]
> Sent: Wednesday, May 3, 2017 3:30 AM
> From: gfree.w...@foxmail.com
> Date: Tue,  2 May 2017 13:58:42 +0800
> 
[...]
> >
> > This solution doesn't only make sure free all resources in any case,
> > but also follows the original desgin that some resources could be kept
> > until the destructor executes normally after register the device
> > successfully.
> 
> I want to think about this some more.
> 
> It is really unfortunate that resources are allocated strictly from the
ndo_init()
> yet released in two different callbacks which are invoked only in certain
> (different) situations.
> 
> Just the fact that we have to make an internal netdev state test in the
> ndo_uninit callback to get this right is a big red flag to me.

Yes, I am very agree with you.
This fix is just like a workaround under current framework.

The root is that allocate in one spot, but free them at two spots.
It means all ndo_uninit need to handle this case if allocate some resource
and free them in the destructor.
It should be done by the framework.

I thought about if there was a better solution to fix it.
But I think it need to modify the framework of net_device, it seems not good
as a bug fix to net.git.

Best Regards
Feng






Re: [PATCH] ipx: call ipxitf_put() in ioctl error path

2017-05-02 Thread Kees Cook
On Tue, May 2, 2017 at 3:58 AM, Dan Carpenter  wrote:
> We should call ipxitf_put() if the copy_to_user() fails.
>
> Reported-by: 李强 
> Signed-off-by: Dan Carpenter 
>
> diff --git a/net/ipx/af_ipx.c b/net/ipx/af_ipx.c
> index 8a9219ff2e77..fa31ef29e3fa 100644
> --- a/net/ipx/af_ipx.c
> +++ b/net/ipx/af_ipx.c
> @@ -1168,11 +1168,10 @@ static int ipxitf_ioctl(unsigned int cmd, void __user 
> *arg)
> sipx->sipx_network  = ipxif->if_netnum;
> memcpy(sipx->sipx_node, ipxif->if_node,
> sizeof(sipx->sipx_node));
> -   rc = -EFAULT;
> +   rc = 0;
> if (copy_to_user(arg, , sizeof(ifr)))
> -   break;
> +   rc = -EFAULT;
> ipxitf_put(ipxif);
> -   rc = 0;
> break;
> }
> case SIOCAIPXITFCRT:

This refcount overflow flaw (and resulting use-after-free) appears to
be reachable from unprivileged userspace, though I think it requires
an interface already be configured, so this is likely not much risk to
most users. Someone more familiar with IPX should double-check...

And, of course, I should point out this flaw would have been blocked
entirely by using refcount_t:
https://lkml.org/lkml/2017/3/17/193

And if it didn't require a configured interface, it would be mitigated
with module autoload blocking:
https://lkml.org/lkml/2017/4/19/1088

(Yes, yes, I know both are still being worked on, but this is a good
example to show another case where they'd have been useful.)

-Kees

-- 
Kees Cook
Pixel Security


[PATCH] Fix for new version of realtek r8153

2017-05-02 Thread jake Briggs
From: jake 

Fix for new version of realtek r8153 not being recognised by the r8152
driver. The new version of the device with idVendor=0bda, idProduct=8153 is 
6010 and treating it as if it was a 5c30 allow it to work.

Error log:

Apr 19 09:55:13 devotron kernel: [  330.167074] usb 2-2.2: New USB
device found, idVendor=0bda, idProduct=8153
Apr 19 09:55:13 devotron kernel: [  330.167078] usb 2-2.2: New USB
device strings: Mfr=1, Product=2, SerialNumber=6
Apr 19 09:55:13 devotron kernel: [  330.167080] usb 2-2.2: Product: USB
10/100/1000 LAN
Apr 19 09:55:13 devotron kernel: [  330.167081] usb 2-2.2: Manufacturer:
Realtek
Apr 19 09:55:13 devotron kernel: [  330.167083] usb 2-2.2: SerialNumber:
00101
Apr 19 09:55:13 devotron kernel: [  330.258781] usb 2-2.2: reset
SuperSpeed USB device number 5 using xhci_hcd
Apr 19 09:55:13 devotron kernel: [  330.279430] r8152 2-2.2:1.0 (unnamed
net_device) (uninitialized): Unknown version 0x6010
Apr 19 09:55:13 devotron kernel: [  330.279434] r8152 2-2.2:1.0 (unnamed
net_device) (uninitialized): Unknown Device

Signed-off-by: jake 
---
 drivers/net/usb/r8152.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 07f788c49d57..2a55459fdfac 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -4277,6 +4277,7 @@ static void r8152b_get_version(struct r8152 *tp)
tp->mii.supports_gmii = 1;
break;
case 0x5c30:
+   case 0x6010:
tp->version = RTL_VER_06;
tp->mii.supports_gmii = 1;
break;
-- 
2.11.0



Re: [PATCH] xdp: use common helper for netlink extended ack reporting

2017-05-02 Thread Jakub Kicinski
On Wed,  3 May 2017 00:39:17 +0200, Daniel Borkmann wrote:
> Small follow-up to d74a32acd59a ("xdp: use netlink extended ACK reporting")
> in order to let drivers all use the same NL_SET_ERR_MSG_MOD() helper macro
> for reporting. This also ensures that we consistently add the driver's
> prefix for dumping the report in user space to indicate that the error
> message is driver specific and not coming from core code. Furthermore,
> NL_SET_ERR_MSG_MOD() now reuses NL_SET_ERR_MSG() and thus makes all macros
> check the pointer as suggested.
> 
> References: https://www.spinics.net/lists/netdev/msg433267.html
> Signed-off-by: Daniel Borkmann 

Acked-by: Jakub Kicinski 

Perhaps I was overthinking loosing the message if extack is NULL.  
I certainly like how this patch simplifies things :)


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Alexander Duyck
On Tue, May 2, 2017 at 12:34 PM, Raj, Ashok  wrote:
> On Tue, May 02, 2017 at 11:10:22AM -0700, Alexander Duyck wrote:
>> On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok  wrote:
>> > On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
>> >> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom  wrote:
>> >> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the 
>> >> > Relaxed
>> >> > Ordering Attribute should not be used on Transaction Layer Packets 
>> >> > destined
>> >> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
>> >> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
>> >> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports 
>> >> > which
>> >> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
>> >>
>> >> So this is a good first step though might I suggest one other change.
>> >>
>> >> We may want to add logic to the core PCIe code that clears the "Enable
>> >> Relaxed Ordering" bit in the device control register for all devices
>> >> hanging off of this root complex. Assuming the devices conform to the
>> >> PCIe spec doing that should disable relaxed ordering in a device
>> >> agnostic way that then enables us at a driver level to just enable the
>> >> feature always without having to perform any checks for your flag. We
>> >> could probably do that as a part of probing the PCIe interfaces
>> >> hanging off of these devices.
>> >
>> > I suppose you don't want to turn off RO completely on the device. When
>> > traffic is targetted to mmio for peer to peer reasons RO has performance
>> > upside. The specific issue with these root ports indicate limitation using
>> > RO for traffic targetting coherent memory.
>>
>> Actually my main concern here is virtualization. If I take the PCIe
>> function and direct assign it I have no way of seeing the root complex
>> flag as it is now virtualized away. In the meantime the guest now has
>> the ability to enable the function and sees nothing that says you
>> can't enable relaxed ordering which in turn ends up potentially
>> causing data corruption on the system. I want relaxed ordering
>> disabled before I even consider assigning it to the guest on the
>> systems where this would be an issue.
>>
>> I prefer to err on the side of caution with this. Enabling Relaxed
>> Ordering is technically a performance enhancement, so we function but
>> not as well as we would like, while having it enabled when there are
>> issues can lead to data corruption. I would weigh the risk of data
>> corruption the thing to be avoided and of much higher priority than
>> enabling improved performance. As such I say we should default the
>> relaxed ordering attribute to off in general and look at
>> "white-listing" it in for various architectures and/or chipsets that
>> support/need it rather than having it enabled by default and trying to
>> switch it off after the fact when we find some new issue.
>
> I agree, after thinking about it a bit more.. even for transactions going to
> p2p, i'm just reading the pcie spec and some sections aren't super clear
> about completion redirect and ACS rules for p2p.
>
> Also it appears the device control default value is 1 for enabling
> Relaxed Ordering. This means we should probably save these states across
> resets/FLR for e.g. To ensure perf isn't affected after a FLR.

Right. That should happen automatically with the PCIe configuration
being saved/restored.

- Alex


[PATCH] xdp: use common helper for netlink extended ack reporting

2017-05-02 Thread Daniel Borkmann
Small follow-up to d74a32acd59a ("xdp: use netlink extended ACK reporting")
in order to let drivers all use the same NL_SET_ERR_MSG_MOD() helper macro
for reporting. This also ensures that we consistently add the driver's
prefix for dumping the report in user space to indicate that the error
message is driver specific and not coming from core code. Furthermore,
NL_SET_ERR_MSG_MOD() now reuses NL_SET_ERR_MSG() and thus makes all macros
check the pointer as suggested.

References: https://www.spinics.net/lists/netdev/msg433267.html
Signed-off-by: Daniel Borkmann 
---
 ( Would still be good to have this so that also virtio_net sets the
   prefix for user error messages. Thanks! )

 drivers/net/ethernet/netronome/nfp/nfp_net_common.c |  4 ++--
 drivers/net/virtio_net.c|  8 
 include/linux/netlink.h | 19 ---
 3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index db20376..82bd6b0 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2532,11 +2532,11 @@ struct nfp_net_dp *nfp_net_clone_dp(struct nfp_net *nn)
if (!dp->xdp_prog)
return 0;
if (dp->fl_bufsz > PAGE_SIZE) {
-   NL_MOD_TRY_SET_ERR_MSG(extack, "MTU too large w/ XDP enabled");
+   NL_SET_ERR_MSG_MOD(extack, "MTU too large w/ XDP enabled");
return -EINVAL;
}
if (dp->num_tx_rings > nn->max_tx_rings) {
-   NL_MOD_TRY_SET_ERR_MSG(extack, "Insufficient number of TX rings 
w/ XDP enabled");
+   NL_SET_ERR_MSG_MOD(extack, "Insufficient number of TX rings w/ 
XDP enabled");
return -EINVAL;
}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d0bc48..1c6d392 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1891,17 +1891,17 @@ static int virtnet_xdp_set(struct net_device *dev, 
struct bpf_prog *prog,
virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_ECN) ||
virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_UFO)) {
-   NL_SET_ERR_MSG(extack, "can't set XDP while host is 
implementing LRO, disable LRO first");
+   NL_SET_ERR_MSG_MOD(extack, "Can't set XDP while host is 
implementing LRO, disable LRO first");
return -EOPNOTSUPP;
}
 
if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
-   NL_SET_ERR_MSG(extack, "XDP expects header/data in single page, 
any_header_sg required");
+   NL_SET_ERR_MSG_MOD(extack, "XDP expects header/data in single 
page, any_header_sg required");
return -EINVAL;
}
 
if (dev->mtu > max_sz) {
-   NL_SET_ERR_MSG(extack, "MTU too large to enable XDP");
+   NL_SET_ERR_MSG_MOD(extack, "MTU too large to enable XDP");
netdev_warn(dev, "XDP requires MTU less than %lu\n", max_sz);
return -EINVAL;
}
@@ -1912,7 +1912,7 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog,
 
/* XDP requires extra queues for XDP_TX */
if (curr_qp + xdp_qp > vi->max_queue_pairs) {
-   NL_SET_ERR_MSG(extack, "Too few free TX rings available");
+   NL_SET_ERR_MSG_MOD(extack, "Too few free TX rings available");
netdev_warn(dev, "request %i queues but max is %i\n",
curr_qp + xdp_qp, vi->max_queue_pairs);
return -ENOMEM;
diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index c20395e..5fff5ba 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -86,19 +86,16 @@ struct netlink_ext_ack {
  * Currently string formatting is not supported (due
  * to the lack of an output buffer.)
  */
-#define NL_SET_ERR_MSG(extack, msg) do {   \
-   static const char _msg[] = (msg);   \
-   \
-   (extack)->_msg = _msg;  \
+#define NL_SET_ERR_MSG(extack, msg) do {   \
+   static const char __msg[] = (msg);  \
+   struct netlink_ext_ack *__extack = (extack);\
+   \
+   if (__extack)   \
+   __extack->_msg = __msg; \
 } while (0)
 
-#define NL_MOD_TRY_SET_ERR_MSG(extack, msg) do {   \
-   static const char _msg[] = KBUILD_MODNAME ": " msg; \
-   struct netlink_ext_ack *_extack = (extack); \
-   \
-   if (_extack)\
-   _extack->_msg = _msg; 

Re: [PATCH net-next V3 2/2] rtnl: Add support for netdev event attribute to link messages

2017-05-02 Thread David Ahern
On 5/1/17 7:35 AM, Vlad Yasevich wrote:
> So, it looks like the notifier might be there to account for the ioctl/sysfs
> interfaces.

hmmm... I guess we have to leave it then.

> 
> Additionally, the message is not generated from do_setlink() if the devices is
> down, so notifier accounts for that as well.
> 
> I guess, basic question is whether data carried in NETDEV_CHANGEINFODATA is 
> useful
> to user space?  If yes (I can possibly see some managements apps that might 
> be interested
> in it), then we shouldn't just remove it.  Possible solutions to eliminate 
> duplicates
> would be to move the notifier call into non-rtnl code paths...
> 
> Also, may be netdev_state_change() should call rtmsg_ifinfo() unconditionally?

Link up will generate a message, so userspace gets notified at some point.


RE: [PATCH v6 net-next]smsc911x: Adding support for Micochip LAN9250 Ethernet controller

2017-05-02 Thread David.Cai
Andrew:

Got it, Thanks!

David Cai

-Original Message-
From: Andrew Lunn [mailto:and...@lunn.ch] 
Sent: Tuesday, May 02, 2017 2:39 PM
To: David Cai - C24226
Cc: netdev@vger.kernel.org; da...@davemloft.net; UNGLinuxDriver; 
steve.glendinn...@shawell.net
Subject: Re: [PATCH v6 net-next]smsc911x: Adding support for Micochip LAN9250 
Ethernet controller

On Tue, May 02, 2017 at 08:59:14PM +, david@microchip.com wrote:
> From: David Cai 
> 
> Adding support for Microchip LAN9250 Ethernet controller.
> 
> Signed-off-by: David Cai 

Hi David

FYI: It is normal to add here any Reviewed-by: or Tested-by, etc you
received from previous versions of the patch, so long as you don't
make major changes.

Reviewed-by: Andrew Lunn 

Andrew


[PATCH net] net: ipv6: Do not duplicate DAD on link up

2017-05-02 Thread David Ahern
Andrey reported a warning triggered by the rcu code:

[ cut here ]
WARNING: CPU: 1 PID: 5911 at lib/debugobjects.c:289
debug_print_object+0x175/0x210
ODEBUG: activate active (active state 1) object type: rcu_head hint:
(null)
Modules linked in:
CPU: 1 PID: 5911 Comm: a.out Not tainted 4.11.0-rc8+ #271
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x192/0x22d lib/dump_stack.c:52
 __warn+0x19f/0x1e0 kernel/panic.c:549
 warn_slowpath_fmt+0xe0/0x120 kernel/panic.c:564
 debug_print_object+0x175/0x210 lib/debugobjects.c:286
 debug_object_activate+0x574/0x7e0 lib/debugobjects.c:442
 debug_rcu_head_queue kernel/rcu/rcu.h:75
 __call_rcu.constprop.76+0xff/0x9c0 kernel/rcu/tree.c:3229
 call_rcu_sched+0x12/0x20 kernel/rcu/tree.c:3288
 rt6_rcu_free net/ipv6/ip6_fib.c:158
 rt6_release+0x1ea/0x290 net/ipv6/ip6_fib.c:188
 fib6_del_route net/ipv6/ip6_fib.c:1461
 fib6_del+0xa42/0xdc0 net/ipv6/ip6_fib.c:1500
 __ip6_del_rt+0x100/0x160 net/ipv6/route.c:2174
 ip6_del_rt+0x140/0x1b0 net/ipv6/route.c:2187
 __ipv6_ifa_notify+0x269/0x780 net/ipv6/addrconf.c:5520
 addrconf_ifdown+0xe60/0x1a20 net/ipv6/addrconf.c:3672
...

Andrey's reproducer program runs in a very tight loop, calling
'unshare -n' and then spawning 2 sets of 14 threads running random ioctl
calls. The relevant networking sequence:

1. New network namespace created via unshare -n
- ip6tnl0 device is created in down state

2. address added to ip6tnl0
- equivalent to ip -6 addr add dev ip6tnl0 fd00::bb/1
- DAD is started on the address and when it completes the host
  route is inserted into the FIB

3. ip6tnl0 is brought up
- the new fixup_permanent_addr function restarts DAD on the address

4. exit namespace
- teardown / cleanup sequence starts
- once in a blue moon, lo teardown appears to happen BEFORE teardown
  of ip6tunl0
  + down on 'lo' removes the host route from the FIB since the dst->dev
for the route is loobback
  + host route added to rcu callback list
* rcu callback has not run yet, so rt is NOT on the gc list so it has
  NOT been marked obsolete

5. in parallel to 4. worker_thread runs addrconf_dad_completed
- DAD on the address on ip6tnl0 completes
- calls ipv6_ifa_notify which inserts the host route

All of that happens very quickly. The result is that a host route that
has been deleted from the IPv6 FIB and added to the RCU list is re-inserted
into the FIB.

The exit namespace eventually gets to cleaning up ip6tnl0 which removes the
host route from the FIB again, calls the rcu function for cleanup -- and
triggers the double rcu trace.

The root cause is duplicate DAD on the address -- steps 2 and 3. Arguably,
DAD should not be started in step 2. The interface is in the down state,
so it can not really send out requests for the address which makes starting
DAD pointless.

Since the second DAD was introduced by a recent change, seems appropriate
to use it for the Fixes tag and have the fixup function only start DAD for
addresses in the PREDAD state which occurs in addrconf_ifdown if the
address is retained.

Big thanks to Andrey for isolating a reliable reproducer for this problem.
Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Andrey Konovalov 
Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 67ec87ea5fb6..4fc4e8634e65 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3306,7 +3306,8 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
  idev->dev, 0, 0);
}
 
-   addrconf_dad_start(ifp);
+   if (ifp->state == INET6_IFADDR_STATE_PREDAD)
+   addrconf_dad_start(ifp);
 
return 0;
 }
@@ -3656,7 +3657,7 @@ static int addrconf_ifdown(struct net_device *dev, int 
how)
!addr_is_local(>addr)) {
/* set state to skip the notifier below */
state = INET6_IFADDR_STATE_DEAD;
-   ifa->state = 0;
+   ifa->state = INET6_IFADDR_STATE_PREDAD;
if (!(ifa->flags & IFA_F_NODAD))
ifa->flags |= IFA_F_TENTATIVE;
 
-- 
2.1.4



Re: [PATCH v6 net-next]smsc911x: Adding support for Micochip LAN9250 Ethernet controller

2017-05-02 Thread Andrew Lunn
On Tue, May 02, 2017 at 08:59:14PM +, david@microchip.com wrote:
> From: David Cai 
> 
> Adding support for Microchip LAN9250 Ethernet controller.
> 
> Signed-off-by: David Cai 

Hi David

FYI: It is normal to add here any Reviewed-by: or Tested-by, etc you
received from previous versions of the patch, so long as you don't
make major changes.

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread Stephen Hemminger
On Tue, 2 May 2017 14:39:40 -0600
David Ahern  wrote:

> On 5/2/17 1:49 PM, Stephen Hemminger wrote:
> > I am not disagreeing that iproute2 should handle the extended
> > error format. Just want the solution to be as small as possible;
> > ie do no more than is absolutely necessary. And future proof
> > for the inevitable growth in new area.  
> 
> Understood. I was trying to not reinvent a wheel. nla_parse and
> nla_validate have not really been touched in 10 years, and both are well
> written with a good API to the rest of the code base.
> 
> From there, I grabbed whole snippets as opposed to just taking what is
> needed for the ext-ack feature. I left the name as nlattr.c for easy
> diff if future code is wanted. The header file was renamed to nlattr.h
> to avoid confusion with other netlink.h files. Not adding it to
> libnetlink.h facilitates pulling more code via diff as needed.
> 
> In short, I think it is good to re-use code from the kernel (lib and
> tools/lib) where possible and doing so in a way that makes updates as
> easy as header files.

The problem with copy and paste is that the code diverges and rots.
Also the security and error model in user space are different

> >   
> >> +
> >> +static const __u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
> >> +  [NLA_U8]= sizeof(__u8),
> >> +  [NLA_U16]   = sizeof(__u16),
> >> +  [NLA_U32]   = sizeof(__u32),
> >> +  [NLA_U64]   = sizeof(__u64),
> >> +  [NLA_MSECS] = sizeof(__u64),
> >> +  [NLA_NESTED]= NLA_HDRLEN,
> >> +  [NLA_S8]= sizeof(__s8),
> >> +  [NLA_S16]   = sizeof(__s16),
> >> +  [NLA_S32]   = sizeof(__s32),
> >> +  [NLA_S64]   = sizeof(__s64),
> >> +};
> >> +  
> > 
> > This patch makes iproute2 now doing validation of netlink attributes
> > coming back from the kernel. What is the point, userspace should be
> > trusting the kernel.  
> 
> The kernel has bugs too; userspace should verify what it sends. In this
> case the policy validation is just data types -- a string was expected
> and a string was returned, or an attribute should be a u32 and it is.
> You can argue it is overkill for iproute2, but it is good coding practice.
> 
> And for many netlink based features iproute2 tends to be the model that
> is copied into other code bases.

Then why only for new code.

I am not trying to be overly picky. Just that review is the time to play
devil's advocate and look for issues.


Re: [net-next PATCH 0/4] Improve bpf ELF-loader under samples/bpf

2017-05-02 Thread Daniel Borkmann

On 05/02/2017 02:31 PM, Jesper Dangaard Brouer wrote:

This series improves and fixes bpf ELF loader and programs under
samples/bpf.  The bpf_load.c created some hard to debug issues when
the struct (bpf_map_def) used in the ELF maps section format changed
in commit fb30d4b71214 ("bpf: Add tests for map-in-map").

This was hotfixed in commit 409526bea3c3 ("samples/bpf: bpf_load.c
detect and abort if ELF maps section size is wrong") by detecting the
issue and aborting the program.

In most situations the bpf-loader should be able to handle these kind
of changes to the struct size.  This patch series aim to do proper
backward and forward compabilility handling when loading ELF files.

This series also adjust the callback that was introduced in commit
9fd63d05f3e8 ("bpf: Allow bpf sample programs (*_user.c) to change
bpf_map_def") to use the new bpf_map_data structure, before more users
start to use this callback.

Hoping these changes can make the merge window, as above mentioned
commits have not been merged yet, and it would be good to avoid users
hitting these issues.


Overall, set looks good to me. The last patch doesn't have a
user yet, so probably better to drop it until there is an actual
user in the tree.

Long term, I'd like to see the samples being migrated to use the
tools/lib/bpf/ library from the tree, so that we can avoid duplicating
effort with having two libs in the tree (f.e. elf map validation is
performed to a certain degree in the other one, but w/o compat
support last time I looked).

Anyway, other than that:

Acked-by: Daniel Borkmann 


[PATCH v6 net-next]smsc911x: Adding support for Micochip LAN9250 Ethernet controller

2017-05-02 Thread David.Cai
From: David Cai 

Adding support for Microchip LAN9250 Ethernet controller.

Signed-off-by: David Cai 
---
Changes
V2
 - email format changed
 - remove unnecessary text in commit log Changes
V3
 - defined all supported Ethernet controller chip ID.
V4
 - changed 'if (pdata->generation == 4 && pdata->sub_generation)' to
   'if ((pdata->idrev & 0x) == LAN9250)' for more readable
V5
 - removed the variable 'sub_generation'.
V6
 - correct email format error. 

 drivers/net/ethernet/smsc/smsc911x.c | 49 ++--
 drivers/net/ethernet/smsc/smsc911x.h | 19 ++
 2 files changed, 49 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index fa5ca09..ea1bbc3 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -25,7 +25,7 @@
  *   LAN9215, LAN9216, LAN9217, LAN9218
  *   LAN9210, LAN9211
  *   LAN9220, LAN9221
- *   LAN89218
+ *   LAN89218,LAN9250
  *
  */
 
@@ -1450,6 +1450,8 @@ static int smsc911x_soft_reset(struct smsc911x_data 
*pdata)
unsigned int timeout;
unsigned int temp;
int ret;
+   unsigned int reset_offset = HW_CFG;
+   unsigned int reset_mask = HW_CFG_SRST_;
 
/*
 * Make sure to power-up the PHY chip before doing a reset, otherwise
@@ -1476,15 +1478,23 @@ static int smsc911x_soft_reset(struct smsc911x_data 
*pdata)
}
}
 
+   if ((pdata->idrev & 0x) == LAN9250) {
+   /* special reset for  LAN9250 */
+   reset_offset = RESET_CTL;
+   reset_mask = RESET_CTL_DIGITAL_RST_;
+   }
+
/* Reset the LAN911x */
-   smsc911x_reg_write(pdata, HW_CFG, HW_CFG_SRST_);
+   smsc911x_reg_write(pdata, reset_offset, reset_mask);
+
+   /* verify reset bit is cleared */
timeout = 10;
do {
udelay(10);
-   temp = smsc911x_reg_read(pdata, HW_CFG);
-   } while ((--timeout) && (temp & HW_CFG_SRST_));
+   temp = smsc911x_reg_read(pdata, reset_offset);
+   } while ((--timeout) && (temp & reset_mask));
 
-   if (unlikely(temp & HW_CFG_SRST_)) {
+   if (unlikely(temp & reset_mask)) {
SMSC_WARN(pdata, drv, "Failed to complete reset");
return -EIO;
}
@@ -2253,28 +2263,29 @@ static int smsc911x_init(struct net_device *dev)
 
pdata->idrev = smsc911x_reg_read(pdata, ID_REV);
switch (pdata->idrev & 0x) {
-   case 0x0118:
-   case 0x0117:
-   case 0x0116:
-   case 0x0115:
-   case 0x218A:
+   case LAN9118:
+   case LAN9117:
+   case LAN9116:
+   case LAN9115:
+   case LAN89218:
/* LAN911[5678] family */
pdata->generation = pdata->idrev & 0x;
break;
 
-   case 0x118A:
-   case 0x117A:
-   case 0x116A:
-   case 0x115A:
+   case LAN9218:
+   case LAN9217:
+   case LAN9216:
+   case LAN9215:
/* LAN921[5678] family */
pdata->generation = 3;
break;
 
-   case 0x9210:
-   case 0x9211:
-   case 0x9220:
-   case 0x9221:
-   /* LAN9210/LAN9211/LAN9220/LAN9221 */
+   case LAN9210:
+   case LAN9211:
+   case LAN9220:
+   case LAN9221:
+   case LAN9250:
+   /* LAN9210/LAN9211/LAN9220/LAN9221/LAN9250 */
pdata->generation = 4;
break;
 
diff --git a/drivers/net/ethernet/smsc/smsc911x.h 
b/drivers/net/ethernet/smsc/smsc911x.h
index 54d6489..8d75508 100644
--- a/drivers/net/ethernet/smsc/smsc911x.h
+++ b/drivers/net/ethernet/smsc/smsc911x.h
@@ -20,6 +20,22 @@
 #ifndef __SMSC911X_H__
 #define __SMSC911X_H__
 
+/*Chip ID*/
+#define LAN91150x0115
+#define LAN91160x0116
+#define LAN91170x0117
+#define LAN91180x0118
+#define LAN92150x115A
+#define LAN92160x116A
+#define LAN92170x117A
+#define LAN92180x118A
+#define LAN92100x9210
+#define LAN92110x9211
+#define LAN92200x9220
+#define LAN92210x9221
+#define LAN92500x9250
+#define LAN89218   0x218A
+
 #define TX_FIFO_LOW_THRESHOLD  ((u32)1600)
 #define SMSC911X_EEPROM_SIZE   ((u32)128)
 #define USE_DEBUG  0
@@ -303,6 +319,9 @@
 #define E2P_DATA_EEPROM_DATA_  0x00FF
 #define LAN_REGISTER_EXTENT0x0100
 
+#define RESET_CTL  0x1F8
+#define RESET_CTL_DIGITAL_RST_ 0x0001
+
 /*
  * MAC Control and Status Register (Indirect Address)
  * Offset (through the MAC_CSR CMD and DATA port)
-- 
2.7.4



Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread David Ahern
On 5/2/17 1:49 PM, Stephen Hemminger wrote:
> I am not disagreeing that iproute2 should handle the extended
> error format. Just want the solution to be as small as possible;
> ie do no more than is absolutely necessary. And future proof
> for the inevitable growth in new area.

Understood. I was trying to not reinvent a wheel. nla_parse and
nla_validate have not really been touched in 10 years, and both are well
written with a good API to the rest of the code base.

>From there, I grabbed whole snippets as opposed to just taking what is
needed for the ext-ack feature. I left the name as nlattr.c for easy
diff if future code is wanted. The header file was renamed to nlattr.h
to avoid confusion with other netlink.h files. Not adding it to
libnetlink.h facilitates pulling more code via diff as needed.

In short, I think it is good to re-use code from the kernel (lib and
tools/lib) where possible and doing so in a way that makes updates as
easy as header files.

> 
>> +
>> +static const __u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
>> +[NLA_U8]= sizeof(__u8),
>> +[NLA_U16]   = sizeof(__u16),
>> +[NLA_U32]   = sizeof(__u32),
>> +[NLA_U64]   = sizeof(__u64),
>> +[NLA_MSECS] = sizeof(__u64),
>> +[NLA_NESTED]= NLA_HDRLEN,
>> +[NLA_S8]= sizeof(__s8),
>> +[NLA_S16]   = sizeof(__s16),
>> +[NLA_S32]   = sizeof(__s32),
>> +[NLA_S64]   = sizeof(__s64),
>> +};
>> +
> 
> This patch makes iproute2 now doing validation of netlink attributes
> coming back from the kernel. What is the point, userspace should be
> trusting the kernel.

The kernel has bugs too; userspace should verify what it sends. In this
case the policy validation is just data types -- a string was expected
and a string was returned, or an attribute should be a u32 and it is.
You can argue it is overkill for iproute2, but it is good coding practice.

And for many netlink based features iproute2 tends to be the model that
is copied into other code bases.


Re: [net-next PATCH 0/4] Improve bpf ELF-loader under samples/bpf

2017-05-02 Thread Daniel Borkmann

On 05/02/2017 09:40 PM, David Miller wrote:

From: Jesper Dangaard Brouer 
Date: Tue, 02 May 2017 14:31:45 +0200


This series improves and fixes bpf ELF loader and programs under
samples/bpf.  The bpf_load.c created some hard to debug issues when
the struct (bpf_map_def) used in the ELF maps section format changed
in commit fb30d4b71214 ("bpf: Add tests for map-in-map").

This was hotfixed in commit 409526bea3c3 ("samples/bpf: bpf_load.c
detect and abort if ELF maps section size is wrong") by detecting the
issue and aborting the program.

In most situations the bpf-loader should be able to handle these kind
of changes to the struct size.  This patch series aim to do proper
backward and forward compabilility handling when loading ELF files.

This series also adjust the callback that was introduced in commit
9fd63d05f3e8 ("bpf: Allow bpf sample programs (*_user.c) to change
bpf_map_def") to use the new bpf_map_data structure, before more users
start to use this callback.

Hoping these changes can make the merge window, as above mentioned
commits have not been merged yet, and it would be good to avoid users
hitting these issues.


Alexei and Daniel, please review.


I'm on it.


Re: [RFC PATCH v2 07/17] net: sched: drop qdisc_reset from dev_graft_qdisc

2017-05-02 Thread Jesper Dangaard Brouer
On Tue, 02 May 2017 08:38:19 -0700
John Fastabend  wrote:

> @@ -991,20 +996,25 @@ void dev_deactivate_many(struct list_head *head)
>_qdisc);
>  
>   dev_watchdog_down(dev);
> - sync_needed |= !dev->dismantle;
>   }
>  
>   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
>* This is avoided if all devices are in dismantle phase :
>* Caller will call synchronize_net() for us
>*/

Is the comment still correct after this change?

> - if (sync_needed)
> - synchronize_net();
> + synchronize_net();

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [net-next 0/2] tipc: refactor socket receive functions

2017-05-02 Thread David Miller
From: Jon Maloy 
Date: Tue, 2 May 2017 18:16:52 +0200

> We try to make the functions tipc_sk_recvmsg() and
> tipc_sk_recvstream() more readable.

Series applied, thanks.


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread Stephen Hemminger
I am not disagreeing that iproute2 should handle the extended
error format. Just want the solution to be as small as possible;
ie do no more than is absolutely necessary. And future proof
for the inevitable growth in new area.

> +
> +static const __u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
> + [NLA_U8]= sizeof(__u8),
> + [NLA_U16]   = sizeof(__u16),
> + [NLA_U32]   = sizeof(__u32),
> + [NLA_U64]   = sizeof(__u64),
> + [NLA_MSECS] = sizeof(__u64),
> + [NLA_NESTED]= NLA_HDRLEN,
> + [NLA_S8]= sizeof(__s8),
> + [NLA_S16]   = sizeof(__s16),
> + [NLA_S32]   = sizeof(__s32),
> + [NLA_S64]   = sizeof(__s64),
> +};
> +

This patch makes iproute2 now doing validation of netlink attributes
coming back from the kernel. What is the point, userspace should be
trusting the kernel.


Re: [PATCH v3 net-next 00/11] ibmvnic: Updated reset handler and code fixes

2017-05-02 Thread David Miller
From: Nathan Fontenot 
Date: Tue, 02 May 2017 14:52:11 -0400

> This set of patches multiple code fixes and a new rest handler
> for the ibmvnic driver. In order to implement the new reset handler
> for the ibmvnic driver resource initialization needed to be moved to
> its own routine, a state variable is introduced to replace the
> various is_* flags in the driver, and a new routine to handle the
> assorted reasons the driver can be reset.
> 
> v3 updates:
> 
> Patch 10/11: Correct patch subject line to be a description of the patch.
> 
> v2 updates:
> 
> Patch 11/11: Use __netif_subqueue_stopped() instead of
> netif_subqueue_stopped() to avoid possible use of an un-initialized
> skb variable.

Your patches have trailing whitespace errors, please fix:

Applying: ibmvnic: Move resource initialization to its own routine
Applying: ibmvnic: Replace is_closed with state field
Applying: ibmvnic: Updated reset handling
/home/davem/src/GIT/net-next/.git/rebase-apply/patch:104: trailing whitespace.

warning: 1 line adds whitespace errors.
Applying: ibmvnic: Delete napi's when releasing driver resources
Applying: ibmvnic: Whitespace correction in release_rx_pools
Applying: ibmvnic: Clean up tx pools when closing
Applying: ibmvnic: Wait for any pending scrqs entries at driver close
/home/davem/src/GIT/net-next/.git/rebase-apply/patch:68: trailing whitespace.

warning: 1 line adds whitespace errors.
Applying: ibmvnic: Check for driver reset first in ibmvnic_xmit
Applying: ibmvnic: Continue skb processing after skb completion error
Applying: ibmvnic: Record SKB RX queue during poll
Applying: ibmvnic: Move queue restarting in ibmvnic_tx_complete


Re: [PATCH 0/9] net: thunderx: Adds XDP support

2017-05-02 Thread David Miller
From: sunil.kovv...@gmail.com
Date: Tue,  2 May 2017 18:36:49 +0530

> From: Sunil Goutham 
> 
> This patch series adds support for XDP to ThunderX NIC driver
> which is used on CN88xx, CN81xx and CN83xx platforms. 
> 
> Patches 1-4 are performance improvement and cleanup patches
> which are done keeping XDP performance bottlenecks in view.
> Rest of the patches adds actual XDP support.

Series applied, thanks for doing this work.

Do you have any performance numbers?


Re: [net-next PATCH 0/4] Improve bpf ELF-loader under samples/bpf

2017-05-02 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Tue, 02 May 2017 14:31:45 +0200

> This series improves and fixes bpf ELF loader and programs under
> samples/bpf.  The bpf_load.c created some hard to debug issues when
> the struct (bpf_map_def) used in the ELF maps section format changed
> in commit fb30d4b71214 ("bpf: Add tests for map-in-map").
> 
> This was hotfixed in commit 409526bea3c3 ("samples/bpf: bpf_load.c
> detect and abort if ELF maps section size is wrong") by detecting the
> issue and aborting the program.
> 
> In most situations the bpf-loader should be able to handle these kind
> of changes to the struct size.  This patch series aim to do proper
> backward and forward compabilility handling when loading ELF files.
> 
> This series also adjust the callback that was introduced in commit
> 9fd63d05f3e8 ("bpf: Allow bpf sample programs (*_user.c) to change
> bpf_map_def") to use the new bpf_map_data structure, before more users
> start to use this callback.
> 
> Hoping these changes can make the merge window, as above mentioned
> commits have not been merged yet, and it would be good to avoid users
> hitting these issues.

Alexei and Daniel, please review.


Re: [PATCH] ipx: call ipxitf_put() in ioctl error path

2017-05-02 Thread David Miller
From: Dan Carpenter 
Date: Tue, 2 May 2017 13:58:53 +0300

> We should call ipxitf_put() if the copy_to_user() fails.
> 
> Reported-by: 李强 
> Signed-off-by: Dan Carpenter 

Applied, thanks Dan.


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Raj, Ashok
On Tue, May 02, 2017 at 11:10:22AM -0700, Alexander Duyck wrote:
> On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok  wrote:
> > On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
> >> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom  wrote:
> >> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
> >> > Ordering Attribute should not be used on Transaction Layer Packets 
> >> > destined
> >> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
> >> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
> >> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports 
> >> > which
> >> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
> >>
> >> So this is a good first step though might I suggest one other change.
> >>
> >> We may want to add logic to the core PCIe code that clears the "Enable
> >> Relaxed Ordering" bit in the device control register for all devices
> >> hanging off of this root complex. Assuming the devices conform to the
> >> PCIe spec doing that should disable relaxed ordering in a device
> >> agnostic way that then enables us at a driver level to just enable the
> >> feature always without having to perform any checks for your flag. We
> >> could probably do that as a part of probing the PCIe interfaces
> >> hanging off of these devices.
> >
> > I suppose you don't want to turn off RO completely on the device. When
> > traffic is targetted to mmio for peer to peer reasons RO has performance
> > upside. The specific issue with these root ports indicate limitation using
> > RO for traffic targetting coherent memory.
> 
> Actually my main concern here is virtualization. If I take the PCIe
> function and direct assign it I have no way of seeing the root complex
> flag as it is now virtualized away. In the meantime the guest now has
> the ability to enable the function and sees nothing that says you
> can't enable relaxed ordering which in turn ends up potentially
> causing data corruption on the system. I want relaxed ordering
> disabled before I even consider assigning it to the guest on the
> systems where this would be an issue.
> 
> I prefer to err on the side of caution with this. Enabling Relaxed
> Ordering is technically a performance enhancement, so we function but
> not as well as we would like, while having it enabled when there are
> issues can lead to data corruption. I would weigh the risk of data
> corruption the thing to be avoided and of much higher priority than
> enabling improved performance. As such I say we should default the
> relaxed ordering attribute to off in general and look at
> "white-listing" it in for various architectures and/or chipsets that
> support/need it rather than having it enabled by default and trying to
> switch it off after the fact when we find some new issue.

I agree, after thinking about it a bit more.. even for transactions going to
p2p, i'm just reading the pcie spec and some sections aren't super clear
about completion redirect and ACS rules for p2p.

Also it appears the device control default value is 1 for enabling
Relaxed Ordering. This means we should probably save these states across
resets/FLR for e.g. To ensure perf isn't affected after a FLR.
> 
> So for example, in the case of x86 it seems like there are multiple
> root complexes that have issues, and the gains for enabling it with
> standard DMA to host memory are small. As such we may want to default
> it to off via the architecture specific PCIe code and then look at
> having "white-list" cases where we enable it for things like
> peer-to-peer accesses. In the case of SPARC we could look at
> defaulting it to on, and only "black-list" any cases where there might
> be issues since SPARC relies on this in a significant way for
> performance. In the case of ARM and other architectures it is a bit of
> a toss-up. I would say we could just default it to on for now and
> "black-list" anything that doesn't work, or we could go the other way
> like I suggested for x86. It all depends on what the ARM community
> would want to agree on for this. I would say unless it makes a
> significant difference like it does in the case of SPARC we are
> probably better off just defaulting it to off.
> 
> - Alex


Re: [patch net-next repost] net: sched: add helpers to handle extended actions

2017-05-02 Thread David Miller
From: Jiri Pirko 
Date: Tue,  2 May 2017 10:12:00 +0200

> From: Jiri Pirko 
> 
> Jump is now the only one using value action opcode. This is going to
> change soon. So introduce helpers to work with this. Convert TC_ACT_JUMP.
> 
> This also fixes the TC_ACT_JUMP check, which is incorrectly done as a
> bit check, not a value check.
> 
> Fixes: e0ee84ded796 ("net sched actions: Complete the JUMPX opcode")
> Signed-off-by: Jiri Pirko 
> ---
> Dave, I'm sending this for -net-next although I know it is closed. But
> the mentioned commit is not yet in -net. Feel free to take this either
> to -net-next or -net, whatever suits you better. Thanks.

Applied to net-next, thanks Jiri.


Re: [PATCH net 0/2] qed*: PTP bug fixes.

2017-05-02 Thread David Miller
From: Sudarsana Reddy Kalluru 
Date: Tue, 2 May 2017 01:11:01 -0700

> The series addresses couple of issues in the PTP implementation.

Series applied.


Re: [PATCH v2] stmmac: Add support for SIMATIC IOT2000 platform

2017-05-02 Thread David Miller
From: Jan Kiszka 
Date: Tue, 2 May 2017 09:58:00 +0200

> The IOT2000 is industrial controller platform, derived from the Intel
> Galileo Gen2 board. The variant IOT2020 comes with one LAN port, the
> IOT2040 has two of them. They can be told apart based on the board asset
> tag in the DMI table.
> 
> Based on patch by Sascha Weisenberger.
> 
> Signed-off-by: Jan Kiszka 
> Signed-off-by: Sascha Weisenberger 
> ---
> 
> Changes in v2:
> - reformatted match conditions [Andy]

Applied, thanks.


Re: [PATCH net v4 00/12] Fix possbile memleaks when fail to register_netdevice

2017-05-02 Thread David Miller
From: gfree.w...@foxmail.com
Date: Tue,  2 May 2017 13:58:42 +0800

> These following drivers allocate kinds of resources in its ndo_init
> func, free some of them or all in the destructor func. Then there is
> one memleak that some errors happen after register_netdevice invokes
> the ndo_init callback. Because only the ndo_uninit callback is invoked
> in the error handler of register_netdevice, but destructor not.
> 
> In my original approach, I tried to free the resources in the newlink
> func when fail to register_netdevice, like destructor did except not
> free the net_dev. This method is not good when destructor is changed,
> and the memleak could be not fixed when there is no newlink callback.
> 
> Now create one new func used to free the resources in the destructor,
> and the ndo_uninit func also could invokes it when fail to register
> the net_device by comparing the dev->reg_state with NETREG_UNINITIALIZED.
> If there is no existing ndo_uninit, just add one.
> 
> This solution doesn't only make sure free all resources in any case,
> but also follows the original desgin that some resources could be kept
> until the destructor executes normally after register the device
> successfully.

I want to think about this some more.

It is really unfortunate that resources are allocated strictly from
the ndo_init() yet released in two different callbacks which are
invoked only in certain (different) situations.

Just the fact that we have to make an internal netdev state test
in the ndo_uninit callback to get this right is a big red flag
to me.


Re: [PATCH net v2] net: hns: fix ethtool_get_strings overflow in hns driver

2017-05-02 Thread David Miller
From: Timmy Li 
Date: Tue, 2 May 2017 10:46:52 +0800

> hns_get_sset_count() returns HNS_NET_STATS_CNT and the data space allocated
> is not enough for ethtool_get_strings(), which will cause random memory
> corruption.
> 
> When SLAB and DEBUG_SLAB are both enabled, memory corruptions like the
> the following can be observed without this patch:
> [   43.115200] Slab corruption (Not tainted): Acpi-ParseExt 
> start=801fb0b69030, len=80
> [   43.115206] Redzone: 0x9f911029d006462/0x5f78745f31657070.
> [   43.115208] Last user: [<5f7272655f746b70>](0x5f7272655f746b70)
> [   43.115214] 010: 70 70 65 31 5f 74 78 5f 70 6b 74 00 6b 6b 6b 6b  
> ppe1_tx_pkt.
> [   43.115217] 030: 70 70 65 31 5f 74 78 5f 70 6b 74 5f 6f 6b 00 6b  
> ppe1_tx_pkt_ok.k
> [   43.115218] Next obj: start=801fb0b69098, len=80
> [   43.115220] Redzone: 0x706d655f6f666966/0x9f911029d74e35b.
> [   43.115229] Last user: 
> [](acpi_os_release_object+0x28/0x38)
> [   43.115231] 000: 74 79 00 6b 6b 6b 6b 6b 70 70 65 31 5f 74 78 5f  
> ty.kppe1_tx_
> [   43.115232] 010: 70 6b 74 5f 65 72 72 5f 63 73 75 6d 5f 66 61 69  
> pkt_err_csum_fai
> 
> Signed-off-by: Timmy Li 
> ---
> Changelog:
> 
> v1 -> v2:
> * Remove unnecessary parenthesis

Applied, thank you.


Re: [PATCH net] tcp: fix wraparound issue in tcp_lp

2017-05-02 Thread David Miller
From: Eric Dumazet 
Date: Mon, 01 May 2017 15:29:48 -0700

> From: Eric Dumazet 
> 
> Be careful when comparing tcp_time_stamp to some u32 quantity,
> otherwise result can be surprising.
> 
> Fixes: 7c106d7e782b ("[TCP]: TCP Low Priority congestion control")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks.


Re: [PATCH net] bpf, arm64: fix jit branch offset related to ldimm64

2017-05-02 Thread David Miller
From: Daniel Borkmann 
Date: Tue,  2 May 2017 20:34:54 +0200

> When the instruction right before the branch destination is
> a 64 bit load immediate, we currently calculate the wrong
> jump offset in the ctx->offset[] array as we only account
> one instruction slot for the 64 bit load immediate although
> it uses two BPF instructions. Fix it up by setting the offset
> into the right slot after we incremented the index.
 ...
> Also, add a couple of test cases to make sure JITs pass
> this test. Tested on Cavium ThunderX ARMv8. The added
> test cases all pass after the fix.
> 
> Fixes: 8eee539ddea0 ("arm64: bpf: fix out-of-bounds read in bpf2a64_offset()")
> Reported-by: David S. Miller 
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 

Applied and queued up for -stable, thanks!

I also applied your XADD patch as well.

Thanks again.


Re: [PATCH net-next RFC 1/1] net netlink: Add new type NLA_FLAG_BITS

2017-05-02 Thread David Miller
From: Jamal Hadi Salim 
Date: Sun, 30 Apr 2017 10:28:39 -0400

> Generic bitflags attribute content sent to the kernel by user.
> With this type the user can either set or unset a flag in the
> kernel.

You asked for feedback, here it is :-)

I think this is overengineered.

Just define a u32 for the value, and mask which defines which bits are
legitimate and defined.  Any bit outside of the legitimate mask must
be zero.

Simple.


Re: [PATCH] net: ethernet: stmmac: properly set PS bit in MII configurations during reset

2017-05-02 Thread David Miller

Someone needs to review this patch.


Re: [PATCH net-next v2] net: ipv6: make sure multicast packets are not forwarded beyond the different scopes

2017-05-02 Thread David Miller
From: Donatas Abraitis 
Date: Thu, 27 Apr 2017 10:12:02 +0300

>   RFC4291 2.7 Routers must not forward any multicast packets
>   beyond of the scope indicated by the scop field in the
>   destination multicast address.
> 
> Signed-off-by: Donatas Abraitis 

I think it's a ">=" test which is needed here, not pure equality.
Scopes are subsets of other scopes and are therefore allowed within
eachother.

Did you actually see misbehavior due to this issue, or see a real
bonafide conformance test fail?

If you're just reading the RFC and sticking tests here and there based
upon what you read, without any testing or real life verification of
the issue, this is _strongly_ discouraged.

It would even be ok if you merely showed how another open source
networking stack makes this test.


[PATCH] bonding: Use seq_putc() in bond_info_show_master()

2017-05-02 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 2 May 2017 20:48:36 +0200

A few single characters should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 drivers/net/bonding/bond_procfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_procfs.c 
b/drivers/net/bonding/bond_procfs.c
index d8d4ada034b7..bdc4db184d94 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -72,7 +72,7 @@ static void bond_info_show_master(struct seq_file *seq)
seq_printf(seq, " (fail_over_mac %s)", optval->string);
}
 
-   seq_printf(seq, "\n");
+   seq_putc(seq, '\n');
 
if (bond_mode_uses_xmit_hash(bond)) {
optval = bond_opt_get_val(BOND_OPT_XMIT_HASH,
@@ -117,11 +117,11 @@ static void bond_info_show_master(struct seq_file *seq)
if (!bond->params.arp_targets[i])
break;
if (printed)
-   seq_printf(seq, ",");
+   seq_putc(seq, ',');
seq_printf(seq, " %pI4", >params.arp_targets[i]);
printed = 1;
}
-   seq_printf(seq, "\n");
+   seq_putc(seq, '\n');
}
 
if (BOND_MODE(bond) == BOND_MODE_8023AD) {
-- 
2.12.2



Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread David Miller
From: David Ahern 
Date: Tue, 2 May 2017 12:39:51 -0600

> On 5/2/17 12:03 PM, Stephen Hemminger wrote:
>> Then use libmnl it is already used in several other places in iproute2.
>> Eventually, I would like to use it everywhere and get rid of old netlink 
>> parser.
>> 
> 
> Why? libmnl is not going to simplify the iproute2 code.

Agreed.

> Look at attribute validation. Importing the kernel code into iproute2,
> the API is very familiar to anyone hacking on the kernel side:
> 
> +   if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy)
> != 0) {
> +   fprintf(stderr,
> +   "Failed to parse extended error attributes\n");
> +   return 0;
> +   }
> +
> 
> ie., you pass a policy to the parse routine and the checking is part of
> nla_parse. The implementation of nla_parse and validate_nla are quite
> easy to read.
> 
> Now take a look at what devlink has for validation - attr_cb. IMO very
> unreadable and puts the burden on the app using the libmnl API.

Also agreed.

Stephen I totally agree with David, asking him to use libmnl for this is
not reasonable nor is it even a good idea given the above.


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread David Ahern
On 5/2/17 12:03 PM, Stephen Hemminger wrote:
> Then use libmnl it is already used in several other places in iproute2.
> Eventually, I would like to use it everywhere and get rid of old netlink 
> parser.
> 

Why? libmnl is not going to simplify the iproute2 code.

Look at attribute validation. Importing the kernel code into iproute2,
the API is very familiar to anyone hacking on the kernel side:

+   if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy)
!= 0) {
+   fprintf(stderr,
+   "Failed to parse extended error attributes\n");
+   return 0;
+   }
+

ie., you pass a policy to the parse routine and the checking is part of
nla_parse. The implementation of nla_parse and validate_nla are quite
easy to read.

Now take a look at what devlink has for validation - attr_cb. IMO very
unreadable and puts the burden on the app using the libmnl API.


Re: net/smc and the RDMA core

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 14:25 +0200, Ursula Braun wrote:
> if you can point out specific issues, we will be happy to work with you
> to get them addressed!

Hello Ursula,

My list of issues that I would like to see addressed can be found below. Doug,
Christoph and others may have additional inputs. The issues that have not yet
been mentioned in other e-mails are:
- The SMC driver only supports one RDMA transport type (RoCE v1) but
  none of the other RDMA transport types (RoCE v2, IB and iWARP). New
  RDMA drivers should support all RDMA transport types transparently.
  The traditional approach to support multiple RDMA transport types is
  by using the RDMA/CM to establish connections.
- The implementation of the SMC driver only supports RoCEv1. This is
  a very unfortunate choice because:
  - RoCEv1 is not routable and hence is limited to a single Ethernet
broadcast domain.
  - RoCEv1 packets escape a whole bunch of mechanisms that only work
for IP packets. Firewalls pass all RoCEv1 packets and switches
do not restrict RoCEv1 packets to a single VLAN. This means that
if the network configuration is changed after an SMC connection
has been set up such that IP communication between the endpoints
of an SMC connection is blocked that the SMC RoCEv1 packets will
not be blocked by the network equipment of which the configuration
has just been changed.
- As already mentioned by Christoph, the SMC implementation uses RDMA
  calls that probably will be deprecated soon (ib_create_cq()) and
  should use the RDMA R/W API instead of building sge lists itself.

Bart.

[PATCH net] bpf, arm64: fix jit branch offset related to ldimm64

2017-05-02 Thread Daniel Borkmann
When the instruction right before the branch destination is
a 64 bit load immediate, we currently calculate the wrong
jump offset in the ctx->offset[] array as we only account
one instruction slot for the 64 bit load immediate although
it uses two BPF instructions. Fix it up by setting the offset
into the right slot after we incremented the index.

Before (ldimm64 test 1):

  [...]
  0020:  5287  mov w7, #0x0 // #0
  0024:  d2800060  mov x0, #0x3 // #3
  0028:  d2800041  mov x1, #0x2 // #2
  002c:  eb01001f  cmp x0, x1
  0030:  5482  b.cs 0x0020
  0034:  d29fffe7  mov x7, #0x // #65535
  0038:  f2bfffe7  movk x7, #0x, lsl #16
  003c:  f2dfffe7  movk x7, #0x, lsl #32
  0040:  f2e7  movk x7, #0x, lsl #48
  0044:  d29dddc7  mov x7, #0x // #61166
  0048:  f2bdddc7  movk x7, #0x, lsl #16
  004c:  f2c7  movk x7, #0x, lsl #32
  0050:  f2fdddc7  movk x7, #0x, lsl #48
  [...]

After (ldimm64 test 1):

  [...]
  0020:  5287  mov w7, #0x0 // #0
  0024:  d2800060  mov x0, #0x3 // #3
  0028:  d2800041  mov x1, #0x2 // #2
  002c:  eb01001f  cmp x0, x1
  0030:  54a2  b.cs 0x0044
  0034:  d29fffe7  mov x7, #0x // #65535
  0038:  f2bfffe7  movk x7, #0x, lsl #16
  003c:  f2dfffe7  movk x7, #0x, lsl #32
  0040:  f2e7  movk x7, #0x, lsl #48
  0044:  d29dddc7  mov x7, #0x // #61166
  0048:  f2bdddc7  movk x7, #0x, lsl #16
  004c:  f2c7  movk x7, #0x, lsl #32
  0050:  f2fdddc7  movk x7, #0x, lsl #48
  [...]

Also, add a couple of test cases to make sure JITs pass
this test. Tested on Cavium ThunderX ARMv8. The added
test cases all pass after the fix.

Fixes: 8eee539ddea0 ("arm64: bpf: fix out-of-bounds read in bpf2a64_offset()")
Reported-by: David S. Miller 
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
Cc: Xi Wang 
---
 ( Based against net-next where BPF related patches are usually
   routed, if something else is preferred please let me know. )

 arch/arm64/net/bpf_jit_comp.c |  8 
 lib/test_bpf.c| 45 +++
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index a785554..ce8ab04 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -779,14 +779,14 @@ static int build_body(struct jit_ctx *ctx)
int ret;
 
ret = build_insn(insn, ctx);
-
-   if (ctx->image == NULL)
-   ctx->offset[i] = ctx->idx;
-
if (ret > 0) {
i++;
+   if (ctx->image == NULL)
+   ctx->offset[i] = ctx->idx;
continue;
}
+   if (ctx->image == NULL)
+   ctx->offset[i] = ctx->idx;
if (ret)
return ret;
}
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 0362da0..2e38502 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -4656,6 +4656,51 @@ static int bpf_fill_ld_abs_vlan_push_pop(struct bpf_test 
*self)
{ },
{ { 0, 1 } },
},
+   {
+   /* Mainly testing JIT + imm64 here. */
+   "JMP_JGE_X: ldimm64 test 1",
+   .u.insns_int = {
+   BPF_ALU32_IMM(BPF_MOV, R0, 0),
+   BPF_LD_IMM64(R1, 3),
+   BPF_LD_IMM64(R2, 2),
+   BPF_JMP_REG(BPF_JGE, R1, R2, 2),
+   BPF_LD_IMM64(R0, 0xUL),
+   BPF_LD_IMM64(R0, 0xUL),
+   BPF_EXIT_INSN(),
+   },
+   INTERNAL,
+   { },
+   { { 0, 0xU } },
+   },
+   {
+   "JMP_JGE_X: ldimm64 test 2",
+   .u.insns_int = {
+   BPF_ALU32_IMM(BPF_MOV, R0, 0),
+   BPF_LD_IMM64(R1, 3),
+   BPF_LD_IMM64(R2, 2),
+   BPF_JMP_REG(BPF_JGE, R1, R2, 0),
+   BPF_LD_IMM64(R0, 0xUL),
+   BPF_EXIT_INSN(),
+   },
+   INTERNAL,
+   { },
+   { { 0, 0xU } },
+   },
+   {
+   "JMP_JGE_X: ldimm64 test 3",
+   .u.insns_int = {
+   BPF_ALU32_IMM(BPF_MOV, R0, 1),
+   BPF_LD_IMM64(R1, 3),
+   BPF_LD_IMM64(R2, 2),
+   BPF_JMP_REG(BPF_JGE, R1, R2, 4),
+   BPF_LD_IMM64(R0, 0xUL),
+   BPF_LD_IMM64(R0, 0xUL),
+   BPF_EXIT_INSN(),
+  

[PATCH] net: ipv6: Fix warning of freeing alive inet6 address

2017-05-02 Thread Mike Manning
While this is not reproducible manually, Andrey's syzkaller program hit
the warning "IPv6: Freeing alive inet6 address" with this part trace:

inet6_ifa_finish_destroy+0x12e/0x190 c:894
in6_ifa_put ./include/net/addrconf.h:330
addrconf_dad_work+0x4e9/0x1040 net/ipv6/addrconf.c:3963

The fix is to call in6_ifa_put() for the inet6_ifaddr before rather
than after calling addrconf_ifdown(), as the latter may remove it from
the address hash table.

Fixes: 85b51b12115c ("net: ipv6: Remove addresses for failures with strict DAD")
Reported-by: Andrey Konovalov 
Signed-off-by: Mike Manning 
---
 net/ipv6/addrconf.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 80ce478..361993a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3902,8 +3902,11 @@ static void addrconf_dad_work(struct work_struct *w)
} else if (action == DAD_ABORT) {
in6_ifa_hold(ifp);
addrconf_dad_stop(ifp, 1);
-   if (disable_ipv6)
+   if (disable_ipv6) {
+   in6_ifa_put(ifp);
addrconf_ifdown(idev->dev, 0);
+   goto unlock;
+   }
goto out;
}
 
@@ -3950,6 +3953,7 @@ static void addrconf_dad_work(struct work_struct *w)
  ifp->dad_nonce);
 out:
in6_ifa_put(ifp);
+unlock:
rtnl_unlock();
 }
 
-- 
2.1.4



Re: TPACKET_V3 timeout bug?

2017-05-02 Thread Guy Harris
On May 2, 2017, at 10:54 AM, Guy Harris  wrote:

> Yes, there's a case where user space wasn't being woken up.

See also this thread from almost 3 years ago, beginning with

http://marc.info/?l=linux-netdev=140633612828824=2

and this patch thread from almost 2 1/2 years ago:

https://lkml.org/lkml/2014/12/18/466


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Alexander Duyck
On Tue, May 2, 2017 at 9:53 AM, Raj, Ashok  wrote:
> On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
>> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom  wrote:
>> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
>> > Ordering Attribute should not be used on Transaction Layer Packets destined
>> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
>> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
>> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
>> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
>>
>> So this is a good first step though might I suggest one other change.
>>
>> We may want to add logic to the core PCIe code that clears the "Enable
>> Relaxed Ordering" bit in the device control register for all devices
>> hanging off of this root complex. Assuming the devices conform to the
>> PCIe spec doing that should disable relaxed ordering in a device
>> agnostic way that then enables us at a driver level to just enable the
>> feature always without having to perform any checks for your flag. We
>> could probably do that as a part of probing the PCIe interfaces
>> hanging off of these devices.
>
> I suppose you don't want to turn off RO completely on the device. When
> traffic is targetted to mmio for peer to peer reasons RO has performance
> upside. The specific issue with these root ports indicate limitation using
> RO for traffic targetting coherent memory.

Actually my main concern here is virtualization. If I take the PCIe
function and direct assign it I have no way of seeing the root complex
flag as it is now virtualized away. In the meantime the guest now has
the ability to enable the function and sees nothing that says you
can't enable relaxed ordering which in turn ends up potentially
causing data corruption on the system. I want relaxed ordering
disabled before I even consider assigning it to the guest on the
systems where this would be an issue.

I prefer to err on the side of caution with this. Enabling Relaxed
Ordering is technically a performance enhancement, so we function but
not as well as we would like, while having it enabled when there are
issues can lead to data corruption. I would weigh the risk of data
corruption the thing to be avoided and of much higher priority than
enabling improved performance. As such I say we should default the
relaxed ordering attribute to off in general and look at
"white-listing" it in for various architectures and/or chipsets that
support/need it rather than having it enabled by default and trying to
switch it off after the fact when we find some new issue.

So for example, in the case of x86 it seems like there are multiple
root complexes that have issues, and the gains for enabling it with
standard DMA to host memory are small. As such we may want to default
it to off via the architecture specific PCIe code and then look at
having "white-list" cases where we enable it for things like
peer-to-peer accesses. In the case of SPARC we could look at
defaulting it to on, and only "black-list" any cases where there might
be issues since SPARC relies on this in a significant way for
performance. In the case of ARM and other architectures it is a bit of
a toss-up. I would say we could just default it to on for now and
"black-list" anything that doesn't work, or we could go the other way
like I suggested for x86. It all depends on what the ARM community
would want to agree on for this. I would say unless it makes a
significant difference like it does in the case of SPARC we are
probably better off just defaulting it to off.

- Alex


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread Stephen Hemminger
On Tue, 02 May 2017 13:00:32 -0400 (EDT)
David Miller  wrote:

> From: David Ahern 
> Date: Tue, 2 May 2017 10:51:23 -0600
> 
> > On 5/2/17 9:25 AM, Stephen Hemminger wrote:  
> >> Please either use existing netlink attribute code in libnetlink.h
> >> (rta_getattr_u32 etc) or use libmnl like devlink.  
> > 
> > All of the existing rta_ functions take a struct rta_attr; netlink
> > messages use struct nlattr. It's just wrong to use rta functions for
> > netlink messages.  
> 
> Agreed.
> 
> > There is no existing parse function for nlattr. There is no existing
> > validate function - for nlattr or rta_attr. So I do not see any overlap
> > with existing code.  
> 
> Also agreed.

Then use libmnl it is already used in several other places in iproute2.
Eventually, I would like to use it everywhere and get rid of old netlink parser.



Re: TPACKET_V3 timeout bug?

2017-05-02 Thread Guy Harris
On May 2, 2017, at 8:04 AM, chetan loke  wrote:

> On Sat, Apr 15, 2017 at 7:41 PM, Guy Harris  wrote:
>> On Apr 15, 2017, at 7:10 PM, Andrew Lunn  wrote:
>> 
>>> Do you think this is a kernel problem, libpcap problem, or an
>>> application problem?
>> 
> 
> Its clearly a kernel regression.
> 
> If you look at if_packet.h, I have explicitly called out all the cases
> for the return/status codes. When I first merged the functionality in
> 3.11(or 3.12 I think) I had the logic in place to retire the block
> with or without packets in it. I think there was one case where we
> wouldn't wake up userspace. Someone checked in a fix for that. Now I
> am not sure the regression happened as part of that bug fix or
> sometime later. If you diff 3.12 against the latest you will find the
> regression. Look for prb_retire_rx_blk_timer_expired().

Yes, there's a case where user space wasn't being woken up.

As I said in


https://github.com/the-tcpdump-group/libpcap/issues/335#issuecomment-30280794

It appeared, at the time, that PF_PACKET sockets delivered a wakeup when a 
packet is put in a buffer block or dropped due to no buffer blocks being empty, 
but not when a buffer block is handed to userland.

This means that if the kernel's timer expires, and there are no packets in the 
current buffer block being filled by the kernel, that buffer block will be 
handed to userland, but userland won't be woken up to tell it to consume that 
block.

Thus, libpcap will consume that block only if either:

* a packet is put in a buffer block, meaning it must pass the filter 
and there must be a current buffer block, belonging to the kernel, into which 
to put it;
* a packet arrives and passes the filter, but there are no current 
buffer blocks belonging to the kernel, so it's dropped;
* the poll() times out.

So, with a low packet acceptance rate (either because there isn't much network 
traffic or because there is but most of it is rejected by the packet filter), 
and with a poll() timeout of -1, meaning "block forever", 1) will happen 
infrequently, and 3) will never happen. With an in-kernel timeout rate 
significantly lower than the rate of packet acceptance, the timeout will often 
occur when there are no packets in the current buffer block, in which case the 
kernel will hand an empty buffer block to userland and not tell userland about 
it.

If that happens often enough in sequence to cause all buffer blocks to be 
handed to userland before any wakeups occur, the kernel now has no buffer 
blocks into which to put packets, and the next time a packet arrives, it will 
be dropped, and a wakeup will finally occur. libpcap will drain the ring, 
handing all buffer blocks to the kernel, but it won't have any packets to 
process!

So this is ultimately a problem with the TPACKET_V3 code in the kernel. I 
personally think that it should not deliver empty buffer blocks to userland, 
and that it also should not deliver a wakeup when a packet is accepted, and 
should deliver a wakeup whenever a buffer block is handed to userland. I'll 
report this to somebody and let them decide which of those changes should be 
done.

If you want to deliver empty buffer blocks to userland, that's fine, but make 
sure you wake up userland so that it can process those packets rather than 
leaving them there taking up space in the ring buffer.

And if you insist on delivering a wakeup when a packet is accepted - a wakeup 
that libpcap, at least, won't do anything with, as there's nothing useful for 
it to do with that wakeup - also make sure you deliver a wakeup when a buffer 
block is handed to userland, which is what libpcap cares about.

> I cannot speak on behalf of user-space wrappers developed around
> tpacket_v3 but the intention(from the kernel POV) of the block_timer
> *is* to unblock the capture/user process/thread so that it does NOT
> stay blocked for an indefinite period of time. The header explicitly
> specifies that contract.

That's not part of the contract for libpcap, as it's a question of what the 
underlying capture mechanism does, and we don't necessarily have any control 
over that; if a particular capture mechanism used by libpcap has that as part 
of its contract, that's OK, but libpcap-based applications shouldn't depend on 
it.


[PATCH 3/3] net/atm: Add some spaces for better code readability

2017-05-02 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 2 May 2017 19:19:14 +0200

Use space characters at some source code places according to
the Linux coding style convention.

Signed-off-by: Markus Elfring 
---
 net/atm/mpoa_proc.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/atm/mpoa_proc.c b/net/atm/mpoa_proc.c
index 324c4f95f4bf..6a52606557f0 100644
--- a/net/atm/mpoa_proc.c
+++ b/net/atm/mpoa_proc.c
@@ -154,8 +154,8 @@ static int mpc_show(struct seq_file *m, void *v)
seq_printf(m, "%-16s%s%-14lu%-12u",
   ip_string,
   ingress_state_string(in_entry->entry_state),
-  in_entry->ctrl_info.holding_time -
-  (now.tv_sec-in_entry->tv.tv_sec),
+  in_entry->ctrl_info.holding_time
+  - (now.tv_sec - in_entry->tv.tv_sec),
   in_entry->packets_fwded);
if (in_entry->shortcut)
seq_printf(m, "   %-3d  %-3d",
@@ -173,8 +173,8 @@ static int mpc_show(struct seq_file *m, void *v)
seq_printf(m, "\n%-16lu%s%-14lu%-15u",
   (unsigned long)ntohl(eg_entry->ctrl_info.cache_id),
   egress_state_string(eg_entry->entry_state),
-  (eg_entry->ctrl_info.holding_time -
-   (now.tv_sec-eg_entry->tv.tv_sec)),
+  eg_entry->ctrl_info.holding_time
+  - (now.tv_sec - eg_entry->tv.tv_sec),
   eg_entry->packets_rcvd);
 
/* latest IP address */
@@ -213,7 +213,7 @@ static ssize_t proc_mpc_write(struct file *file, const char 
__user *buff,
return 0;
 
if (nbytes >= PAGE_SIZE)
-   nbytes = PAGE_SIZE-1;
+   nbytes = PAGE_SIZE - 1;
 
page = (char *)__get_free_page(GFP_KERNEL);
if (!page)
@@ -251,18 +251,21 @@ static int parse_qos(const char *buff)
memset(, 0, sizeof(struct atm_qos));
 
if (sscanf(buff, "del %hhu.%hhu.%hhu.%hhu",
-   ip, ip+1, ip+2, ip+3) == 4) {
+  ip, ip + 1, ip + 2, ip + 3) == 4) {
ipaddr = *(__be32 *)ip;
return atm_mpoa_delete_qos(atm_mpoa_search_qos(ipaddr));
}
 
if (sscanf(buff, "add %hhu.%hhu.%hhu.%hhu tx=%d,%d rx=tx",
-   ip, ip+1, ip+2, ip+3, _pcr, _sdu) == 6) {
+  ip, ip + 1, ip + 2, ip + 3, _pcr, _sdu) == 6) {
rx_pcr = tx_pcr;
rx_sdu = tx_sdu;
-   } else if (sscanf(buff, "add %hhu.%hhu.%hhu.%hhu tx=%d,%d rx=%d,%d",
-   ip, ip+1, ip+2, ip+3, _pcr, _sdu, _pcr, _sdu) != 8)
-   return 0;
+   } else {
+   if (sscanf(buff, "add %hhu.%hhu.%hhu.%hhu tx=%d,%d rx=%d,%d",
+  ip, ip + 1, ip + 2, ip + 3,
+  _pcr, _sdu, _pcr, _sdu) != 8)
+   return 0;
+   }
 
ipaddr = *(__be32 *)ip;
qos.txtp.traffic_class = ATM_CBR;
-- 
2.12.2



[PATCH 2/3] net/atm: Use seq_putc() in mpc_show()

2017-05-02 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 2 May 2017 18:58:08 +0200

Single characters (line breaks) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/atm/mpoa_proc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/atm/mpoa_proc.c b/net/atm/mpoa_proc.c
index 6ea6028fd865..324c4f95f4bf 100644
--- a/net/atm/mpoa_proc.c
+++ b/net/atm/mpoa_proc.c
@@ -161,7 +161,7 @@ static int mpc_show(struct seq_file *m, void *v)
seq_printf(m, "   %-3d  %-3d",
   in_entry->shortcut->vpi,
   in_entry->shortcut->vci);
-   seq_printf(m, "\n");
+   seq_putc(m, '\n');
}
 
seq_printf(m,
@@ -185,9 +185,9 @@ static int mpc_show(struct seq_file *m, void *v)
seq_printf(m, " %-3d %-3d",
   eg_entry->shortcut->vpi,
   eg_entry->shortcut->vci);
-   seq_printf(m, "\n");
+   seq_putc(m, '\n');
}
-   seq_printf(m, "\n");
+   seq_putc(m, '\n');
return 0;
 }
 
-- 
2.12.2



[PATCH 1/3] net/atm: Combine four seq_printf() calls in mpc_show()

2017-05-02 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 2 May 2017 18:52:58 +0200

Some data were put into a sequence by four separate function calls.
Print the same data by two function calls instead.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/atm/mpoa_proc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/atm/mpoa_proc.c b/net/atm/mpoa_proc.c
index 2df34eb5d65f..6ea6028fd865 100644
--- a/net/atm/mpoa_proc.c
+++ b/net/atm/mpoa_proc.c
@@ -145,8 +145,8 @@ static int mpc_show(struct seq_file *m, void *v)
return 0;
}
 
-   seq_printf(m, "\nInterface %d:\n\n", mpc->dev_num);
-   seq_printf(m, "Ingress Entries:\nIP address  State  Holding 
time  Packets fwded  VPI  VCI\n");
+   seq_printf(m, "\nInterface %d:\n\nIngress Entries:\nIP address  
State  Holding time  Packets fwded  VPI  VCI\n",
+  mpc->dev_num);
do_gettimeofday();
 
for (in_entry = mpc->in_cache; in_entry; in_entry = in_entry->next) {
@@ -165,7 +165,7 @@ static int mpc_show(struct seq_file *m, void *v)
}
 
-   seq_printf(m, "\n");
-   seq_printf(m, "Egress Entries:\nIngress MPC ATM addr\nCache-id
State  Holding time  Packets recvd  Latest IP addr   VPI VCI\n");
+   seq_printf(m,
+  "\nEgress Entries:\nIngress MPC ATM addr\nCache-id
State  Holding time  Packets recvd  Latest IP addr   VPI VCI\n");
for (eg_entry = mpc->eg_cache; eg_entry; eg_entry = eg_entry->next) {
unsigned char *p = eg_entry->ctrl_info.in_MPC_data_ATM_addr;
for (i = 0; i < ATM_ESA_LEN; i++)
-- 
2.12.2



[PATCH 0/3] net/atm: Fine-tuning for three function implementations

2017-05-02 Thread SF Markus Elfring
From: Markus Elfring 
Date: Tue, 2 May 2017 19:37:39 +0200

A few update suggestions were taken into account
from static source code analysis.

Markus Elfring (3):
  Combine four seq_printf() calls in mpc_show()
  Use seq_putc() in mpc_show()
  Add some spaces for better code readability

 net/atm/mpoa_proc.c | 37 -
 1 file changed, 20 insertions(+), 17 deletions(-)

-- 
2.12.2



Re: net/ipv6: use-after-free in __call_rcu/in6_dev_finish_destroy_rcu

2017-05-02 Thread David Ahern
On 5/2/17 10:58 AM, Andrey Konovalov wrote:
> Do you have a patch that I could test?

not yet.

> 
> I also reported another issue recently, that might also be related to this 
> one:
> https://groups.google.com/forum/#!topic/syzkaller/Rt0pgY4wfiw

different problem. I can still trigger this one with the reproducer you sent


Re: TPACKET_V3 timeout bug?

2017-05-02 Thread chetan loke
On Tue, May 2, 2017 at 8:04 AM, chetan loke  wrote:
> On Sat, Apr 15, 2017 at 7:41 PM, Guy Harris  wrote:
>> On Apr 15, 2017, at 7:10 PM, Andrew Lunn  wrote:
>>
>>> Do you think this is a kernel problem, libpcap problem, or an
>>> application problem?
>>
>
> Its clearly a kernel regression.
>

Commit that caused it:

https://github.com/torvalds/linux/commit/41a50d621a321b4c15273cc1b5ed41437f4acdfb

Reverting that change is what we need.

When monitoring aggregated links (example: request going out on one
link and response coming in on another) you have to know when to start
mining packets for time-interval[s] to report anomalies etc. And the
block-retirement was designed such that the kernel would fill those
values for the app in the ts_first[/last]_pkt.

We have already amortized the cost by doing a bulk wakeup. And if
user-space holds on to the block even when it is empty (instead of
just using/copying the start/end time interval and releasing the
block) then its a bug in your app.

And if user-space is lagging behind then you will always run out of
buffer space. So I don't buy the entire commit argument of the patch.


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread David Miller
From: David Ahern 
Date: Tue, 2 May 2017 10:51:23 -0600

> On 5/2/17 9:25 AM, Stephen Hemminger wrote:
>> Please either use existing netlink attribute code in libnetlink.h
>> (rta_getattr_u32 etc) or use libmnl like devlink.
> 
> All of the existing rta_ functions take a struct rta_attr; netlink
> messages use struct nlattr. It's just wrong to use rta functions for
> netlink messages.

Agreed.

> There is no existing parse function for nlattr. There is no existing
> validate function - for nlattr or rta_attr. So I do not see any overlap
> with existing code.

Also agreed.


Re: [PATCH net v3] driver: veth: Fix one possbile memleak when fail to register_netdevice

2017-05-02 Thread Xin Long
On Tue, May 2, 2017 at 7:03 PM, Gao Feng  wrote:
>> From: Xin Long [mailto:lucien@gmail.com]
>> Sent: Tuesday, May 2, 2017 3:56 PM
>> On Sat, Apr 29, 2017 at 11:51 AM,   wrote:
>> > From: Gao Feng 
> [...]
>> > -static void veth_dev_free(struct net_device *dev)
>> > +static void veth_destructor_free(struct net_device *dev)
>> >  {
>> > free_percpu(dev->vstats);
>> > +}
>> not sure why you needed to add this function.
>> to use free_percpu() directly may be clearer.
>
> Because both of ndo_uninit and destructor need to perform same free 
> statements.
> It is good at maintain the codes with the common function.
>>
>> > +
>> > +static void veth_dev_uninit(struct net_device *dev) {
>> call free_percpu() here, no need to check dev->reg_state.
>> free_percpu will just return if dev->vstats is NULL.
>
> It would break the original design if don't check the reg_state.
> The original logic is that free the resources in the destructor, not in 
> ndo_init.
I got what you're doing now, can you pls try to fix this with:

--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -219,10 +219,9 @@ static int veth_dev_init(struct net_device *dev)
return 0;
 }

-static void veth_dev_free(struct net_device *dev)
+static void veth_dev_uninit(struct net_device *dev)
 {
free_percpu(dev->vstats);
-   free_netdev(dev);
 }

 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -279,6 +278,7 @@ static void veth_set_rx_headroom(struct net_device
*dev, int new_hr)

 static const struct net_device_ops veth_netdev_ops = {
.ndo_init= veth_dev_init,
+   .ndo_uninit  = veth_dev_uninit,
.ndo_open= veth_open,
.ndo_stop= veth_close,
.ndo_start_xmit  = veth_xmit,
@@ -317,7 +317,7 @@ static void veth_setup(struct net_device *dev)
   NETIF_F_HW_VLAN_STAG_TX |
   NETIF_F_HW_VLAN_CTAG_RX |
   NETIF_F_HW_VLAN_STAG_RX);
-   dev->destructor = veth_dev_free;
+   dev->destructor = free_netdev;
dev->max_mtu = ETH_MAX_MTU;

dev->hw_features = VETH_FEATURES;


just as what other virtual nic drivers do (vxlan, geneve, macsec, bridge )

>
> BTW, because I send multiple patches too fast today, the email server blocks 
> my account.
> So I have to reply you with a different email account. Sorry.
>
> Best Regards
> Feng
>
>>
> [...]
>
>


Re: net/ipv6: use-after-free in __call_rcu/in6_dev_finish_destroy_rcu

2017-05-02 Thread Andrey Konovalov
On Tue, May 2, 2017 at 4:44 AM, David Ahern  wrote:
> On 4/26/17 9:15 AM, Andrey Konovalov wrote:
>> +David
>>
>> I've enabled CONFIG_DEBUG_OBJECTS_RCU_HEAD and this is what I get.
>>
>> Apparently the rcu warning is related to the fib6_del_route bug I've
>> been trying to reproduce:
>> https://groups.google.com/forum/#!msg/syzkaller/3SS80JbVPKA/2tfIAcW7DwAJ
>>
>> Adding David, who provided the fix:
>> https://patchwork.ozlabs.org/patch/754913/
>>
>> I've managed to extract a reproducer, attached together with the
>> .config that I used.
>>
>> On commit 5a7ad1146caa895ad718a534399e38bd2ba721b7 (4.11-rc8) with
>> David's patch applied.
>>
>> [ cut here ]
>> WARNING: CPU: 1 PID: 5911 at lib/debugobjects.c:289
>> debug_print_object+0x175/0x210
>> ODEBUG: activate active (active state 1) object type: rcu_head hint:
>> (null)
>> Modules linked in:
>> CPU: 1 PID: 5911 Comm: a.out Not tainted 4.11.0-rc8+ #271
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:16
>>  dump_stack+0x192/0x22d lib/dump_stack.c:52
>>  __warn+0x19f/0x1e0 kernel/panic.c:549
>>  warn_slowpath_fmt+0xe0/0x120 kernel/panic.c:564
>>  debug_print_object+0x175/0x210 lib/debugobjects.c:286
>>  debug_object_activate+0x574/0x7e0 lib/debugobjects.c:442
>>  debug_rcu_head_queue kernel/rcu/rcu.h:75
>>  __call_rcu.constprop.76+0xff/0x9c0 kernel/rcu/tree.c:3229
>>  call_rcu_sched+0x12/0x20 kernel/rcu/tree.c:3288
>>  rt6_rcu_free net/ipv6/ip6_fib.c:158
>>  rt6_release+0x1ea/0x290 net/ipv6/ip6_fib.c:188
>>  fib6_del_route net/ipv6/ip6_fib.c:1461
>
> I think I got to the bottom of this one.
>
> With your config, ip6_tunnel is compiled in.
>
> The program runs in a very tight loop, calling 'unshare -n' and then
> spawns 2 sets of 14 threads running random ioctl calls. The networking
> sequence:
>
> 1. New network namespace created via unshare -n
> - ip6tnl0 device is created in down state
>
> 2. address added to ip6tnl0 (equivalent to ip -6 addr add dev ip6tnl0
> fd00::bb/1)
> - the host route is created and inserted into FIB
>
> 3. ip6tnl0 is brought up - starts DAD on the address
>
> 4. exit namespace
> - teardown / cleanup sequence starts
> - lo teardown appears to happen BEFORE teardown of ip6tunl0
>   + removes host route from FIB
>   + host route added to rcu callback list: call_rcu(>dst.rcu_head,
> dst_rcu_free);
>   + rcu callback has not run yet, so rt is NOT on the gc list so it has
> NOT been marked obsolete
>
> 5. worker_thread runs addrconf_dad_completed
> - calls ipv6_ifa_notify which inserts the host route
>
> All of that happens very quickly. The result is that a route that has
> been deleted and added to the RCU list is re-inserted into the FIB. What
> happens next depends on order -- in this case the exit namespace
> eventually gets to cleaning up ip6tnl0 which removes the host route from
> the FIB, calls the rcu function for cleanup -- and triggers the double
> rcu trace.
>
> I have a hack that flags this sequence and prevents the re-insertion
> following DAD. That allows the command to run until it consumes all 2G
> of memory the VM has -- about 600+ iterations without triggering any
> stack traces.

Hi David,

Thanks for looking into this!

Do you have a patch that I could test?

I also reported another issue recently, that might also be related to this one:
https://groups.google.com/forum/#!topic/syzkaller/Rt0pgY4wfiw

Thanks!


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Raj, Ashok
On Tue, May 02, 2017 at 09:39:34AM -0700, Alexander Duyck wrote:
> On Mon, May 1, 2017 at 4:13 PM, Casey Leedom  wrote:
> > The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
> > Ordering Attribute should not be used on Transaction Layer Packets destined
> > for the PCIe End Node so flagged.  Initially flagged this way are Intel
> > E5-26xx Root Complex Ports which suffer from a Flow Control Credit
> > Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
> > don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
> 
> So this is a good first step though might I suggest one other change.
> 
> We may want to add logic to the core PCIe code that clears the "Enable
> Relaxed Ordering" bit in the device control register for all devices
> hanging off of this root complex. Assuming the devices conform to the
> PCIe spec doing that should disable relaxed ordering in a device
> agnostic way that then enables us at a driver level to just enable the
> feature always without having to perform any checks for your flag. We
> could probably do that as a part of probing the PCIe interfaces
> hanging off of these devices.

I suppose you don't want to turn off RO completely on the device. When
traffic is targetted to mmio for peer to peer reasons RO has performance
upside. The specific issue with these root ports indicate limitation using 
RO for traffic targetting coherent memory.

> 
> > ---
> >  drivers/pci/quirks.c | 38 ++
> >  include/linux/pci.h  |  2 ++
> >  2 files changed, 40 insertions(+)
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index f754453..4ae78b3 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
> >   quirk_tw686x_class);
> >
> >  /*
> > + * Some devices have problems with Transaction Layer Packets with the 
> > Relaxed
> > + * Ordering Attribute set.  Such devices should mark themselves and other
> > + * Device Drivers should check before sending TLPs with RO set.
> > + */
> > +static void quirk_relaxedordering_disable(struct pci_dev *dev)
> > +{
> > +   dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
> > +}
> > +
> > +/*
> > + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can
> > + * cause performance problems with Upstream Transaction Layer Packets with
> > + * Relaxed Ordering set.
> > + */
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +
> > +/*
> > + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
> > + * where Upstream Transaction Layer Packets with the Relaxed Ordering
> > + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
> > + * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
> > + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 
> > 3.0
> > + * November 10, 2010).  As a result, on this platform we can't use Relaxed
> > + * Ordering for Upstream TLPs.
> > + */
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8,
> > + quirk_relaxedordering_disable);
> > +
> > +/*
> >   * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
> >   * values for the Attribute as were supplied in the header of the
> >   * corresponding Request, except as explicitly allowed when IDO is used."
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index eb3da1a..6764f66 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -178,6 +178,8 @@ enum pci_dev_flags {
> > PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7),
> > /* Get VPD from function 0 VPD */
> > PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8),
> > +   /* Don't use Relaxed Ordering for TLPs directed at this device */
> > +   PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 
> > 9),
> >  };
> >
> >  enum pci_irq_reroute_variant {
> > --
> > 1.9.1
> >


Re: [PATCH net-next iproute2 1/3] netlink: import netlink message parsing from kernel

2017-05-02 Thread David Ahern
On 5/2/17 9:25 AM, Stephen Hemminger wrote:
> Please either use existing netlink attribute code in libnetlink.h
> (rta_getattr_u32 etc) or use libmnl like devlink.

All of the existing rta_ functions take a struct rta_attr; netlink
messages use struct nlattr. It's just wrong to use rta functions for
netlink messages.

There is no existing parse function for nlattr. There is no existing
validate function - for nlattr or rta_attr. So I do not see any overlap
with existing code.

If you prefer ip to gain a dependency on libmnl, I'll take a look.


Re: net/key: slab-out-of-bounds in pfkey_compile_policy

2017-05-02 Thread Andrey Konovalov
On Tue, May 2, 2017 at 6:45 PM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit d3b5d35290d729a2518af00feca867385a1b08fa (4.11).
>
> A reproducer and .config are attached.
>
> ==
> BUG: KASAN: slab-out-of-bounds in pfkey_compile_policy+0x8e6/0xd40 at
> addr 88006701f798
> Read of size 1280 by task a.out/4181
> CPU: 0 PID: 4181 Comm: a.out Not tainted 4.11.0+ #306
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16
>  dump_stack+0x292/0x395 lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202
>  kasan_report_error mm/kasan/report.c:291
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  check_memory_region_inline mm/kasan/kasan.c:326
>  check_memory_region+0x139/0x190 mm/kasan/kasan.c:333
>  memcpy+0x23/0x50 mm/kasan/kasan.c:368
>  pfkey_sadb2xfrm_user_sec_ctx net/key/af_key.c:474
>  pfkey_compile_policy+0x8e6/0xd40 net/key/af_key.c:3294
>  xfrm_user_policy+0x349/0x560 net/xfrm/xfrm_state.c:1892
>  do_ip_setsockopt.isra.12+0x1d05/0x38c0 net/ipv4/ip_sockglue.c:1175
>  ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
>  tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2732
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
>  SYSC_setsockopt net/socket.c:1798
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
>  entry_SYSCALL_64_fastpath+0x1f/0xbe arch/x86/entry/entry_64.S:204
> RIP: 0033:0x7f13a0968b79
> RSP: 002b:7fff131bc3f8 EFLAGS: 0206 ORIG_RAX: 0036
> RAX: ffda RBX: 7fff131bc550 RCX: 7f13a0968b79
> RDX: 0010 RSI:  RDI: 0003
> RBP: 004004e0 R08: 00c2 R09: 
> R10: 20a2ff3e R11: 0206 R12: 
> R13: 7fff131bc550 R14:  R15: 
> Object at 88006701f780, in cache kmalloc-256 size: 256
> Allocated:
> PID = 4181
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  __kmalloc+0xa0/0x2d0 mm/slub.c:3745
>  kmalloc ./include/linux/slab.h:495
>  xfrm_user_policy+0xd8/0x560 net/xfrm/xfrm_state.c:1881
>  do_ip_setsockopt.isra.12+0x1d05/0x38c0 net/ipv4/ip_sockglue.c:1175
>  ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
>  tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2732
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
>  SYSC_setsockopt net/socket.c:1798
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
>  entry_SYSCALL_64_fastpath+0x1f/0xbe arch/x86/entry/entry_64.S:204
> Freed:
> PID = 3951
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357
>  slab_free_freelist_hook mm/slub.c:1379
>  slab_free mm/slub.c:2961
>  kfree+0xe8/0x2b0 mm/slub.c:3882
>  free_bprm+0x19d/0x200 fs/exec.c:1382
>  do_execveat_common.isra.34+0x19ad/0x2220 fs/exec.c:1778
>  do_execve fs/exec.c:1813
>  SYSC_execve fs/exec.c:1894
>  SyS_execve+0x39/0x50 fs/exec.c:1889
>  do_syscall_64+0x2c7/0x7a0 arch/x86/entry/common.c:281
>  return_from_SYSCALL_64+0x0/0x7a arch/x86/entry/entry_64.S:246
> Memory state around the buggy address:
>  88006701f700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>  88006701f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>88006701f800: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
>^
>  88006701f880: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>  88006701f900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==

+ syzkaller


net/key: slab-out-of-bounds in pfkey_compile_policy

2017-05-02 Thread Andrey Konovalov
Hi,

I've got the following error report while fuzzing the kernel with syzkaller.

On commit d3b5d35290d729a2518af00feca867385a1b08fa (4.11).

A reproducer and .config are attached.

==
BUG: KASAN: slab-out-of-bounds in pfkey_compile_policy+0x8e6/0xd40 at
addr 88006701f798
Read of size 1280 by task a.out/4181
CPU: 0 PID: 4181 Comm: a.out Not tainted 4.11.0+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 print_address_description mm/kasan/report.c:202
 kasan_report_error mm/kasan/report.c:291
 kasan_report+0x252/0x510 mm/kasan/report.c:347
 check_memory_region_inline mm/kasan/kasan.c:326
 check_memory_region+0x139/0x190 mm/kasan/kasan.c:333
 memcpy+0x23/0x50 mm/kasan/kasan.c:368
 pfkey_sadb2xfrm_user_sec_ctx net/key/af_key.c:474
 pfkey_compile_policy+0x8e6/0xd40 net/key/af_key.c:3294
 xfrm_user_policy+0x349/0x560 net/xfrm/xfrm_state.c:1892
 do_ip_setsockopt.isra.12+0x1d05/0x38c0 net/ipv4/ip_sockglue.c:1175
 ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
 tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2732
 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
 SYSC_setsockopt net/socket.c:1798
 SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
 entry_SYSCALL_64_fastpath+0x1f/0xbe arch/x86/entry/entry_64.S:204
RIP: 0033:0x7f13a0968b79
RSP: 002b:7fff131bc3f8 EFLAGS: 0206 ORIG_RAX: 0036
RAX: ffda RBX: 7fff131bc550 RCX: 7f13a0968b79
RDX: 0010 RSI:  RDI: 0003
RBP: 004004e0 R08: 00c2 R09: 
R10: 20a2ff3e R11: 0206 R12: 
R13: 7fff131bc550 R14:  R15: 
Object at 88006701f780, in cache kmalloc-256 size: 256
Allocated:
PID = 4181
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 __kmalloc+0xa0/0x2d0 mm/slub.c:3745
 kmalloc ./include/linux/slab.h:495
 xfrm_user_policy+0xd8/0x560 net/xfrm/xfrm_state.c:1881
 do_ip_setsockopt.isra.12+0x1d05/0x38c0 net/ipv4/ip_sockglue.c:1175
 ip_setsockopt+0x3a/0xb0 net/ipv4/ip_sockglue.c:1264
 tcp_setsockopt+0x82/0xd0 net/ipv4/tcp.c:2732
 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2750
 SYSC_setsockopt net/socket.c:1798
 SyS_setsockopt+0x270/0x3a0 net/socket.c:1777
 entry_SYSCALL_64_fastpath+0x1f/0xbe arch/x86/entry/entry_64.S:204
Freed:
PID = 3951
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357
 slab_free_freelist_hook mm/slub.c:1379
 slab_free mm/slub.c:2961
 kfree+0xe8/0x2b0 mm/slub.c:3882
 free_bprm+0x19d/0x200 fs/exec.c:1382
 do_execveat_common.isra.34+0x19ad/0x2220 fs/exec.c:1778
 do_execve fs/exec.c:1813
 SYSC_execve fs/exec.c:1894
 SyS_execve+0x39/0x50 fs/exec.c:1889
 do_syscall_64+0x2c7/0x7a0 arch/x86/entry/common.c:281
 return_from_SYSCALL_64+0x0/0x7a arch/x86/entry/entry_64.S:246
Memory state around the buggy address:
 88006701f700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 88006701f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>88006701f800: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
   ^
 88006701f880: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 88006701f900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==
// autogenerated by syzkaller (http://github.com/google/syzkaller)

#ifndef __NR_mmap
#define __NR_mmap 9
#endif
#ifndef __NR_socket
#define __NR_socket 41
#endif
#ifndef __NR_setsockopt
#define __NR_setsockopt 54
#endif

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static uintptr_t execute_syscall(int nr, uintptr_t a0, uintptr_t a1,
 uintptr_t a2, uintptr_t a3,
 uintptr_t a4, uintptr_t a5,
 uintptr_t a6, uintptr_t a7,
 uintptr_t a8)
{
  switch (nr) {
  default:
return syscall(nr, a0, a1, a2, a3, a4, a5);
  }
}

long r[4];

void main()
{
  memset(r, -1, sizeof(r));
  r[0] = execute_syscall(__NR_mmap, 0x2000ul, 0xeed000ul, 0x3ul,
 0x32ul, 0xul, 0x0ul, 0, 0, 0);
  r[1] = 

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Raj, Ashok
Hi Casey


On Mon, May 01, 2017 at 04:13:50PM -0700, Casey Leedom wrote:
> The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
> Ordering Attribute should not be used on Transaction Layer Packets destined
> for the PCIe End Node so flagged.  Initially flagged this way are Intel
> E5-26xx Root Complex Ports which suffer from a Flow Control Credit
> Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
> don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.
> ---
>  drivers/pci/quirks.c | 38 ++
>  include/linux/pci.h  |  2 ++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index f754453..4ae78b3 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
> quirk_tw686x_class);
>  
>  /*
> + * Some devices have problems with Transaction Layer Packets with the Relaxed
> + * Ordering Attribute set.  Such devices should mark themselves and other
> + * Device Drivers should check before sending TLPs with RO set.
> + */
> +static void quirk_relaxedordering_disable(struct pci_dev *dev)
> +{
> + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
> +}
> +
> +/*
> + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can
> + * cause performance problems with Upstream Transaction Layer Packets with
> + * Relaxed Ordering set.
> + */
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8,
> +   quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8,
> +   quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8,
> +   quirk_relaxedordering_disable);
> +

You might want to add the RP ID's for both HSX/BDX. Tne entire range 
is 2F01H-2F0EH & 6F01H-6F0EH.

Cheers,
Ashok


Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-02 Thread Alexander Duyck
On Mon, May 1, 2017 at 4:13 PM, Casey Leedom  wrote:
> The new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING indicates that the Relaxed
> Ordering Attribute should not be used on Transaction Layer Packets destined
> for the PCIe End Node so flagged.  Initially flagged this way are Intel
> E5-26xx Root Complex Ports which suffer from a Flow Control Credit
> Performance Problem and AMD A1100 ARM ("SEATTLE") Root Complex Ports which
> don't obey PCIe 3.0 ordering rules which can lead to Data Corruption.

So this is a good first step though might I suggest one other change.

We may want to add logic to the core PCIe code that clears the "Enable
Relaxed Ordering" bit in the device control register for all devices
hanging off of this root complex. Assuming the devices conform to the
PCIe spec doing that should disable relaxed ordering in a device
agnostic way that then enables us at a driver level to just enable the
feature always without having to perform any checks for your flag. We
could probably do that as a part of probing the PCIe interfaces
hanging off of these devices.

> ---
>  drivers/pci/quirks.c | 38 ++
>  include/linux/pci.h  |  2 ++
>  2 files changed, 40 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index f754453..4ae78b3 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3979,6 +3979,44 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
>   quirk_tw686x_class);
>
>  /*
> + * Some devices have problems with Transaction Layer Packets with the Relaxed
> + * Ordering Attribute set.  Such devices should mark themselves and other
> + * Device Drivers should check before sending TLPs with RO set.
> + */
> +static void quirk_relaxedordering_disable(struct pci_dev *dev)
> +{
> +   dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
> +}
> +
> +/*
> + * Intel E5-26xx Root Complex has a Flow Control Credit issue which can
> + * cause performance problems with Upstream Transaction Layer Packets with
> + * Relaxed Ordering set.
> + */
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f02, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f04, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x8086, 0x6f08, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +
> +/*
> + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
> + * where Upstream Transaction Layer Packets with the Relaxed Ordering
> + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
> + * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
> + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0
> + * November 10, 2010).  As a result, on this platform we can't use Relaxed
> + * Ordering for Upstream TLPs.
> + */
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a00, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a01, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +DECLARE_PCI_FIXUP_CLASS_EARLY(0x1022, 0x1a02, PCI_CLASS_NOT_DEFINED, 8,
> + quirk_relaxedordering_disable);
> +
> +/*
>   * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
>   * values for the Attribute as were supplied in the header of the
>   * corresponding Request, except as explicitly allowed when IDO is used."
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index eb3da1a..6764f66 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -178,6 +178,8 @@ enum pci_dev_flags {
> PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7),
> /* Get VPD from function 0 VPD */
> PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8),
> +   /* Don't use Relaxed Ordering for TLPs directed at this device */
> +   PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 
> 9),
>  };
>
>  enum pci_irq_reroute_variant {
> --
> 1.9.1
>


Re: [PATCH v2 net-next 4/7] net: add new control message for incoming HW-timestamped packets

2017-05-02 Thread kbuild test robot
Hi Miroslav,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Miroslav-Lichvar/Extend-socket-timestamping-API/20170502-212515
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   net/socket.c: In function 'put_ts_pktinfo':
>> net/socket.c:685:28: error: 'SCM_TIMESTAMPING_PKTINFO' undeclared (first use 
>> in this function)
 put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_PKTINFO,
   ^
   net/socket.c:685:28: note: each undeclared identifier is reported only once 
for each function it appears in

vim +/SCM_TIMESTAMPING_PKTINFO +685 net/socket.c

   679  
   680  if (ifindex == 0)
   681  return;
   682  
   683  ts_pktinfo.if_index = ifindex;
   684  ts_pktinfo.pkt_length = skb->len - skb_mac_offset(skb);
 > 685  put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_PKTINFO,
   686   sizeof(ts_pktinfo), _pktinfo);
   687  }
   688  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[net-next 2/2] tipc: refactor function tipc_sk_recv_stream()

2017-05-02 Thread Jon Maloy
We try to make this function more readable by improving variable names
and comments, using more stack variables, and doing some smaller changes
to the logics. We also rename the function to make it consistent with
naming conventions used elsewhere in the code.

Reviewed-by: Parthasarathy Bhuvaragan 
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 155 +-
 1 file changed, 71 insertions(+), 84 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 3855bfd..7e45ef9 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1375,9 +1375,9 @@ static int tipc_recvmsg(struct socket *sock, struct 
msghdr *m,
 }
 
 /**
- * tipc_recv_stream - receive stream-oriented data
+ * tipc_recvstream - receive stream-oriented data
  * @m: descriptor for message info
- * @buf_len: total size of user buffer area
+ * @buflen: total size of user buffer area
  * @flags: receive flags
  *
  * Used for SOCK_STREAM messages only.  If not enough data is available
@@ -1385,111 +1385,98 @@ static int tipc_recvmsg(struct socket *sock, struct 
msghdr *m,
  *
  * Returns size of returned message data, errno otherwise
  */
-static int tipc_recv_stream(struct socket *sock, struct msghdr *m,
-   size_t buf_len, int flags)
+static int tipc_recvstream(struct socket *sock, struct msghdr *m,
+  size_t buflen, int flags)
 {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct sk_buff *buf;
-   struct tipc_msg *msg;
-   long timeo;
-   unsigned int sz;
-   int target;
-   int sz_copied = 0;
-   u32 err;
-   int res = 0, hlen;
+   struct sk_buff *skb;
+   struct tipc_msg *hdr;
+   struct tipc_skb_cb *skb_cb;
+   bool peek = flags & MSG_PEEK;
+   int offset, required, copy, copied = 0;
+   int hlen, dlen, err, rc;
+   long timeout;
 
/* Catch invalid receive attempts */
-   if (unlikely(!buf_len))
+   if (unlikely(!buflen))
return -EINVAL;
 
lock_sock(sk);
 
if (unlikely(sk->sk_state == TIPC_OPEN)) {
-   res = -ENOTCONN;
-   goto exit;
-   }
-
-   target = sock_rcvlowat(sk, flags & MSG_WAITALL, buf_len);
-   timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
-
-restart:
-   /* Look for a message in receive queue; wait if necessary */
-   res = tipc_wait_for_rcvmsg(sock, );
-   if (res)
+   rc = -ENOTCONN;
goto exit;
-
-   /* Look at first message in receive queue */
-   buf = skb_peek(>sk_receive_queue);
-   msg = buf_msg(buf);
-   sz = msg_data_sz(msg);
-   hlen = msg_hdr_sz(msg);
-   err = msg_errcode(msg);
-
-   /* Discard an empty non-errored message & try again */
-   if ((!sz) && (!err)) {
-   tsk_advance_rx_queue(sk);
-   goto restart;
}
+   required = sock_rcvlowat(sk, flags & MSG_WAITALL, buflen);
+   timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
-   /* Optionally capture sender's address & ancillary data of first msg */
-   if (sz_copied == 0) {
-   set_orig_addr(m, msg);
-   res = tipc_sk_anc_data_recv(m, msg, tsk);
-   if (res)
-   goto exit;
-   }
-
-   /* Capture message data (if valid) & compute return value (always) */
-   if (!err) {
-   u32 offset = TIPC_SKB_CB(buf)->bytes_read;
-   u32 needed;
-   int sz_to_copy;
-
-   sz -= offset;
-   needed = (buf_len - sz_copied);
-   sz_to_copy = min(sz, needed);
+   do {
+   /* Look at first msg in receive queue; wait if necessary */
+   rc = tipc_wait_for_rcvmsg(sock, );
+   if (unlikely(rc))
+   break;
+   skb = skb_peek(>sk_receive_queue);
+   skb_cb = TIPC_SKB_CB(skb);
+   hdr = buf_msg(skb);
+   dlen = msg_data_sz(hdr);
+   hlen = msg_hdr_sz(hdr);
+   err = msg_errcode(hdr);
 
-   res = skb_copy_datagram_msg(buf, hlen + offset, m, sz_to_copy);
-   if (res)
-   goto exit;
+   /* Discard any empty non-errored (SYN-) message */
+   if (unlikely(!dlen && !err)) {
+   tsk_advance_rx_queue(sk);
+   continue;
+   }
 
-   sz_copied += sz_to_copy;
+   /* Collect msg meta data, incl. error code and rejected data */
+   if (!copied) {
+   set_orig_addr(m, hdr);
+   rc = tipc_sk_anc_data_recv(m, hdr, tsk);
+   if (rc)
+   break;
+   }
 
-   if (sz_to_copy < sz) {
-   if 

[net-next 0/2] tipc: refactor socket receive functions

2017-05-02 Thread Jon Maloy
We try to make the functions tipc_sk_recvmsg() and tipc_sk_recvstream() more 
readable.

Jon Maloy (2):
  tipc: refactor function tipc_sk_recvmsg()
  tipc: refactor function tipc_sk_recv_stream()

 net/tipc/socket.c | 266 +-
 1 file changed, 122 insertions(+), 144 deletions(-)

-- 
2.1.4



[net-next 1/2] tipc: refactor function tipc_sk_recvmsg()

2017-05-02 Thread Jon Maloy
We try to make this function more readable by improving variable names
and comments, plus some minor changes to the logics.

Reviewed-by: Parthasarathy Bhuvaragan 
Signed-off-by: Jon Maloy 
---
 net/tipc/socket.c | 109 +-
 1 file changed, 50 insertions(+), 59 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 8a4e9fe..3855bfd 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -51,6 +51,7 @@
 #define TIPC_FWD_MSG   1
 #define TIPC_MAX_PORT  0x
 #define TIPC_MIN_PORT  1
+#define TIPC_ACK_RATE  4   /* ACK at 1/4 of of rcv window size */
 
 enum {
TIPC_LISTEN = TCP_LISTEN,
@@ -1290,7 +1291,7 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, long 
*timeop)
 /**
  * tipc_recvmsg - receive packet-oriented message
  * @m: descriptor for message info
- * @buf_len: total size of user buffer area
+ * @buflen: length of user buffer area
  * @flags: receive flags
  *
  * Used for SOCK_DGRAM, SOCK_RDM, and SOCK_SEQPACKET messages.
@@ -1298,89 +1299,79 @@ static int tipc_wait_for_rcvmsg(struct socket *sock, 
long *timeop)
  *
  * Returns size of returned message data, errno otherwise
  */
-static int tipc_recvmsg(struct socket *sock, struct msghdr *m, size_t buf_len,
-   int flags)
+static int tipc_recvmsg(struct socket *sock, struct msghdr *m,
+   size_t buflen,  int flags)
 {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct sk_buff *buf;
-   struct tipc_msg *msg;
-   bool is_connectionless = tipc_sk_type_connectionless(sk);
-   long timeo;
-   unsigned int sz;
-   u32 err;
-   int res, hlen;
+   struct sk_buff *skb;
+   struct tipc_msg *hdr;
+   bool connected = !tipc_sk_type_connectionless(sk);
+   int rc, err, hlen, dlen, copy;
+   long timeout;
 
/* Catch invalid receive requests */
-   if (unlikely(!buf_len))
+   if (unlikely(!buflen))
return -EINVAL;
 
lock_sock(sk);
-
-   if (!is_connectionless && unlikely(sk->sk_state == TIPC_OPEN)) {
-   res = -ENOTCONN;
+   if (unlikely(connected && sk->sk_state == TIPC_OPEN)) {
+   rc = -ENOTCONN;
goto exit;
}
+   timeout = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
-   timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
-restart:
-
-   /* Look for a message in receive queue; wait if necessary */
-   res = tipc_wait_for_rcvmsg(sock, );
-   if (res)
-   goto exit;
-
-   /* Look at first message in receive queue */
-   buf = skb_peek(>sk_receive_queue);
-   msg = buf_msg(buf);
-   sz = msg_data_sz(msg);
-   hlen = msg_hdr_sz(msg);
-   err = msg_errcode(msg);
-
-   /* Discard an empty non-errored message & try again */
-   if ((!sz) && (!err)) {
+   do {
+   /* Look at first msg in receive queue; wait if necessary */
+   rc = tipc_wait_for_rcvmsg(sock, );
+   if (unlikely(rc))
+   goto exit;
+   skb = skb_peek(>sk_receive_queue);
+   hdr = buf_msg(skb);
+   dlen = msg_data_sz(hdr);
+   hlen = msg_hdr_sz(hdr);
+   err = msg_errcode(hdr);
+   if (likely(dlen || err))
+   break;
tsk_advance_rx_queue(sk);
-   goto restart;
-   }
+   } while (1);
 
-   /* Capture sender's address (optional) */
-   set_orig_addr(m, msg);
-
-   /* Capture ancillary data (optional) */
-   res = tipc_sk_anc_data_recv(m, msg, tsk);
-   if (res)
+   /* Collect msg meta data, including error code and rejected data */
+   set_orig_addr(m, hdr);
+   rc = tipc_sk_anc_data_recv(m, hdr, tsk);
+   if (unlikely(rc))
goto exit;
 
-   /* Capture message data (if valid) & compute return value (always) */
-   if (!err) {
-   if (unlikely(buf_len < sz)) {
-   sz = buf_len;
+   /* Capture data if non-error msg, otherwise just set return value */
+   if (likely(!err)) {
+   copy = min_t(int, dlen, buflen);
+   if (unlikely(copy != dlen))
m->msg_flags |= MSG_TRUNC;
-   }
-   res = skb_copy_datagram_msg(buf, hlen, m, sz);
-   if (res)
-   goto exit;
-   res = sz;
+   rc = skb_copy_datagram_msg(skb, hlen, m, copy);
} else {
-   if (is_connectionless || err == TIPC_CONN_SHUTDOWN ||
-   m->msg_control)
-   res = 0;
-   else
-   res = -ECONNRESET;
+   copy = 0;
+   rc = 0;
+   if (err != TIPC_CONN_SHUTDOWN && 

Re: [PATCH v2 net-next 3/7] net: add function to retrieve original skb device using NAPI ID

2017-05-02 Thread Willem de Bruijn
On Tue, May 2, 2017 at 6:10 AM, Miroslav Lichvar  wrote:
> Since commit b68581778cd0 ("net: Make skb->skb_iif always track
> skb->dev") skbs don't have the original index of the interface which
> received the packet. This information is now needed for a new control
> message related to hardware timestamping.
>
> Instead of adding a new field to skb, we can find the device by the NAPI
> ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the
> driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to
> hide the CONFIG_NET_RX_BUSY_POLL ifdef.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 

>  /**
> + * dev_get_by_napi_id - find a device by napi_id
> + * @napi_id: ID of the NAPI struct
> + *
> + * Search for an interface by NAPI ID. Returns %NULL if the device
> + * is not found or a pointer to the device. The device has not had
> + * its reference counter increased so the caller must be careful
> + * about locking. The caller must hold RCU lock.

Instead of a comment, can check with

  WARN_ON_ONCE(!rcu_read_lock_held());

> + */
> +
> +struct net_device *dev_get_by_napi_id(unsigned int napi_id)
> +{
> +   struct napi_struct *napi;
> +
> +   if (napi_id < MIN_NAPI_ID)
> +   return NULL;
> +
> +   napi = napi_by_id(napi_id);
> +   if (napi)
> +   return napi->dev;
> +
> +   return NULL;

make return NULL the branch expected not taken.

  return napi ? napi->dev : NULL;


Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses

2017-05-02 Thread Dan Williams
On Tue, 2017-05-02 at 15:08 +, Chiappero, Marco wrote:
> > -Original Message-
> > From: Dan Williams [mailto:d...@redhat.com]
> > On Thu, 2017-04-27 at 11:20 -0500, Dan Williams wrote:
> > > On Thu, 2017-04-27 at 15:51 +0100, Marco Chiappero wrote:
> > > > Currently all the slave devices belonging to the same port
> > > > inherit
> > > > their MAC address from its master device. This patch removes
> > > > this
> > > > limitation and allows every slave device to obtain a unique MAC
> > > > address, by default randomly generated at creation time.
> > > > 
> > > > Moreover it is now possible to correctly modify the MAC address
> > > > at
> > > > any time, fixing an existing bug as MAC address changes on the
> > > > master were not reflected on the slaves. It also avoids
> > > > multiple
> > > > interfaces sharing the same IPv6 link-local address.
> > > 
> > > How is this different than macvlan now?
> 
> The same way it was before. The purpose of the patch is to make
> possible to change the MAC address on slaves, not to change the
> external behavior of ipvlan: ipvlan will still behave as ipvlan,
> macvlan will still behave as macvlan.

Ok, it was completely unclear from the commit message that the
"internal" MAC addresses of the ipvlan interfaces were not reflected
"on the wire", but that this was essentially (as you say below) MAC
NAT.

I think everyone agrees that being able to change the MAC is useful and
was a bug.

What I'm still not clear on is, if IPv6 is already solved, why is it
useful to have assign ipvlan interface a unique MAC address?  Is it
only to make interface lookups via MAC easier?

Dan

> > >   Why would you use unique
> > > addressed ipvlan instances instead of macvlan?  Wouldn't the same
> > > problems around external switches not expecting multiple MACs
> > > from the
> > > same switch port apply now to ipvlan?
> 
> No. I'm willing to rework the commit messages if this point is not
> clear.
> The idea is to effectively NAT communications, which is more or less
> the whole concept behind ipvlan.
> 
> > > The whole point of ipvlan AIUI was to get around macvlan problems
> > > related to multiple MACs on the same port.
> > 
> > Another issue is the unicast MAC limits on cards.  ipvlan is now
> > much more likely
> > to hit the unicast MAC limit of the NIC and thus trigger
> > promiscuous mode and
> > the resulting performance drop, where before it would not.
> 
> The outside world will still see and use only one unicast MAC
> address, belonging to the master interface. No need to store any
> other unicast MAC in the filter.
> 
> Multicast filters will work as before, the only difference is that
> now slaves have different IPv6 link-local addresses, with an
> associated L2 multicast address). If this turns out to be an issue
> it's possible to keep the same MAC address by default but still allow
> it change. Or maybe force the same default link-local address
> somehow.
>  
> However, if the MAC filter is really the issue, I strongly encourage
> you to start working with HW vendors in order to better support
> modern data center workloads as a long term solution.
> 
> > > Also, I think the IPv6 thing you mention is incorrect and has
> > > long
> > > since been solved.  Originally, ipvlan did not include a "dev_id"
> > > property that differened between child interfaces, and thus the
> > > IID of
> > > the each interface was the same.  That has now been fixed, and
> > > each
> > > ipvlan slave should now have a different IID and thus a different
> > > link-
> > > local address.
> 
> Yes, thank you for pointing it out. I started this change a few
> months ago when such fix not present and simply rebased lately
> without noticing it.
> 
> Please let me know if you have further questions.
> 
> Thank you,
> Marco
> --
> Intel Research and Development Ireland Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County
> Kildare
> Registered Number: 308263
> 
> 
> This e-mail and any attachments may contain confidential material for
> the sole
> use of the intended recipient(s). Any review or distribution by
> others is
> strictly prohibited. If you are not the intended recipient, please
> contact the
> sender and delete all copies.
> 


Re: those bpf binutils testsuite failures..

2017-05-02 Thread Aaron Conole
David Miller  writes:

> I wonder if whatever tool you used to get rid of trailing whitespace
> did so in the testsuite foo.d files too?  That definitely is
> undesirable :-)

Most likely I was being too clever by half.  I'll make sure I omit the
.d files from scanning.


remaining sparc bpf hacks...

2017-05-02 Thread David Miller

So I just have two issues remaining for test_progs to run properly
on sparc.  We've discussed them before.

The first is the clang Makefile hack:

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index d8d94b9..2ed63b6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,5 +35,5 @@ CLANG ?= clang
 
 %.o: %.c
$(CLANG) -I../../../include/uapi -I../../../../samples/bpf/ \
-   -D__x86_64__ -Wno-compare-distinct-pointer-types \
+   -D__sparc__ -Wno-compare-distinct-pointer-types \
-O2 -target bpf -c $< -o $@

which should be reasonably easy to deal with.

The second is more difficult:

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c2ff608..bb99677 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -801,8 +801,12 @@ static int check_ptr_alignment(const struct bpf_reg_state 
*reg,
 {
switch (reg->type) {
case PTR_TO_PACKET:
+#if 1
+   return 0;
+#else
return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ? 0 :
   check_pkt_ptr_alignment(reg, off, size);
+#endif
case PTR_TO_MAP_VALUE_ADJ:
return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ? 0 :
   check_val_ptr_alignment(reg, size);

Studying over the verifier yesterday I figure we can add an
"alignment" attribute to registers, and track this.

The packet pointer entering a function starts with a weird "modulo
NET_IP_ALIGN" relative alignment, so we have to take that into
consideration.

Then for operations like shift left and multiply by constants, we can
record a known minimum alignment.  The most common case is, of course,
"iph->ihl << 2"

And later when registers with known alignment are added to a packet
pointer, we can constrain the alignment of the destination register.



Re: [PATCH v2 net-next 4/7] net: add new control message for incoming HW-timestamped packets

2017-05-02 Thread Willem de Bruijn
On Tue, May 2, 2017 at 6:11 AM, Miroslav Lichvar  wrote:
> Add SOF_TIMESTAMPING_OPT_PKTINFO option to request a new control message
> for incoming packets with hardware timestamps. It contains the index of
> the real interface which received the packet and the length of the
> packet at layer 2.
>
> The index is useful with bonding, bridges and other interfaces, where
> IP_PKTINFO doesn't allow applications to determine which PHC made the
> timestamp. With the L2 length (and link speed) it is possible to
> transpose preamble timestamps to trailer timestamps, which are used in
> the NTP protocol.
>
> While this information could be provided by two new socket options
> independently from timestamping, it doesn't look like it would be very
> useful. With this option any performance impact is limited to hardware
> timestamping.
>
> Use dev_get_by_napi_id() to look up the device and its index. This
> limits the option to kernels with enabled CONFIG_NET_RX_BUSY_POLL and
> drivers using napi, but it should cover all current MAC drivers that
> support hardware timestamping.
>
> CC: Richard Cochran 
> CC: Willem de Bruijn 
> Signed-off-by: Miroslav Lichvar 
> ---
>  Documentation/networking/timestamping.txt |  8 
>  include/uapi/asm-generic/socket.h |  2 ++
>  include/uapi/linux/net_tstamp.h   |  9 -
>  net/socket.c  | 30 +-
>  4 files changed, 47 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/networking/timestamping.txt 
> b/Documentation/networking/timestamping.txt
> index 96f5069..6c07e7c 100644
> --- a/Documentation/networking/timestamping.txt
> +++ b/Documentation/networking/timestamping.txt
> @@ -193,6 +193,14 @@ SOF_TIMESTAMPING_OPT_STATS:
>the transmit timestamps, such as how long a certain block of
>data was limited by peer's receiver window.
>
> +SOF_TIMESTAMPING_OPT_PKTINFO:
> +
> +  Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
> +  packets with hardware timestamps. The message contains struct
> +  scm_ts_pktinfo, which supplies the index of the real interface
> +  which received the packet and its length at layer 2. This option
> +  works only if CONFIG_NET_RX_BUSY_POLL is enabled.
> +
>  New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
>  disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
>  regardless of the setting of sysctl net.core.tstamp_allow_data.
> diff --git a/include/uapi/asm-generic/socket.h 
> b/include/uapi/asm-generic/socket.h
> index 2b48856..a5f6e81 100644
> --- a/include/uapi/asm-generic/socket.h
> +++ b/include/uapi/asm-generic/socket.h
> @@ -100,4 +100,6 @@
>
>  #define SO_COOKIE  57
>
> +#define SCM_TIMESTAMPING_PKTINFO   58
> +
>  #endif /* __ASM_GENERIC_SOCKET_H */
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index 0749fb1..8fcae35 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -26,8 +26,9 @@ enum {
> SOF_TIMESTAMPING_OPT_CMSG = (1<<10),
> SOF_TIMESTAMPING_OPT_TSONLY = (1<<11),
> SOF_TIMESTAMPING_OPT_STATS = (1<<12),
> +   SOF_TIMESTAMPING_OPT_PKTINFO = (1<<13),
>
> -   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_STATS,
> +   SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_PKTINFO,
> SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) |
>  SOF_TIMESTAMPING_LAST
>  };
> @@ -130,4 +131,10 @@ enum hwtstamp_rx_filters {
> HWTSTAMP_FILTER_NTP_ALL,
>  };
>
> +/* SCM_TIMESTAMPING_PKTINFO control message */
> +struct scm_ts_pktinfo {
> +   int if_index;
> +   int pkt_length;
> +};

Use fixed width integer types across the ABI.

> +static void put_ts_pktinfo(struct msghdr *msg, struct sk_buff *skb)
> +{
> +   struct scm_ts_pktinfo ts_pktinfo;
> +   struct net_device *orig_dev;
> +   int ifindex = 0;
> +
> +   if (!skb_mac_header_was_set(skb))
> +   return;
> +
> +   rcu_read_lock();
> +   orig_dev = dev_get_by_napi_id(skb_napi_id(skb));
> +   if (orig_dev)
> +   ifindex = orig_dev->ifindex;
> +   rcu_read_unlock();
> +
> +   if (ifindex == 0)
> +   return;

If the caller requests PKTINFO, it might be better to explicitly return
a structure with illegal value 0, than to silently skip the message if
no device can be found.


Re: [PATCH 0/2] Fix some bpf program testing framework bugs

2017-05-02 Thread Alexei Starovoitov

On 5/2/17 8:36 AM, David Miller wrote:


This series fixes two issue:

1) Accidental user pointer dereference in bpf_test_finish()

2) The packet data given to the test programs is not aligned correctly

The first issue is fixed simply because we have a kernel side copy
of the datastructure in question already.  And the second bug is
a simple matter of applying NET_IP_ALIGN where needed.

Signed-off-by: David S. Miller 


thanks a lot for the fixes.

Acked-by: Alexei Starovoitov 




Re: [PATCH 2/2] bpf: Align packet data properly in program testing framework.

2017-05-02 Thread Daniel Borkmann

On 05/02/2017 05:36 PM, David Miller wrote:


Make sure we apply NET_IP_ALIGN when reserving headroom for SKB
and XDP test runs, just like a real driver would.

Signed-off-by: David S. Miller 


Thanks for fixing!

Acked-by: Daniel Borkmann 


[RFC PATCH v2 17/17] net: sched: lock once per bulk dequeue

2017-05-02 Thread John Fastabend

---
 net/sched/sch_generic.c |   53 +--
 1 file changed, 24 insertions(+), 29 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index be5a201..0f0831a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -205,33 +205,22 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
 {
const struct netdev_queue *txq = q->dev_queue;
struct sk_buff *skb = NULL;
+   spinlock_t *lock = NULL;
 
-   *packets = 1;
-   if (unlikely(!skb_queue_empty(>gso_skb))) {
-   spinlock_t *lock = NULL;
-
-   if (q->flags & TCQ_F_NOLOCK) {
-   lock = qdisc_lock(q);
-   spin_lock(lock);
-   }
-
-   skb = skb_peek(>gso_skb);
-
-   /* skb may be null if another cpu pulls gso_skb off in between
-* empty check and lock.
-*/
-   if (!skb) {
-   if (lock)
-   spin_unlock(lock);
-   goto validate;
-   }
+   if (q->flags & TCQ_F_NOLOCK) {
+   lock = qdisc_lock(q);
+   spin_lock(lock);
+   }
 
+   *packets = 1;
+   skb = skb_peek(>gso_skb);
+   if (unlikely(skb)) {
/* skb in gso_skb were already validated */
*validate = false;
/* check the reason of requeuing without tx lock first */
txq = skb_get_tx_queue(txq->dev, skb);
if (!netif_xmit_frozen_or_stopped(txq)) {
-   skb = __skb_dequeue(>gso_skb);
+   __skb_unlink(skb, >gso_skb);
if (qdisc_is_percpu_stats(q)) {
qdisc_qstats_cpu_backlog_dec(q, skb);
qdisc_qstats_cpu_qlen_dec(q);
@@ -246,17 +235,18 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
spin_unlock(lock);
return skb;
}
-validate:
-   *validate = true;
 
-   if ((q->flags & TCQ_F_ONETXQUEUE) &&
-   netif_xmit_frozen_or_stopped(txq))
-   return skb;
+   *validate = true;
 
skb = qdisc_dequeue_skb_bad_txq(q);
-   if (unlikely(skb))
+   if (unlikely(skb)) {
goto bulk;
-   skb = q->dequeue(q);
+   }
+
+   if (!(q->flags & TCQ_F_ONETXQUEUE) ||
+   !netif_xmit_frozen_or_stopped(txq))
+   skb = q->dequeue(q);
+
if (skb) {
 bulk:
if (qdisc_may_bulk(q))
@@ -264,6 +254,11 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool 
*validate,
else
try_bulk_dequeue_skb_slow(q, skb, packets);
}
+
+blocked:
+   if (lock)
+   spin_unlock(lock);
+
return skb;
 }
 
@@ -621,7 +616,7 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
*qdisc)
if (__skb_array_empty(q))
continue;
 
-   skb = skb_array_consume_bh(q);
+   skb = __skb_array_consume(q);
}
if (likely(skb)) {
qdisc_qstats_cpu_backlog_dec(qdisc, skb);
@@ -658,7 +653,7 @@ static void pfifo_fast_reset(struct Qdisc *qdisc)
struct skb_array *q = band2list(priv, band);
struct sk_buff *skb;
 
-   while ((skb = skb_array_consume_bh(q)) != NULL)
+   while ((skb = __skb_array_consume(q)) != NULL)
__skb_array_destroy_skb(skb);
}
 



[RFC PATCH v2 16/17] net: skb_array additions for unlocked consumer

2017-05-02 Thread John Fastabend
Signed-off-by: John Fastabend 
---
 include/linux/skb_array.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 33f1f0c..b28db83 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -117,6 +117,11 @@ static inline struct sk_buff *skb_array_consume_bh(struct 
skb_array *a)
return ptr_ring_consume_bh(>ring);
 }
 
+static inline struct sk_buff *__skb_array_consume(struct skb_array *a)
+{
+   return __ptr_ring_consume(>ring);
+}
+
 static inline int __skb_array_len_with_tag(struct sk_buff *skb)
 {
if (likely(skb)) {



  1   2   3   >