date:20180605

Re: [RFC PATCH 2/2] net: macb: Disable TX checksum offloading on all Zynq

2018-06-05 Thread Michal Simek

On 4.6.2018 17:06, Nicolas Ferre wrote:
> On 25/05/2018 at 23:44, Jennifer Dahm wrote:
>> The Zynq ethernet hardware has checksum offloading bugs that cause
>> small UDP packets (<= 2 bytes) to be sent with an incorrect checksum
>> (0x) and forwarded UDP packets to be re-checksummed, which is
>> illegal behavior. The best solution we have right now is to disable
>> hardware TX checksum offloading entirely.
>>
>> Signed-off-by: Jennifer Dahm 
> 
> Adding some xilinx people I know...

Harini: Please look at this.

Thanks,
Michal

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Ilias Apalodimas

Hi Andrew,

>On Wed, Jun 06, 2018 at 01:53:56AM +0200, Andrew Lunn wrote:
> > And I just have to look a little bit in the future as selected approach
> > expected to be extended on future SoC (and other parts of existing SoCs - 
> > ICSS-G SW switch)
> > where we going to have more features, like TSN, EST and packet Policing and 
> > Classification.
> 
> You should probably took a closer look at the Mellonex driver. It has
> a lot of TC offload, which is what policing and classification is.
> 
I did take a close look to both Mellanox and rocker drivers before issuing this
RFC and we seem to be close on the approach. What Grygorii is reffering to, is
that for CBS to work properly on CPSW there *must* be a way to configure the
CPU port individually.

> TSN is being worked on in general, and i think the i210 is taking the
> lead. So you probably want to keep an eye on that, and join the
> discussion.
> 
TSN is not just "one thing". It's a few IEEE standards bundled up to provide the
needed functionality. i210 is only implementing CBS at the moment and there's
an RFC out there to support what they call "Time based scheduling".
I am already having discussions with Jesus on their current work.

The idea behind using switchdev is that due to it's design, it's a very
convenient way of utilizing netlink and iproute2/tc to configure any kind of
future shapers. Since net_device_ops now has a callback to configure hardware
schedulers being able to expose netdevs as hardware ports and configure them
individually is great. As you can understand you'll end up with TSN capable
switches and NICs being configured with the same commands from a userspace point
of view. I am not sure this is the proper way to go, but at least for me, 
sounds like a viable solution.

Thanks
Ilias

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar





On 6/5/2018 11:00 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:39:12 -0700
"Samudrala, Sridhar"  wrote:


On 6/5/2018 8:51 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:
  

On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:
 

See:
  https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.

I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?

  

The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.

In the net_failover_slave_register() we do a dev_open() and ignore any failure 
due to
EBUSY and do another dev_open() in the name change event handler.
If the name is not expected to change, i would think the dev_open() at the time 
of
register will succeed.

The problem is your first dev_open will bring device up and lockout
udev from changing the network device name.


I have tried with/without udev and didn't see any issue with the naming of the 
primary/standby
devices in my testing.

With the 3-netdev failover model, we are only interested in setting the right 
name for the failover
netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it 
really matter if udev fails
to rename the lower primary/standby netdevs, i don't think it will matter? The 
user is not expected
to touch the lower netdevs.

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger

On Tue, 5 Jun 2018 22:39:12 -0700
"Samudrala, Sridhar"  wrote:

> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 16:52:22 -0700
> > "Samudrala, Sridhar"  wrote:
> >  
> >> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> >>> On Tue, 5 Jun 2018 22:38:43 +0300
> >>> "Michael S. Tsirkin"  wrote:
> >>> 
> > See:
> >  https://patchwork.ozlabs.org/patch/851711/  
>  Let me try to summarize that:
> 
>   You wanted to speed up the delayed link up.  You had an idea to
>   additionally take link up when userspace renames the interface (standby
>   one which is also the failover for netvsc).
> 
>   But userspace might not do any renames, in which case there will
>   still be the delay, and so this never got applied.
> 
>   Is this a good summary?
> 
>  Davem said delay should go away completely as it's not robust, and I
>  think I agree.  So I don't think we should make all failover users use
>  delay. IIUC failover kept a delay option especially for netvsc to
>  minimize the surprise factor. Hopefully we can come up with
>  something more robust and drop that option completely.  
> >>> The timeout was the original solution to how to complete setup after
> >>> userspace has had a chance to rename the device. Unfortunately, the whole 
> >>> network
> >>> device initialization (cooperation with udev and userspace) is a a mess 
> >>> because
> >>> there is no well defined specification, and there are multiple ways 
> >>> userspace
> >>> does this in old and new distributions.  The timeout has its own issues
> >>> (how long, handling errors during that window, what if userspace modifies 
> >>> other
> >>> device state); and open to finding a better solution.
> >>>
> >>> My point was that if name change can not be relied on (or used) by netvsc,
> >>> then we can't allow it for net_failover either.  
> >> I think the push back was with the usage of the delay, not bringing up the 
> >> primary/standby
> >> device in the name change event handler.
> >> Can't netvsc use this mechanism instead of depending on the delay?
> >>
> >>  
> > The patch that was rejected for netvsc was about using name change.
> > Also, you can't depend on name change; you still need a timer. Not all 
> > distributions
> > change name of devices. Or user has blocked that by udev rules.  
> 
> In the net_failover_slave_register() we do a dev_open() and ignore any 
> failure due to
> EBUSY and do another dev_open() in the name change event handler.
> If the name is not expected to change, i would think the dev_open() at the 
> time of
> register will succeed.

The problem is your first dev_open will bring device up and lockout
udev from changing the network device name.

ATENCIÓN

2018-06-05 Thread Sistemas administrador

ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que vuelva a validar su buzón de 
correo electrónico. Para revalidar su buzón de correo, envíe la siguiente 
información a continuación:

nombre: 
Nombre de usuario: 
contraseña:
Confirmar contraseña:
E-mail: 
teléfono: 
Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación: es: 006524
Correo Soporte Técnico © 2018

¡gracias
Sistemas administrador

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar


On 6/5/2018 8:51 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:


On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:
  

See:
 https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.

I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?



The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.


In the net_failover_slave_register() we do a dev_open() and ignore any failure 
due to
EBUSY and do another dev_open() in the name change event handler.
If the name is not expected to change, i would think the dev_open() at the time 
of
register will succeed.

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread Or Gerlitz

On Wed, Jun 6, 2018 at 12:27 AM, Jakub Kicinski  wrote:
> On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
>> From: Jakub Kicinski 
>> Date: Tue, 5 Jun 2018 11:57:47 -0700
>>
>> > Do we still care about correctness and not breaking backward
>> > compatibility?
>>
>> Jakub let me know if you want me to revert this change.
>
> Yes, I think this patch introduces a regression when block is shared
> between offload capable and in-capable device, therefore it should be
> reverted.  Shared blocks went through a number of review cycles to
> ensure such cases are handled correctly.
>
>
> Longer story about the actual issue which is never explained in the
> commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
> supported on tunnels in cls_flower only:
>
> static int fl_hw_replace_filter(struct tcf_proto *tp,
> [...]
> if (!tc_can_offload(dev)) {
> if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
> (f->hw_dev && !tc_can_offload(f->hw_dev))) {
> f->hw_dev = dev;
> return tc_skip_sw(f->flags) ? -EINVAL : 0;
> }
> dev = f->hw_dev;
> cls_flower.egress_dev = true;
> } else {
> f->hw_dev = dev;
> }
>
>
> In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
> promoted to a generic TC thing supported on all filters but it no
> longer works with skip_sw.
>
> I'd argue skip_sw is incorrect for tunnels, because the rule will only
> apply to traffic ingressing on ASIC ports, not all traffic which hits
> the tunnel netdev.

This argument makes sense, however, skip_sw for tunnel decap rules
 **is** allowed since 4.10 and we have some sort of regression here (turns
out that before and after the patch..)

> Therefore we should keep the 4.15 - 4.17 behaviour.
> But that's a side note, I don't think we should be breaking offload on
> shared blocks whether we decide to support skip_sw on tunnels or not.

skip_sw on tunnels was there before shared-block, newer features should
take care not to break existing ones.

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread Or Gerlitz

On Tue, Jun 5, 2018 at 9:59 PM, Jakub Kicinski  wrote:
> On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:
>> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
>> > When using a vxlan device as the ingress dev, we count it as a
>> > "no offload dev", so when such a rule comes and err stop is true,
>> > we fail early and don't try the egdev route which can offload it
>> > through the egress device.
>> >
>> > Fix that by not calling the block offload if one of the devices
>> > attached to it is not offload capable, but make sure egress on such case
>> > is capable instead.
>> >
>> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
>> > Reviewed-by: Roi Dayan 
>> > Acked-by: Jiri Pirko 
>> > Signed-off-by: Paul Blakey 
>>
>> Very poor commit message.  What you're doing is re-enabling skip_sw
>> filters on tunnel devices which semantically make no sense and
>> shouldn't have been allowed in the first place.
>>
>> This will breaks block sharing between tunnels and HW netdevs (because
>> you skip the tcf_block_cb_call() completely).  The entire egdev idea
>> remains badly broken accepting filters like this:
>>
>> # tc filter add dev vxlan0 ingress \
>>   matchall action skip_sw \
>>   mirred egress redirect dev lo \
>>   mirred egress redirect dev sw1np0
>
> For above we probably need something like this (untested):
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index 3f4cf930f809..71e5eebec31a 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
> struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
>
> if (!egdev)
> -   return 0;
> +   return err_stop ? -EOPNOTSUPP : 0;
> return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
>  }
>  EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);

Jakub,

We will test it out and let you know

> But the correct fix is to remove egdev crutch completely IMO.

Not against it, sometimes designs should change and be replaced
with better ones, happens

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger

On Tue, 5 Jun 2018 16:52:22 -0700
"Samudrala, Sridhar"  wrote:

> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 22:38:43 +0300
> > "Michael S. Tsirkin"  wrote:
> >  
> >>> See:
> >>> https://patchwork.ozlabs.org/patch/851711/  
> >> Let me try to summarize that:
> >>
> >>You wanted to speed up the delayed link up.  You had an idea to
> >>additionally take link up when userspace renames the interface (standby
> >>one which is also the failover for netvsc).
> >>
> >>But userspace might not do any renames, in which case there will
> >>still be the delay, and so this never got applied.
> >>
> >>Is this a good summary?
> >>
> >> Davem said delay should go away completely as it's not robust, and I
> >> think I agree.  So I don't think we should make all failover users use
> >> delay. IIUC failover kept a delay option especially for netvsc to
> >> minimize the surprise factor. Hopefully we can come up with
> >> something more robust and drop that option completely.  
> > The timeout was the original solution to how to complete setup after
> > userspace has had a chance to rename the device. Unfortunately, the whole 
> > network
> > device initialization (cooperation with udev and userspace) is a a mess 
> > because
> > there is no well defined specification, and there are multiple ways 
> > userspace
> > does this in old and new distributions.  The timeout has its own issues
> > (how long, handling errors during that window, what if userspace modifies 
> > other
> > device state); and open to finding a better solution.
> >
> > My point was that if name change can not be relied on (or used) by netvsc,
> > then we can't allow it for net_failover either.  
> 
> I think the push back was with the usage of the delay, not bringing up the 
> primary/standby
> device in the name change event handler.
> Can't netvsc use this mechanism instead of depending on the delay?
> 
> 

The patch that was rejected for netvsc was about using name change.
Also, you can't depend on name change; you still need a timer. Not all 
distributions
change name of devices. Or user has blocked that by udev rules.

Re: [PATCH bpf-next v7 3/6] bpf: Add IPv6 Segment Routing helpers

2018-06-05 Thread Mathieu Xhonneux

2018-05-30 13:00 GMT+02:00 Daniel Borkmann :
>> Instead of doing this inside the helper you can reject the program already
>> in the lwt_*_func_proto() by returning NULL when !CONFIG_IPV6_SEG6_BPF. That
>> way programs get rejected at verification time instead of runtime, so the
>> user can probe availability more easily.
>

My initial idea here was that we could still allow End.BPF when
!CONFIG_IPV6_SEG6_BPF, even if the helpers aren't accessible. But
indeed, having a homogeneous behavior is probably a better solution,
that would make things more clearer for users.

> Mathieu, before this gets lost in archives, plan to follow-up on this one?
Sure, I have been totally overbooked the last two weeks and will still
be for one week or two. I'm planning to send some patches ASAP, there
are a few things to improve and debug. iproute2 support has still not
been sent upstream as well.

[PATCH net-next] enic: fix UDP rss bits

2018-06-05 Thread Govindarajulu Varadarajan

In commit 48398b6e7065 ("enic: set UDP rss flag") driver needed to set a
single bit to enable UDP rss. This is changed to two bit. One for UDP
IPv4 and other bit for UDP IPv6. The hardware which supports this is not
released yet. When released, driver should set 2 bit to enable UDP rss for
both IPv4 and IPv6.

Also add spinlock around vnic_dev_capable_rss_hash_type().

Signed-off-by: Govindarajulu Varadarajan 
---
 .../net/ethernet/cisco/enic/enic_ethtool.c| 18 +
 drivers/net/ethernet/cisco/enic/enic_main.c   | 20 +--
 drivers/net/ethernet/cisco/enic/enic_res.c|  7 ++-
 drivers/net/ethernet/cisco/enic/vnic_dev.c| 18 ++---
 drivers/net/ethernet/cisco/enic/vnic_dev.h|  2 +-
 drivers/net/ethernet/cisco/enic/vnic_devcmd.h | 20 ++-
 drivers/net/ethernet/cisco/enic/vnic_nic.h|  3 ++-
 7 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_ethtool.c 
b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
index 869006c2002d..f42f7a6e1559 100644
--- a/drivers/net/ethernet/cisco/enic/enic_ethtool.c
+++ b/drivers/net/ethernet/cisco/enic/enic_ethtool.c
@@ -476,18 +476,28 @@ static int enic_grxclsrule(struct enic *enic, struct 
ethtool_rxnfc *cmd)
 
 static int enic_get_rx_flow_hash(struct enic *enic, struct ethtool_rxnfc *cmd)
 {
+   u8 rss_hash_type = 0;
cmd->data = 0;
 
+   spin_lock_bh(&enic->devcmd_lock);
+   (void)vnic_dev_capable_rss_hash_type(enic->vdev, &rss_hash_type);
+   spin_unlock_bh(&enic->devcmd_lock);
switch (cmd->flow_type) {
case TCP_V6_FLOW:
case TCP_V4_FLOW:
-   cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
-   /* Fall through */
+   cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3 |
+RXH_IP_SRC | RXH_IP_DST;
+   break;
case UDP_V6_FLOW:
+   cmd->data |= RXH_IP_SRC | RXH_IP_DST;
+   if (rss_hash_type & NIC_CFG_RSS_HASH_TYPE_UDP_IPV6)
+   cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
+   break;
case UDP_V4_FLOW:
-   if (vnic_dev_capable_udp_rss(enic->vdev))
+   cmd->data |= RXH_IP_SRC | RXH_IP_DST;
+   if (rss_hash_type & NIC_CFG_RSS_HASH_TYPE_UDP_IPV4)
cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
-   /* Fall through */
+   break;
case SCTP_V4_FLOW:
case AH_ESP_V4_FLOW:
case AH_V4_FLOW:
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index 8a8b12b720ef..30d2eaa18c04 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2320,16 +2320,24 @@ static int enic_set_rss_nic_cfg(struct enic *enic)
 {
struct device *dev = enic_get_dev(enic);
const u8 rss_default_cpu = 0;
-   u8 rss_hash_type = NIC_CFG_RSS_HASH_TYPE_IPV4 |
-   NIC_CFG_RSS_HASH_TYPE_TCP_IPV4 |
-   NIC_CFG_RSS_HASH_TYPE_IPV6 |
-   NIC_CFG_RSS_HASH_TYPE_TCP_IPV6;
const u8 rss_hash_bits = 7;
const u8 rss_base_cpu = 0;
+   u8 rss_hash_type;
+   int res;
u8 rss_enable = ENIC_SETTING(enic, RSS) && (enic->rq_count > 1);
 
-   if (vnic_dev_capable_udp_rss(enic->vdev))
-   rss_hash_type |= NIC_CFG_RSS_HASH_TYPE_UDP;
+   spin_lock_bh(&enic->devcmd_lock);
+   res = vnic_dev_capable_rss_hash_type(enic->vdev, &rss_hash_type);
+   spin_unlock_bh(&enic->devcmd_lock);
+   if (res) {
+   /* defaults for old adapters
+*/
+   rss_hash_type = NIC_CFG_RSS_HASH_TYPE_IPV4  |
+   NIC_CFG_RSS_HASH_TYPE_TCP_IPV4  |
+   NIC_CFG_RSS_HASH_TYPE_IPV6  |
+   NIC_CFG_RSS_HASH_TYPE_TCP_IPV6;
+   }
+
if (rss_enable) {
if (!enic_set_rsskey(enic)) {
if (enic_set_rsscpu(enic, rss_hash_bits)) {
diff --git a/drivers/net/ethernet/cisco/enic/enic_res.c 
b/drivers/net/ethernet/cisco/enic/enic_res.c
index 9c96911fb2c8..40b20817ddd5 100644
--- a/drivers/net/ethernet/cisco/enic/enic_res.c
+++ b/drivers/net/ethernet/cisco/enic/enic_res.c
@@ -149,6 +149,7 @@ int enic_set_nic_cfg(struct enic *enic, u8 rss_default_cpu, 
u8 rss_hash_type,
u8 rss_hash_bits, u8 rss_base_cpu, u8 rss_enable, u8 tso_ipid_split_en,
u8 ig_vlan_strip_en)
 {
+   enum vnic_devcmd_cmd cmd = CMD_NIC_CFG;
u64 a0, a1;
u32 nic_cfg;
int wait = 1000;
@@ -160,7 +161,11 @@ int enic_set_nic_cfg(struct enic *enic, u8 
rss_default_cpu, u8 rss_hash_type,
a0 = nic_cfg;
a1 = 0;
 
-   return vnic_dev_cmd(enic->vdev, CMD_NIC_CFG, &a0, &a1, wait);
+   if (rss_hash_type & (NIC_CFG_RSS_HASH_TYPE_UDP_IPV4 |
+NIC_CFG_RSS_HASH_TYPE_UDP_IP

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

> And I just have to look a little bit in the future as selected approach
> expected to be extended on future SoC (and other parts of existing SoCs - 
> ICSS-G SW switch)
> where we going to have more features, like TSN, EST and packet Policing and 
> Classification.

You should probably took a closer look at the Mellonex driver. It has
a lot of TC offload, which is what policing and classification is.

TSN is being worked on in general, and i think the i210 is taking the
lead. So you probably want to keep an eye on that, and join the
discussion.

Andrew

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar


On 6/5/2018 2:52 PM, Stephen Hemminger wrote:

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:


See:
https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.


I think the push back was with the usage of the delay, not bringing up the 
primary/standby
device in the name change event handler.
Can't netvsc use this mechanism instead of depending on the delay?

Re: umh build...

2018-06-05 Thread David Miller

From: Alexei Starovoitov 
Date: Tue, 5 Jun 2018 15:29:34 -0700

> On Tue, Jun 05, 2018 at 05:03:59PM -0400, David Miller wrote:
>> 
>> Alexei, I tried to build bpfilter on sparc64 and it shows that
>> CONFIG_OUTPUT_FORMAT is an x86 specific Kconfig value.
>> 
>> And, even if I added it, it's not clear what it should even be set to.
>> 
>> Right now, for example, my userland builds default to 32-bit sparc
>> even though I'm building a 64-bit kernel.
>> 
>> I think what ends up happening has to be in some way in response to
>> what kind of "native" binaries HOSTCC is actually building.
> 
> I guess you mean native-ness not of HOSTCC, but what target rootfs will be.

Target rootfs is where I am building the kernel in this case.

Kernel running is 64-bit, most of userland is 32-bit.  Kernel is built
explicitly with "-m64" but gcc by default is building 32-bit.

> I guess it's probably equal to 32 or 64-bitness of the kernel from
> CC.

Nope, they are and can be different, even natively.

> I completely forgot that I need to fix CONFIG_OUTPUT_FORMAT.
> On my arm64 I just did
> export CONFIG_OUTPUT_FORMAT=elf64-littleaarch64
> before doing 'make'.
> I can do a hack to extract it from objdump
> similar to the hack I do for '-B ...'
> Like the patch below...
> and it works on x64 too, but I'm not sure about sparc.
 ...
> @@ -21,7 +21,7 @@ endif
>  # which bpfilter_kern.c passes further into umh blob loader at run-time
>  quiet_cmd_copy_umh = GEN $@
>cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> -  $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
> +  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
>-B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
>--rename-section .data=.init.rodata $< $@

Yeah this doesn't do it in the 64-bit kernel 32-bit userspace scenerio.

We build UMH which ends up being a 32-bit ELF binary, then we later
try to link it:

  ld -m elf64_sparc   -r -o net/bpfilter/bpfilter.o 
net/bpfilter/bpfilter_kern.o net/bpfilter/bpfilter_umh.o ; scripts/mod/modpost 
net/bpfilter/bpfilter.o
ld: sparc:v8plusb architecture of input file `net/bpfilter/bpfilter_umh.o' is 
incompatible with sparc:v9 output

But your patch is a step in the right direction because it gets us
away from depending upon CONFIG_OUTPUT_FORMAT.

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

On Tue, Jun 05, 2018 at 06:23:45PM -0500, Grygorii Strashko wrote:
> 
> 
> On 06/02/2018 07:26 PM, Andrew Lunn wrote:
> >> *After this patch set*: goal keep things working the same as max as
> >> possible and get rid of TI custom tool.
> > 
> > We are happy to keep things the same, if they fit with the switchdev
> > model. Anything in your customer TI tool/model which does not fit the
> > switchdev model you won't be able to keep, except if we agree to
> > extend the model.
> 
> Right. That's the main goal of RFC to identify those gaps.
> 
> > 
> > I can say now, sw0p0 is going to cause problems. I really do suggest
> > you drop it for the moment in order to get a minimal driver
> > accepted. sw0p0 does not fit the switchdev model.
> 
> Honestly, this is not the first patchset and we started without sw0p0,
> but then (with current LKML)
> - default vlan offloading breaks traffic reception to P0
>  (Ilias saying it's fixed in next - good)
> - adding vlan to P1/P2 works, but not for P0 (again as per Ilias -fixed)
> - mcast - no way to manually add static record and include or exclude P0.
> 
> 
> :( above are basic functionality required.

For a DSA driver, this is way more than basic. A basic DSA driver just
provides interfaces, and does everything in software. No offload at
all. Generally, FDB offload is next, then MDB, and then VLAN, each as
separate patch sets.

> Unfortunately, I'm not sure if all this reworking and switchdev conversation 
> would make sense
> if we will not be able to fit Ivan's work in new CPSW driver model ;..(
> and do AVB bridge.

AVB bridge should fit the switchdev model. You can offload TC via
switchdev e.g. the b53 has mirred, mellonex has flower and a lot more.

  Andrew

wir bieten 2% Kredite

2018-06-05 Thread Ronald Bernstein

Sehr geehrte Damen und Herren,

Sie brauchen Geld? Sie sind auf der suche nach einem Darlehen? Seriös und
unkompliziert?
Dann sind Sie hier bei uns genau richtig.
Durch unsere jahrelange Erfahrung und kompetente Beratung sind wir
Europaweit tätig.

Wir bieten jedem ein GÜNSTIGES Darlehen zu TOP Konditionen an.
Darlehnen zwischen 5000 CHF/Euro bis zu 20 Millionen CHF/Euro möglich.
Wir erheben dazu 2% Zinssatz.

Lassen Sie sich von unserem kompetenten Team beraten.

Zögern Sie nicht und kontaktieren Sie mich unter für weitere Infos &
Anfragen unter der eingeblendeten Email Adresse.


Ich freue mich von Ihnen zu hören.

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

> So, my understanding for (1) "blocked FDB entry support" is reasonable
> extension for bridge/switchdev ("green").

You might have to justify it, but yes.

> > Does the network stack need for forward specific multicast MAC
> > addresses between bridge ports independent of the state? If there is
> > no need for it, you don't need to accelerate it.
> 
> Assume this is about item 2 - this question is related to STP packets.
> CPSW/ALE will drop STP packets if receiving port is in blocking/learning 
> states 
> unless corresponding  mcast entry exist in ALE entry with (Supervisory 
> Packet) flag set
> (Actually ALE mcast entry has two fields (TRM): 
> Supervisory Packet (SUPER): When set, this field indicates that the packet
>  with a matching multicast destination address is a supervisory packet.
> Multicast Forward State (MCAST_FWD_STATE): Indicates the port state(s) 
> required for the received port
> on a destination address lookup in order for the multicast packet to be 
> forwarded to
> the transmit port(s). A transmit port must be in the Forwarding state in
> order to forward the packet.)

So i this case, i would expect your driver to just add these entries
to the ALE. No need for configuration for above. Just do it as soon as
a port is made a member of a bridge. And remove it when the port
leaves the bridge.

DSA devices are smarter. They all have hardware which looks for BPDU
and forwards them to the host independent of the state of the port.
They also tend to have hardware looking for IGMP packets, and will
forward them to the host, even if the multicast address is not being
forwarded to the host.

> ** "unknown vlan configuration"
> 
> This is about following use case. Non static network configuration when
> CPSW based device knows what traffic it can accept (Host port 0), but
> it has no knowledge (or limited) about network segments attached to Port 1 
> and Port 2.
> 
> For example: Host 0 can accept only vlan 100 traffic coming from Port 1.
> ALE entry: vid =100, port_mask 0x3
> 
> But there can be vlan 50 created in attached network segments.
> Unknown VLAN Force Untagged Egress ports Mask = 0x0
> Unknown VLAN Registered Multicast Flood Ports Mask = 0x6 (P1|P2)
> Unknown VLAN Multicast Flood ports Mask = 0x6 (P1|P2)
> Unknown VLAN Member ports List  = 0x6 (P1|P2)
> 
> with above configuration packets with "unknown vlan" (no ALE entry) will
> still be forwarded between port 1 and 2, but not Port 0. 
> 
> So, is it reasonable to add "unknown vlan configuration" to bridge/switchdev
> if not exist yet? will any other hw known benefit from it?

You need to think about the case of two e1000e. How do you configure
this setup to do what you want? It probably is already possible.

 Andrew

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko




On 06/02/2018 07:26 PM, Andrew Lunn wrote:
>> *After this patch set*: goal keep things working the same as max as
>> possible and get rid of TI custom tool.
> 
> We are happy to keep things the same, if they fit with the switchdev
> model. Anything in your customer TI tool/model which does not fit the
> switchdev model you won't be able to keep, except if we agree to
> extend the model.

Right. That's the main goal of RFC to identify those gaps.

> 
> I can say now, sw0p0 is going to cause problems. I really do suggest
> you drop it for the moment in order to get a minimal driver
> accepted. sw0p0 does not fit the switchdev model.

Honestly, this is not the first patchset and we started without sw0p0,
but then (with current LKML)
- default vlan offloading breaks traffic reception to P0
 (Ilias saying it's fixed in next - good)
- adding vlan to P1/P2 works, but not for P0 (again as per Ilias -fixed)
- mcast - no way to manually add static record and include or exclude P0.


:( above are basic functionality required.

> 
>> Below I've described some tested use cases (not include full static 
>> configuration),
>> but regarding sw0p0 - there is work done by Ivan Khoronzhuk [1] which enables
>> adds MQPRIO and CBS Qdisc and targets AVB network features. It required to
>> offload MQPRIO and CBS parameters on all ports including P0. In case of P0,
>> CPDMA TX channels shapers need to be configured, and in case
>> of sw0p1/sw0p2 internal FIFOS.
>> sw0p0 also expected to be used to configure CPDMA interface in general -
>> number of tx/rx channels, rates, ring sizes.
> 
> Can this be derives from the configuration on sw0p1 and sw0p2?
> sw0p1 has 1 tx channel, sw0p2 has 2 tx channels, so give p0 3 tx
> channels?

This not exactly what is required, sry I probably will need just repeat what 
Ivan 
described in https://lkml.org/lkml/2018/5/18/1135. I'd try to provide this info 
tomorrow.

Unfortunately, I'm not sure if all this reworking and switchdev conversation 
would make sense
if we will not be able to fit Ivan's work in new CPSW driver model ;..(
and do AVB bridge.

> 
>> In addition there is set of global CPSW parameters (not related to P1/P2, 
>> like
>> MAC Authorization Mode, OUI Deny Mode, crc ) which I've
>> thought can be added to sw0p0 (using ethtool -priv-flags).
> 
> You should describe these features, and then we can figure out how
> best to model them. devlink might be an option if they are switch
> global.

Ok. that can be postponed.

-- 
regards,
-grygorii

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger

On Tue, 5 Jun 2018 22:38:43 +0300
"Michael S. Tsirkin"  wrote:

> > 
> > See:
> >https://patchwork.ozlabs.org/patch/851711/  
> 
> Let me try to summarize that:
> 
>   You wanted to speed up the delayed link up.  You had an idea to
>   additionally take link up when userspace renames the interface (standby
>   one which is also the failover for netvsc).
> 
>   But userspace might not do any renames, in which case there will
>   still be the delay, and so this never got applied.
> 
>   Is this a good summary?
> 
> Davem said delay should go away completely as it's not robust, and I
> think I agree.  So I don't think we should make all failover users use
> delay. IIUC failover kept a delay option especially for netvsc to
> minimize the surprise factor. Hopefully we can come up with
> something more robust and drop that option completely.

The timeout was the original solution to how to complete setup after
userspace has had a chance to rename the device. Unfortunately, the whole 
network
device initialization (cooperation with udev and userspace) is a a mess because
there is no well defined specification, and there are multiple ways userspace
does this in old and new distributions.  The timeout has its own issues
(how long, handling errors during that window, what if userspace modifies other
device state); and open to finding a better solution.

My point was that if name change can not be relied on (or used) by netvsc,
then we can't allow it for net_failover either.

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko

On 06/02/2018 09:08 AM, Andrew Lunn wrote:
> On Fri, Jun 01, 2018 at 04:29:08PM -0500, Grygorii Strashko wrote:
>> Hi Ilias,
> 
> 
>> Second, Thanks a lot for your great work. I'm still testing it with different
>>   use cases and trying to consolidate my reply for all questions.
>>
>> All, thanks for your comments.
> 
> Hi Grygorii
> 
> Something i've said to Ilias already. I would recommend you don't try
> to cover all your uses cases with the first version. Keep it simple
> and clean, don't do anything controversial and get it merged. Then add
> more features one by one. We can then discuss any odd ball features
> while being able to look at the complete system, driver, switchdev and
> the network stack.
> 

yes. It definitely no problem from my side, except basic customer use-cases 
simply not working without sw0p0, at least with current LKML :(

And I just have to look a little bit in the future as selected approach
expected to be extended on future SoC (and other parts of existing SoCs - 
ICSS-G SW switch)
 where we going to have more features, like TSN, EST and packet Policing and 
Classification.

And I very, very appreciated for your (and all others) time and comments.

Thank you.

-- 
regards,
-grygorii

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko

Hi Andrew,

Thanks a lot for you comments.

On 06/02/2018 07:49 PM, Andrew Lunn wrote:
> Hi Grygorii
> 
>> Don't know howto:
>> 1) add FDB entry with "blocked" flag - ALE can discard all packets with 
>> SRC/DST
>> address = blocked MAC
>> 2) add multicast MAC address with Supervisory Packet flag set.
>> Such packets will bypass most of checks inside ALE and will be forwarded in 
>> all port's
>> states except "disabled".
>> 3) add "unknown vlan configuration" : ALE provides possibility to configure
>> default behavior for tagged packets with "unknown vlan" by configuring
>> - Unknown VLAN Force Untagged Egress ports Mask.
>> - Unknown VLAN Registered Multicast Flood Ports Mask
>> - Unknown VLAN Multicast Flood ports Mask
>> - Unknown VLAN Member ports List
>> 4) The way to detect "brctl stp br0 on/off"
> 
> You are probably looking at this from the wrong direction. Yes, the
> switch can do these things. But the real question is, why would the
> network stack want to do this? As i've said before, you are
> accelerating the network stack by offloading things to the hardware.

Right. Thanks. 

> 
> Does the software bridge support FDB with a blocked flag? I don't
> think it does. So you first need to extend the software bridge with
> this concept. Then you can offload it to the hardware to accelerate
> it.

Sry, for possible misunderstanding: in "Don't know howto" i've listed
things I was not able to discover from code or documentation with hope
that expert opinion will help to confirm if this this a real/possible gap
or I/we've just missed smth. And if this is a real gap - get "green" or "red"
flag for future work (which need to be planned somehow). 

So, my understanding for (1) "blocked FDB entry support" is reasonable
extension for bridge/switchdev ("green").

> 
> Does the network stack need for forward specific multicast MAC
> addresses between bridge ports independent of the state? If there is
> no need for it, you don't need to accelerate it.

Assume this is about item 2 - this question is related to STP packets.
CPSW/ALE will drop STP packets if receiving port is in blocking/learning states 
unless corresponding  mcast entry exist in ALE entry with (Supervisory Packet) 
flag set
(Actually ALE mcast entry has two fields (TRM): 
Supervisory Packet (SUPER): When set, this field indicates that the packet
 with a matching multicast destination address is a supervisory packet.
Multicast Forward State (MCAST_FWD_STATE): Indicates the port state(s) required 
for the received port
on a destination address lookup in order for the multicast packet to be 
forwarded to
the transmit port(s). A transmit port must be in the Forwarding state in
order to forward the packet.)

Question 4 was asked with assumption that if (2) not supported and "red" flag
- then option (4) can be used as w/a (again if "green" flag) and STP mcast 
address
can be added in ALE on event "stp on".

** "unknown vlan configuration"

This is about following use case. Non static network configuration when
CPSW based device knows what traffic it can accept (Host port 0), but
it has no knowledge (or limited) about network segments attached to Port 1 and 
Port 2.

For example: Host 0 can accept only vlan 100 traffic coming from Port 1.
ALE entry: vid =100, port_mask 0x3

But there can be vlan 50 created in attached network segments.
Unknown VLAN Force Untagged Egress ports Mask = 0x0
Unknown VLAN Registered Multicast Flood Ports Mask = 0x6 (P1|P2)
Unknown VLAN Multicast Flood ports Mask = 0x6 (P1|P2)
Unknown VLAN Member ports List  = 0x6 (P1|P2)

with above configuration packets with "unknown vlan" (no ALE entry) will
still be forwarded between port 1 and 2, but not Port 0. 

So, is it reasonable to add "unknown vlan configuration" to bridge/switchdev
if not exist yet? will any other hw known benefit from it?

-- 
regards,
-grygorii

Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP

2018-06-05 Thread Julian Anastasov

Hello,

On Tue, 5 Jun 2018, Martin KaFai Lau wrote:

> On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
> > So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> > we should have no other issues. 
> Hi Julian,
> 
> Do you have a chance to work on the IPv6 side?

I was chasing other IPVS issues while preparing
to finalize these changes. I plan to do tests this weekend
and to submit my patch but without gso_size modifications.
Can post patch for this check in __ip6_rt_update_pmtu too,
separately after making sure all is fine.

Regards

--
Julian Anastasov

Re: [PATCH iproute2-next v2 1/1] tc: add json support in csum action

2018-06-05 Thread David Ahern

On 6/5/18 1:44 PM, Keara Leibovitz wrote:
> Add json output support for checksum action.
> 
...

> 
> Signed-off-by: Keara Leibovitz 
> ---
>  tc/m_csum.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/tc/m_csum.c b/tc/m_csum.c
> index 8391071d73f2..0bdbcf361a28 100644
> --- a/tc/m_csum.c
> +++ b/tc/m_csum.c
> @@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
> *arg)
>   char *uflag_5 = "";
>   char *uflag_6 = "";
>   char *uflag_7 = "";
> + char buf[64];

changed that line to 'SPRINT_BUF(buf);' per Stephen's comment and
applied to iproute2-next.

Re: umh build...

2018-06-05 Thread Alexei Starovoitov

On Tue, Jun 05, 2018 at 05:03:59PM -0400, David Miller wrote:
> 
> Alexei, I tried to build bpfilter on sparc64 and it shows that
> CONFIG_OUTPUT_FORMAT is an x86 specific Kconfig value.
> 
> And, even if I added it, it's not clear what it should even be set to.
> 
> Right now, for example, my userland builds default to 32-bit sparc
> even though I'm building a 64-bit kernel.
> 
> I think what ends up happening has to be in some way in response to
> what kind of "native" binaries HOSTCC is actually building.

I guess you mean native-ness not of HOSTCC, but what target rootfs will be.
I guess it's probably equal to 32 or 64-bitness of the kernel from CC.

I completely forgot that I need to fix CONFIG_OUTPUT_FORMAT.
On my arm64 I just did
export CONFIG_OUTPUT_FORMAT=elf64-littleaarch64
before doing 'make'.
I can do a hack to extract it from objdump
similar to the hack I do for '-B ...'
Like the patch below...
and it works on x64 too, but I'm not sure about sparc.

>From e216123e54041f73ecf1676e355bc83e80c63d1d Mon Sep 17 00:00:00 2001
Date: Tue, 5 Jun 2018 15:27:07 -0700
Subject: [PATCH] bpfilter: fix OUTPUT_FORMAT

Signed-off-by: Alexei Starovoitov 
---
 net/bpfilter/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index aafa72001fcd..e0bbe7583e58 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -21,7 +21,7 @@ endif
 # which bpfilter_kern.c passes further into umh blob loader at run-time
 quiet_cmd_copy_umh = GEN $@
   cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
-  $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
   -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
   --rename-section .data=.init.rodata $< $@

--
2.9.5

Re: [PATCH bpf-next v4 2/2] samples/bpf: Add xdp_sample_pkts example

2018-06-05 Thread Jakub Kicinski

On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
> Add an example program showing how to sample packets from XDP using the
> perf event buffer. The example userspace program just prints the ethernet
> header for every packet sampled.
> 
> Signed-off-by: Toke Høiland-Jørgensen 

> diff --git a/samples/bpf/xdp_sample_pkts_kern.c 
> b/samples/bpf/xdp_sample_pkts_kern.c
> new file mode 100644
> index ..4560522ca015
> --- /dev/null
> +++ b/samples/bpf/xdp_sample_pkts_kern.c
> @@ -0,0 +1,62 @@
> +#include 
> +#include 
> +#include 
> +#include "bpf_helpers.h"
> +
> +#define SAMPLE_SIZE 64ul
> +#define MAX_CPUS 24

That may be a lil' too few for modern HW with hyper-threading on ;)  
My development machine says:

$ ncpus
28

128, maybe?

> +#define bpf_printk(fmt, ...) \
> +({   \
> +char fmt[] = fmt;\
> +bpf_trace_printk(fmt, sizeof(fmt),   \
> + ##__VA_ARGS__); \
> +})
> +
> +struct bpf_map_def SEC("maps") my_map = {
> + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(u32),
> + .max_entries = MAX_CPUS,
> +};
> +
> +SEC("xdp_sample")
> +int xdp_sample_prog(struct xdp_md *ctx)
> +{
> + void *data_end = (void *)(long)ctx->data_end;
> + void *data = (void *)(long)ctx->data;
> +
> +/* Metadata will be in the perf event before the packet data. */
> + struct S {
> + u16 cookie;
> + u16 pkt_len;
> + } __attribute__((packed)) metadata;
> +
> + if (data + SAMPLE_SIZE < data_end) {
> + /* The XDP perf_event_output handler will use the upper 32 bits
> +  * of the flags argument as a number of bytes to include of the
> +  * packet payload in the event data. If the size is too big, the
> +  * call to bpf_perf_event_output will fail and return -EFAULT.
> +  *
> +  * See bpf_xdp_event_output in net/core/filter.c.
> +  *
> +  * The BPF_F_CURRENT_CPU flag means that the event output fd
> +  * will be indexed by the CPU number in the event map.
> +  */
> + u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
> + int ret;
> +
> + metadata.cookie = 0xdead;
> + metadata.pkt_len = (u16)(data_end - data);
> +
> + ret = bpf_perf_event_output(ctx, &my_map, flags,
> +   &metadata, sizeof(metadata));
> + if(ret)

Please run checkpatch --strict on the samples.

> + bpf_printk("perf_event_output failed: %d\n", ret);
> + }
> +
> + return XDP_PASS;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> +u32 _version SEC("version") = LINUX_VERSION_CODE;

Re: [PATCH bpf-next v4 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-05 Thread Jakub Kicinski

On Tue, 05 Jun 2018 16:50:00 +0200, Toke Høiland-Jørgensen wrote:
> Add two new helper functions to trace_helpers that supports polling
> multiple perf file descriptors for events. These are used to the XDP
> perf_event_output example, which needs to work with one perf fd per CPU.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  tools/testing/selftests/bpf/trace_helpers.c |   47 
> ++-
>  tools/testing/selftests/bpf/trace_helpers.h |4 ++
>  2 files changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
> b/tools/testing/selftests/bpf/trace_helpers.c
> index 3868dcb63420..1e62d89f34cf 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -88,7 +88,7 @@ static int page_size;
>  static int page_cnt = 8;
>  static struct perf_event_mmap_page *header;
>  
> -int perf_event_mmap(int fd)
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
>  {
>   void *base;
>   int mmap_size;
> @@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
>   return -1;
>   }
>  
> - header = base;
> + *header = base;
>   return 0;
>  }
>  
> +int perf_event_mmap(int fd)
> +{
> + return perf_event_mmap_header(fd, &header);
> +}
> +
>  static int perf_event_poll(int fd)
>  {
>   struct pollfd pfd = { .fd = fd, .events = POLLIN };
> @@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn 
> output_fn)
>  
>   return ret;
>  }
> +
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> + int num_fds, perf_event_print_fn output_fn)
> +{
> + enum bpf_perf_event_ret ret;
> + struct pollfd *pfds;
> + void *buf = NULL;
> + size_t len = 0;
> + int i;
> +
> + pfds = malloc(sizeof(*pfds) * num_fds);
> + if (!pfds)
> + return -1;

nit: this is correct, but better use LIBBPF_PERF_EVENT_ERROR?  This
function is supposed to return enum bpf_perf_event_ret values I assume?
In case someone moves this code to libbpf later on...

> + memset(pfds, 0, sizeof(*pfds) * num_fds);
> + for (i = 0; i < num_fds; i++) {
> + pfds[i].fd = fds[i];
> + pfds[i].events = POLLIN;
> + }
> +
> + for (;;) {
> + poll(pfds, num_fds, 1000);
> + for (i = 0; i < num_fds; i++) {
> + if (pfds[i].revents) {

nit: save yourself a lever of indentation by doing:

if (!pfds[i].revents)
continue;

and you won't have to go over 80 chars.

> + ret = bpf_perf_event_read_simple(headers[i], 
> page_cnt * page_size,
> + page_size, 
> &buf, &len,
> + 
> bpf_perf_event_print,
> + output_fn);
> + if (ret != LIBBPF_PERF_EVENT_CONT)
> + break;
> + }
> + }
> + }
> + free(buf);
> + free(pfds);
> +
> + return ret;
> +}

[PATCH net] net: ipv6: ip6_output: alloc skb with tailroom

2018-06-05 Thread Alexander Aring

This patch adds care about tailroom length for allocate a skb from ipv6
level stack. In case of 6lowpan we had the problem the skb runs into a
skb_over_panic() in some special length cases. The root was there was no
tailroom allocated for the IEEE 802.15.4 checksum, although we had
the necessary tailroom specified inside the netdev structure.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195059
Reported-by: David Palma 
Reported-by: Rabi Narayan Sahoo 
Signed-off-by: Alexander Aring 
---
Hi,

nasty bug, I suppose this is the correct fix to my last question for
what dev->needed_tailroom is designed for.

I added two Reported-by here, David Palma reported this bug one year ago
and I didn't had time to investigate. Interesting is that he told this
bug doesn't occur (in case of 6lowpan 802.15.4) on 32 bit systems.
Maybe alignment is related to the question why it works on 32 bit.

Anyway, a week ago "Rabi Narayan Sahoo" reported the bug again and I
needed to investigate something "why", since I also use a 64 bit vm.

David Palma did a nice job for reproduce this bug and he (I think) lives
at least one year with it, so I put him at first.

Anyway, Rabi Narayan Sahoo was very very close to fix it and found the
right code part which I also found. I read his mail afterwards because
it was received messed on the linux-wpan mailinglist. So it's correct
to give him credits too. :-)

I hope there are no other cases where tailroom is missing.
The second one is not needed to fix my bug but I think we need it there.
Also hh_len is also used inside a skb_resever() in this function,
but this is for headroom only.

- Alex

 net/ipv6/ip6_output.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7b6d1689087b..b4e521cfe3cf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1262,6 +1262,7 @@ static int __ip6_append_data(struct sock *sk,
int exthdrlen = 0;
int dst_exthdrlen = 0;
int hh_len;
+   int t_len;
int copy;
int err;
int offset = 0;
@@ -1283,6 +1284,7 @@ static int __ip6_append_data(struct sock *sk,
orig_mtu = mtu;
 
hh_len = LL_RESERVED_SPACE(rt->dst.dev);
+   t_len = rt->dst.dev->needed_tailroom;
 
fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
(opt ? opt->opt_nflen : 0);
@@ -1425,13 +1427,13 @@ static int __ip6_append_data(struct sock *sk,
}
if (transhdrlen) {
skb = sock_alloc_send_skb(sk,
-   alloclen + hh_len,
+   alloclen + hh_len + t_len,
(flags & MSG_DONTWAIT), &err);
} else {
skb = NULL;
if (refcount_read(&sk->sk_wmem_alloc) + 
wmem_alloc_delta <=
2 * sk->sk_sndbuf)
-   skb = alloc_skb(alloclen + hh_len,
+   skb = alloc_skb(alloclen + hh_len + 
t_len,
sk->sk_allocation);
if (unlikely(!skb))
err = -ENOBUFS;
-- 
2.11.0

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

> PHC only one, but hw timestamping blocks are per port.

Yes, same as the Marvell. Per port, there are two receive time stamps
and one transmit time stamp.

Andrew

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko





On 06/05/2018 04:28 PM, Andrew Lunn wrote:

I hope you are right - question is always in number of available options
and which one to select - and, most important, explain it to the end user :(


The end customer being ptp4linux? At least for Marvell switches, it is
happy about everything except that the switch is a bit slow, so we
need to modify some of the time outs in the configuration file.


For example:
phc_index is returned as part of .get_ts_info() = cpsw_get_ts_info(),
so which intf should return phc_index?


It is not a 1:1 relationship. See:

https://elixir.bootlin.com/linux/latest/source/drivers/net/dsa/mv88e6xxx/hwtstamp.c#L61

All interfaces return the same index.

In fact, for a switch, having a PHC per port would be odd. That would
mean you need to sync the PHCs in order to act as a boundary clock.


PHC only one, but hw timestamping blocks are per port.

--
regards,
-grygorii

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

> Right, thanky for the info, but still (sry, to be annoying) why default vlan 
> is added by bridge
> when vlan_filtering == 0?

I have no idea!

I just made sure the Marvell driver works with this.

You might want to do a git blame and find out who added it, and it
might say why.

  Andrew

Re: [PATCH 4/4] cpsw: add switchdev support

2018-06-05 Thread Florian Fainelli

On 06/05/2018 02:03 PM, Grygorii Strashko wrote:
> 
> 
> On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:
>> Hi Florian,
>>
>> Thanks for taking time to look into this
>>
>> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:
>>>
>>>
>>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:
 On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:
> On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:
>> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:
>>> Device tree is supposed to describe the hardware. Using that hardware
>>> in different ways is not something you should describe in DT.
>>>
>> The new switchdev mode is applied with a .config option in the kernel. 
>> What you
>> see is pre-existing code, so i am not sure if i should change it in this
>> patchset.
>
> If you break the code up into a library and two drivers, it becomes a
> moot point.
 Agree

>
> But what i don't like here is that the device tree says to do dual
> mac. But you ignore that and do sometime else. I would prefer that if
> DT says dual mac, and switchdev is compiled in, the probe fails with
> EINVAL. Rather than ignore something, make it clear it is invalid.
 The switch has 3 modes of operation as is.
 1. switch mode, to enable that you don't need to add anything on
 the DTS and linux registers a single netdev interface.
 2. dual mac mode, this is when you need to add dual_emac; on the DTS.
 3. switchdev mode which is controlled by a .config option, since as you
 pointed out DTS was not made for controlling config options.

 I agree that this is far from beautiful. If the driver remains as in 
 though,
 i'd prefer either keeping what's there or making "switchdev" a DTS option,
 following the pre-existing erroneous usage rather than making the device
 unusable.  If we end up returning some error and refuse to initialize, 
 users
 that remote upgrade their equipment, without taking a good look at 
 changelog,
 will loose access to their devices with no means of remotely fixing that.
>>>
>>> It seems to me that the mistake here is seeing multiple modes of
>>> operations for the cpsw. There are not actually many, there is one
>>> usage, and then there is what you can and cannot offload.
> 
>> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's
>> called ALE in the current driver) by-pass(which is used in dual emac for
>> example) and other features. Again Grygorii is better suited to answer the
>> exact differences.
> 
> dual_mac is HW enabled mode of operation, so having DT option is pretty
> reasonable as for me. 
> 1) when enabled it configures internal FIFOs in special way so both
> external Ports became equal in the direction toward to Host port 0.
> 
> TRM "The intention of this mode is to allow packets from both ethernet ports
> to be written into the FIFO without one port starving the other port."
> 
> 2) ALE, out of the box, does not support this mode and, as result, two
> default vlan have to be created to direct traffic P1->P0 (vlan1) and
> P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed
> (and will bypass ALE). This way traffic separated on cpsw egress towards to 
> P0, 
> 
>>> The basic
>>> premise with switchdev and DSA (which uses switchdev) is that each
>>> user-facing port of your switch needs to work as if it were a normal
>>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you
>>> create a bridge and you enslave those ports into the bridge, you need to
>>> have forwarding done in hardware between these two ports when the
>>> SMAC/DMAC are not for the host/CPU/management interface and you must
>>> simultaneously still have the host have the ability to send/receive
>>> traffic through the bridge device.
> 
> TRM
> "When operating in dual mac mode the intention is to transfer packets between 
> ports 0 and 1 and ports 0
> and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC 
> with no bridging
> between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac 
> address."
> 
> So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to 
> be 
> completely independent without any packet leaking between interfaces.
> 
> !! BUT !! CPSW ALE (switch core) still expected to be used, but for packet 
> filtering
> - only registered vlans
> - only registered mcast/bcast
> - ingress mcast/bcast rate limiting (it's actually more like coalescing -
>  limits number of mcast/bcast packets per sec. 
> 
> And all offloading ALE (val/mdb) entries should always contain two ports in 
> masks:
> p1&p0 or p2&p0. Never ever all three ports. 
> 
>> Yes dual emac does that. But dual emac configures the port facing VLAN to the
>> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is
>> configured on port1 + CPU port and VLAN 2 is con

Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP

2018-06-05 Thread Martin KaFai Lau

On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote:
>   So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
> we should have no other issues. 
Hi Julian,

Do you have a chance to work on the IPv6 side?

Thanks,
Martin

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko




On 06/02/2018 07:37 PM, Andrew Lunn wrote:
>> 1) boot, ping no vlan
>>
>> # ip link add name br0 type bridge
>> # echo 0 > /sys/class/net/br0/bridge/default_pvid
>> # ip link set dev eth2 master br0
>> # ip link set dev eth0 master br0
>> # ip link set dev eth1 master br0
>> # ifconfig br0 192.168.1.2
>>
>> *Note*: I've had to disable default_pvid as otherwise linux Bridge adds
>> and offloads default vlan 1, but default configuration for CPSW driver is 
>> vid 0.
>> +  CPSW specific - it can't untag packets for P0.
>> Another option I've found:
>> # ip link set dev br0 type bridge vlan_filtering 1.
>> but anyway, I've found it confusing that Linux bridge adds default vlan when 
>> vlan_filtering == 0
> 
> There are three different configurations here you need to worry about,
> with respect to vlans:
> 
> # CONFIG_VLAN_8021Q is not set
> 
> So you don't have any vlan support in the kernel.
> 
> CONFIG_VLAN_8021Q=y, vlan_filtering = 0
> 
> So you have vlans, but filtering is off
> 
> CONFIG_VLAN_8021Q=y, vlan_filtering = 1
> 
> So you have vlans, and filtering is on.
> 
> Even with vlan_filtering off, the bridge still does a little with
> vlans.
> 
> And you need all three to work correctly.
> 

Right, thanky for the info, but still (sry, to be annoying) why default vlan is 
added by bridge
when vlan_filtering == 0?

-- 
regards,
-grygorii

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Andrew Lunn

> I hope you are right - question is always in number of available options
> and which one to select - and, most important, explain it to the end user :( 

The end customer being ptp4linux? At least for Marvell switches, it is
happy about everything except that the switch is a bit slow, so we
need to modify some of the time outs in the configuration file.

> For example:
> phc_index is returned as part of .get_ts_info() = cpsw_get_ts_info(),
> so which intf should return phc_index?

It is not a 1:1 relationship. See:

https://elixir.bootlin.com/linux/latest/source/drivers/net/dsa/mv88e6xxx/hwtstamp.c#L61

All interfaces return the same index.

In fact, for a switch, having a PHC per port would be odd. That would
mean you need to sync the PHCs in order to act as a boundary clock.

Andrew

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread Jakub Kicinski

On Tue, 05 Jun 2018 15:06:40 -0400 (EDT), David Miller wrote:
> From: Jakub Kicinski 
> Date: Tue, 5 Jun 2018 11:57:47 -0700
> 
> > Do we still care about correctness and not breaking backward
> > compatibility?  
> 
> Jakub let me know if you want me to revert this change.

Yes, I think this patch introduces a regression when block is shared
between offload capable and in-capable device, therefore it should be
reverted.  Shared blocks went through a number of review cycles to
ensure such cases are handled correctly.

Longer story about the actual issue which is never explained in the
commit message is this: in kernels 4.10 - 4.14 skip_sw flag was
supported on tunnels in cls_flower only:

static int fl_hw_replace_filter(struct tcf_proto *tp,
[...]
if (!tc_can_offload(dev)) {
if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
(f->hw_dev && !tc_can_offload(f->hw_dev))) {
f->hw_dev = dev;
return tc_skip_sw(f->flags) ? -EINVAL : 0;
}
dev = f->hw_dev;
cls_flower.egress_dev = true;
} else {
f->hw_dev = dev;
}

In 4.15 - 4.17 with addition of shared blocks egdev mechanism was
promoted to a generic TC thing supported on all filters but it no
longer works with skip_sw.

I'd argue skip_sw is incorrect for tunnels, because the rule will only
apply to traffic ingressing on ASIC ports, not all traffic which hits
the tunnel netdev.  Therefore we should keep the 4.15 - 4.17 behaviour.

But that's a side note, I don't think we should be breaking offload on
shared blocks whether we decide to support skip_sw on tunnels or not.

Re: [PATCH bpf-next v5 00/10] BTF: BPF Type Format

2018-06-05 Thread Martin KaFai Lau

On Thu, Apr 19, 2018 at 04:40:34PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Apr 18, 2018 at 03:55:56PM -0700, Martin KaFai Lau escreveu:
> > This patch introduces BPF Type Format (BTF).
> > 
> > BTF (BPF Type Format) is the meta data format which describes
> > the data types of BPF program/map.  Hence, it basically focus
> > on the C programming language which the modern BPF is primary
> > using.  The first use case is to provide a generic pretty print
> > capability for a BPF map.
> > 
> > A modified pahole that can convert dwarf to BTF is here:
> > https://github.com/iamkafai/pahole/tree/btf
> > (Arnaldo, there is some BTF_KIND numbering changes on
> >  Apr 18th, d61426c1571)
> 
> Thanks for letting me know, I'm starting to look at this,
Hi Arnaldo,

Do you have a chance to take a look and pull it?  The kernel
changes will be in 4.18, so it will be handy if it is available in
the pahole repository.

[ btw, the latest commit (1 commit) should be 94a11b59e592 ].

Thanks,
Martin

> 
> - Arnaldo
>  
> > Please see individual patch for details.
> > 
> > v5:
> > - Remove BTF_KIND_FLOAT and BTF_KIND_FUNC which are not
> >   currently used.  They can be added in the future.
> >   Some bpf_df_xxx() are removed together.
> > - Add comment in patch 7 to clarify that the new bpffs_map_fops
> >   should not be extended further.
> > 
> > v4:
> > - Fix warning (remove unneeded semicolon)
> > - Remove a redundant variable (nr_bytes) from btf_int_check_meta() in
> >   patch 1.  Caught by W=1.
> > 
> > v3:
> > - Rebase to bpf-next
> > - Fix sparse warning (by adding static)
> > - Add BTF header logging: btf_verifier_log_hdr()
> > - Fix the alignment test on btf->type_off
> > - Add tests for the BTF header
> > - Lower the max BTF size to 16MB.  It should be enough
> >   for some time.  We could raise it later if it would
> >   be needed.
> > 
> > v2:
> > - Use kvfree where needed in patch 1 and 2
> > - Also consider BTF_INT_OFFSET() in the btf_int_check_meta()
> >   in patch 1
> > - Fix an incorrect goto target in map_create() during
> >   the btf-error-path in patch 7
> > - re-org some local vars to keep the rev xmas tree in btf.c
> > 
> > Martin KaFai Lau (10):
> >   bpf: btf: Introduce BPF Type Format (BTF)
> >   bpf: btf: Validate type reference
> >   bpf: btf: Check members of struct/union
> >   bpf: btf: Add pretty print capability for data with BTF type info
> >   bpf: btf: Add BPF_BTF_LOAD command
> >   bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
> >   bpf: btf: Add pretty print support to the basic arraymap
> >   bpf: btf: Sync bpf.h and btf.h to tools/
> >   bpf: btf: Add BTF support to libbpf
> >   bpf: btf: Add BTF tests
> > 
> >  include/linux/bpf.h  |   20 +-
> >  include/linux/btf.h  |   48 +
> >  include/uapi/linux/bpf.h |   12 +
> >  include/uapi/linux/btf.h |  130 ++
> >  kernel/bpf/Makefile  |1 +
> >  kernel/bpf/arraymap.c|   50 +
> >  kernel/bpf/btf.c | 2064 
> > ++
> >  kernel/bpf/inode.c   |  156 +-
> >  kernel/bpf/syscall.c |   51 +-
> >  tools/include/uapi/linux/bpf.h   |   12 +
> >  tools/include/uapi/linux/btf.h   |  130 ++
> >  tools/lib/bpf/Build  |2 +-
> >  tools/lib/bpf/bpf.c  |   92 +-
> >  tools/lib/bpf/bpf.h  |   16 +
> >  tools/lib/bpf/btf.c  |  374 +
> >  tools/lib/bpf/btf.h  |   22 +
> >  tools/lib/bpf/libbpf.c   |  148 +-
> >  tools/lib/bpf/libbpf.h   |3 +
> >  tools/testing/selftests/bpf/Makefile |   26 +-
> >  tools/testing/selftests/bpf/test_btf.c   | 1669 +
> >  tools/testing/selftests/bpf/test_btf_haskv.c |   48 +
> >  tools/testing/selftests/bpf/test_btf_nokv.c  |   43 +
> >  22 files changed, 5076 insertions(+), 41 deletions(-)
> >  create mode 100644 include/linux/btf.h
> >  create mode 100644 include/uapi/linux/btf.h
> >  create mode 100644 kernel/bpf/btf.c
> >  create mode 100644 tools/include/uapi/linux/btf.h
> >  create mode 100644 tools/lib/bpf/btf.c
> >  create mode 100644 tools/lib/bpf/btf.h
> >  create mode 100644 tools/testing/selftests/bpf/test_btf.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_haskv.c
> >  create mode 100644 tools/testing/selftests/bpf/test_btf_nokv.c
> > 
> > -- 
> > 2.9.5

Re: [PATCH 0/4] RFC CPSW switchdev mode

2018-06-05 Thread Grygorii Strashko




On 06/02/2018 07:08 PM, Andrew Lunn wrote:
> On Sat, Jun 02, 2018 at 06:28:22PM -0500, Grygorii Strashko wrote:
> 
> Hi Grygorii
> 
> I'm just picking out one thing here... there is lots more good stuff here.
> 
>> Additional headache is PTP: we have on PHC, but both external interfaces 
>> P1/P2
>> can timestamp packets.
>   
> This should not be a problem. The Marvell switches have one PHC, but
> each port can time stamp packets using this counter. Each port has its
> own receive and transmit time stamp registers. So i don't think this
> will cause you problems.

I hope you are right - question is always in number of available options
and which one to select - and, most important, explain it to the end user :( 


For example:
phc_index is returned as part of .get_ts_info() = cpsw_get_ts_info(),
so which intf should return phc_index?

Still not tested, so jut hope ...


-- 
regards,
-grygorii

Re: [PATCH iproute2-next v2 1/1] tc: add json support in csum action

2018-06-05 Thread Stephen Hemminger

On Tue,  5 Jun 2018 16:44:19 -0400
Keara Leibovitz  wrote:

> diff --git a/tc/m_csum.c b/tc/m_csum.c
> index 8391071d73f2..0bdbcf361a28 100644
> --- a/tc/m_csum.c
> +++ b/tc/m_csum.c
> @@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
> *arg)
>   char *uflag_5 = "";
>   char *uflag_6 = "";
>   char *uflag_7 = "";
> + char buf[64];

Looks good.
In iproute2 the common macro for declaring these kind of buffers is:
SPRINT_BUF(buf);

Doesn't matter that much, but nice to have consistency.

Re: [PATCH 4/4] cpsw: add switchdev support

2018-06-05 Thread Grygorii Strashko




On 06/02/2018 05:34 AM, Ilias Apalodimas wrote:
> Hi Florian,
> 
> Thanks for taking time to look into this
> 
> On Fri, Jun 01, 2018 at 02:48:48PM -0700, Florian Fainelli wrote:
>>
>>
>> On 05/24/2018 09:56 PM, Ilias Apalodimas wrote:
>>> On Thu, May 24, 2018 at 06:39:04PM +0200, Andrew Lunn wrote:
 On Thu, May 24, 2018 at 04:32:34PM +0300, Ilias Apalodimas wrote:
> On Thu, May 24, 2018 at 03:12:29PM +0200, Andrew Lunn wrote:
>> Device tree is supposed to describe the hardware. Using that hardware
>> in different ways is not something you should describe in DT.
>>
> The new switchdev mode is applied with a .config option in the kernel. 
> What you
> see is pre-existing code, so i am not sure if i should change it in this
> patchset.

 If you break the code up into a library and two drivers, it becomes a
 moot point.
>>> Agree
>>>

 But what i don't like here is that the device tree says to do dual
 mac. But you ignore that and do sometime else. I would prefer that if
 DT says dual mac, and switchdev is compiled in, the probe fails with
 EINVAL. Rather than ignore something, make it clear it is invalid.
>>> The switch has 3 modes of operation as is.
>>> 1. switch mode, to enable that you don't need to add anything on
>>> the DTS and linux registers a single netdev interface.
>>> 2. dual mac mode, this is when you need to add dual_emac; on the DTS.
>>> 3. switchdev mode which is controlled by a .config option, since as you
>>> pointed out DTS was not made for controlling config options.
>>>
>>> I agree that this is far from beautiful. If the driver remains as in though,
>>> i'd prefer either keeping what's there or making "switchdev" a DTS option,
>>> following the pre-existing erroneous usage rather than making the device
>>> unusable.  If we end up returning some error and refuse to initialize, users
>>> that remote upgrade their equipment, without taking a good look at 
>>> changelog,
>>> will loose access to their devices with no means of remotely fixing that.
>>
>> It seems to me that the mistake here is seeing multiple modes of
>> operations for the cpsw. There are not actually many, there is one
>> usage, and then there is what you can and cannot offload.

> CPSW has in fact 2 modes of operation, different FIFO usage/lookup entry(it's
> called ALE in the current driver) by-pass(which is used in dual emac for
> example) and other features. Again Grygorii is better suited to answer the
> exact differences.

dual_mac is HW enabled mode of operation, so having DT option is pretty
reasonable as for me. 
1) when enabled it configures internal FIFOs in special way so both
external Ports became equal in the direction toward to Host port 0.

TRM "The intention of this mode is to allow packets from both ethernet ports
to be written into the FIFO without one port starving the other port."

2) ALE, out of the box, does not support this mode and, as result, two
default vlan have to be created to direct traffic P1->P0 (vlan1) and
P2-P0 (vlan2), but any packets P0->P1 and P0->P2 have to be directed
(and will bypass ALE). This way traffic separated on cpsw egress towards to P0, 

>> The basic
>> premise with switchdev and DSA (which uses switchdev) is that each
>> user-facing port of your switch needs to work as if it were a normal
>> Ethernet NIC, that is what you call dual-MAC I believe. Then, when you
>> create a bridge and you enslave those ports into the bridge, you need to
>> have forwarding done in hardware between these two ports when the
>> SMAC/DMAC are not for the host/CPU/management interface and you must
>> simultaneously still have the host have the ability to send/receive
>> traffic through the bridge device.

TRM
"When operating in dual mac mode the intention is to transfer packets between 
ports 0 and 1 and ports 0
and 2, but not between ports 1 and 2. Each CPGMAC_SL appears as a single MAC 
with no bridging
between MAC’s. Each CPGMAC_SL has at least one unique (not the same) mac 
address."

So, we cant't talk about forwarding in dual_mac mode. Interfaces expected to be 
completely independent without any packet leaking between interfaces.

!! BUT !! CPSW ALE (switch core) still expected to be used, but for packet 
filtering
- only registered vlans
- only registered mcast/bcast
- ingress mcast/bcast rate limiting (it's actually more like coalescing -
 limits number of mcast/bcast packets per sec. 

And all offloading ALE (val/mdb) entries should always contain two ports in 
masks:
p1&p0 or p2&p0. Never ever all three ports. 

> Yes dual emac does that. But dual emac configures the port facing VLAN to the
> CPU port as well. So dual emac splits and uses 2 interfaces. VLAN 1 is
> configured on port1 + CPU port and VLAN 2 is confired on port 2 + CPU port
> That's exactly what the current RFC does as well, with the addition of
> registering a sw0p0 (i'll explain why later on this mail)
> A little more detail on the issue we are havin

umh build...

2018-06-05 Thread David Miller



Alexei, I tried to build bpfilter on sparc64 and it shows that
CONFIG_OUTPUT_FORMAT is an x86 specific Kconfig value.

And, even if I added it, it's not clear what it should even be set to.

Right now, for example, my userland builds default to 32-bit sparc
even though I'm building a 64-bit kernel.

I think what ends up happening has to be in some way in response to
what kind of "native" binaries HOSTCC is actually building.

[PATCH iproute2-next v2 1/1] tc: add json support in csum action

2018-06-05 Thread Keara Leibovitz

Add json output support for checksum action.

Example output:

~$ $TC actions add action csum udp continue index 7
~$ $TC actions add action csum icmp iph igmp pipe index 200 cookie 112233
~$ $TC -j actions ls action csum

[{
"total acts":2
}, {
"actions": [{
"order":0,
"csum":"udp",
"control_action": {
"type":"continue"
},
"index":7,
"ref":1,
"bind":0
}, {
"order":1,
"csum":"iph, icmp, igmp",
"control_action": {
"type":"pipe"
},
"index":200,
"ref":1,
"bind":0,
"cookie":"112233"
}]
}]

v2:
Don't initialized char buf[64];
Add output example

Signed-off-by: Keara Leibovitz 
---
 tc/m_csum.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/tc/m_csum.c b/tc/m_csum.c
index 8391071d73f2..0bdbcf361a28 100644
--- a/tc/m_csum.c
+++ b/tc/m_csum.c
@@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
char *uflag_5 = "";
char *uflag_6 = "";
char *uflag_7 = "";
+   char buf[64];
 
int uflag_count = 0;
 
@@ -198,12 +199,15 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
uflag_1 = "?empty";
}
 
-   fprintf(f, "csum (%s%s%s%s%s%s%s) ",
-   uflag_1, uflag_2, uflag_3,
-   uflag_4, uflag_5, uflag_6, uflag_7);
+   snprintf(buf, sizeof(buf), "%s%s%s%s%s%s%s",
+uflag_1, uflag_2, uflag_3,
+uflag_4, uflag_5, uflag_6, uflag_7);
+   print_string(PRINT_ANY, "csum", "csum (%s) ", buf);
+
print_action_control(f, "action ", sel->action, "\n");
-   fprintf(f, "\tindex %u ref %d bind %d", sel->index, sel->refcnt,
-   sel->bindcnt);
+   print_uint(PRINT_ANY, "index", "\tindex %u", sel->index);
+   print_int(PRINT_ANY, "ref", " ref %d", sel->refcnt);
+   print_int(PRINT_ANY, "bind", " bind %d", sel->bindcnt);
 
if (show_stats) {
if (tb[TCA_CSUM_TM]) {
@@ -212,7 +216,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
print_tm(f, tm);
}
}
-   fprintf(f, "\n");
+   print_string(PRINT_FP, NULL, "%s", "\n");
 
return 0;
 }
-- 
2.7.4

Re: [PATCH iproute2-next 1/1] tc: add json support in csum action

2018-06-05 Thread David Ahern

On 6/5/18 12:30 PM, Keara Leibovitz wrote:

please add some words here. e.g., add example output

> Signed-off-by: Keara Leibovitz 
> ---
>  tc/m_csum.c | 16 ++--
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/tc/m_csum.c b/tc/m_csum.c
> index 8391071d73f2..67481667d9d2 100644
> --- a/tc/m_csum.c
> +++ b/tc/m_csum.c
> @@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
> *arg)
>   char *uflag_5 = "";
>   char *uflag_6 = "";
>   char *uflag_7 = "";
> + char buf[64] = {0};

initialization is not needed. It is set before use.

[PATCH iproute2-next 1/1] tc: add json support in csum action

2018-06-05 Thread Keara Leibovitz

Signed-off-by: Keara Leibovitz 
---
 tc/m_csum.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/tc/m_csum.c b/tc/m_csum.c
index 8391071d73f2..67481667d9d2 100644
--- a/tc/m_csum.c
+++ b/tc/m_csum.c
@@ -162,6 +162,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
char *uflag_5 = "";
char *uflag_6 = "";
char *uflag_7 = "";
+   char buf[64] = {0};
 
int uflag_count = 0;
 
@@ -198,12 +199,15 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
uflag_1 = "?empty";
}
 
-   fprintf(f, "csum (%s%s%s%s%s%s%s) ",
-   uflag_1, uflag_2, uflag_3,
-   uflag_4, uflag_5, uflag_6, uflag_7);
+   snprintf(buf, sizeof(buf), "%s%s%s%s%s%s%s",
+uflag_1, uflag_2, uflag_3,
+uflag_4, uflag_5, uflag_6, uflag_7);
+   print_string(PRINT_ANY, "csum", "csum (%s) ", buf);
+
print_action_control(f, "action ", sel->action, "\n");
-   fprintf(f, "\tindex %u ref %d bind %d", sel->index, sel->refcnt,
-   sel->bindcnt);
+   print_uint(PRINT_ANY, "index", "\tindex %u", sel->index);
+   print_int(PRINT_ANY, "ref", " ref %d", sel->refcnt);
+   print_int(PRINT_ANY, "bind", " bind %d", sel->bindcnt);
 
if (show_stats) {
if (tb[TCA_CSUM_TM]) {
@@ -212,7 +216,7 @@ print_csum(struct action_util *au, FILE *f, struct rtattr 
*arg)
print_tm(f, tm);
}
}
-   fprintf(f, "\n");
+   print_string(PRINT_FP, NULL, "%s", "\n");
 
return 0;
 }
-- 
2.7.4

Re: [PATCH net-next v2 0/2] net: phy: improve PHY suspend/resume

2018-06-05 Thread Heiner Kallweit

On 02.06.2018 22:27, Heiner Kallweit wrote:
> On 01.06.2018 02:10, Andrew Lunn wrote:
>>> Configuring the different WoL options isn't handled by writing to
>>> the PHY registers but by writing to chip / MAC registers.
>>> Therefore phy_suspend() isn't able to figure out whether WoL is
>>> enabled or not. Only the parent has the full picture.
>>
>> Hi Heiner
>>
>> I think you need to look at your different runtime PM domains.  If i
>> understand the code right, you runtime suspend if there is no
>> link. But for this to work correctly, your PHY needs to keep working.
>> You also cannot assume all accesses to the PHY go via the MAC. Some
>> calls will go direct to the PHY, and they can trigger MDIO bus
>> accesses.  So i think you need two runtime PM domains. MAC and MDIO
>> bus.  Maybe just the pll? An MDIO bus is a device, so it can have its
>> on PM callbacks. It is not clear what you need to resume in order to
>> make MDIO work.
>>
> Thanks for your comments!
> The actual problem is quite small: I get an error at MDIO suspend,
> the PHY however is suspended later by the driver's suspend callback
> anyway. Because the problem is small I'm somewhat reluctant to
> consider bigger changes like introducing different PM domains.
> 
> Primary reason for the error is that the network chip is in PCI D3hot
> at that moment. In addition to that for some of the chips supported by
> the driver also MDIO-relevant PLL's might be disabled.
> 
> By the way:
> When checking PM handling for PHY/MDIO I stumbled across something
> that can be improved IMO, I'll send a patch for your review.
> 
I experimented a little and with adding Runtime PM to MDIO bus and
PHY device I can make it work:
PHY runtime-resumes before entering suspend and resumes its parent
(MDIO bus) which then resumes its parent (PCI device).
However this needs quite some code and is hard to read / understand
w/o reading through this mail thread.

And in general I still have doubts this is the right way. Let's
consider the following scenario:

A network driver configures WoL in its suspend callback
(incl. setting the device to wakeup-enabled).
The suspend callback of the PHY is called before this and therefore
has no clue that WoL will be configured a little later, with the
consequence that it will do an unsolicited power-down.
The network driver then has to detect this and power-up the PHY again.
This doesn't seem to make much sense and still my best idea is to
establish a mechanism that a device can state: I orchestrate PM
of my children.

Heiner

>> It might also help if you do the phy_connect in .ndo_open and
>> disconnect in .ndo_stop. This is a common pattern in drivers. But some
>> also do it is probe and remove.
>>
> Thanks for the hint. I will move phy_connect_direct accordingly.
> 
>>  Andrew
>>
> Heiner
>

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Michael S. Tsirkin

On Tue, Jun 05, 2018 at 11:53:05AM -0700, Stephen Hemminger wrote:
> > >   * Now, netvsc and net_failover use the same delayed work type
> > > mechanism for setup. Previously, net_failover code was triggering off
> > > name change but a similar policy was rejected for netvsc.
> > > "what is good for the goose is good for the gander"  
> > 
> > I don't really understand what you are saying here.  I think the delayed
> > hack is kind of ugly and seems racy.  Current failover code was rejected
> > by whom?  Why is new one good and for whom?  Did you want to do a name
> > change in netvsc but it was rejected? Could you clarify please?
> 
> See:
>https://patchwork.ozlabs.org/patch/851711/

Let me try to summarize that:

You wanted to speed up the delayed link up.  You had an idea to
additionally take link up when userspace renames the interface (standby
one which is also the failover for netvsc).

But userspace might not do any renames, in which case there will
still be the delay, and so this never got applied.

Is this a good summary?

Davem said delay should go away completely as it's not robust, and I
think I agree.  So I don't think we should make all failover users use
delay. IIUC failover kept a delay option especially for netvsc to
minimize the surprise factor. Hopefully we can come up with
something more robust and drop that option completely.

> > >   * Set permanent and current address of net_failover device
> > > to match the primary.
> > > 
> > >   * Carrier should be marked off before registering device
> > > the net_failover device.  
> > 
> > Are above two bugfixes?
> 
> Yes.

Maybe fix these two as first patches in the set?

> > > Although this patch needs to go into 4.18 (linux-net),  
> > 
> > I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> > 4.19.
> >
> 
> Either we fix or revert the current code in 4.18.
> Sorry, I am not having callback hell code in any vendor or upstream kernel.

I agree callbacks add complexity which often isn't necessary, so
removing them where possible is a good cleanup.  But maybe a patch
shouldn't mix bugfixes, cleanups and behaviour changes all together.  If
nothing else it makes review harder.  Splitting patches up might make it
more likely they can go into 4.18 which seems to be what you want.

HTH,
-- 
MST

Re: pull-request: bpf-next 2018-06-05

2018-06-05 Thread David Miller

From: Daniel Borkmann 
Date: Tue,  5 Jun 2018 18:39:16 +0200

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

I've pulled this in and will push back out after some build testing.

Thanks!

Re: suspicius csum initialization in vmxnet3_rx_csum

2018-06-05 Thread Ronak Doshi




On Tue, 5 Jun 2018, Paolo Abeni wrote:

> Hi,
> 
> I'm sorry for the long delay in my answer, I've been travelling.
> 
> On Fri, 2018-06-01 at 11:10 -0700, Ronak Doshi wrote:
> > On Thu, 31 May 2018, Neil Horman wrote:
> > > What packet types will rcd.csum be set for?
> > > Neil 
> > 
> > I looked thorugh the emulation code and found that rcd.csum is not set. 
> > For valid v4/v6, TCP/UDP packets the code block above the mentioend "if" 
> > block will be executed or else it will go through checksum none.
> > 
> > That's why I wanted to know (in previous emails) which ESX build is being 
> > used while this was tested. The code block under "if (gdesc->rcd.csum)" 
> > block might seem incorrect but it shouldn't be hit as rcd.csum is not set. 
> 
> I'm unsure if I read the above correctly. Do you mean that the relevant
> code-path is never hit? If so, can we simply drop it, as we agreed that
> such code is uncorrect? Elsewhere, could you plese specify under which
> circumstances gdesc->rcd.csum is filled by the hypervisor?
>
I do not see hypervisor populating rcd.csum field or may be the code has 
changed over the years. So, the codepath should not be hit as it is not 
populated. I will check and fix it or remove the block if not required. 
But as far as your issue is concerned, that code block is not hit.
 
Thanks,
Ronak

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread David Miller

From: Jakub Kicinski 
Date: Tue, 5 Jun 2018 11:57:47 -0700

> Do we still care about correctness and not breaking backward
> compatibility?

Jakub let me know if you want me to revert this change.

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread Jakub Kicinski

On Tue, 5 Jun 2018 11:57:47 -0700, Jakub Kicinski wrote:
> On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
> > When using a vxlan device as the ingress dev, we count it as a
> > "no offload dev", so when such a rule comes and err stop is true,
> > we fail early and don't try the egdev route which can offload it
> > through the egress device.
> > 
> > Fix that by not calling the block offload if one of the devices
> > attached to it is not offload capable, but make sure egress on such case
> > is capable instead.
> > 
> > Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> > Reviewed-by: Roi Dayan 
> > Acked-by: Jiri Pirko 
> > Signed-off-by: Paul Blakey   
> 
> Very poor commit message.  What you're doing is re-enabling skip_sw
> filters on tunnel devices which semantically make no sense and
> shouldn't have been allowed in the first place.
> 
> This will breaks block sharing between tunnels and HW netdevs (because
> you skip the tcf_block_cb_call() completely).  The entire egdev idea
> remains badly broken accepting filters like this:
> 
> # tc filter add dev vxlan0 ingress \
>   matchall action skip_sw \
>   mirred egress redirect dev lo \
>   mirred egress redirect dev sw1np0

For above we probably need something like this (untested):

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3f4cf930f809..71e5eebec31a 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1511,7 +1511,7 @@ int tc_setup_cb_egdev_call(const struct net_device *dev,
struct tcf_action_egdev *egdev = tcf_action_egdev_lookup(dev);
 
if (!egdev)
-   return 0;
+   return err_stop ? -EOPNOTSUPP : 0;
return tcf_action_egdev_cb_call(egdev, type, type_data, err_stop);
 }
 EXPORT_SYMBOL_GPL(tc_setup_cb_egdev_call);

But the correct fix is to remove egdev crutch completely IMO.

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread Jakub Kicinski

On Tue,  5 Jun 2018 11:04:03 +0300, Paul Blakey wrote:
> When using a vxlan device as the ingress dev, we count it as a
> "no offload dev", so when such a rule comes and err stop is true,
> we fail early and don't try the egdev route which can offload it
> through the egress device.
> 
> Fix that by not calling the block offload if one of the devices
> attached to it is not offload capable, but make sure egress on such case
> is capable instead.
> 
> Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> Reviewed-by: Roi Dayan 
> Acked-by: Jiri Pirko 
> Signed-off-by: Paul Blakey 

Very poor commit message.  What you're doing is re-enabling skip_sw
filters on tunnel devices which semantically make no sense and
shouldn't have been allowed in the first place.

This will breaks block sharing between tunnels and HW netdevs (because
you skip the tcf_block_cb_call() completely).  The entire egdev idea
remains badly broken accepting filters like this:

# tc filter add dev vxlan0 ingress \
matchall action skip_sw \
mirred egress redirect dev lo \
mirred egress redirect dev sw1np0

Do we still care about correctness and not breaking backward
compatibility?

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger

On Tue, 5 Jun 2018 21:35:26 +0300
"Michael S. Tsirkin"  wrote:

> Thanks, I think this is nice patch but I wonder whether it can be split
> up somewhat. Not all of it is uncontroversial.

I started that way, but then I was fixing code that was later deleted.
The big change was eliminating the callbacks.

> 
> On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
> >   * The matching of secondary device to primary device policy
> > is up to the network device. Both net_failover and netvsc
> > will use MAC for now but can change separately.  
> 
> I actually suspect both will change to a serial number
> down the road.
> 
> >   * The match policy is only used during initial discovery; after
> > that the secondary device knows what the upper device is because
> > of the parent/child relationship; no searching is required.  
> 
> That would obviously be an improvement - does it have to be tied with
> rest of changes?

This was not possible with the version of the common code that
is in net now.

> 
> >   * Now, netvsc and net_failover use the same delayed work type
> > mechanism for setup. Previously, net_failover code was triggering off
> > name change but a similar policy was rejected for netvsc.
> > "what is good for the goose is good for the gander"  
> 
> I don't really understand what you are saying here.  I think the delayed
> hack is kind of ugly and seems racy.  Current failover code was rejected
> by whom?  Why is new one good and for whom?  Did you want to do a name
> change in netvsc but it was rejected? Could you clarify please?

See:
   https://patchwork.ozlabs.org/patch/851711/

> 
> >   * The net_failover private device info 'struct net_failover_info'
> > should have been private to the driver file, not a visible
> > API.
> > 
> >   * The net_failover device should use SET_NETDEV_DEV
> > that is intended only for physical devices not virtual devices.  
> 
> You mean should not.

Yes. Virtual device should not set device parent.

> 
> >   * No point in having DocBook style comments on a driver file.
> > They only make sense on an external exposed API.
> > 
> >   * net_failover only supports Ethernet, so use ether_addr_copy.  
> 
> It is since you need to know about all the things you need to copy, and
> because of mac matching.  But it isn't too much effort to add more
> transports and I don't see value in going in the reverse direction and
> making it more ethernet specific that it already is.

Sure, then do memcpy base on addr_len

> 
> >   * Set permanent and current address of net_failover device
> > to match the primary.
> > 
> >   * Carrier should be marked off before registering device
> > the net_failover device.  
> 
> Are above two bugfixes?

Yes.

> 
> >   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> > 
> >   * Since failover infrastructure is about linking devices just
> > use RTNL no need for other locking in init and teardown.
> > 
> >   * Don't bother with ERR_PTR() style return if only possible
> > return is success or no memory.
> > 
> >   * As much as possible, the terms master and slave should be avoided
> > because of their cultural connotations.  
> 
> Also for consistency, failover is calling these primary and standby now.

Good, let's standardize on that. 

> 
> > Note; this code has been tested on Hyper-V
> > but is compile tested only on virtio.
> > 
> > Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> > Signed-off-by: Stephen Hemminger 
> > ---
> > 
> > Although this patch needs to go into 4.18 (linux-net),  
> 
> I'd rather we focused on fixing bugs in 4.18, and left refactoring to
> 4.19.
>

Either we fix or revert the current code in 4.18.
Sorry, I am not having callback hell code in any vendor or upstream kernel.

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Michael S. Tsirkin

Thanks, I think this is nice patch but I wonder whether it can be split
up somewhat. Not all of it is uncontroversial.

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * The matching of secondary device to primary device policy
> is up to the network device. Both net_failover and netvsc
> will use MAC for now but can change separately.

I actually suspect both will change to a serial number
down the road.

>   * The match policy is only used during initial discovery; after
> that the secondary device knows what the upper device is because
> of the parent/child relationship; no searching is required.

That would obviously be an improvement - does it have to be tied with
rest of changes?

>   * Now, netvsc and net_failover use the same delayed work type
> mechanism for setup. Previously, net_failover code was triggering off
> name change but a similar policy was rejected for netvsc.
> "what is good for the goose is good for the gander"

I don't really understand what you are saying here.  I think the delayed
hack is kind of ugly and seems racy.  Current failover code was rejected
by whom?  Why is new one good and for whom?  Did you want to do a name
change in netvsc but it was rejected? Could you clarify please?

>   * The net_failover private device info 'struct net_failover_info'
> should have been private to the driver file, not a visible
> API.
> 
>   * The net_failover device should use SET_NETDEV_DEV
> that is intended only for physical devices not virtual devices.

You mean should not.

>   * No point in having DocBook style comments on a driver file.
> They only make sense on an external exposed API.
> 
>   * net_failover only supports Ethernet, so use ether_addr_copy.

It is since you need to know about all the things you need to copy, and
because of mac matching.  But it isn't too much effort to add more
transports and I don't see value in going in the reverse direction and
making it more ethernet specific that it already is.

>   * Set permanent and current address of net_failover device
> to match the primary.
> 
>   * Carrier should be marked off before registering device
> the net_failover device.

Are above two bugfixes?

>   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)
> 
>   * Since failover infrastructure is about linking devices just
> use RTNL no need for other locking in init and teardown.
> 
>   * Don't bother with ERR_PTR() style return if only possible
> return is success or no memory.
> 
>   * As much as possible, the terms master and slave should be avoided
> because of their cultural connotations.

Also for consistency, failover is calling these primary and standby now.

> Note; this code has been tested on Hyper-V
> but is compile tested only on virtio.
> 
> Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
> Signed-off-by: Stephen Hemminger 
> ---
> 
> Although this patch needs to go into 4.18 (linux-net),

I'd rather we focused on fixing bugs in 4.18, and left refactoring to
4.19.

At some point you said refactoring is needed to support matching using
the serial number, but I see this didn't make 4.18. So no rush IMHO.

> this version is based against net-next because net-next
> hasn't been merged into linux-net yet.
> 
> 
>  drivers/net/hyperv/hyperv_net.h |   3 +-
>  drivers/net/hyperv/netvsc_drv.c | 173 +++--
>  drivers/net/net_failover.c  | 312 ---
>  drivers/net/virtio_net.c|   9 +-
>  include/net/failover.h  |  31 +---
>  include/net/net_failover.h  |  32 +---
>  net/Kconfig |  13 +-
>  net/core/failover.c | 316 
>  8 files changed, 373 insertions(+), 516 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 99d8e7398a5b..c7d25d10765e 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -902,6 +902,8 @@ struct net_device_context {
>   struct hv_device *device_ctx;
>   /* netvsc_device */
>   struct netvsc_device __rcu *nvdev;
> + /* list of netvsc net_devices */
> + struct list_head list;
>   /* reconfigure work */
>   struct delayed_work dwork;
>   /* last reconfig time */
> @@ -933,7 +935,6 @@ struct net_device_context {
>   /* Serial number of the VF to team with */
>   u32 vf_serial;
>  
> - struct failover *failover;
>  };
>  
>  /* Per channel data */
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index bef4d55a108c..074e6b8578df 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -70,6 +70,8 @@ static int debug = -1;
>  module_param(debug, int, 0444);
>  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
>  
> +static LIST_HEAD(netvsc_dev_list);
> +
>  static void netvsc_change_rx_flags(struct net_device *ne

Re: [Patch net v2] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-05 Thread David Miller

From: Cong Wang 
Date: Tue,  5 Jun 2018 09:48:13 -0700

> Per discussion with David at netconf 2018, let's clarify
> DaveM's position of handling stable backports in netdev-FAQ.
> 
> This is important for people relying on upstream -stable
> releases.
> 
> Cc: sta...@vger.kernel.org
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks Cong.

Re: [PATCH iproute2] configure: require libmnl

2018-06-05 Thread Simon Horman

On Thu, May 31, 2018 at 03:32:09PM -0400, Stephen Hemminger wrote:
> Several users of BPF and other features are trying to build without
> libmnl, then complaining that features don't work.  The time has
> come to require libmnl to build iproute2.
> 
> Signed-off-by: Stephen Hemminger 

Reviewed-by: Simon Horman

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread David Miller

From: Stephen Hemminger 
Date: Tue, 5 Jun 2018 10:45:10 -0700

> I said it wasn't tested. Not surprising. Don't have a version of KVM
> that supports standby (and not going to build KVM from scratch for
> this).

It would definitely help me if you put "RFC" in the subject line
for patches which aren't tested :-)

Thanks.

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Stephen Hemminger

On Tue, 5 Jun 2018 10:22:13 -0700
"Samudrala, Sridhar"  wrote:

> On 6/4/2018 8:42 PM, Stephen Hemminger wrote:
> > The net failover should be a simple library, not a virtual
> > object with function callbacks (see callback hell).
> > The code is simpler is smaller both for the netvsc and virtio use case.  
> 
> I quickly tried this patch and it breaks virtio-net in standby mode.
> I don't see failover netdev, unloading virtio-net causes a crash.

I said it wasn't tested. Not surprising. Don't have a version of KVM
that supports standby (and not going to build KVM from scratch for this).

> 
> With these changes, there is very minimal code that is shared between
> netvsc and virtio-net. The notifier and event handling code and the
> lookup_bymac routines are now duplicated in both the drivers. I thought
> we wanted to keep this code common between the 2 drivers and we went through
> multiple revisions to make sure that it works with both netvsc's 2 netdev
> and virtio-net's 3 netdev models.

The sharing is what needs to be shared; it turns out not much
is common really. The total code size is less with this version.
Also, since even with nested virtualization, both drivers will not
be present on same guest at same time.

> The reason for the indirect ops is to support these 2 different models and
> i am not sure if the overhead of the callbacks is that significant considering
> that they are not called in the hot path.

The callbacks invert the functionality. It makes everything bottom up.

Re: Qualcomm rmnet driver and qmi_wwan

2018-06-05 Thread Subash Abhinov Kasiviswanathan


On 2018-06-05 08:54, Dan Williams wrote:

On Tue, 2018-06-05 at 11:38 +0200, Daniele Palmas wrote:

Hi,

2018-02-21 20:47 GMT+01:00 Subash Abhinov Kasiviswanathan
:
> On 2018-02-21 04:38, Daniele Palmas wrote:
> >
> > Hello,
> >
> > in rmnet kernel documentation I read:
> >
> > "This driver can be used to register onto any physical network
> > device in
> > IP mode. Physical transports include USB, HSIC, PCIe and IP
> > accelerator."
> >
> > Does this mean that it can be used in association with the
> > qmi_wwan
> > driver?
> >
> > If yes, can someone give me an hint on the steps to follow?
> >
> > If not, does anyone know if it is possible to modify qmi_wwan in
> > order
> > to take advantage of the features provided by the rmnet driver?
> >
> > In this case hint on the changes for modifying qmi_wwan are
> > welcome.
> >
> > Thanks in advance,
> > Daniele
>
>
> Hi
>
> I havent used qmi_wwan so the following comment is based on code
> inspection.
> qmimux_register_device() is creating qmimux devices with usb net
> device as
> real_dev. The Multiplexing and aggregation header (qmimux_hdr) is
> stripped
> off
> in qmimux_rx_fixup() and the packet is passed on to stack.
>
> You could instead create rmnet devices with the usb netdevice as
> real dev.
> The packets from the usb net driver can be queued to network stack
> directly
> as rmnet driver will setup a RX handler. rmnet driver will process
> the
> packets
> further and then queue to network stack.
>

in kernel documentation I read that rmnet user space configuration is
done through librmnetctl available at

https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource
/dataservices/tree/rmnetctl

However it seems to me that this is a bit outdated (e.g. it does not
properly build since it is looking for kernel header
linux/rmnet_data.h that, as far as I understand, is no more present).

Is there available a more recent version of the tool?


Hi Daniele

The attached patch should have an updated version of the tool.
Usage -

rmnetcli -n newlink wwan0 rmnet0 1 1
where wwan0 is the physical device
rmnet0 is the virtual device to be created
1 is the mux id
the other 1 is the flag to configure DL de-aggregation by default

To delete a device -

ip link delete rmnet0



I'd expect that somebody (Subash?) would add support for the
rmnet/qmimux options to iproute2 via 'ip link' like exists for most
other device types.


Hi Dan

Yes, I can do that and update the documentation to point to using 
iproute2.


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative ProjectFrom 94abb589a8e60c49d9cc4598a27b50c9f757b623 Mon Sep 17 00:00:00 2001
From: Subash Abhinov Kasiviswanathan 
Date: Tue, 5 Jun 2018 11:20:56 -0600
Subject: [PATCH] librmnetctl: add initial library code

Add initial snapshot of Rmnet Control library for
upstream rmnet driver.

---
 rmnetlib/Makefile.am|2 +
 rmnetlib/cli/Makefile.am|   12 +
 rmnetlib/cli/rmnetcli.c |  502 ++
 rmnetlib/cli/rmnetcli.h |   61 ++
 rmnetlib/inc/librmnetctl.h  |  604 +
 rmnetlib/inc/librmnetctl_hndl.h |   65 ++
 rmnetlib/src/Makefile.am|   14 +
 rmnetlib/src/librmnetctl.c  | 1369 +++
 8 files changed, 2629 insertions(+)
 create mode 100644 rmnetlib/Makefile.am
 create mode 100644 rmnetlib/cli/Makefile.am
 create mode 100644 rmnetlib/cli/rmnetcli.c
 create mode 100644 rmnetlib/cli/rmnetcli.h
 create mode 100644 rmnetlib/inc/librmnetctl.h
 create mode 100644 rmnetlib/inc/librmnetctl_hndl.h
 create mode 100644 rmnetlib/src/Makefile.am
 create mode 100644 rmnetlib/src/librmnetctl.c

diff --git a/rmnetlib/Makefile.am b/rmnetlib/Makefile.am
new file mode 100644
index 000..7cafa82
--- /dev/null
+++ b/rmnetlib/Makefile.am
@@ -0,0 +1,2 @@
+SUBDIRS=src cli
+
diff --git a/rmnetlib/cli/Makefile.am b/rmnetlib/cli/Makefile.am
new file mode 100644
index 000..499ef88
--- /dev/null
+++ b/rmnetlib/cli/Makefile.am
@@ -0,0 +1,12 @@
+AM_CFLAGS = $(CODE_COVERAGE_CFLAGS)
+AM_CFLAGS += -Wall -Werror -Wundef -Wstrict-prototypes -Wno-trigraphs -g
+AM_CFLAGS += -I./../inc
+AM_LDFLAGS = $(CODE_COVERAGE_LDFLAGS)
+
+rmnetcli_SOURCES = rmnetcli.c
+bin_PROGRAMS = rmnetcli
+requiredlibs = ../src/librmnetctl.la
+rmnetcli_LDADD = $(requiredlibs)
+LOCAL_MODULE := librmnetctl
+LOCAL_PRELINK_MODULE := false
+include $(BUILD_SHARED_LIBRARY)
diff --git a/rmnetlib/cli/rmnetcli.c b/rmnetlib/cli/rmnetcli.c
new file mode 100644
index 000..dc81054
--- /dev/null
+++ b/rmnetlib/cli/rmnetcli.c
@@ -0,0 +1,502 @@
+/**
+
+			R M N E T C L I . C
+
+Copyright (c) 2013-2015, 2017-2018 The Linux Foundation. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+	* Redistributions of source code must retain

Re: [PATCH net] failover: eliminate callback hell

2018-06-05 Thread Samudrala, Sridhar


On 6/4/2018 8:42 PM, Stephen Hemminger wrote:

The net failover should be a simple library, not a virtual
object with function callbacks (see callback hell).
The code is simpler is smaller both for the netvsc and virtio use case.


I quickly tried this patch and it breaks virtio-net in standby mode.
I don't see failover netdev, unloading virtio-net causes a crash.

With these changes, there is very minimal code that is shared between
netvsc and virtio-net. The notifier and event handling code and the
lookup_bymac routines are now duplicated in both the drivers. I thought
we wanted to keep this code common between the 2 drivers and we went through
multiple revisions to make sure that it works with both netvsc's 2 netdev
and virtio-net's 3 netdev models.

The reason for the indirect ops is to support these 2 different models and
i am not sure if the overhead of the callbacks is that significant considering
that they are not called in the hot path.






The code is restructured in many ways. I should have given these
as review comments to net_failover during review
but did not want to overwhelm the original submitter.
Therefore it was merged prematurely.

Some of the many items changed are:

   * The support routines should just be selected as needed in
 kernel config, no need for them to be visible config items.

   * Both netvsc and net_failover should keep their list of their
 own devices. Not a common list.

  * The matching of secondary device to primary device policy
 is up to the network device. Both net_failover and netvsc
 will use MAC for now but can change separately.

   * The match policy is only used during initial discovery; after
 that the secondary device knows what the upper device is because
 of the parent/child relationship; no searching is required.

   * Now, netvsc and net_failover use the same delayed work type
 mechanism for setup. Previously, net_failover code was triggering off
 name change but a similar policy was rejected for netvsc.
 "what is good for the goose is good for the gander"

   * The net_failover private device info 'struct net_failover_info'
 should have been private to the driver file, not a visible
 API.

   * The net_failover device should use SET_NETDEV_DEV
 that is intended only for physical devices not virtual devices.

   * No point in having DocBook style comments on a driver file.
 They only make sense on an external exposed API.

   * net_failover only supports Ethernet, so use ether_addr_copy.

   * Set permanent and current address of net_failover device
 to match the primary.

   * Carrier should be marked off before registering device
 the net_failover device.

   * Use netdev_XXX for log messages, in net_failover (not dev_xxx)

   * Since failover infrastructure is about linking devices just
 use RTNL no need for other locking in init and teardown.

   * Don't bother with ERR_PTR() style return if only possible
 return is success or no memory.

   * As much as possible, the terms master and slave should be avoided
 because of their cultural connotations.

Note; this code has been tested on Hyper-V
but is compile tested only on virtio.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Stephen Hemminger 
---

Although this patch needs to go into 4.18 (linux-net),
this version is based against net-next because net-next
hasn't been merged into linux-net yet.


  drivers/net/hyperv/hyperv_net.h |   3 +-
  drivers/net/hyperv/netvsc_drv.c | 173 +++--
  drivers/net/net_failover.c  | 312 ---
  drivers/net/virtio_net.c|   9 +-
  include/net/failover.h  |  31 +---
  include/net/net_failover.h  |  32 +---
  net/Kconfig |  13 +-
  net/core/failover.c | 316 
  8 files changed, 373 insertions(+), 516 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 99d8e7398a5b..c7d25d10765e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -902,6 +902,8 @@ struct net_device_context {
struct hv_device *device_ctx;
/* netvsc_device */
struct netvsc_device __rcu *nvdev;
+   /* list of netvsc net_devices */
+   struct list_head list;
/* reconfigure work */
struct delayed_work dwork;
/* last reconfig time */
@@ -933,7 +935,6 @@ struct net_device_context {
/* Serial number of the VF to team with */
u32 vf_serial;
  
-	struct failover *failover;

  };
  
  /* Per channel data */

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index bef4d55a108c..074e6b8578df 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -70,6 +70,8 @@ static int debug = -1;
  module_param(debug, int, 0444);
  MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all

Re: [PATCH iproute2 v2 1/2] ip: display netns name instead of nsid

2018-06-05 Thread Stephen Hemminger

On Tue,  5 Jun 2018 15:08:30 +0200
Nicolas Dichtel  wrote:

>  
> +char *get_name_from_nsid(int nsid)
> +{
> + struct nsid_cache *c;
> +
> + netns_nsid_socket_init();
> + netns_map_init();
> +
> + c = netns_map_get_by_nsid(nsid);
> + if (c)
> + return c->name;
> +
> + return NULL;
> +}
> +

This is better, but now there is a different problem.
When doing multiple interfaces, won't the initialization code be called twice?

[Patch net v2] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-05 Thread Cong Wang

Per discussion with David at netconf 2018, let's clarify
DaveM's position of handling stable backports in netdev-FAQ.

This is important for people relying on upstream -stable
releases.

Cc: sta...@vger.kernel.org
Cc: Greg Kroah-Hartman 
Signed-off-by: Cong Wang 
---
 Documentation/networking/netdev-FAQ.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/networking/netdev-FAQ.txt 
b/Documentation/networking/netdev-FAQ.txt
index 2a3278d5cf35..fa951b820b25 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -179,6 +179,15 @@ A: No.  See above answer.  In short, if you think it 
really belongs in
dash marker line as described in 
Documentation/process/submitting-patches.rst to
temporarily embed that information into the patch that you send.
 
+Q: Are all networking bug fixes backported to all stable releases?
+
+A: Due to capacity, Dave could only take care of the backports for the last
+   2 stable releases. For earlier stable releases, each stable branch 
maintainer
+   is supposed to take care of them. If you find any patch is missing from an
+   earlier stable branch, please notify sta...@vger.kernel.org with either a
+   commit ID or a formal patch backported, and CC Dave and other relevant
+   networking developers.
+
 Q: Someone said that the comment style and coding convention is different
for the networking content.  Is this true?
 
-- 
2.13.0

Re: [PATCH net-next] rtnetlink: validate attributes in do_setlink()

2018-06-05 Thread David Miller

From: Eric Dumazet 
Date: Tue,  5 Jun 2018 09:25:19 -0700

> It seems that rtnl_group_changelink() can call do_setlink
> while a prior call to validate_linkmsg(dev = NULL, ...) could
> not validate IFLA_ADDRESS / IFLA_BROADCAST
> 
> Make sure do_setlink() calls validate_linkmsg() instead
> of letting its callers having this responsibility.
> 
> With help from Dmitry Vyukov, thanks a lot !
 ...
> Fixes: e7ed828f10bd ("netlink: support setting devgroup parameters")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied and queued up for -stable.

Re: [PATCH net-next] rtnetlink: validate attributes in do_setlink()

2018-06-05 Thread David Miller

From: Eric Dumazet 
Date: Tue, 5 Jun 2018 09:42:54 -0700

> On 06/05/2018 09:41 AM, David Miller wrote:
>> From: Eric Dumazet 
>> Date: Tue,  5 Jun 2018 09:25:19 -0700
>> 
>>> It seems that rtnl_group_changelink() can call do_setlink
>>> while a prior call to validate_linkmsg(dev = NULL, ...) could
>>> not validate IFLA_ADDRESS / IFLA_BROADCAST
>>>
>>> Make sure do_setlink() calls validate_linkmsg() instead
>>> of letting its callers having this responsibility.
>> 
>> But now rtnl_newlink() will validate_linkmsg() twice
>> 
> 
> Yes, is it a problem ? That is hardly fast path :)

Not a problem, just making sure you were aware.

Re: [Patch net-next] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-05 Thread Cong Wang

On Tue, Jun 5, 2018 at 6:43 AM, David Miller  wrote:
> From: Cong Wang 
> Date: Mon,  4 Jun 2018 11:07:19 -0700
>
>> +Q: Are all networking bug fixes backported to all stable releases?
>> +
>> +A: Due to capacity, Dave could only take care of the backports for the last
>> +   3 stable releases.
>
> As Greg stated, I only do 2 not 3.

Sure, will send v2.
We just need a number here. :)

Thanks!

Re: [PATCH net-next] rtnetlink: validate attributes in do_setlink()

2018-06-05 Thread Eric Dumazet




On 06/05/2018 09:41 AM, David Miller wrote:
> From: Eric Dumazet 
> Date: Tue,  5 Jun 2018 09:25:19 -0700
> 
>> It seems that rtnl_group_changelink() can call do_setlink
>> while a prior call to validate_linkmsg(dev = NULL, ...) could
>> not validate IFLA_ADDRESS / IFLA_BROADCAST
>>
>> Make sure do_setlink() calls validate_linkmsg() instead
>> of letting its callers having this responsibility.
> 
> But now rtnl_newlink() will validate_linkmsg() twice
> 

Yes, is it a problem ? That is hardly fast path :)

Re: [PATCH net-next] rtnetlink: validate attributes in do_setlink()

2018-06-05 Thread David Miller

From: Eric Dumazet 
Date: Tue,  5 Jun 2018 09:25:19 -0700

> It seems that rtnl_group_changelink() can call do_setlink
> while a prior call to validate_linkmsg(dev = NULL, ...) could
> not validate IFLA_ADDRESS / IFLA_BROADCAST
> 
> Make sure do_setlink() calls validate_linkmsg() instead
> of letting its callers having this responsibility.

But now rtnl_newlink() will validate_linkmsg() twice

pull-request: bpf-next 2018-06-05

2018-06-05 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
   connect: "This allows to override source IP (including the case when it's
   set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
   TCP and connected UDP (fast path) are not affected. This makes UDP support
   complete, that is, connected UDP is handled by connect hooks, unconnected
   by sendmsg ones.", from Andrey.

2) Rework of the AF_XDP API to allow extending it in future for type writer
   model if necessary. In this mode a memory window is passed to hardware
   and multiple frames might be filled into that window instead of just one
   that is the case in the current fixed frame-size model. With the new
   changes made this can be supported without having to add a new descriptor
   format. Also, core bits for the zero-copy support for AF_XDP have been
   merged as agreed upon, where i40e bits will be routed via Jeff later on.
   Various improvements to documentation and sample programs included as
   well, all from Björn and Magnus.

3) Given BPF's flexibility, a new program type has been added to implement
   infrared decoders. Quote: "The kernel IR decoders support the most
   widely used IR protocols, but there are many protocols which are not
   supported. [...] There is a 'long tail' of unsupported IR protocols,
   for which lircd is need to decode the IR. IR encoding is done in such
   a way that some simple circuit can decode it; therefore, BPF is ideal.
   [...] user-space can define a decoder in BPF, attach it to the rc
   device through the lirc chardev.", from Sean.

4) Several improvements and fixes to BPF core, among others, dumping map
   and prog IDs into fdinfo which is a straight forward way to correlate
   BPF objects used by applications, removing an indirect call and therefore
   retpoline in all map lookup/update/delete calls by invoking the callback
   directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
   for tc BPF programs to have an efficient way of looking up cgroup v2 id
   for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
   state that hasn't been filled, to allow context access wrt pt_regs in
   32 bit archs for tracing, and last but not least various test cases
   for fixes that landed in bpf earlier, from Daniel.

5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
   a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
   call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.

6) Add a new bpf_get_current_cgroup_id() helper that can be used in
   tracing to retrieve the cgroup id from the current process in order
   to allow for e.g. aggregation of container-level events, from Yonghong.

7) Two follow-up fixes for BTF to reject invalid input values and
   related to that also two test cases for BPF kselftests, from Martin.

8) Various API improvements to the bpf_fib_lookup() helper, that is,
   dropping MPLS bits which are not fully hashed out yet, rejecting
   invalid helper flags, returning error for unsupported address
   families as well as renaming flowlabel to flowinfo, from David.

9) Various fixes and improvements to sockmap BPF kselftests in particular
   in proper error detection and data verification, from Prashant.

10) Two arm32 BPF JIT improvements. One is to fix imm range check with
regards to whether immediate fits into 24 bits, and a naming cleanup
to get functions related to rsh handling consistent to those handling
lsh, from Wang.

11) Two compile warning fixes in BPF, one for BTF and a false positive
to silent gcc in stack_map_get_build_id_offset(), from Arnd.

12) Add missing seg6.h header into tools include infrastructure in order
to fix compilation of BPF kselftests, from Mathieu.

13) Several formatting cleanups in the BPF UAPI helper description that
also fix an error during rst2man compilation, from Quentin.

14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
not built into the kernel, from Yue.

15) Remove a useless double assignment in dev_map_enqueue(), from Colin.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 5b79c2af667c0e2684f2a6dbf6439074b78f490c:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2018-05-26 
19:46:15 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 9fa06104a235f64d6a2bf3012cc9966e8e4be5eb:

  Merge branch 'bpf-af-xdp-zc-api' (2018-06-05 15:58:07 +0200)


Alexei Starovoitov (4):
  Merge branch 'btf-fixes'
  Merge branch '

Re: [PATCH net] l2tp: fix refcount leakage on PPPoL2TP sockets

2018-06-05 Thread Guillaume Nault

On Tue, Jun 05, 2018 at 09:41:24AM -0400, David Miller wrote:
> From: Guillaume Nault 
> Date: Mon, 4 Jun 2018 18:52:19 +0200
> 
> > Commit d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session
> > object destroy") tried to fix a race condition where a PPPoL2TP socket
> > would disappear while the L2TP session was still using it. However, it
> > missed the root issue which is that an L2TP session may accept to be
> > reconnected if its associated socket has entered the release process.
> > 
> > The tentative fix makes the session hold the socket it is connected to.
> > That saves the kernel from crashing, but introduces refcount leakage,
> > preventing the socket from completing the release process. Once stalled,
> > everything the socket depends on can't be released anymore, including
> > the L2TP session and the l2tp_ppp module.
>  ...
> > So it all boils down to pppol2tp_connect() failing to realise that the
> > session has already been connected. This patch drops the unneeded extra
> > reference counting (mostly reverting d02ba2a6110c) and checks that
> > neither ->sk nor ->__sk is set before allowing a session to be
> > connected.
> > 
> > Fixes: d02ba2a6110c ("l2tp: fix race in pppol2tp_release with session 
> > object destroy")
> > Signed-off-by: Guillaume Nault 
> 
> So much fidgeting around in this area over the past year or two :-)
> 
Putting L2TP into production without adding custom workarounds has been
such a long journey, but we're almost there :-)

> Applied and queued up for -stable, thanks for fixing this.
> 
I still have a handful of issues to fix, though.

Re: [net-next PATCH v3 0/5] Symmetric queue selection using XPS for Rx queues

2018-06-05 Thread David Miller

From: Amritha Nambiar 
Date: Tue, 05 Jun 2018 01:37:45 -0700

> This patch series implements support for Tx queue selection based on
> Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
> using sysfs attribute. If the user configuration for Rx queues does
> not apply, then the Tx queue selection falls back to XPS using CPUs and
> finally to hashing.
> 
> XPS is refactored to support Tx queue selection based on either the
> CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
> enabled. By default no receive queues are configured for the Tx queue.
> 
> - /sys/class/net//queues/tx-*/xps_rxqs
> 
> A set of receive queues can be mapped to a set of transmit queues (many:many),
> although the common use case is a 1:1 mapping. This will enable sending
> packets on the same Tx-Rx queue pair as this is useful for busy polling
> multi-threaded workloads where it is not possible to pin the threads to
> a CPU. This is a rework of Sridhar's patch for symmetric queueing via
> socket option:
> https://www.spinics.net/lists/netdev/msg453106.html
 ...

Thanks for doing this work.

I think this needs more time to sit and get reviews, and therefore needs
to be deferred to the next merge window.

Re: [PATCH v2 net-next 0/3] devlink: Add extack messages for reload and port split/unsplit

2018-06-05 Thread David Miller

From: dsah...@kernel.org
Date: Tue,  5 Jun 2018 08:14:08 -0700

> From: David Ahern 
> 
> Patch 1 adds extack arg to reload, port_split and port_unsplit devlink
> operations.
> 
> Patch 2 adds extack messages for reload operation in netdevsim.
> 
> Patch 3 adds extack messages to port split/unsplit in mlxsw driver.
> 
> v2
> - make the extack messages align with existing dev_err

Series applied, thanks David.

Re: [PATCH net v2 2/2] ipmr: fix error path when ipmr_new_table fails

2018-06-05 Thread David Miller

From: Sabrina Dubroca 
Date: Tue,  5 Jun 2018 15:02:00 +0200

> commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> refactored ipmr_new_table, so that it now returns NULL when
> mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
> expect an ERR_PTR.
> 
> This can result in NULL deref, for example when ipmr_rules_exit calls
> ipmr_free_table with NULL net->ipv4.mrt in the
> !CONFIG_IP_MROUTE_MULTIPLE_TABLES version.
> 
> This patch makes mr_table_alloc return errors, and changes
> ip6mr_new_table and its callers to return/expect error pointers as
> well. It also removes the version of mr_table_alloc defined under
> !CONFIG_IP_MROUTE_COMMON, since it is never used.
> 
> Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> Signed-off-by: Sabrina Dubroca 
> ---
> v2: - fixed brainfart that shadowed mrt variable in ip6_mroute_setsockopt
> - rebased on top of ip6_mroute_setsockopt fix

Applied and queued up for -stable.

Re: [PATCH v2 net-next] net: metrics: add proper netlink validation

2018-06-05 Thread David Miller

From: Eric Dumazet 
Date: Tue,  5 Jun 2018 06:06:19 -0700

> Before using nla_get_u32(), better make sure the attribute
> is of the proper size.
> 
> Code recently was changed, but bug has been there from beginning
> of git.
 ...
> Fixes: a919525ad832 ("net: Move fib_convert_metrics to metrics file")
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Cc: David Ahern 
> ---
> v2: fixed a typo.

Applied and queued up for -stable, thanks Eric.

Re: [PATCH net v2 1/2] ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds

2018-06-05 Thread David Miller

From: Sabrina Dubroca 
Date: Tue,  5 Jun 2018 15:01:59 +0200

> Currently, raw6_sk(sk)->ip6mr_table is set unconditionally during
> ip6_mroute_setsockopt(MRT6_TABLE). A subsequent attempt at the same
> setsockopt will fail with -ENOENT, since we haven't actually created
> that table.
> 
> A similar fix for ipv4 was included in commit 5e1859fbcc3c ("ipv4: ipmr:
> various fixes and cleanups").
> 
> Fixes: d1db275dd3f6 ("ipv6: ip6mr: support multiple tables")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable.

[PATCH net-next] rtnetlink: validate attributes in do_setlink()

2018-06-05 Thread Eric Dumazet

It seems that rtnl_group_changelink() can call do_setlink
while a prior call to validate_linkmsg(dev = NULL, ...) could
not validate IFLA_ADDRESS / IFLA_BROADCAST

Make sure do_setlink() calls validate_linkmsg() instead
of letting its callers having this responsibility.

With help from Dmitry Vyukov, thanks a lot !

BUG: KMSAN: uninit-value in is_valid_ether_addr include/linux/etherdevice.h:199 
[inline]
BUG: KMSAN: uninit-value in eth_prepare_mac_addr_change net/ethernet/eth.c:275 
[inline]
BUG: KMSAN: uninit-value in eth_mac_addr+0x203/0x2b0 net/ethernet/eth.c:308
CPU: 1 PID: 8695 Comm: syz-executor3 Not tainted 4.17.0-rc5+ #103
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x149/0x260 mm/kmsan/kmsan.c:1084
 __msan_warning_32+0x6e/0xc0 mm/kmsan/kmsan_instr.c:686
 is_valid_ether_addr include/linux/etherdevice.h:199 [inline]
 eth_prepare_mac_addr_change net/ethernet/eth.c:275 [inline]
 eth_mac_addr+0x203/0x2b0 net/ethernet/eth.c:308
 dev_set_mac_address+0x261/0x530 net/core/dev.c:7157
 do_setlink+0xbc3/0x5fc0 net/core/rtnetlink.c:2317
 rtnl_group_changelink net/core/rtnetlink.c:2824 [inline]
 rtnl_newlink+0x1fe9/0x37a0 net/core/rtnetlink.c:2976
 rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x455a09
RSP: 002b:7fc07480ec68 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 7fc07480f6d4 RCX: 00455a09
RDX:  RSI: 23c0 RDI: 0014
RBP: 0072bea0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 05d0 R14: 006fdc20 R15: 

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
 kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527
 __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:478
 do_setlink+0xb84/0x5fc0 net/core/rtnetlink.c:2315
 rtnl_group_changelink net/core/rtnetlink.c:2824 [inline]
 rtnl_newlink+0x1fe9/0x37a0 net/core/rtnetlink.c:2976
 rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
 netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
 kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
 slab_post_alloc_hook mm/slab.h:446 [inline]
 slab_alloc_node mm/slub.c:2753 [inline]
 __kmalloc_node_track_caller+0xb32/0x11b0 mm/slub.c:4395
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:988 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
 netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg net/socket.c:639 [inline]
 ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
 __sys_sendmsg net/socket.c:2155 [inline]
 __do_sys_sendmsg net/socket.c:2164 [inline]
 __se_sys_sendmsg net/socket.c:2162 [inline]
 __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
 do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e7ed828f10bd ("netlink: support setting devgroup parameters")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Cc: Dmitry Vyukov 
---
 net/core/rtnetlink.c | 8 
 1 file changed

Re: suspicius csum initialization in vmxnet3_rx_csum

2018-06-05 Thread Paolo Abeni

Hi,

I'm sorry for the long delay in my answer, I've been travelling.

On Fri, 2018-06-01 at 11:10 -0700, Ronak Doshi wrote:
> On Thu, 31 May 2018, Neil Horman wrote:
> > What packet types will rcd.csum be set for?
> > Neil 
> 
> I looked thorugh the emulation code and found that rcd.csum is not set. 
> For valid v4/v6, TCP/UDP packets the code block above the mentioend "if" 
> block will be executed or else it will go through checksum none.
> 
> That's why I wanted to know (in previous emails) which ESX build is being 
> used while this was tested. The code block under "if (gdesc->rcd.csum)" 
> block might seem incorrect but it shouldn't be hit as rcd.csum is not set. 

I'm unsure if I read the above correctly. Do you mean that the relevant
code-path is never hit? If so, can we simply drop it, as we agreed that
such code is uncorrect? Elsewhere, could you plese specify under which
circumstances gdesc->rcd.csum is filled by the hypervisor?

> Hence, I asked did the fix provided by Paolo worked for the icmp test?

Unfortunatelly so far I've not been able to reproduce the issue outside
a production environment and I can't run test kernel there.

Thanks,

Paolo

Re: [RFC PATCH] kcm: hold rx mux lock when updating the receive queue.

2018-06-05 Thread Paolo Abeni

Hi,

On Tue, 2018-06-05 at 08:35 -0700, Tom Herbert wrote:
> On Tue, Jun 5, 2018 at 7:53 AM, David Miller  wrote:
> > From: Paolo Abeni 
> > Date: Tue,  5 Jun 2018 12:32:33 +0200
> >
> >> @@ -1157,7 +1158,9 @@ static int kcm_recvmsg(struct socket *sock, struct 
> >> msghdr *msg,
> >>   /* Finished with message */
> >>   msg->msg_flags |= MSG_EOR;
> >>   KCM_STATS_INCR(kcm->stats.rx_msgs);
> >> + spin_lock_bh(&kcm->mux->rx_lock);
> >>   skb_unlink(skb, &sk->sk_receive_queue);
> >> + spin_unlock_bh(&kcm->mux->rx_lock);
> >
> > Hmmm, maybe I don't understand the corruption.
> >
> > But, skb_unlink() takes the sk->sk_receive_queue.lock which should
> > prevent SKB list corruption.
> 
> It looks like there is a case where the list is being manipulated
> without the queue lock. That is in requeue_rx_msgs where
> __skb_dequeue is being called instead of skb_dequeue which is in
> requeue_rx_msgs. requeue_rx_msgs holds the mux rx_lock which would
> explain why the suggested patch avoids the issue.

Yep, I belive this is the correct explanation. Sorry for the noise with
the previous patch, I underlooked the skb_queue lock already in place.

> Paolo, thanks for looking into this! Can you try replacing
> __skb_dequeue in requeue_rx_msgs with skb_dequeue to see if that is
> the fix.

Sure, I'll retrigger the test, and report the result here (or directly
a new patch, should the test be succesful)

Thanks,

Paolo

Re: [PATCH v2 net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures

2018-06-05 Thread Jiri Pirko

Tue, Jun 05, 2018 at 05:14:11PM CEST, dsah...@kernel.org wrote:
>From: David Ahern 
>
>Return messages in extack for port split/unsplit errors. e.g.,
>$ devlink port split swp1s1 count 4
>Error: mlxsw_spectrum: Port cannot be split further.
>devlink answers: Invalid argument
>
>$ devlink port unsplit swp4
>Error: mlxsw_spectrum: Port was not split.
>devlink answers: Invalid argument
>
>Signed-off-by: David Ahern 
>Reviewed-by: Ido Schimmel 

Acked-by: Jiri Pirko

[PATCH v2 net-next 1/3] devlink: Add extack to reload and port_{un,}split operations

2018-06-05 Thread dsahern

From: David Ahern 

Add extack argument to reload, port_split and port_unsplit operations.

Signed-off-by: David Ahern 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c   |  9 ++---
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c |  5 +++--
 drivers/net/netdevsim/devlink.c  |  3 ++-
 include/net/devlink.h|  7 ---
 net/core/devlink.c   | 18 ++
 5 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 8a766fe28fa0..7ed38d80bc08 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -770,7 +770,8 @@ static void mlxsw_core_driver_put(const char *kind)
 
 static int mlxsw_devlink_port_split(struct devlink *devlink,
unsigned int port_index,
-   unsigned int count)
+   unsigned int count,
+   struct netlink_ext_ack *extack)
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
 
@@ -782,7 +783,8 @@ static int mlxsw_devlink_port_split(struct devlink *devlink,
 }
 
 static int mlxsw_devlink_port_unsplit(struct devlink *devlink,
- unsigned int port_index)
+ unsigned int port_index,
+ struct netlink_ext_ack *extack)
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
 
@@ -963,7 +965,8 @@ mlxsw_devlink_sb_occ_tc_port_bind_get(struct devlink_port 
*devlink_port,
 pool_type, p_cur, p_max);
 }
 
-static int mlxsw_devlink_core_bus_device_reload(struct devlink *devlink)
+static int mlxsw_devlink_core_bus_device_reload(struct devlink *devlink,
+   struct netlink_ext_ack *extack)
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
int err;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c 
b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
index 71c2edd83031..db463e20a876 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_devlink.c
@@ -92,7 +92,7 @@ nfp_devlink_set_lanes(struct nfp_pf *pf, unsigned int idx, 
unsigned int lanes)
 
 static int
 nfp_devlink_port_split(struct devlink *devlink, unsigned int port_index,
-  unsigned int count)
+  unsigned int count, struct netlink_ext_ack *extack)
 {
struct nfp_pf *pf = devlink_priv(devlink);
struct nfp_eth_table_port eth_port;
@@ -123,7 +123,8 @@ nfp_devlink_port_split(struct devlink *devlink, unsigned 
int port_index,
 }
 
 static int
-nfp_devlink_port_unsplit(struct devlink *devlink, unsigned int port_index)
+nfp_devlink_port_unsplit(struct devlink *devlink, unsigned int port_index,
+struct netlink_ext_ack *extack)
 {
struct nfp_pf *pf = devlink_priv(devlink);
struct nfp_eth_table_port eth_port;
diff --git a/drivers/net/netdevsim/devlink.c b/drivers/net/netdevsim/devlink.c
index bef7db5d129a..e8366cf372ff 100644
--- a/drivers/net/netdevsim/devlink.c
+++ b/drivers/net/netdevsim/devlink.c
@@ -147,7 +147,8 @@ static int devlink_resources_register(struct devlink 
*devlink)
return err;
 }
 
-static int nsim_devlink_reload(struct devlink *devlink)
+static int nsim_devlink_reload(struct devlink *devlink,
+  struct netlink_ext_ack *extack)
 {
enum nsim_resource_id res_ids[] = {
NSIM_RESOURCE_IPV4_FIB, NSIM_RESOURCE_IPV4_FIB_RULES,
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 9686a1aa4ec9..e336ea9c73df 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -296,12 +296,13 @@ struct devlink_resource {
 #define DEVLINK_RESOURCE_ID_PARENT_TOP 0
 
 struct devlink_ops {
-   int (*reload)(struct devlink *devlink);
+   int (*reload)(struct devlink *devlink, struct netlink_ext_ack *extack);
int (*port_type_set)(struct devlink_port *devlink_port,
 enum devlink_port_type port_type);
int (*port_split)(struct devlink *devlink, unsigned int port_index,
- unsigned int count);
-   int (*port_unsplit)(struct devlink *devlink, unsigned int port_index);
+ unsigned int count, struct netlink_ext_ack *extack);
+   int (*port_unsplit)(struct devlink *devlink, unsigned int port_index,
+   struct netlink_ext_ack *extack);
int (*sb_pool_get)(struct devlink *devlink, unsigned int sb_index,
   u16 pool_index,
   struct devlink_sb_pool_info *pool_info);
diff --git a/net/core/devlink.c b/net/core/devlink.c
index f75ee022e6b2..2209

[PATCH v2 net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures

2018-06-05 Thread dsahern

From: David Ahern 

Return messages in extack for port split/unsplit errors. e.g.,
$ devlink port split swp1s1 count 4
Error: mlxsw_spectrum: Port cannot be split further.
devlink answers: Invalid argument

$ devlink port unsplit swp4
Error: mlxsw_spectrum: Port was not split.
devlink answers: Invalid argument

Signed-off-by: David Ahern 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 14 ++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  5 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 15 ---
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 7ed38d80bc08..f9c724752a32 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -775,11 +775,14 @@ static int mlxsw_devlink_port_split(struct devlink 
*devlink,
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
 
-   if (port_index >= mlxsw_core->max_ports)
+   if (port_index >= mlxsw_core->max_ports) {
+   NL_SET_ERR_MSG_MOD(extack, "Port index exceeds maximum number 
of ports");
return -EINVAL;
+   }
if (!mlxsw_core->driver->port_split)
return -EOPNOTSUPP;
-   return mlxsw_core->driver->port_split(mlxsw_core, port_index, count);
+   return mlxsw_core->driver->port_split(mlxsw_core, port_index, count,
+ extack);
 }
 
 static int mlxsw_devlink_port_unsplit(struct devlink *devlink,
@@ -788,11 +791,14 @@ static int mlxsw_devlink_port_unsplit(struct devlink 
*devlink,
 {
struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
 
-   if (port_index >= mlxsw_core->max_ports)
+   if (port_index >= mlxsw_core->max_ports) {
+   NL_SET_ERR_MSG_MOD(extack, "Port index exceeds maximum number 
of ports");
return -EINVAL;
+   }
if (!mlxsw_core->driver->port_unsplit)
return -EOPNOTSUPP;
-   return mlxsw_core->driver->port_unsplit(mlxsw_core, port_index);
+   return mlxsw_core->driver->port_unsplit(mlxsw_core, port_index,
+   extack);
 }
 
 static int
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h 
b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 4a8d4c7f89d9..552cfa29c2f7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -274,8 +274,9 @@ struct mlxsw_driver {
int (*port_type_set)(struct mlxsw_core *mlxsw_core, u8 local_port,
 enum devlink_port_type new_type);
int (*port_split)(struct mlxsw_core *mlxsw_core, u8 local_port,
- unsigned int count);
-   int (*port_unsplit)(struct mlxsw_core *mlxsw_core, u8 local_port);
+ unsigned int count, struct netlink_ext_ack *extack);
+   int (*port_unsplit)(struct mlxsw_core *mlxsw_core, u8 local_port,
+   struct netlink_ext_ack *extack);
int (*sb_pool_get)(struct mlxsw_core *mlxsw_core,
   unsigned int sb_index, u16 pool_index,
   struct devlink_sb_pool_info *pool_info);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index fc39f22e5c70..968b88af2ef5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3092,7 +3092,8 @@ static void mlxsw_sp_port_unsplit_create(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static int mlxsw_sp_port_split(struct mlxsw_core *mlxsw_core, u8 local_port,
-  unsigned int count)
+  unsigned int count,
+  struct netlink_ext_ack *extack)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
struct mlxsw_sp_port *mlxsw_sp_port;
@@ -3104,6 +3105,7 @@ static int mlxsw_sp_port_split(struct mlxsw_core 
*mlxsw_core, u8 local_port,
if (!mlxsw_sp_port) {
dev_err(mlxsw_sp->bus_info->dev, "Port number \"%d\" does not 
exist\n",
local_port);
+   NL_SET_ERR_MSG_MOD(extack, "Port number does not exist");
return -EINVAL;
}
 
@@ -3112,11 +3114,13 @@ static int mlxsw_sp_port_split(struct mlxsw_core 
*mlxsw_core, u8 local_port,
 
if (count != 2 && count != 4) {
netdev_err(mlxsw_sp_port->dev, "Port can only be split into 2 
or 4 ports\n");
+   NL_SET_ERR_MSG_MOD(extack, "Port can only be split into 2 or 4 
ports");
return -EINVAL;
}
 
if (cur_width != MLXSW_PORT_MODULE_MAX_WIDTH) {
netdev_err(mlxsw_sp_port->dev, "Port cannot be split 
further\n");
+   NL_SET_ERR_MSG_MOD(extack, "Port cannot be sp

[PATCH v2 net-next 0/3] devlink: Add extack messages for reload and port split/unsplit

2018-06-05 Thread dsahern

From: David Ahern 

Patch 1 adds extack arg to reload, port_split and port_unsplit devlink
operations.

Patch 2 adds extack messages for reload operation in netdevsim.

Patch 3 adds extack messages to port split/unsplit in mlxsw driver.

v2
- make the extack messages align with existing dev_err

David Ahern (3):
  devlink: Add extack to reload and port_{un,}split operations
  netdevsim: Add extack error message for devlink reload
  mlxsw: Add extack messages for port_{un,}split failures

 drivers/net/ethernet/mellanox/mlxsw/core.c   | 23 ---
 drivers/net/ethernet/mellanox/mlxsw/core.h   |  5 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   | 15 ---
 drivers/net/ethernet/netronome/nfp/nfp_devlink.c |  5 +++--
 drivers/net/netdevsim/devlink.c  |  7 ---
 drivers/net/netdevsim/fib.c  |  9 ++---
 drivers/net/netdevsim/netdevsim.h|  3 ++-
 include/net/devlink.h|  7 ---
 net/core/devlink.c   | 18 ++
 9 files changed, 60 insertions(+), 32 deletions(-)

-- 
2.11.0

[PATCH v2 net-next 2/3] netdevsim: Add extack error message for devlink reload

2018-06-05 Thread dsahern

From: David Ahern 

devlink reset command can fail if a FIB resource limit is set to a value
lower than the current occupancy. Return a proper message indicating the
reason for the failure.

$ devlink resource sh netdevsim/netdevsim0
netdevsim/netdevsim0:
  name IPv4 size unlimited unit entry size_min 0 size_max unlimited size_gran 1 
dpipe_tables none
resources:
  name fib size unlimited occ 43 unit entry size_min 0 size_max unlimited 
size_gran 1 dpipe_tables none
  name fib-rules size unlimited occ 4 unit entry size_min 0 size_max 
unlimited size_gran 1 dpipe_tables none
  name IPv6 size unlimited unit entry size_min 0 size_max unlimited size_gran 1 
dpipe_tables none
resources:
  name fib size unlimited occ 54 unit entry size_min 0 size_max unlimited 
size_gran 1 dpipe_tables none
  name fib-rules size unlimited occ 3 unit entry size_min 0 size_max 
unlimited size_gran 1 dpipe_tables none

$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 40

$ devlink dev  reload netdevsim/netdevsim0
Error: netdevsim: New size is less than current occupancy.
devlink answers: Invalid argument

Signed-off-by: David Ahern 
Acked-by: Jakub Kicinski 
---
 drivers/net/netdevsim/devlink.c   | 4 ++--
 drivers/net/netdevsim/fib.c   | 9 ++---
 drivers/net/netdevsim/netdevsim.h | 3 ++-
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/netdevsim/devlink.c b/drivers/net/netdevsim/devlink.c
index e8366cf372ff..ba663e5af168 100644
--- a/drivers/net/netdevsim/devlink.c
+++ b/drivers/net/netdevsim/devlink.c
@@ -163,7 +163,7 @@ static int nsim_devlink_reload(struct devlink *devlink,
 
err = devlink_resource_size_get(devlink, res_ids[i], &val);
if (!err) {
-   err = nsim_fib_set_max(net, res_ids[i], val);
+   err = nsim_fib_set_max(net, res_ids[i], val, extack);
if (err)
return err;
}
@@ -181,7 +181,7 @@ static void nsim_devlink_net_reset(struct net *net)
int i;
 
for (i = 0; i < ARRAY_SIZE(res_ids); ++i) {
-   if (nsim_fib_set_max(net, res_ids[i], (u64)-1)) {
+   if (nsim_fib_set_max(net, res_ids[i], (u64)-1, NULL)) {
pr_err("Failed to reset limit for resource %u\n",
   res_ids[i]);
}
diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index 9bfe9e151e13..f61d094746c0 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -64,7 +64,8 @@ u64 nsim_fib_get_val(struct net *net, enum nsim_resource_id 
res_id, bool max)
return max ? entry->max : entry->num;
 }
 
-int nsim_fib_set_max(struct net *net, enum nsim_resource_id res_id, u64 val)
+int nsim_fib_set_max(struct net *net, enum nsim_resource_id res_id, u64 val,
+struct netlink_ext_ack *extack)
 {
struct nsim_fib_data *fib_data = net_generic(net, nsim_fib_net_id);
struct nsim_fib_entry *entry;
@@ -90,10 +91,12 @@ int nsim_fib_set_max(struct net *net, enum nsim_resource_id 
res_id, u64 val)
/* not allowing a new max to be less than curren occupancy
 * --> no means of evicting entries
 */
-   if (val < entry->num)
+   if (val < entry->num) {
+   NL_SET_ERR_MSG_MOD(extack, "New size is less than current 
occupancy");
err = -EINVAL;
-   else
+   } else {
entry->max = val;
+   }
 
return err;
 }
diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index 3a8581af3b85..8ca50b72c328 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -126,7 +126,8 @@ void nsim_devlink_exit(void);
 int nsim_fib_init(void);
 void nsim_fib_exit(void);
 u64 nsim_fib_get_val(struct net *net, enum nsim_resource_id res_id, bool max);
-int nsim_fib_set_max(struct net *net, enum nsim_resource_id res_id, u64 val);
+int nsim_fib_set_max(struct net *net, enum nsim_resource_id res_id, u64 val,
+struct netlink_ext_ack *extack);
 #else
 static inline int nsim_devlink_setup(struct netdevsim *ns)
 {
-- 
2.11.0

Re: [PATCH net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures?

2018-06-05 Thread Jiri Pirko

Tue, Jun 05, 2018 at 04:58:44PM CEST, dsah...@gmail.com wrote:
>On 6/5/18 1:18 AM, Jiri Pirko wrote:
>> Tue, Jun 05, 2018 at 10:05:28AM CEST, ido...@idosch.org wrote:
>>> On Tue, Jun 05, 2018 at 09:52:30AM +0200, Jiri Pirko wrote:
 Tue, Jun 05, 2018 at 12:15:03AM CEST, dsah...@kernel.org wrote:
>   if (!mlxsw_sp_port->split) {
>   netdev_err(mlxsw_sp_port->dev, "Port wasn't split\n");
> + NL_SET_ERR_MSG_MOD(extack, "Port was not split");

 I wonder if we need the dmesg for these as well. Plus it is not the same
 (wasn't/was not) which is maybe confusing. Any objection against the
 original dmesg messages removal?
>>>
>>> We had this discussion about three months ago and decided to keep the
>>> existing messages:
>>> https://marc.info/?l=linux-netdev&m=151982813309466&w=2
>> 
>> I forgot. Thanks for reminding me. So could we at least have the
>> messages 100% same? Thanks.
>> 
>
>ok if I convert the current message to 'was not' and avoid the
>contraction in messages?

Sure.

Re: [PATCH net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures?

2018-06-05 Thread David Miller

From: David Ahern 
Date: Tue, 5 Jun 2018 07:58:44 -0700

> On 6/5/18 1:18 AM, Jiri Pirko wrote:
>> Tue, Jun 05, 2018 at 10:05:28AM CEST, ido...@idosch.org wrote:
>>> On Tue, Jun 05, 2018 at 09:52:30AM +0200, Jiri Pirko wrote:
 Tue, Jun 05, 2018 at 12:15:03AM CEST, dsah...@kernel.org wrote:
>   if (!mlxsw_sp_port->split) {
>   netdev_err(mlxsw_sp_port->dev, "Port wasn't split\n");
> + NL_SET_ERR_MSG_MOD(extack, "Port was not split");

 I wonder if we need the dmesg for these as well. Plus it is not the same
 (wasn't/was not) which is maybe confusing. Any objection against the
 original dmesg messages removal?
>>>
>>> We had this discussion about three months ago and decided to keep the
>>> existing messages:
>>> https://marc.info/?l=linux-netdev&m=151982813309466&w=2
>> 
>> I forgot. Thanks for reminding me. So could we at least have the
>> messages 100% same? Thanks.
>> 
> 
> ok if I convert the current message to 'was not' and avoid the
> contraction in messages?

Sure.

Re: [PATCH net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures?

2018-06-05 Thread David Ahern

On 6/5/18 1:18 AM, Jiri Pirko wrote:
> Tue, Jun 05, 2018 at 10:05:28AM CEST, ido...@idosch.org wrote:
>> On Tue, Jun 05, 2018 at 09:52:30AM +0200, Jiri Pirko wrote:
>>> Tue, Jun 05, 2018 at 12:15:03AM CEST, dsah...@kernel.org wrote:
if (!mlxsw_sp_port->split) {
netdev_err(mlxsw_sp_port->dev, "Port wasn't split\n");
 +  NL_SET_ERR_MSG_MOD(extack, "Port was not split");
>>>
>>> I wonder if we need the dmesg for these as well. Plus it is not the same
>>> (wasn't/was not) which is maybe confusing. Any objection against the
>>> original dmesg messages removal?
>>
>> We had this discussion about three months ago and decided to keep the
>> existing messages:
>> https://marc.info/?l=linux-netdev&m=151982813309466&w=2
> 
> I forgot. Thanks for reminding me. So could we at least have the
> messages 100% same? Thanks.
> 

ok if I convert the current message to 'was not' and avoid the
contraction in messages?

Re: Qualcomm rmnet driver and qmi_wwan

2018-06-05 Thread Dan Williams

On Tue, 2018-06-05 at 11:38 +0200, Daniele Palmas wrote:
> Hi,
> 
> 2018-02-21 20:47 GMT+01:00 Subash Abhinov Kasiviswanathan
> :
> > On 2018-02-21 04:38, Daniele Palmas wrote:
> > > 
> > > Hello,
> > > 
> > > in rmnet kernel documentation I read:
> > > 
> > > "This driver can be used to register onto any physical network
> > > device in
> > > IP mode. Physical transports include USB, HSIC, PCIe and IP
> > > accelerator."
> > > 
> > > Does this mean that it can be used in association with the
> > > qmi_wwan
> > > driver?
> > > 
> > > If yes, can someone give me an hint on the steps to follow?
> > > 
> > > If not, does anyone know if it is possible to modify qmi_wwan in
> > > order
> > > to take advantage of the features provided by the rmnet driver?
> > > 
> > > In this case hint on the changes for modifying qmi_wwan are
> > > welcome.
> > > 
> > > Thanks in advance,
> > > Daniele
> > 
> > 
> > Hi
> > 
> > I havent used qmi_wwan so the following comment is based on code
> > inspection.
> > qmimux_register_device() is creating qmimux devices with usb net
> > device as
> > real_dev. The Multiplexing and aggregation header (qmimux_hdr) is
> > stripped
> > off
> > in qmimux_rx_fixup() and the packet is passed on to stack.
> > 
> > You could instead create rmnet devices with the usb netdevice as
> > real dev.
> > The packets from the usb net driver can be queued to network stack
> > directly
> > as rmnet driver will setup a RX handler. rmnet driver will process
> > the
> > packets
> > further and then queue to network stack.
> > 
> 
> in kernel documentation I read that rmnet user space configuration is
> done through librmnetctl available at
> 
> https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource
> /dataservices/tree/rmnetctl
> 
> However it seems to me that this is a bit outdated (e.g. it does not
> properly build since it is looking for kernel header
> linux/rmnet_data.h that, as far as I understand, is no more present).
> 
> Is there available a more recent version of the tool?

I'd expect that somebody (Subash?) would add support for the
rmnet/qmimux options to iproute2 via 'ip link' like exists for most
other device types.

Dan

> Thanks,
> Daniele
> 
> > --
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > a Linux Foundation Collaborative Project

Re: [RFC PATCH] kcm: hold rx mux lock when updating the receive queue.

2018-06-05 Thread David Miller

From: Paolo Abeni 
Date: Tue,  5 Jun 2018 12:32:33 +0200

> @@ -1157,7 +1158,9 @@ static int kcm_recvmsg(struct socket *sock, struct 
> msghdr *msg,
>   /* Finished with message */
>   msg->msg_flags |= MSG_EOR;
>   KCM_STATS_INCR(kcm->stats.rx_msgs);
> + spin_lock_bh(&kcm->mux->rx_lock);
>   skb_unlink(skb, &sk->sk_receive_queue);
> + spin_unlock_bh(&kcm->mux->rx_lock);

Hmmm, maybe I don't understand the corruption.

But, skb_unlink() takes the sk->sk_receive_queue.lock which should
prevent SKB list corruption.

[PATCH bpf-next v4 2/2] samples/bpf: Add xdp_sample_pkts example

2018-06-05 Thread Toke Høiland-Jørgensen

Add an example program showing how to sample packets from XDP using the
perf event buffer. The example userspace program just prints the ethernet
header for every packet sampled.

Signed-off-by: Toke Høiland-Jørgensen 
---
 samples/bpf/Makefile   |4 +
 samples/bpf/xdp_sample_pkts_kern.c |   62 +
 samples/bpf/xdp_sample_pkts_user.c |  176 
 3 files changed, 242 insertions(+)
 create mode 100644 samples/bpf/xdp_sample_pkts_kern.c
 create mode 100644 samples/bpf/xdp_sample_pkts_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af10e54d..9ea2f7b64869 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += xdp_sample_pkts
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += xdp_sample_pkts_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -179,6 +182,7 @@ HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
 HOSTCFLAGS_task_fd_query_user.o += -I$(srctree)/tools/lib/bpf/
+HOSTCFLAGS_xdp_sample_pkts_user.o += -I$(srctree)/tools/lib/bpf/
 
 HOST_LOADLIBES += $(LIBBPF) -lelf
 HOSTLOADLIBES_tracex4  += -lrt
diff --git a/samples/bpf/xdp_sample_pkts_kern.c 
b/samples/bpf/xdp_sample_pkts_kern.c
new file mode 100644
index ..4560522ca015
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_kern.c
@@ -0,0 +1,62 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define SAMPLE_SIZE 64ul
+#define MAX_CPUS 24
+
+#define bpf_printk(fmt, ...)   \
+({ \
+  char fmt[] = fmt;\
+  bpf_trace_printk(fmt, sizeof(fmt),   \
+   ##__VA_ARGS__); \
+})
+
+struct bpf_map_def SEC("maps") my_map = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_CPUS,
+};
+
+SEC("xdp_sample")
+int xdp_sample_prog(struct xdp_md *ctx)
+{
+   void *data_end = (void *)(long)ctx->data_end;
+   void *data = (void *)(long)ctx->data;
+
+/* Metadata will be in the perf event before the packet data. */
+   struct S {
+   u16 cookie;
+   u16 pkt_len;
+   } __attribute__((packed)) metadata;
+
+   if (data + SAMPLE_SIZE < data_end) {
+   /* The XDP perf_event_output handler will use the upper 32 bits
+* of the flags argument as a number of bytes to include of the
+* packet payload in the event data. If the size is too big, the
+* call to bpf_perf_event_output will fail and return -EFAULT.
+*
+* See bpf_xdp_event_output in net/core/filter.c.
+*
+* The BPF_F_CURRENT_CPU flag means that the event output fd
+* will be indexed by the CPU number in the event map.
+*/
+   u64 flags = (SAMPLE_SIZE << 32) | BPF_F_CURRENT_CPU;
+   int ret;
+
+   metadata.cookie = 0xdead;
+   metadata.pkt_len = (u16)(data_end - data);
+
+   ret = bpf_perf_event_output(ctx, &my_map, flags,
+ &metadata, sizeof(metadata));
+   if(ret)
+   bpf_printk("perf_event_output failed: %d\n", ret);
+   }
+
+   return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/xdp_sample_pkts_user.c 
b/samples/bpf/xdp_sample_pkts_user.c
new file mode 100644
index ..672392d48ce3
--- /dev/null
+++ b/samples/bpf/xdp_sample_pkts_user.c
@@ -0,0 +1,176 @@
+/* This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#inc

[PATCH bpf-next v4 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-05 Thread Toke Høiland-Jørgensen

Add two new helper functions to trace_helpers that supports polling
multiple perf file descriptors for events. These are used to the XDP
perf_event_output example, which needs to work with one perf fd per CPU.

Signed-off-by: Toke Høiland-Jørgensen 
---
 tools/testing/selftests/bpf/trace_helpers.c |   47 ++-
 tools/testing/selftests/bpf/trace_helpers.h |4 ++
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
b/tools/testing/selftests/bpf/trace_helpers.c
index 3868dcb63420..1e62d89f34cf 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -88,7 +88,7 @@ static int page_size;
 static int page_cnt = 8;
 static struct perf_event_mmap_page *header;
 
-int perf_event_mmap(int fd)
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
 {
void *base;
int mmap_size;
@@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
return -1;
}
 
-   header = base;
+   *header = base;
return 0;
 }
 
+int perf_event_mmap(int fd)
+{
+   return perf_event_mmap_header(fd, &header);
+}
+
 static int perf_event_poll(int fd)
 {
struct pollfd pfd = { .fd = fd, .events = POLLIN };
@@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn 
output_fn)
 
return ret;
 }
+
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+   int num_fds, perf_event_print_fn output_fn)
+{
+   enum bpf_perf_event_ret ret;
+   struct pollfd *pfds;
+   void *buf = NULL;
+   size_t len = 0;
+   int i;
+
+   pfds = malloc(sizeof(*pfds) * num_fds);
+   if (!pfds)
+   return -1;
+
+   memset(pfds, 0, sizeof(*pfds) * num_fds);
+   for (i = 0; i < num_fds; i++) {
+   pfds[i].fd = fds[i];
+   pfds[i].events = POLLIN;
+   }
+
+   for (;;) {
+   poll(pfds, num_fds, 1000);
+   for (i = 0; i < num_fds; i++) {
+   if (pfds[i].revents) {
+   ret = bpf_perf_event_read_simple(headers[i], 
page_cnt * page_size,
+   page_size, 
&buf, &len,
+   
bpf_perf_event_print,
+   output_fn);
+   if (ret != LIBBPF_PERF_EVENT_CONT)
+   break;
+   }
+   }
+   }
+   free(buf);
+   free(pfds);
+
+   return ret;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h 
b/tools/testing/selftests/bpf/trace_helpers.h
index 3b4bcf7f5084..18924f23db1b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -3,6 +3,7 @@
 #define __TRACE_HELPER_H
 
 #include 
+#include 
 
 struct ksym {
long addr;
@@ -16,6 +17,9 @@ long ksym_get_addr(const char *name);
 typedef enum bpf_perf_event_ret (*perf_event_print_fn)(void *data, int size);
 
 int perf_event_mmap(int fd);
+int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header);
 /* return LIBBPF_PERF_EVENT_DONE or LIBBPF_PERF_EVENT_ERROR */
 int perf_event_poller(int fd, perf_event_print_fn output_fn);
+int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
+   int num_fds, perf_event_print_fn output_fn);
 #endif

Re: [PATCH] net-tcp: remove useless tw_timeout field

2018-06-05 Thread David Miller

From: Eric Dumazet 
Date: Tue, 5 Jun 2018 05:59:27 -0700

> On 06/05/2018 03:07 AM, Maciej Żenczykowski wrote:
>> From: Maciej Żenczykowski 
>> 
>> Tested: 'git grep tw_timeout' comes up empty and it builds :-)
>> 
>> Signed-off-by: Maciej Żenczykowski 
>> Cc: Eric Dumazet 
> 
> This field became no longer needed when tcp_tw_recycle was removed in 
> linux-4.12
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks everyone.

Re: [PATCH bpf-next v3 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-05 Thread Toke Høiland-Jørgensen

Daniel Borkmann  writes:

> Hi Toke,
>
> On 06/05/2018 01:14 PM, Toke Høiland-Jørgensen wrote:
>> Signed-off-by: Toke Høiland-Jørgensen 
>
> Please no empty commit message. Not sure why from the previous patch
> you removed it here.

Ah, right, sorry; think I got patch versions mixed up :/

Will resend

-Toke

Re: [PATCH net] net: sched: cls: Fix offloading when ingress dev is vxlan

2018-06-05 Thread David Miller

From: Paul Blakey 
Date: Tue,  5 Jun 2018 11:04:03 +0300

> When using a vxlan device as the ingress dev, we count it as a
> "no offload dev", so when such a rule comes and err stop is true,
> we fail early and don't try the egdev route which can offload it
> through the egress device.
> 
> Fix that by not calling the block offload if one of the devices
> attached to it is not offload capable, but make sure egress on such case
> is capable instead.
> 
> Fixes: caa7260156eb ("net: sched: keep track of offloaded filters [..]")
> Reviewed-by: Roi Dayan 
> Acked-by: Jiri Pirko 
> Signed-off-by: Paul Blakey 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next 3/3] mlxsw: Add extack messages for port_{un,}split failures?

2018-06-05 Thread David Miller

From: Jiri Pirko 
Date: Tue, 5 Jun 2018 10:18:36 +0200

> Tue, Jun 05, 2018 at 10:05:28AM CEST, ido...@idosch.org wrote:
>>On Tue, Jun 05, 2018 at 09:52:30AM +0200, Jiri Pirko wrote:
>>> Tue, Jun 05, 2018 at 12:15:03AM CEST, dsah...@kernel.org wrote:
>>> >   if (!mlxsw_sp_port->split) {
>>> >   netdev_err(mlxsw_sp_port->dev, "Port wasn't split\n");
>>> >+  NL_SET_ERR_MSG_MOD(extack, "Port was not split");
>>> 
>>> I wonder if we need the dmesg for these as well. Plus it is not the same
>>> (wasn't/was not) which is maybe confusing. Any objection against the
>>> original dmesg messages removal?
>>
>>We had this discussion about three months ago and decided to keep the
>>existing messages:
>>https://marc.info/?l=linux-netdev&m=151982813309466&w=2
> 
> I forgot. Thanks for reminding me. So could we at least have the
> messages 100% same? Thanks.

Seems like a reasonable request, David A.?

Re: [PATCH net] sctp: not allow transport timeout value less than HZ/5 for hb_timer

2018-06-05 Thread David Miller

From: Xin Long 
Date: Tue,  5 Jun 2018 12:16:58 +0800

> syzbot reported a rcu_sched self-detected stall on CPU which is caused
> by too small value set on rto_min with SCTP_RTOINFO sockopt. With this
> value, hb_timer will get stuck there, as in its timer handler it starts
> this timer again with this value, then goes to the timer handler again.
> 
> This problem is there since very beginning, and thanks to Eric for the
> reproducer shared from a syzbot mail.
> 
> This patch fixes it by not allowing sctp_transport_timeout to return a
> smaller value than HZ/5 for hb_timer, which is based on TCP's min rto.
> 
> Note that it doesn't fix this issue by limiting rto_min, as some users
> are still using small rto and no proper value was found for it yet.
> 
> Reported-by: syzbot+3dcd59a1f907245f8...@syzkaller.appspotmail.com
> Suggested-by: Marcelo Ricardo Leitner 
> Signed-off-by: Xin Long 

Applied and queued up for -stable, thanks Xin.

Re: [PATCH net-next] net/mlx5e: fix error return code in mlx5e_alloc_rq()

2018-06-05 Thread David Miller

From: Wei Yongjun 
Date: Tue, 5 Jun 2018 02:42:56 +

> Fix to return error code -ENOMEM from the kvzalloc_node() error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory 
> scheme")
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next] net/mlx5e: Make function mlx5e_change_rep_mtu() static

2018-06-05 Thread David Miller

From: Wei Yongjun 
Date: Tue, 5 Jun 2018 02:42:45 +

> Fixes the following sparse warning:
> 
> drivers/net/ethernet/mellanox/mlx5/core/en_rep.c:903:5: warning:
>  symbol 'mlx5e_change_rep_mtu' was not declared. Should it be static?
> 
> Signed-off-by: Wei Yongjun 

Applied.

Re: [PATCH net-next v2] net: qualcomm: rmnet: Fix use after free while sending command ack

2018-06-05 Thread David Miller

From: Subash Abhinov Kasiviswanathan 
Date: Mon,  4 Jun 2018 19:43:38 -0600

> When sending an ack to a command packet, the skb is still referenced
> after it is sent to the real device. Since the real device could
> free the skb, the device pointer would be invalid.
> Also, remove an unnecessary variable.
> 
> Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial 
> implementation")
> Signed-off-by: Subash Abhinov Kasiviswanathan 

Applied.

Re: [PATCH net-next v2] net: ipv6: Generate random IID for addresses on RAWIP devices

2018-06-05 Thread David Miller

From: Subash Abhinov Kasiviswanathan 
Date: Mon,  4 Jun 2018 19:26:07 -0600

> RAWIP devices such as rmnet do not have a hardware address and
> instead require the kernel to generate a random IID for the
> IPv6 addresses.
> 
> Signed-off-by: Sean Tranchetti 
> Signed-off-by: Subash Abhinov Kasiviswanathan 

Applied.

bpf-next is CLOSED

2018-06-05 Thread Daniel Borkmann

Please only submit bug fixes at this time due to merge window, thank you.

Re: [PATCH bpf-next v3 1/2] trace_helpers.c: Add helpers to poll multiple perf FDs for events

2018-06-05 Thread Daniel Borkmann

Hi Toke,

On 06/05/2018 01:14 PM, Toke Høiland-Jørgensen wrote:
> Signed-off-by: Toke Høiland-Jørgensen 

Please no empty commit message. Not sure why from the previous patch
you removed it here.

> ---
>  tools/testing/selftests/bpf/trace_helpers.c |   47 
> ++-
>  tools/testing/selftests/bpf/trace_helpers.h |4 ++
>  2 files changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c 
> b/tools/testing/selftests/bpf/trace_helpers.c
> index 3868dcb63420..1e62d89f34cf 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -88,7 +88,7 @@ static int page_size;
>  static int page_cnt = 8;
>  static struct perf_event_mmap_page *header;
>  
> -int perf_event_mmap(int fd)
> +int perf_event_mmap_header(int fd, struct perf_event_mmap_page **header)
>  {
>   void *base;
>   int mmap_size;
> @@ -102,10 +102,15 @@ int perf_event_mmap(int fd)
>   return -1;
>   }
>  
> - header = base;
> + *header = base;
>   return 0;
>  }
>  
> +int perf_event_mmap(int fd)
> +{
> + return perf_event_mmap_header(fd, &header);
> +}
> +
>  static int perf_event_poll(int fd)
>  {
>   struct pollfd pfd = { .fd = fd, .events = POLLIN };
> @@ -163,3 +168,41 @@ int perf_event_poller(int fd, perf_event_print_fn 
> output_fn)
>  
>   return ret;
>  }
> +
> +int perf_event_poller_multi(int *fds, struct perf_event_mmap_page **headers,
> + int num_fds, perf_event_print_fn output_fn)
> +{
> + enum bpf_perf_event_ret ret;
> + struct pollfd *pfds;
> + void *buf = NULL;
> + size_t len = 0;
> + int i;
> +
> + pfds = malloc(sizeof(*pfds) * num_fds);
> + if (!pfds)
> + return -1;

Also, just noticed here you mix -1 as return code with LIBBPF_* return
codes. Would be better not not overlap such usage.

> + memset(pfds, 0, sizeof(*pfds) * num_fds);
> + for (i = 0; i < num_fds; i++) {
> + pfds[i].fd = fds[i];
> + pfds[i].events = POLLIN;
> + }
> +
> + for (;;) {
> + poll(pfds, num_fds, 1000);
> + for (i = 0; i < num_fds; i++) {
> + if (pfds[i].revents) {
> + ret = bpf_perf_event_read_simple(headers[i], 
> page_cnt * page_size,
> + page_size, 
> &buf, &len,
> + 
> bpf_perf_event_print,
> + output_fn);
> + if (ret != LIBBPF_PERF_EVENT_CONT)
> + break;
> + }
> + }
> + }
> + free(buf);
> + free(pfds);
> +
> + return ret;

Thanks,
Daniel

Re: AF_XDP. Was: [net-next 00/12][pull request] Intel Wired LAN Driver Updates 2018-06-04

2018-06-05 Thread Daniel Borkmann

On 06/05/2018 10:44 AM, Björn Töpel wrote:
> Den tis 5 juni 2018 kl 03:46 skrev Alexander Duyck 
> :
>> On Mon, Jun 4, 2018 at 4:32 PM, Alexei Starovoitov
>>  wrote:
>>> On Mon, Jun 04, 2018 at 03:02:31PM -0700, Alexander Duyck wrote:
 On Mon, Jun 4, 2018 at 2:27 PM, David Miller  wrote:
> From: Or Gerlitz 
> Date: Tue, 5 Jun 2018 00:11:35 +0300
>
>> Just to make sure, is the AF_XDP ZC (Zero Copy) UAPI going to be
>> merged for this window -- AFAIU from [1], it's still under
>> examination/development/research for non Intel HWs, am I correct or
>> this is going to get in now?
>
> All of the pending AF_XDP changes will be merged this merge window.
>
> I think Intel folks need to review things as fast as possible because
> I pretty much refuse to revert the series or disable it in Kconfig at
> this point.
>
> Thank you.

 My understanding of things is that the current AF_XDP patches were
 going to be updated to have more of a model agnostic API such that
 they would work for either the "typewriter" mode or the descriptor
 ring based approach. The current plan was to have the zero copy
 patches be a follow-on after the vendor agnostic API bits in the
 descriptors and such had been sorted out. I believe you guys have the
 descriptor fixes already right?

 In my opinion the i40e code isn't mature enough yet to really go into
 anything other than maybe net-next in a couple weeks. We are going to
 need a while to get adequate testing in order to flush out all the
 bugs and performance regressions we are likely to see coming out of
 this change.
>>>
>>> I think the work everyone did in this release cycle increased my confidence
>>> that the way descriptors are defined and the rest of uapi are stable enough
>>> and i40e zero copy bits can land in the next release without uapi changes.
>>> In that sense even if we merge i40e parts now, the other nic vendors
>>> will be in the same situation and may find things that they would like
>>> to improve in uapi.
>>> So I propose we merge the first 7 patches of the last series now and
>>> let 3 remaining i40e patches go via intel trees for the next release.
>>> In the mean time other NIC vendors should start actively working
>>> on AF_XDP support as well.
>>> If somehow uapi would need tweaks, we can still do minor adjustments
>>> since 4.18 won't be released for ~10 weeks.
>>
>> That works for me. Actually I think patch 11 can probably be included
>> as well since that is just sample code and could probably be used by
>> whatever drivers end up implementing this.
> 
> The approach suggested by Alexei and Alex sounds good to us. Alex's
> review items are very much valid, and require more time to address.
> Therefore addressing i40e in the next merge windows sounds like a
> great idea.
> 
> As Alex suggests, including patch 11 together with the first seven makes 
> sense.

Ok with it as well, and I've pushed just that, thanks everyone!

Re: [PATCH net-next] tcp: refactor tcp_ecn_check_ce to remove sk type cast

2018-06-05 Thread David Miller

From: Yousuk Seung 
Date: Mon,  4 Jun 2018 15:29:51 -0700

> Refactor tcp_ecn_check_ce and __tcp_ecn_check_ce to accept struct sock*
> instead of tcp_sock* to clean up type casts. This is a pure refactor
> patch.
> 
> Signed-off-by: Yousuk Seung 
> Signed-off-by: Neal Cardwell 
> Signed-off-by: Yuchung Cheng 
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 

Applied.

1 2 >

1 - 100 of 154 matches

Mail list logo