VRF: Ingress IPv6 Linklocal/Multicast destined pkt from slave VRF device does not map to Master device socket

2018-04-23 Thread Sukumar Gopalakrishnan
VRF: Ingress IPv6 Linklocal/Multicast pkt from slave VRF device does
not map to Master device socket.

KERNEL VERSION:

4.14.28

BUG REPORT:

https://bugzilla.kernel.org/show_bug.cgi?id=199409

CONFIGURATION  AND PROBLEM ROOT CAUSE:


1) Created VRF device(Vrf_258) and enslaved network device(v1_F4246) to this
VRF.

/exos/bin # ip link show v1_F4246
54: v1_F4246:  mtu 1500 qdisc noqueue
master vrf_258 state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff

/exos/bin # ip link show vrf_258
14: vrf_258:  mtu 65536 qdisc noqueue state
UP mode DEFAULT group default qlen 1000
link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff


2) Opened PIM protocol raw socket for AF_INET6 family

pim_socket = socket(AF_INET6, SOCK_RAW , IPPROTO_PIM )


3) PIM user daemon process per VRF so opened RX socket SO_BINDTODEVICE
to VRF_258 netdevice.
PIM control packets ingressing any slave devices belongs to this
master VRF device should be sent to this socket.


4) Ingressing PIM hello control packets which is having SrcIP =
fe80::204:96ff:fe98:c918 (IPv6 Link-local) and DestIP = ff02::0d
(Multicast pkt)
does not mapped to vrf_258 bounded socket and gets dropped in socket
lookup function.


5)  inet6_iif() is returning v1_F4246's ifindex 54 and inet6_sdif()
returns value zero.

__raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb),
inet6_sdif(skb));


sk->sk_bound_dev_if is having vrf_258(ifIndex value 14)  but dif(value
54) and sdif(value 0) does not match this socket hence socket not
found.

struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
unsigned short num, const struct in6_addr *loc_addr,
const struct in6_addr *rmt_addr, int dif, int sdif) {

..
if (sk->sk_bound_dev_if &&
sk->sk_bound_dev_if != dif &&
sk->sk_bound_dev_if != sdif)
..



}


6) This problem is seen for Raw, Udp and TCP socket look up function
for IPv6 packets destined to linklocal or multicast address.

7) This issue do not occur for all types of IPV4 address and IPv6
unicast global address.



TEMP FIX:
=

Get master device address from (skb->dev) and  pass master  to socket
lookup up function for Ipv6 Linklocal/Multicast address.

ipv6_raw_deliver()
{
int mdif;
..
..
mdif = (((nexthdr == IPPROTO_PIM || nexthdr == 89 /* IPPROTO_OSPF */ ||
nexthdr == IPPROTO_ICMPV6 || nexthdr == 112 /*IPPROTO_VRRP*/) &&
(ipv6_addr_type(daddr) &
(IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL))) ?
l3mdev_master_ifindex_rcu(skb->dev) : inet6_iif(skb));


sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, mdif,
inet6_sdif(skb));

...
..
}


Regards,
Sukumar


Re: [PATCH net] sfc: ARFS filter IDs

2018-04-23 Thread kbuild test robot
Hi Edward,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Edward-Cree/sfc-ARFS-filter-IDs/20180424-080737
config: i386-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/sfc/farch.c: In function 
'efx_farch_filter_rfs_expire_one':
>> drivers/net/ethernet/sfc/farch.c:2938:7: warning: 'rule' may be used 
>> uninitialized in this function [-Wmaybe-uninitialized]
   if (rule)
  ^

coccinelle warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model 
>> on line 3041
   drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model 
on line 3062

vim +/rule +2938 drivers/net/ethernet/sfc/farch.c

  2902  
  2903  bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
  2904   unsigned int index)
  2905  {
  2906  struct efx_farch_filter_state *state = efx->filter_state;
  2907  struct efx_farch_filter_table *table;
  2908  bool ret = false, force = false;
  2909  u16 arfs_id;
  2910  
  2911  down_write(>lock);
  2912  spin_lock_bh(>rps_hash_lock);
  2913  table = >table[EFX_FARCH_FILTER_TABLE_RX_IP];
  2914  if (test_bit(index, table->used_bitmap) &&
  2915  table->spec[index].priority == EFX_FILTER_PRI_HINT) {
  2916  struct efx_filter_spec spec;
  2917  struct efx_arfs_rule *rule;
  2918  
  2919  efx_farch_filter_to_gen_spec(, 
>spec[index]);
  2920  if (!efx->rps_hash_table) {
  2921  /* In the absence of the table, we always 
returned 0 to
  2922   * ARFS, so use the same to query it.
  2923   */
  2924  arfs_id = 0;
  2925  } else {
  2926  rule = efx_rps_hash_find(efx, );
  2927  if (!rule) {
  2928  /* ARFS table doesn't know of this 
filter, remove it */
  2929  force = true;
  2930  } else {
  2931  arfs_id = rule->arfs_id;
  2932  if (!efx_rps_check_rule(rule, index, 
))
  2933  goto out_unlock;
  2934  }
  2935  }
  2936  if (force || rps_may_expire_flow(efx->net_dev, 
spec.dmaq_id,
  2937   flow_id, arfs_id)) {
> 2938  if (rule)
  2939  rule->filter_id = 
EFX_ARFS_FILTER_ID_REMOVING;
  2940  efx_rps_hash_del(efx, );
  2941  efx_farch_filter_table_clear_entry(efx, table, 
index);
  2942  ret = true;
  2943  }
  2944  }
  2945  out_unlock:
  2946  spin_unlock_bh(>rps_hash_lock);
  2947  up_write(>lock);
  2948  return ret;
  2949  }
  2950  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()

2018-04-23 Thread Julian Anastasov

Hello,

On Mon, 23 Apr 2018, Cong Wang wrote:

> Similarly, tbl->entries is not initialized after kmalloc(),
> therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
> as reported by syzbot.
> 
> Reported-by: 
> Cc: Simon Horman 
> Cc: Julian Anastasov 
> Cc: Pablo Neira Ayuso 
> Signed-off-by: Cong Wang 

Thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_lblc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
> index 3057e453bf31..83918119ceb8 100644
> --- a/net/netfilter/ipvs/ip_vs_lblc.c
> +++ b/net/netfilter/ipvs/ip_vs_lblc.c
> @@ -371,6 +371,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
>   tbl->counter = 1;
>   tbl->dead = false;
>   tbl->svc = svc;
> + atomic_set(>entries, 0);
>  
>   /*
>*Hook periodic timer for garbage collection
> -- 
> 2.13.0

Regards

--
Julian Anastasov 


Re: [Patch nf] ipvs: initialize tbl->entries after allocation

2018-04-23 Thread Julian Anastasov

Hello,

On Mon, 23 Apr 2018, Cong Wang wrote:

> tbl->entries is not initialized after kmalloc(), therefore
> causes an uninit-value warning in ip_vs_lblc_check_expire()
> as reported by syzbot.
> 
> Reported-by: 
> Cc: Simon Horman 
> Cc: Julian Anastasov 
> Cc: Pablo Neira Ayuso 
> Signed-off-by: Cong Wang 

Thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c 
> b/net/netfilter/ipvs/ip_vs_lblcr.c
> index 92adc04557ed..bc2bc5eebcb8 100644
> --- a/net/netfilter/ipvs/ip_vs_lblcr.c
> +++ b/net/netfilter/ipvs/ip_vs_lblcr.c
> @@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
>   tbl->counter = 1;
>   tbl->dead = false;
>   tbl->svc = svc;
> + atomic_set(>entries, 0);
>  
>   /*
>*Hook periodic timer for garbage collection
> -- 
> 2.13.0

Regards

--
Julian Anastasov 


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger
On Tue, 24 Apr 2018 04:42:22 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> > On Mon, 23 Apr 2018 12:44:39 -0700
> > Siwei Liu  wrote:
> >   
> > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  
> > > wrote:  
> > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> > > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > > >> "Michael S. Tsirkin"  wrote:
> > > >>
> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:   
> > > >> >  
> > > >> > > > >
> > > >> > > > >I will NAK patches to change to common code for netvsc 
> > > >> > > > >especially the
> > > >> > > > >three device model.  MS worked hard with distro vendors to 
> > > >> > > > >support transparent
> > > >> > > > >mode, ans we really can't have a new model; or do backport.
> > > >> > > > >
> > > >> > > > >Plus, DPDK is now dependent on existing model.
> > > >> > > >
> > > >> > > > Sorry, but nobody here cares about dpdk or other similar 
> > > >> > > > oddities.
> > > >> > >
> > > >> > > The network device model is a userspace API, and DPDK is a 
> > > >> > > userspace application.
> > > >> >
> > > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > > >> > AFAIK it's normally banging device registers directly.
> > > >> >
> > > >> > > You can't go breaking userspace even if you don't like the 
> > > >> > > application.
> > > >> >
> > > >> > Could you please explain how is the proposed patchset breaking
> > > >> > userspace? Ignoring DPDK for now, I don't think it changes the 
> > > >> > userspace
> > > >> > API at all.
> > > >> >
> > > >>
> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> > > >> devices
> > > >> to look for Linux netvsc device and the paired VF device and setup the
> > > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
> > > >> instance
> > > >> and sets up TAP support over the Linux netvsc device as well as the 
> > > >> Mellanox
> > > >> VF device.
> > > >>
> > > >> So it depends on existing 2 device model. You can't go to a 3 device 
> > > >> model
> > > >> or start hiding devices from userspace.
> > > >
> > > > Okay so how does the existing patch break that? IIUC does not go to
> > > > a 3 device model since netvsc calls failover_register directly.
> > > >
> > > >> Also, I am working on associating netvsc and VF device based on serial 
> > > >> number
> > > >> rather than MAC address. The serial number is how Windows works now, 
> > > >> and it makes
> > > >> sense for Linux and Windows to use the same mechanism if possible.
> > > >
> > > > Maybe we should support same for virtio ...
> > > > Which serial do you mean? From vpd?
> > > >
> > > > I guess you will want to keep supporting MAC for old hypervisors?  
> > 
> > The serial number has always been in the hypervisor since original support 
> > of SR-IOV
> > in WS2008.  So no backward compatibility special cases would be needed.  
> 
> Is that a serial from real hardware or a hypervisor thing?
> 
> 

It is a hypervisor thing in the PCI hyperv code and the hyperv Netvsc interface.
It might also be in the PCI spec, but the value in Hyper-V is being generated 
by the host.


Re: [PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters

2018-04-23 Thread Roopa Prabhu
On Mon, Apr 23, 2018 at 3:51 PM, Florian Fainelli  wrote:
> While adding support for ethtool::get_fecparam and set_fecparam, kernel
> doc for these functions was missed, add those.
>
> Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction 
> modes")
> Signed-off-by: Florian Fainelli 

Acked-by: Roopa Prabhu 

Thanks Florian.


Re: [PATCH net-next] net: init sk_cookie for inet socket

2018-04-23 Thread Yafang Shao
On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet  wrote:
>
>
> On 04/23/2018 08:58 AM, David Miller wrote:
>> From: Yafang Shao 
>> Date: Sun, 22 Apr 2018 21:50:04 +0800
>>
>>> With sk_cookie we can identify a socket, that is very helpful for
>>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>>> So we'd better init it by default for inet socket.
>>> When using it, we just need call atomic64_read(>sk_cookie).
>>>
>>> Signed-off-by: Yafang Shao 
>>
>> Applied, thank you.
>>
>
> This is adding yet another atomic_inc on a global cache line.
>

That's a trade-off.

> Most applications do not need the cookie being ever set.
>
> The existing mechanism was fine. Set it on demand.

There are some drawback in the existing mechanism.
- we have to set the net->cookie_gen and then sk->sk_cookie when we
want to get the sk_cookie, that's a little expensive as well.
  After that change, sock_gen_cookie() could be replaced by
atomic64_read(>sk_cookie) in most places.

- If the application want to get the sk_cookie, it must set it first.
   What if the application don't have the permision to write?
   Furthermore, maybe it is a security concern ?


Thanks
Yafang


Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Eric Dumazet


On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet  wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> 
>>> I would suggest that you rework the interface a bit.  First a user would 
>>> call mmap() on a TCP socket, which would create an empty VMA.  (It would 
>>> set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
>>> recognize it, but it would have no effect whatsoever on the TCP state 
>>> machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new 
>>> ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>>   void *address;
>>>   size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a 
>>> single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
> 
> MM layer.  I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
> 
> len = get_user(...);
> 
> down_read(>mm->mmap_sem);
> 
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
>   return -EFAULT;
> 
> /* This is buggy.  You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
>   return -EINVAL;
> 
> if (len > vma->vm_end - start)
>   return -EFAULT;  /* too big a request. */
> 
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right.  Instead you'd just stop and tell the caller that you
> didn't map the full requested size.  You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
> 
> You also need to provide some way for user programs to signal that
> they're done with the page in question.  MADV_DONTNEED might be
> sufficient.
> 
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable.  And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
> 
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers.  The latter is awful, especially in
> multithreaded programs.  In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes.  (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on.  I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
> 
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
> 
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless 
>> we can return smaller @length than requested maybe ?
> 
> Exactly.  If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
> 
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty 
>> VMA (no pages in there yet)
> 
> There's nothing to account.  It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
> 

Thanks Andy, I am working on all this, and initial patch looks sane enough.

 include/uapi/linux/tcp.h |7 +
 net/ipv4/tcp.c   |  175 +++
 2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)




Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-23 Thread Matthew Wilcox
On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.

Well now, this brings up another item for the collective TODO list --
implement redzone checks for vmalloc.  Unless this is something already
taken care of by kasan or similar.


Re: [PATCH 2/2] alx: add disable_wol paramenter

2018-04-23 Thread AceLan Kao
Hi,

May I know the final decision of this patch?
Thanks.

Best regards,
AceLan Kao.

2018-04-10 10:40 GMT+08:00 AceLan Kao :
> The problem is I don't have a machine with that wakeup issue, and I
> need WoL feature.
> Instead of spreading "alx with WoL" dkms package everywhere, I would
> like to see it's supported in the driver and is disabled by default.
>
> Moreover, the wakeup issue may come from old Atheros chips, or result
> from buggy BIOS.
> With the WoL has been removed from the driver, no one will report
> issue about that, and we don't have any chance to find a fix for it.
>
> Adding this feature back is not covering a paper on the issue, it
> makes people have a chance to examine this feature.
>
> 2018-04-09 22:50 GMT+08:00 David Miller :
>> From: Andrew Lunn 
>> Date: Mon, 9 Apr 2018 14:39:10 +0200
>>
>>> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
 The WoL feature was reported broken and will lead to
 the system resume immediately after suspending.
 This symptom is not happening on every system, so adding
 disable_wol option and disable WoL by default to prevent the issue from
 happening again.
>>>
  const char alx_drv_name[] = "alx";

 +/* disable WoL by default */
 +bool disable_wol = 1;
 +module_param(disable_wol, bool, 0);
 +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
 +
>>>
>>> Hi AceLan
>>>
>>> This seems like you are papering over the cracks. And module
>>> parameters are not liked.
>>>
>>> Please try to find the real problem.
>>
>> Agreed.


Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing

2018-04-23 Thread David Ahern
On 4/23/18 9:21 PM, David Miller wrote:
> From: Roopa Prabhu 
> Date: Mon, 23 Apr 2018 20:08:41 -0700
> 
>> From: Roopa Prabhu 
>>
>> Fixes: b16fb418b1bf ("net: fib_rules: add extack support")
>> Signed-off-by: Roopa Prabhu 
> 
> Applied.
> 
> It would be nice to get rid of these if() conditionals dangling
> around ifdef blocks.  They are quite error prone.
> 

I'll send a patch. I'd prefer a different message when NET_L3_MASTER_DEV
is not enabled.


Re: [PATCH 1/1] Revert "rds: ib: add error handle"

2018-04-23 Thread santosh.shilim...@oracle.com

On 4/23/18 6:39 PM, Zhu Yanjun wrote:

This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.

After long time discussion and investigations, it seems that there
is no mem leak. So this patch is reverted.

Signed-off-by: Zhu Yanjun 
---

Well your fix was not for any leaks but just proper labels for
graceful exits. I don't know which long time discussion
you are referring but there is no need to revert this change
unless you see any issue with your change.

Regards,
Santosh


Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing

2018-04-23 Thread David Miller
From: Roopa Prabhu 
Date: Mon, 23 Apr 2018 20:08:41 -0700

> From: Roopa Prabhu 
> 
> Fixes: b16fb418b1bf ("net: fib_rules: add extack support")
> Signed-off-by: Roopa Prabhu 

Applied.

It would be nice to get rid of these if() conditionals dangling
around ifdef blocks.  They are quite error prone.


Re: [PATCH net-next v2 0/2] fib rules extack support

2018-04-23 Thread Roopa Prabhu
On Mon, Apr 23, 2018 at 7:21 AM, David Miller  wrote:
> From: Roopa Prabhu 
> Date: Sat, 21 Apr 2018 09:41:29 -0700
>
>> From: Roopa Prabhu 
>>
>> First patch refactors code to move fib rule netlink handling
>> into a common function. This became obvious when adding
>> duplicate extack msgs in add and del paths. Second patch
>> adds extack msgs.
>>
>> v2 - Dropped the ip route get support and selftests from
>>  the series to look at the input path some more (as pointed
>>  out by ido). Will come back to that next week when i have
>>  some time. resending just the extack part for now.
>
> Series applied, but I was really looking forward to having those
> nice test cases in the tree.


yes, am on it.  (and also excited about my first selftest!)

thanks (I just sent a follow-up bug fix for vrf).


[PATCH] mac80211_hwsim: fix a possible memory leak in hwsim_new_radio_nl()

2018-04-23 Thread YueHaibing
'hwname' should be freed before leaving from the error handling cases,
otherwise it will cause mem leak

Fixes: cb1a5bae5684 ("mac80211_hwsim: add permanent mac address option for new 
radios")
Signed-off-by: YueHaibing 
---
 drivers/net/wireless/mac80211_hwsim.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/mac80211_hwsim.c 
b/drivers/net/wireless/mac80211_hwsim.c
index 96d26cf..4a017a0 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -3236,6 +3236,7 @@ static int hwsim_new_radio_nl(struct sk_buff *msg, struct 
genl_info *info)
GENL_SET_ERR_MSG(info,"MAC is no valid source addr");
NL_SET_BAD_ATTR(info->extack,
info->attrs[HWSIM_ATTR_PERM_ADDR]);
+   kfree(hwname);
return -EINVAL;
}
 
-- 
2.7.0




[PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing

2018-04-23 Thread Roopa Prabhu
From: Roopa Prabhu 

Fixes: b16fb418b1bf ("net: fib_rules: add extack support")
Signed-off-by: Roopa Prabhu 
---
Looks like I broke this when i split extack changes into a separate patch :(

 net/core/fib_rules.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index ebd9351..2271c80 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -541,8 +541,10 @@ static int fib_nl2rule(struct sk_buff *skb, struct 
nlmsghdr *nlh,
nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
if (nlrule->l3mdev != 1)
 #endif
+   {
NL_SET_ERR_MSG(extack, "Invalid l3mdev");
goto errout_free;
+   }
}
 
nlrule->action = frh->action;
-- 
2.1.4



[PATCH v2 7/8] ipconfig: Create /proc/net/ipconfig directory

2018-04-23 Thread Chris Novakovic
To allow ipconfig to report IP configuration details to user space
processes without cluttering /proc/net, create a new subdirectory
/proc/net/ipconfig. All files containing IP configuration details should
be written to this directory.

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index e11dfd29a929..9abf833f3a99 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -158,6 +158,9 @@ static u8 ic_domain[64];/* DNS (not NIS) domain 
name */
  * Private state.
  */
 
+/* proc_dir_entry for /proc/net/ipconfig */
+static struct proc_dir_entry *ipconfig_dir;
+
 /* Name of user-selected boot device */
 static char user_dev_name[IFNAMSIZ] __initdata = { 0, };
 
@@ -1301,6 +1304,16 @@ static const struct file_operations pnp_seq_fops = {
.llseek = seq_lseek,
.release= single_release,
 };
+
+/* Create the /proc/net/ipconfig directory */
+static int ipconfig_proc_net_init(void)
+{
+   ipconfig_dir = proc_net_mkdir(_net, "ipconfig", init_net.proc_net);
+   if (!ipconfig_dir)
+   return -ENOMEM;
+
+   return 0;
+}
 #endif /* CONFIG_PROC_FS */
 
 /*
@@ -1384,6 +1397,8 @@ static int __init ip_auto_config(void)
 
 #ifdef CONFIG_PROC_FS
proc_create("pnp", 0444, init_net.proc_net, _seq_fops);
+
+   ipconfig_proc_net_init();
 #endif /* CONFIG_PROC_FS */
 
if (!ic_enable)
-- 
2.14.1



[PATCH v2 6/8] ipconfig: Correctly initialise ic_nameservers

2018-04-23 Thread Chris Novakovic
ic_nameservers, which stores the list of name servers discovered by
ipconfig, is initialised (i.e. has all of its elements set to NONE, or
0x) by ic_nameservers_predef() in the following scenarios:

 - before the "ip=" and "nfsaddrs=" kernel command line parameters are
   parsed (in ip_auto_config_setup());
 - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
   order to clear any values that may have been set after parsing "ip="
   or "nfsaddrs=" and are no longer needed.

This means that ic_nameservers_predef() is not called when neither "ip="
nor "nfsaddrs=" is specified on the kernel command line. In this
scenario, every element in ic_nameservers remains set to 0x,
which is indistinguishable from ANY and causes pnp_seq_show() to write
the following (bogus) information to /proc/net/pnp:

  #MANUAL
  nameserver 0.0.0.0
  nameserver 0.0.0.0
  nameserver 0.0.0.0

This is potentially problematic for systems that blindly link
/etc/resolv.conf to /proc/net/pnp.

Ensure that ic_nameservers is also initialised when neither "ip=" nor
"nfsaddrs=" are specified by calling ic_nameservers_predef() in
ip_auto_config(), but only when ip_auto_config_setup() was not called
earlier. This causes the following to be written to /proc/net/pnp, and
is consistent with what gets written when ipconfig is configured
manually but no name servers are specified on the kernel command line:

  #MANUAL

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 0f460d6d3cce..e11dfd29a929 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -750,6 +750,11 @@ static void __init ic_bootp_init_ext(u8 *e)
  */
 static inline void __init ic_bootp_init(void)
 {
+   /* Re-initialise all name servers to NONE, in case any were set via the
+* "ip=" or "nfsaddrs=" kernel command line parameters: any IP addresses
+* specified there will already have been decoded but are no longer
+* needed
+*/
ic_nameservers_predef();
 
dev_add_pack(_packet_type);
@@ -1370,6 +1375,13 @@ static int __init ip_auto_config(void)
int err;
unsigned int i;
 
+   /* Initialise all name servers to NONE (but only if the "ip=" or
+* "nfsaddrs=" kernel command line parameters weren't decoded, otherwise
+* we'll overwrite the IP addresses specified there)
+*/
+   if (ic_set_manually == 0)
+   ic_nameservers_predef();
+
 #ifdef CONFIG_PROC_FS
proc_create("pnp", 0444, init_net.proc_net, _seq_fops);
 #endif /* CONFIG_PROC_FS */
@@ -1593,6 +1605,7 @@ static int __init ip_auto_config_setup(char *addrs)
return 1;
}
 
+   /* Initialise all name servers to NONE */
ic_nameservers_predef();
 
/* Parse string for static IP assignment.  */
-- 
2.14.1



[PATCH v2 4/8] ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers

2018-04-23 Thread Chris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
server option, limiting the BOOTP server to responding with at most 2
name servers even though ipconfig in fact supports an arbitrary number
of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
3).

Only request name servers in the request packet if CONF_NAMESERVERS_MAX
is positive (to comply with [1, §3.8]), and allocate enough space in the
packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
number we can accept in response.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index bcf3c4f9882d..0f460d6d3cce 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -721,9 +721,11 @@ static void __init ic_bootp_init_ext(u8 *e)
*e++ = 3;   /* Default gateway request */
*e++ = 4;
e += 4;
+#if CONF_NAMESERVERS_MAX > 0
*e++ = 6;   /* (DNS) name server request */
-   *e++ = 8;
-   e += 8;
+   *e++ = 4 * CONF_NAMESERVERS_MAX;
+   e += 4 * CONF_NAMESERVERS_MAX;
+#endif
*e++ = 12;  /* Host name request */
*e++ = 32;
e += 32;
-- 
2.14.1



[PATCH v2 8/8] ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers

2018-04-23 Thread Chris Novakovic
Distributed filesystems are most effective when the server and client
clocks are synchronised. Embedded devices often use NFS for their
root filesystem but typically do not contain an RTC, so the clocks of
the NFS server and the embedded device will be out-of-sync when the root
filesystem is mounted (and may not be synchronised until late in the
boot process).

Extend ipconfig with the ability to export IP addresses of NTP servers
it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as
follows:

 - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
   kernel command line parameters, one NTP server can be specified in
   the new "" parameter.
 - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
   the DHCPDISCOVER message, and record the IP addresses of up to three
   NTP servers sent by the responding DHCP server in the subsequent
   DHCPOFFER message.

ipconfig will only write the NTP server IP addresses it discovers to
/proc/net/ipconfig/ntp_servers, one per line (in the order received from
the DHCP server, if DHCP autoconfiguration is used); making use of these
NTP servers is the responsibility of a user space process (e.g. an
initrd/initram script that invokes an NTP client before mounting an NFS
root filesystem).

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt |  35 +++--
 net/ipv4/ipconfig.c   | 118 +++---
 2 files changed, 136 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index a1030bea60d3..d2963123eb1c 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -5,6 +5,7 @@ Written 1996 by Gero Kuhlmann 
 Updated 1997 by Martin Mares 
 Updated 2006 by Nico Schottelius 
 Updated 2006 by Horms 
+Updated 2018 by Chris Novakovic 
 
 
 
@@ -79,7 +80,7 @@ nfsroot=[:][,]
 
 
 ip=:::
-   :
+   ::
 
   This parameter tells the kernel how to configure IP addresses of devices
   and also how to set up the IP routing table. It was originally called
@@ -178,9 +179,18 @@ 
ip=:::
   IP address of secondary nameserver.
See .
 
-  After configuration (whether manual or automatic) is complete, a file is
-  created at /proc/net/pnp in the following format; lines are omitted if
-  their respective value is empty following configuration.
+  IP address of a Network Time Protocol (NTP) server.
+   Value is exported to /proc/net/ipconfig/ntp_servers, but is
+   otherwise unused (see below).
+
+   Default: None if not using autoconfiguration; determined
+   automatically if using autoconfiguration.
+
+  After configuration (whether manual or automatic) is complete, two files
+  are created in the following format; lines are omitted if their respective
+  value is empty following configuration:
+
+  - /proc/net/pnp:
 
#PROTO: (depending on configuration 
method)
domain  (if autoconfigured, the DNS 
domain)
@@ -189,13 +199,26 @@ 
ip=:::
nameserver (tertiary name server IP)
bootserver   (NFS server IP)
 
-   and  are requested during autoconfiguration; they
-  cannot be specified as part of the "ip=" kernel command line parameter.
+  - /proc/net/ipconfig/ntp_servers:
+
+  (NTP server IP)
+  (NTP server IP)
+  (NTP server IP)
+
+   and  (in /proc/net/pnp) and  and 
+  (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration;
+  they cannot be specified as part of the "ip=" kernel command line parameter.
 
   Because the "domain" and "nameserver" options are recognised by DNS
   resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
   that use an NFS root filesystem.
 
+  Note that the kernel will not synchronise the system time with any NTP
+  servers it discovers; this is the responsibility of a user space process
+  (e.g. an initrd/initramfs script that passes the IP addresses listed in
+  /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real
+  root filesystem if it is on NFS).
+
 
 nfsrootdebug
 
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 9abf833f3a99..d839d74853fc 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -28,6 +28,9 @@
  *
  *  Multiple Nameservers in /proc/net/pnp
  *  --  Josef Siemes , Aug 2002
+ *
+ *  NTP servers in /proc/net/ipconfig/ntp_servers
+ *  --  Chris Novakovic , April 2018
  */
 
 #include 
@@ -93,6 +96,7 @@
 #define CONF_TIMEOUT_MAX   (HZ*30) 

[PATCH v2 1/8] ipconfig: Document setting of NIS domain name

2018-04-23 Thread Chris Novakovic
ic_do_bootp_ext() is responsible for parsing the "ip=" and "nfsaddrs="
kernel parameters. If a "." character is found in parameter 4 (the
client's hostname), everything before the first "." is used as the
hostname, and everything after it is used as the NIS domain name (but
not necessarily the DNS domain name).

Document this behaviour in Documentation/filesystems/nfs/nfsroot.txt,
as it is not made explicit.

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index 5efae00f6c7f..1513e5d663fd 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -123,10 +123,13 @@ 
ip=:::
 
Default:  Determined using autoconfiguration.
 
- Name of the client. May be supplied by autoconfiguration,
-   but its absence will not trigger autoconfiguration.
-   If specified and DHCP is used, the user provided hostname will
-   be carried in the DHCP request to hopefully update DNS record.
+ Name of the client. If a '.' character is present, anything
+   before the first '.' is used as the client's hostname, and 
anything
+   after it is used as its NIS domain name. May be supplied by
+   autoconfiguration, but its absence will not trigger 
autoconfiguration.
+   If specified and DHCP is used, the user-provided hostname (and 
NIS
+   domain name, if present) will be carried in the DHCP request; 
this
+   may cause a DNS record to be created or updated for the client.
 
Default: Client IP address is used in ASCII notation.
 
-- 
2.14.1



[PATCH v2 0/8] ipconfig: NTP server support, bug fixes, documentation improvements

2018-04-23 Thread Chris Novakovic
This series (against net-next) makes various improvements to ipconfig:

 - Patch #1 correctly documents the behaviour of parameter 4 in the
   "ip=" and "nfsaddrs=" command line parameter.
 - Patch #2 tidies up the printk()s for reporting configured name
   servers.
 - Patch #3 fixes a bug in autoconfiguration via BOOTP whereby the IP
   addresses of IEN-116 name servers are requested from the BOOTP
   server, rather than those of DNS name servers.
 - Patch #4 requests the number of DNS servers specified by
   CONF_NAMESERVERS_MAX when autoconfiguring via BOOTP, rather than
   hardcoding it to 2.
 - Patch #5 fully documents the contents and format of /proc/net/pnp in
   Documentation/filesystems/nfs/nfsroot.txt.
 - Patch #6 fixes a bug whereby bogus information is written to
   /proc/net/pnp when ipconfig is not used.
 - Patch #7 creates a new procfs directory for ipconfig-related
   configuration reports at /proc/net/ipconfig.
 - Patch #8 allows for NTP servers to be configured (manually on the
   kernel command line or automatically via DHCP), enabling systems with
   an NFS root filesystem to synchronise their clock before mounting
   their root filesystem. NTP server IP addresses are written to
   /proc/net/ipconfig/ntp_servers.

Changes from v1:

 - David requested that a new directory /proc/net/ipconfig be created to
   contain ipconfig-related configuration reports, which is implemented
   in the new patch #7. NTP server IPs are now written to this directory
   instead of /proc/net/ntp in the new patch #8.
 - Cong and David both requested that the modification to CREDITS be
   dropped. This patch has been removed from the series.

Chris Novakovic (8):
  ipconfig: Document setting of NIS domain name
  ipconfig: Tidy up reporting of name servers
  ipconfig: BOOTP: Don't request IEN-116 name servers
  ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers
  ipconfig: Document /proc/net/pnp
  ipconfig: Correctly initialise ic_nameservers
  ipconfig: Create /proc/net/ipconfig directory
  ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers

 Documentation/filesystems/nfs/nfsroot.txt |  70 --
 net/ipv4/ipconfig.c   | 151 +++---
 2 files changed, 200 insertions(+), 21 deletions(-)

-- 
2.14.1



[PATCH v2 2/8] ipconfig: Tidy up reporting of name servers

2018-04-23 Thread Chris Novakovic
Commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add
nameserver IPs to kernel-parameter ip=") adds the IP addresses of
discovered name servers to the summary printed by ipconfig when
configuration is complete. It appears the intention in ip_auto_config()
was to print the name servers on a new line (especially given the
spacing and lack of comma before "nameserver0="), but they're actually
printed on the same line as the NFS root filesystem configuration
summary:

  [0.686186] IP-Config: Complete:
  [0.686226]  device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, 
mask=255.255.255.0, gw=10.0.0.1
  [0.686328]  host=test, domain=example.com, nis-domain=(none)
  [0.686386]  bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= 
nameserver0=10.0.0.1

This makes it harder to read and parse ipconfig's output. Instead, print
the name servers on a separate line:

  [0.791250] IP-Config: Complete:
  [0.791289]  device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, 
mask=255.255.255.0, gw=10.0.0.1
  [0.791407]  host=test, domain=example.com, nis-domain=(none)
  [0.791475]  bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
  [0.791476]  nameserver0=10.0.0.1

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 43f620feb1c4..d0ea0ecc9008 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -1481,16 +1481,19 @@ static int __init ip_auto_config(void)
_servaddr, _server_addr, root_server_path);
if (ic_dev_mtu)
pr_cont(", mtu=%d", ic_dev_mtu);
-   for (i = 0; i < CONF_NAMESERVERS_MAX; i++)
+   /* Name servers (if any): */
+   for (i = 0; i < CONF_NAMESERVERS_MAX; i++) {
if (ic_nameservers[i] != NONE) {
-   pr_cont(" nameserver%u=%pI4",
-   i, _nameservers[i]);
-   break;
+   if (i == 0)
+   pr_info(" nameserver%u=%pI4",
+   i, _nameservers[i]);
+   else
+   pr_cont(", nameserver%u=%pI4",
+   i, _nameservers[i]);
}
-   for (i++; i < CONF_NAMESERVERS_MAX; i++)
-   if (ic_nameservers[i] != NONE)
-   pr_cont(", nameserver%u=%pI4", i, _nameservers[i]);
-   pr_cont("\n");
+   if (i + 1 == CONF_NAMESERVERS_MAX)
+   pr_cont("\n");
+   }
 #endif /* !SILENT */
 
/*
-- 
2.14.1



[PATCH v2 5/8] ipconfig: Document /proc/net/pnp

2018-04-23 Thread Chris Novakovic
Fully document the format used by the /proc/net/pnp file written by
ipconfig, explain where its values originate from, and clarify that the
tertiary name server IP and DNS domain name are only written to the file
when autoconfiguration is used.

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt | 34 ++-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index 1513e5d663fd..a1030bea60d3 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -110,6 +110,9 @@ 
ip=:::
will not be triggered if it is missing and NFS root is not
in operation.
 
+   Value is exported to /proc/net/pnp with the prefix "bootserver "
+   (see below).
+
Default: Determined using autoconfiguration.
 The address of the autoconfiguration server is used.
 
@@ -165,12 +168,33 @@ 
ip=:::
 
 Default: any
 
-  IP address of first nameserver.
-   Value gets exported by /proc/net/pnp which is often linked
-   on embedded systems by /etc/resolv.conf.
+  IP address of primary nameserver.
+   Value is exported to /proc/net/pnp with the prefix "nameserver "
+   (see below).
+
+   Default: None if not using autoconfiguration; determined
+   automatically if using autoconfiguration.
+
+  IP address of secondary nameserver.
+   See .
+
+  After configuration (whether manual or automatic) is complete, a file is
+  created at /proc/net/pnp in the following format; lines are omitted if
+  their respective value is empty following configuration.
+
+   #PROTO: (depending on configuration 
method)
+   domain  (if autoconfigured, the DNS 
domain)
+   nameserver (primary name server IP)
+   nameserver (secondary name server IP)
+   nameserver (tertiary name server IP)
+   bootserver   (NFS server IP)
+
+   and  are requested during autoconfiguration; they
+  cannot be specified as part of the "ip=" kernel command line parameter.
 
-  IP address of second nameserver.
-   Same as above.
+  Because the "domain" and "nameserver" options are recognised by DNS
+  resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
+  that use an NFS root filesystem.
 
 
 nfsrootdebug
-- 
2.14.1



[PATCH v2 3/8] ipconfig: BOOTP: Don't request IEN-116 name servers

2018-04-23 Thread Chris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
Server" [1, §3.7]), but tag 5 in the response isn't processed by
ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
to have been the intended tag to request.

This won't cause any breakage for existing users, as tag 5 responses
provided by BOOTP servers weren't being processed anyway.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index d0ea0ecc9008..bcf3c4f9882d 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -721,7 +721,7 @@ static void __init ic_bootp_init_ext(u8 *e)
*e++ = 3;   /* Default gateway request */
*e++ = 4;
e += 4;
-   *e++ = 5;   /* Name server request */
+   *e++ = 6;   /* (DNS) name server request */
*e++ = 8;
e += 8;
*e++ = 12;  /* Host name request */
-- 
2.14.1



Re: [PATCH iproute2-next v2 2/2] gre/gre6: allow clearing {,i,o}{key,seq,csum} flags

2018-04-23 Thread David Ahern
On 4/20/18 2:32 AM, Sabrina Dubroca wrote:
> Currently, iproute allows setting those flags, but it's impossible to
> clear them, since their current value is fetched from the kernel and
> then we OR in the additional flags passed on the command line.
> 
> Add no* variants to allow clearing them.
> 
> Signed-off-by: Sabrina Dubroca 
> ---
> v2: fixed up okey flag clearing
> also reset the actual value of the key, not just the flag

both applied to iproute2-next. Thanks,



Re: [PATCH iproute2 net-next] vxlan: add ttl auto in help message

2018-04-23 Thread David Ahern
On 4/23/18 8:40 PM, Hangbin Liu wrote:
> Signed-off-by: Hangbin Liu 
> ---
>  ip/iplink_vxlan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks,



Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-23 Thread David Rientjes
On Mon, 23 Apr 2018, Mikulas Patocka wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because kvmalloc falls back to vmalloc
> only if memory is fragmented.
> 
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
> is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
> verify the addresses passed to it, and so the user will get a reliable
> stacktrace.
> 

Why not just do it unconditionally?  Sounds better than "50% of the time 
this will catch bugs".

> Some bugs (such as buffer overflows) are better detected
> with kmalloc code, so we must test the kmalloc path too.
> 
> Signed-off-by: Mikulas Patocka 
> 
> ---
>  mm/util.c |   10 ++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-2.6/mm/util.c
> ===
> --- linux-2.6.orig/mm/util.c  2018-04-23 00:12:05.0 +0200
> +++ linux-2.6/mm/util.c   2018-04-23 17:57:02.0 +0200
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>  
> +#ifdef CONFIG_DEBUG_SG
> + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> + if (!(prandom_u32_max(2) & 1))
> + goto do_vmalloc;
> +#endif
> +
>   /*
>* We want to attempt a large physically contiguous block first because
>* it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
>   if (ret || size <= PAGE_SIZE)
>   return ret;
>  
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif

You can just do

do_vmalloc: __maybe_unused

>   return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));
>  }
> 
> 


[PATCH iproute2 net-next] vxlan: add ttl auto in help message

2018-04-23 Thread Hangbin Liu
Signed-off-by: Hangbin Liu 
---
 ip/iplink_vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 661eaa7..d89b68b 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -51,7 +51,7 @@ static void print_explain(FILE *f)
"Where: VNI   := 0-16777215\n"
"   ADDR  := { IP_ADDRESS | any }\n"
"   TOS   := { NUMBER | inherit }\n"
-   "   TTL   := { 1..255 | inherit }\n"
+   "   TTL   := { 1..255 | auto | inherit }\n"
"   LABEL := 0-1048575\n"
);
 }
-- 
2.5.5



Re: [PATCH bpf-next 00/15] Introducing AF_XDP support

2018-04-23 Thread Jason Wang



On 2018年04月23日 21:56, Björn Töpel wrote:

From: Björn Töpel 

This RFC introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this v2 version, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This RFC only supports copy-mode for
the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
using the XDP_DRV path. Zero-copy support requires XDP and driver
changes that Jesper Dangaard Brouer is working on. Some of his work
has already been accepted. We will publish our zero-copy support for
RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this RFC, all
packet data is copied out to user-space.

A new feature in this RFC is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for 

Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-04-23 Thread Richard Guy Briggs
On 2018-04-23 19:15, Paul Moore wrote:
> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs  wrote:
> > On 2018-04-18 19:47, Paul Moore wrote:
> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  
> >> wrote:
> >> > Implement the proc fs write to set the audit container ID of a process,
> >> > emitting an AUDIT_CONTAINER record to document the event.
> >> >
> >> > This is a write from the container orchestrator task to a proc entry of
> >> > the form /proc/PID/containerid where PID is the process ID of the newly
> >> > created task that is to become the first task in a container, or an
> >> > additional task added to a container.
> >> >
> >> > The write expects up to a u64 value (unset: 18446744073709551615).
> >> >
> >> > This will produce a record such as this:
> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
> >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 
> >> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 
> >> > res=0
> >> >
> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> >> > the orchestrator while the "opid" field is the object's PID, the process
> >> > being "contained".  Old and new container ID values are given in the
> >> > "contid" fields, while res indicates its success.
> >> >
> >> > It is not permitted to self-set, unset or re-set the container ID.  A
> >> > child inherits its parent's container ID, but then can be set only once
> >> > after.
> >> >
> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> >
> >> > Signed-off-by: Richard Guy Briggs 
> >> > ---
> >> >  fs/proc/base.c | 37 
> >> >  include/linux/audit.h  | 16 +
> >> >  include/linux/init_task.h  |  4 ++-
> >> >  include/linux/sched.h  |  1 +
> >> >  include/uapi/linux/audit.h |  2 ++
> >> >  kernel/auditsc.c   | 84 
> >> > ++
> >> >  6 files changed, 143 insertions(+), 1 deletion(-)
> 
> ...
> 
> >> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> > index d258826..1b82191 100644
> >> > --- a/include/linux/sched.h
> >> > +++ b/include/linux/sched.h
> >> > @@ -796,6 +796,7 @@ struct task_struct {
> >> >  #ifdef CONFIG_AUDITSYSCALL
> >> > kuid_t  loginuid;
> >> > unsigned intsessionid;
> >> > +   u64 containerid;
> >>
> >> This one line addition to the task_struct scares me the most of
> >> anything in this patchset.  Why?  It's a field named "containerid" in
> >> a perhaps one of the most widely used core kernel structures; the
> >> possibilities for abuse are endless, and it's foolish to think we
> >> would ever be able to adequately police this.
> >
> > Fair enough.
> >
> >> Unfortunately, we can't add the field to audit_context as things
> >> currently stand because we don't always allocate an audit_context,
> >> it's dependent on the system's configuration, and we need to track the
> >> audit container ID for a given process, regardless of the audit
> >> configuration.  Pretty much the same reason why loginuid and sessionid
> >> are located directly in task_struct now.  As I stressed during the
> >> design phase, I really want to keep this as an *audit* container ID
> >> and not a general purpose kernel wide container ID.  If the kernel
> >> ever grows a general purpose container ID token, I'll be the first in
> >> line to convert the audit code, but I don't want audit to be that
> >> general purpose mechanism ... audit is hated enough as-is ;)
> >
> > When would we need an audit container ID when audit is not enabled
> > enough to have an audit_context?
> 
> I'm thinking of the audit_alloc() case where audit_filter_task()
> returns AUDIT_DISABLED.

Ok, so a task could be marked as filtered but its children would still
be auditable and inheriting its parent containerid (as well at its
loginuid and sessionid)...

> I believe this is the same reason why loginuid and sessionid live
> directly in the task_struct and not in the audit_context; they need to
> persist for the lifetime of the task.

Yes, probably.

> > If it is only used for audit, and audit is the only consumer, and audit
> > can only use it when it is enabled, then we can just return success to
> > any write to the proc filehandle, or not even present it.  Nothing will
> > be able to know that value wasn't used.
> >
> > When are loginuid and sessionid used now when audit is not enabled (or
> > should I say, explicitly disabled)?
> 
> See above.  I think that should answer these questions.

Ok.

> >> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> >> > index 4e61a9e..921a71f 100644
> >> > --- a/include/uapi/linux/audit.h
> >> > +++ b/include/uapi/linux/audit.h
> >> > @@ -71,6 +71,7 @@
> >> >  #define AUDIT_TTY_SET  1017/* Set TTY auditing status 

Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Andy Lutomirski
On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet  wrote:
> Hi Andy
>
> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:

>> I would suggest that you rework the interface a bit.  First a user would 
>> call mmap() on a TCP socket, which would create an empty VMA.  (It would set 
>> vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
>> recognize it, but it would have no effect whatsoever on the TCP state 
>> machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new 
>> ioctl() or setsockopt() function and pass something like:
>
>
>>
>> struct tcp_zerocopy_receive {
>>   void *address;
>>   size_t length;
>> };
>>
>> The kernel would verify that [address, address+length) is entirely inside a 
>> single TCP VMA and then would do the vm_insert_range magic.
>
> I have no idea what is the proper API for that.
> Where the TCP VMA(s) would be stored ?
> In TCP socket, or MM layer ?

MM layer.  I haven't tested this at all, and the error handling is
totally wrong, but I think you'd do something like:

len = get_user(...);

down_read(>mm->mmap_sem);

vma = find_vma(mm, start);
if (!vma || vma->vm_start > start)
  return -EFAULT;

/* This is buggy.  You also need to check that the file is a socket.
This is probably trivial. */
if (vma->vm_file->private_data != sock)
  return -EINVAL;

if (len > vma->vm_end - start)
  return -EFAULT;  /* too big a request. */

and now you'd do the vm_insert_page() dance, except that you don't
have to abort the whole procedure if you discover that something isn't
aligned right.  Instead you'd just stop and tell the caller that you
didn't map the full requested size.  You might also need to add some
code to charge the caller for the pages that get pinned, but that's an
orthogonal issue.

You also need to provide some way for user programs to signal that
they're done with the page in question.  MADV_DONTNEED might be
sufficient.

In the mmap() helper, you might want to restrict the mapped size to
something reasonable.  And it might be nice to hook mremap() to
prevent user code from causing too much trouble.

With my x86-writer-of-TLB-code hat on, I expect the major performance
costs to be the generic costs of mmap() and munmap() (which only
happen once per socket instead of once per read if you like my idea),
the cost of a TLB miss when the data gets read (really not so bad on
modern hardware), and the cost of the TLB invalidation when user code
is done with the buffers.  The latter is awful, especially in
multithreaded programs.  In fact, it's so bad that it might be worth
mentioning in the documentation for this code that it just shouldn't
be used in multithreaded processes.  (Also, on non-PCID hardware,
there's an annoying situation in which a recently-migrated thread that
removes a mapping sends an IPI to the CPU that the thread used to be
on.  I thought I had a clever idea to get rid of that IPI once, but it
turned out to be wrong.)

Architectures like ARM that have superior TLB handling primitives will
not be hurt as badly if this is used my a multithreaded program.

>
>
> And I am not sure why the error handling would be better (point 4), unless we 
> can return smaller @length than requested maybe ?

Exactly.  If I request 10MB mapped and only the first 9MB are aligned
right, I still want the first 9 MB.

>
> Also how the VMA space would be accounted (point 3) when creating an empty 
> VMA (no pages in there yet)

There's nothing to account.  It's the same as mapping /dev/null or
similar -- the mm core should take care of it for you.


Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > > Hello everyone,
> > > > > > > > > 
> > > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > > 
> > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) 
> > > > > > > > > implemented
> > > > > > > > > by Jens at 
> > > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the 
> > > > > > > > > guest.
> > > > > > > > > 
> > > > > > > > > TODO:
> > > > > > > > > - Refinements and bug fixes;
> > > > > > > > > - Split into small patches;
> > > > > > > > > - Test indirect descriptor support;
> > > > > > > > > - Test/fix event suppression support;
> > > > > > > > > - Test devices other than net;
> > > > > > > > > 
> > > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > > - Split vring_unmap_one() for packed ring and split ring 
> > > > > > > > > (Jason);
> > > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() 
> > > > > > > > > (Jason);
> > > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Tiwei Bie 
> > > > > > > > > ---
> > > > > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > > > > +---
> > > > > > > > >include/linux/virtio_ring.h|8 +-
> > > > > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > +
> > > > > > > > > + if (vq->indirect) {
> > > > > > > > > + u32 len;
> > > > > > > > > +
> > > > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > > > + /* Free the indirect table, if any, now that 
> > > > > > > > > it's unmapped. */
> > > > > > > > > + if (!desc)
> > > > > > > > > + goto out;
> > > > > > > > > +
> > > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > > +   
> > > > > > > > > vq->vring_packed.desc[head].len);
> > > > > > > > > +
> > > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > > +  cpu_to_virtio16(vq->vq.vdev, 
> > > > > > > > > VRING_DESC_F_INDIRECT)));
> > > > > > > > It looks to me spec does not force to keep 
> > > > > > > > VRING_DESC_F_INDIRECT here. So we
> > > > > > > > can safely remove this BUG_ON() here.
> > > > > > > > 
> > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct 
> > > > > > > > > vring_packed_desc));
> > > > > > > > Len could be ignored for used descriptor according to the spec, 
> > > > > > > > so we need
> > > > > > > > remove this BUG_ON() too.
> > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > > And I think something related to this in the spec isn't very
> > > > > > > clear currently.
> > > > > > > 
> > > > > > > In the spec, there are below words:
> > > > > > > 
> > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > > """
> > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > > is reserved and is ignored by the device.
> > > > > > > """
> > > > > > > 
> > > > > > > So when device writes back an used descriptor in this case,
> > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > > is reserved and 

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Michael S. Tsirkin
On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > > 
> > > > > 
> > > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > > Hello everyone,
> > > > > > > > 
> > > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > > 
> > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) 
> > > > > > > > implemented
> > > > > > > > by Jens at 
> > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the 
> > > > > > > > guest.
> > > > > > > > 
> > > > > > > > TODO:
> > > > > > > > - Refinements and bug fixes;
> > > > > > > > - Split into small patches;
> > > > > > > > - Test indirect descriptor support;
> > > > > > > > - Test/fix event suppression support;
> > > > > > > > - Test devices other than net;
> > > > > > > > 
> > > > > > > > RFC v1 -> RFC v2:
> > > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > > - Split vring_unmap_one() for packed ring and split ring 
> > > > > > > > (Jason);
> > > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() 
> > > > > > > > (Jason);
> > > > > > > > - Some other refinements and bug fixes;
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > Signed-off-by: Tiwei Bie 
> > > > > > > > ---
> > > > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > > > +---
> > > > > > > >include/linux/virtio_ring.h|8 +-
> > > > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > @@ -58,14 +58,15 @@
> > > > > > > [...]
> > > > > > > 
> > > > > > > > +
> > > > > > > > +   if (vq->indirect) {
> > > > > > > > +   u32 len;
> > > > > > > > +
> > > > > > > > +   desc = vq->desc_state[head].indir_desc;
> > > > > > > > +   /* Free the indirect table, if any, now that 
> > > > > > > > it's unmapped. */
> > > > > > > > +   if (!desc)
> > > > > > > > +   goto out;
> > > > > > > > +
> > > > > > > > +   len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > > + 
> > > > > > > > vq->vring_packed.desc[head].len);
> > > > > > > > +
> > > > > > > > +   BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > > +cpu_to_virtio16(vq->vq.vdev, 
> > > > > > > > VRING_DESC_F_INDIRECT)));
> > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT 
> > > > > > > here. So we
> > > > > > > can safely remove this BUG_ON() here.
> > > > > > > 
> > > > > > > > +   BUG_ON(len == 0 || len % sizeof(struct 
> > > > > > > > vring_packed_desc));
> > > > > > > Len could be ignored for used descriptor according to the spec, 
> > > > > > > so we need
> > > > > > > remove this BUG_ON() too.
> > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > > And I think something related to this in the spec isn't very
> > > > > > clear currently.
> > > > > > 
> > > > > > In the spec, there are below words:
> > > > > > 
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > > """
> > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > > is reserved and is ignored by the device.
> > > > > > """
> > > > > > 
> > > > > > So when device writes back an used descriptor in this case,
> > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > > is reserved and should be ignored.
> > > > > > 
> > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > > """
> > > > > > Element Length is reserved for used descriptors without the
> > > > > > VIRTQ_DESC_F_WRITE flag, and 

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Michael S. Tsirkin
On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote:
> On Mon, 23 Apr 2018 12:44:39 -0700
> Siwei Liu  wrote:
> 
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  
> > wrote:
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin"  wrote:
> > >>  
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially 
> > >> > > > >the
> > >> > > > >three device model.  MS worked hard with distro vendors to 
> > >> > > > >support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.  
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. 
> > >> > > >  
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace 
> > >> > > application.  
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >  
> > >> > > You can't go breaking userspace even if you don't like the 
> > >> > > application.  
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the 
> > >> > userspace
> > >> > API at all.
> > >> >  
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> > >> devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
> > >> instance
> > >> and sets up TAP support over the Linux netvsc device as well as the 
> > >> Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device 
> > >> model
> > >> or start hiding devices from userspace.  
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >  
> > >> Also, I am working on associating netvsc and VF device based on serial 
> > >> number
> > >> rather than MAC address. The serial number is how Windows works now, and 
> > >> it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.  
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> 
> The serial number has always been in the hypervisor since original support of 
> SR-IOV
> in WS2008.  So no backward compatibility special cases would be needed.

Is that a serial from real hardware or a hypervisor thing?


-- 
MST


[PATCH 1/1] Revert "rds: ib: add error handle"

2018-04-23 Thread Zhu Yanjun
This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.

After long time discussion and investigations, it seems that there
is no mem leak. So this patch is reverted.

Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_cm.c | 47 +++
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index eea1d86..d64bfaf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -443,7 +443,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_send_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
-   goto rds_ibdev_out;
+   goto out;
}
 
ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
@@ -457,19 +457,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_recv_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
-   goto send_cq_out;
+   goto out;
}
 
ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
if (ret) {
rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
if (ret) {
rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
/* XXX negotiate max send/recv with remote? */
@@ -495,7 +495,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, );
if (ret) {
rdsdebug("rdma_create_qp failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
@@ -505,7 +505,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_send_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent send failed\n");
-   goto qp_out;
+   goto out;
}
 
ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
@@ -515,7 +515,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recv_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent recv failed\n");
-   goto send_hdrs_dma_out;
+   goto out;
}
 
ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
@@ -523,7 +523,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_ack) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent ack failed\n");
-   goto recv_hdrs_dma_out;
+   goto out;
}
 
ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct 
rds_ib_send_work),
@@ -531,7 +531,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_sends) {
ret = -ENOMEM;
rdsdebug("send allocation failed\n");
-   goto ack_dma_out;
+   goto out;
}
 
ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct 
rds_ib_recv_work),
@@ -539,7 +539,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recvs) {
ret = -ENOMEM;
rdsdebug("recv allocation failed\n");
-   goto sends_out;
+   goto out;
}
 
rds_ib_recv_init_ack(ic);
@@ -547,33 +547,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
 ic->i_send_cq, ic->i_recv_cq);
 
-   return ret;
-
-sends_out:
-   vfree(ic->i_sends);
-ack_dma_out:
-   ib_dma_free_coherent(dev, sizeof(struct rds_header),
-ic->i_ack, ic->i_ack_dma);
-recv_hdrs_dma_out:
-   ib_dma_free_coherent(dev, ic->i_recv_ring.w_nr *
-   sizeof(struct rds_header),
-   ic->i_recv_hdrs, ic->i_recv_hdrs_dma);
-send_hdrs_dma_out:
-   ib_dma_free_coherent(dev, ic->i_send_ring.w_nr *
-   sizeof(struct rds_header),
-   ic->i_send_hdrs, ic->i_send_hdrs_dma);
-qp_out:
-   rdma_destroy_qp(ic->i_cm_id);
-recv_cq_out:
-   if (!ib_destroy_cq(ic->i_recv_cq))
-   ic->i_recv_cq = NULL;
-send_cq_out:
-   if (!ib_destroy_cq(ic->i_send_cq))
-   ic->i_send_cq = NULL;
-rds_ibdev_out:
-   rds_ib_remove_conn(rds_ibdev, conn);
+out:
rds_ib_dev_put(rds_ibdev);
-
return ret;
 }
 
-- 
2.7.4



Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > > 
> > > > 
> > > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > > Hello everyone,
> > > > > > > 
> > > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > > 
> > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) 
> > > > > > > implemented
> > > > > > > by Jens at 
> > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > > Minor changes are needed for the vhost code, e.g. to kick the 
> > > > > > > guest.
> > > > > > > 
> > > > > > > TODO:
> > > > > > > - Refinements and bug fixes;
> > > > > > > - Split into small patches;
> > > > > > > - Test indirect descriptor support;
> > > > > > > - Test/fix event suppression support;
> > > > > > > - Test devices other than net;
> > > > > > > 
> > > > > > > RFC v1 -> RFC v2:
> > > > > > > - Add indirect descriptor support - compile test only;
> > > > > > > - Add event suppression supprt - compile test only;
> > > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > > - Avoid using '%' operator (Jason);
> > > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > > - Some other refinements and bug fixes;
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > Signed-off-by: Tiwei Bie 
> > > > > > > ---
> > > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > > +---
> > > > > > >include/linux/virtio_ring.h|8 +-
> > > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -58,14 +58,15 @@
> > > > > > [...]
> > > > > > 
> > > > > > > +
> > > > > > > + if (vq->indirect) {
> > > > > > > + u32 len;
> > > > > > > +
> > > > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > > > + /* Free the indirect table, if any, now that it's 
> > > > > > > unmapped. */
> > > > > > > + if (!desc)
> > > > > > > + goto out;
> > > > > > > +
> > > > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > > +   vq->vring_packed.desc[head].len);
> > > > > > > +
> > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > > +  cpu_to_virtio16(vq->vq.vdev, 
> > > > > > > VRING_DESC_F_INDIRECT)));
> > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT 
> > > > > > here. So we
> > > > > > can safely remove this BUG_ON() here.
> > > > > > 
> > > > > > > + BUG_ON(len == 0 || len % sizeof(struct 
> > > > > > > vring_packed_desc));
> > > > > > Len could be ignored for used descriptor according to the spec, so 
> > > > > > we need
> > > > > > remove this BUG_ON() too.
> > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > > And I think something related to this in the spec isn't very
> > > > > clear currently.
> > > > > 
> > > > > In the spec, there are below words:
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > > """
> > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > > is reserved and is ignored by the device.
> > > > > """
> > > > > 
> > > > > So when device writes back an used descriptor in this case,
> > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > > is reserved and should be ignored.
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > > """
> > > > > Element Length is reserved for used descriptors without the
> > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > > """
> > > > > 
> > > > > And this is the way how driver ignores the `len` in an used
> > > > > descriptor.
> > > > > 
> > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > > """
> > > > > To increase ring capacity the driver can store a (read-only
> > > > > by the device) table of indirect descriptors 

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Michael S. Tsirkin
On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote:
> On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > > Hello everyone,
> > > > > > 
> > > > > > This RFC implements packed ring support for virtio driver.
> > > > > > 
> > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > > 
> > > > > > TODO:
> > > > > > - Refinements and bug fixes;
> > > > > > - Split into small patches;
> > > > > > - Test indirect descriptor support;
> > > > > > - Test/fix event suppression support;
> > > > > > - Test devices other than net;
> > > > > > 
> > > > > > RFC v1 -> RFC v2:
> > > > > > - Add indirect descriptor support - compile test only;
> > > > > > - Add event suppression supprt - compile test only;
> > > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > > - Avoid using '%' operator (Jason);
> > > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > > - Some other refinements and bug fixes;
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > Signed-off-by: Tiwei Bie 
> > > > > > ---
> > > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > > +---
> > > > > >include/linux/virtio_ring.h|8 +-
> > > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -58,14 +58,15 @@
> > > > > [...]
> > > > > 
> > > > > > +
> > > > > > +   if (vq->indirect) {
> > > > > > +   u32 len;
> > > > > > +
> > > > > > +   desc = vq->desc_state[head].indir_desc;
> > > > > > +   /* Free the indirect table, if any, now that it's 
> > > > > > unmapped. */
> > > > > > +   if (!desc)
> > > > > > +   goto out;
> > > > > > +
> > > > > > +   len = virtio32_to_cpu(vq->vq.vdev,
> > > > > > + vq->vring_packed.desc[head].len);
> > > > > > +
> > > > > > +   BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > > +cpu_to_virtio16(vq->vq.vdev, 
> > > > > > VRING_DESC_F_INDIRECT)));
> > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT 
> > > > > here. So we
> > > > > can safely remove this BUG_ON() here.
> > > > > 
> > > > > > +   BUG_ON(len == 0 || len % sizeof(struct 
> > > > > > vring_packed_desc));
> > > > > Len could be ignored for used descriptor according to the spec, so we 
> > > > > need
> > > > > remove this BUG_ON() too.
> > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > > And I think something related to this in the spec isn't very
> > > > clear currently.
> > > > 
> > > > In the spec, there are below words:
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > > """
> > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > > is reserved and is ignored by the device.
> > > > """
> > > > 
> > > > So when device writes back an used descriptor in this case,
> > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > > is reserved and should be ignored.
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > > """
> > > > Element Length is reserved for used descriptors without the
> > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > > """
> > > > 
> > > > And this is the way how driver ignores the `len` in an used
> > > > descriptor.
> > > > 
> > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > > """
> > > > To increase ring capacity the driver can store a (read-only
> > > > by the device) table of indirect descriptors anywhere in memory,
> > > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > > containing this indirect descriptor table;
> > > > """
> > > > 
> > > > So the indirect descriptors in the table are read-only by
> > 

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger
On Mon, 23 Apr 2018 23:06:55 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  
> > wrote:  
> > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> > >> On Mon, 23 Apr 2018 20:24:56 +0300
> > >> "Michael S. Tsirkin"  wrote:
> > >>  
> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> > >> > > > >
> > >> > > > >I will NAK patches to change to common code for netvsc especially 
> > >> > > > >the
> > >> > > > >three device model.  MS worked hard with distro vendors to 
> > >> > > > >support transparent
> > >> > > > >mode, ans we really can't have a new model; or do backport.
> > >> > > > >
> > >> > > > >Plus, DPDK is now dependent on existing model.  
> > >> > > >
> > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. 
> > >> > > >  
> > >> > >
> > >> > > The network device model is a userspace API, and DPDK is a userspace 
> > >> > > application.  
> > >> >
> > >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> > >> > AFAIK it's normally banging device registers directly.
> > >> >  
> > >> > > You can't go breaking userspace even if you don't like the 
> > >> > > application.  
> > >> >
> > >> > Could you please explain how is the proposed patchset breaking
> > >> > userspace? Ignoring DPDK for now, I don't think it changes the 
> > >> > userspace
> > >> > API at all.
> > >> >  
> > >>
> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> > >> devices
> > >> to look for Linux netvsc device and the paired VF device and setup the
> > >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
> > >> instance
> > >> and sets up TAP support over the Linux netvsc device as well as the 
> > >> Mellanox
> > >> VF device.
> > >>
> > >> So it depends on existing 2 device model. You can't go to a 3 device 
> > >> model
> > >> or start hiding devices from userspace.  
> > >
> > > Okay so how does the existing patch break that? IIUC does not go to
> > > a 3 device model since netvsc calls failover_register directly.
> > >  
> > >> Also, I am working on associating netvsc and VF device based on serial 
> > >> number
> > >> rather than MAC address. The serial number is how Windows works now, and 
> > >> it makes
> > >> sense for Linux and Windows to use the same mechanism if possible.  
> > >
> > > Maybe we should support same for virtio ...
> > > Which serial do you mean? From vpd?
> > >
> > > I guess you will want to keep supporting MAC for old hypervisors?
> > >
> > > It all seems like a reasonable thing to support in the generic core.  
> > 
> > That's the reason why I chose explicit identifier rather than rely on
> > MAC address to bind/pair a device. MAC address can change. Even if it
> > can't, malicious guest user can fake MAC address to skip binding.
> > 
> > -Siwei  
> 
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

On Hyper-V guest can't really change MAC address if SR-IOV is enabled.
Also, Linux has permanent ether address in netdev which is what should
be used to avoid user messing with device.


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Stephen Hemminger
On Mon, 23 Apr 2018 12:44:39 -0700
Siwei Liu  wrote:

> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  wrote:
> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:  
> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:  
> >> > > > >
> >> > > > >I will NAK patches to change to common code for netvsc especially 
> >> > > > >the
> >> > > > >three device model.  MS worked hard with distro vendors to support 
> >> > > > >transparent
> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> > > > >
> >> > > > >Plus, DPDK is now dependent on existing model.  
> >> > > >
> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.  
> >> > >
> >> > > The network device model is a userspace API, and DPDK is a userspace 
> >> > > application.  
> >> >
> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> > AFAIK it's normally banging device registers directly.
> >> >  
> >> > > You can't go breaking userspace even if you don't like the 
> >> > > application.  
> >> >
> >> > Could you please explain how is the proposed patchset breaking
> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> > API at all.
> >> >  
> >>
> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
> >> devices
> >> to look for Linux netvsc device and the paired VF device and setup the
> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> and sets up TAP support over the Linux netvsc device as well as the 
> >> Mellanox
> >> VF device.
> >>
> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> or start hiding devices from userspace.  
> >
> > Okay so how does the existing patch break that? IIUC does not go to
> > a 3 device model since netvsc calls failover_register directly.
> >  
> >> Also, I am working on associating netvsc and VF device based on serial 
> >> number
> >> rather than MAC address. The serial number is how Windows works now, and 
> >> it makes
> >> sense for Linux and Windows to use the same mechanism if possible.  
> >
> > Maybe we should support same for virtio ...
> > Which serial do you mean? From vpd?
> >
> > I guess you will want to keep supporting MAC for old hypervisors?

The serial number has always been in the hypervisor since original support of 
SR-IOV
in WS2008.  So no backward compatibility special cases would be needed.




Re: [PATCH net 0/3] amd-xgbe: AMD XGBE driver fixes 2018-04-23

2018-04-23 Thread David Miller
From: Tom Lendacky 
Date: Mon, 23 Apr 2018 11:42:58 -0500

> This patch series addresses some issues in the AMD XGBE driver.
> 
> The following fixes are included in this driver update series:
> 
> - Improve KR auto-negotiation and training (2 patches)
>   - Add pre and post auto-negotiation hooks
>   - Use the pre and post auto-negotiation hooks to disable CDR tracking
> during auto-negotiation page exchange in KR mode
> - Check for SFP tranceiver signal support and only use the signal if the
>   SFP indicates that it is supported
> 
> This patch series is based on net.
> 
> ---
> 
> Please queue this patch series up to stable releases 4.14 and above.

Series applied and queued up for -stable, thanks.


Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Tiwei Bie
On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2018年04月23日 17:29, Tiwei Bie wrote:
> > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > This RFC implements packed ring support for virtio driver.
> > > > > 
> > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > > 
> > > > > TODO:
> > > > > - Refinements and bug fixes;
> > > > > - Split into small patches;
> > > > > - Test indirect descriptor support;
> > > > > - Test/fix event suppression support;
> > > > > - Test devices other than net;
> > > > > 
> > > > > RFC v1 -> RFC v2:
> > > > > - Add indirect descriptor support - compile test only;
> > > > > - Add event suppression supprt - compile test only;
> > > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > > - Avoid using '%' operator (Jason);
> > > > > - Rename free_head -> next_avail_idx (Jason);
> > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > > - Some other refinements and bug fixes;
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Tiwei Bie 
> > > > > ---
> > > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > > +---
> > > > >include/linux/virtio_ring.h|8 +-
> > > > >include/uapi/linux/virtio_config.h |   12 +-
> > > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 71458f493cf8..0515dca34d77 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -58,14 +58,15 @@
> > > > [...]
> > > > 
> > > > > +
> > > > > + if (vq->indirect) {
> > > > > + u32 len;
> > > > > +
> > > > > + desc = vq->desc_state[head].indir_desc;
> > > > > + /* Free the indirect table, if any, now that it's 
> > > > > unmapped. */
> > > > > + if (!desc)
> > > > > + goto out;
> > > > > +
> > > > > + len = virtio32_to_cpu(vq->vq.vdev,
> > > > > +   vq->vring_packed.desc[head].len);
> > > > > +
> > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > > +  cpu_to_virtio16(vq->vq.vdev, 
> > > > > VRING_DESC_F_INDIRECT)));
> > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. 
> > > > So we
> > > > can safely remove this BUG_ON() here.
> > > > 
> > > > > + BUG_ON(len == 0 || len % sizeof(struct 
> > > > > vring_packed_desc));
> > > > Len could be ignored for used descriptor according to the spec, so we 
> > > > need
> > > > remove this BUG_ON() too.
> > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > > And I think something related to this in the spec isn't very
> > > clear currently.
> > > 
> > > In the spec, there are below words:
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > > """
> > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > > is reserved and is ignored by the device.
> > > """
> > > 
> > > So when device writes back an used descriptor in this case,
> > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > > is reserved and should be ignored.
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > > """
> > > Element Length is reserved for used descriptors without the
> > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > > """
> > > 
> > > And this is the way how driver ignores the `len` in an used
> > > descriptor.
> > > 
> > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > > """
> > > To increase ring capacity the driver can store a (read-only
> > > by the device) table of indirect descriptors anywhere in memory,
> > > and insert a descriptor in the main virtqueue (with \field{Flags}
> > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > > containing this indirect descriptor table;
> > > """
> > > 
> > > So the indirect descriptors in the table are read-only by
> > > the device. And the only descriptor which is writeable by
> > > the device is the descriptor in the main virtqueue (with
> > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > > `len` in this descriptor, we won't be able to get the
> > > length 

Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Jason Wang



On 2018年04月24日 09:05, Michael S. Tsirkin wrote:

+   if (vq->indirect) {
+   u32 len;
+
+   desc = vq->desc_state[head].indir_desc;
+   /* Free the indirect table, if any, now that it's unmapped. */
+   if (!desc)
+   goto out;
+
+   len = virtio32_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[head].len);
+
+   BUG_ON(!(vq->vring_packed.desc[head].flags &
+cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));

It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
can safely remove this BUG_ON() here.


+   BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));

Len could be ignored for used descriptor according to the spec, so we need
remove this BUG_ON() too.

Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
And I think something related to this in the spec isn't very
clear currently.

In the spec, there are below words:

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
"""
In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
is reserved and is ignored by the device.
"""

So when device writes back an used descriptor in this case,
device may not set the VIRTQ_DESC_F_WRITE flag as the flag
is reserved and should be ignored.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
"""
Element Length is reserved for used descriptors without the
VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
"""

And this is the way how driver ignores the `len` in an used
descriptor.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
"""
To increase ring capacity the driver can store a (read-only
by the device) table of indirect descriptors anywhere in memory,
and insert a descriptor in the main virtqueue (with \field{Flags}
bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
containing this indirect descriptor table;
"""

So the indirect descriptors in the table are read-only by
the device. And the only descriptor which is writeable by
the device is the descriptor in the main virtqueue (with
Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
`len` in this descriptor, we won't be able to get the
length of the data written by the device.

So I think the `len` in this descriptor will carry the
length of the data written by the device (if the buffers
are writable to the device) even if the VIRTQ_DESC_F_WRITE
isn't set by the device. How do you think?

Yes I think so. But we'd better need clarification from Michael.

I think if you use a descriptor, and you want to supply len
to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor.
Spec also says you must not set VIRTQ_DESC_F_INDIRECT then.
If that's a problem we could look at relaxing that last requirement -
does driver want INDIRECT in used descriptor to match
the value in the avail descriptor for some reason?



Looks not, so what I get it:

- device and set VIRTQ_DESC_F_WRITE flag for used descriptor when needed
- there no need to keep INDIRECT flag in used descriptor

So for the above case, we can just have a used descriptor with _F_WRITE 
but without INDIRECT flag.


Thanks



Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()

2018-04-23 Thread David Miller
From: Guillaume Nault 
Date: Mon, 23 Apr 2018 16:38:27 +0200

> We must validate sockaddr_len, otherwise userspace can pass fewer data
> than we expect and we end up accessing invalid data.
> 
> Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
> Reported-by: syzbot+4f03bdf92fdf9ef5d...@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault 

Applied and queued up for -stable, thank you.


Re: [PATCH net] l2tp: check sockaddr length in pppol2tp_connect()

2018-04-23 Thread David Miller
From: Guillaume Nault 
Date: Mon, 23 Apr 2018 16:15:14 +0200

> Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that
> it actually points to valid data.
> 
> Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp 
> parts")
> Reported-by: syzbot+a70ac890b23b1bf29...@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault 

Applied and queued up for -stable.

I guess you can completely remove the "bad socket address" -EINVAL else
clause later in the function as a cleanup in net-next.


Re: [PATCH] selftests: net: update .gitignore with missing test

2018-04-23 Thread David Miller
From: Anders Roxell 
Date: Mon, 23 Apr 2018 16:00:50 +0200

> Fixes: 192dc405f308 ("selftests: net: add tcp_mmap program")
> Signed-off-by: Anders Roxell 

Applied, thanks.


Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Michael S. Tsirkin
On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote:
> 
> 
> On 2018年04月23日 17:29, Tiwei Bie wrote:
> > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:
> > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This RFC implements packed ring support for virtio driver.
> > > > 
> > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > > > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > > > 
> > > > TODO:
> > > > - Refinements and bug fixes;
> > > > - Split into small patches;
> > > > - Test indirect descriptor support;
> > > > - Test/fix event suppression support;
> > > > - Test devices other than net;
> > > > 
> > > > RFC v1 -> RFC v2:
> > > > - Add indirect descriptor support - compile test only;
> > > > - Add event suppression supprt - compile test only;
> > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > - Avoid using '%' operator (Jason);
> > > > - Rename free_head -> next_avail_idx (Jason);
> > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > Thanks!
> > > > 
> > > > Signed-off-by: Tiwei Bie 
> > > > ---
> > > >drivers/virtio/virtio_ring.c   | 1094 
> > > > +---
> > > >include/linux/virtio_ring.h|8 +-
> > > >include/uapi/linux/virtio_config.h |   12 +-
> > > >include/uapi/linux/virtio_ring.h   |   61 ++
> > > >4 files changed, 980 insertions(+), 195 deletions(-)
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 71458f493cf8..0515dca34d77 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -58,14 +58,15 @@
> > > [...]
> > > 
> > > > +
> > > > +   if (vq->indirect) {
> > > > +   u32 len;
> > > > +
> > > > +   desc = vq->desc_state[head].indir_desc;
> > > > +   /* Free the indirect table, if any, now that it's 
> > > > unmapped. */
> > > > +   if (!desc)
> > > > +   goto out;
> > > > +
> > > > +   len = virtio32_to_cpu(vq->vq.vdev,
> > > > + vq->vring_packed.desc[head].len);
> > > > +
> > > > +   BUG_ON(!(vq->vring_packed.desc[head].flags &
> > > > +cpu_to_virtio16(vq->vq.vdev, 
> > > > VRING_DESC_F_INDIRECT)));
> > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So 
> > > we
> > > can safely remove this BUG_ON() here.
> > > 
> > > > +   BUG_ON(len == 0 || len % sizeof(struct 
> > > > vring_packed_desc));
> > > Len could be ignored for used descriptor according to the spec, so we need
> > > remove this BUG_ON() too.
> > Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
> > And I think something related to this in the spec isn't very
> > clear currently.
> > 
> > In the spec, there are below words:
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
> > """
> > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
> > is reserved and is ignored by the device.
> > """
> > 
> > So when device writes back an used descriptor in this case,
> > device may not set the VIRTQ_DESC_F_WRITE flag as the flag
> > is reserved and should be ignored.
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
> > """
> > Element Length is reserved for used descriptors without the
> > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
> > """
> > 
> > And this is the way how driver ignores the `len` in an used
> > descriptor.
> > 
> > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
> > """
> > To increase ring capacity the driver can store a (read-only
> > by the device) table of indirect descriptors anywhere in memory,
> > and insert a descriptor in the main virtqueue (with \field{Flags}
> > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
> > containing this indirect descriptor table;
> > """
> > 
> > So the indirect descriptors in the table are read-only by
> > the device. And the only descriptor which is writeable by
> > the device is the descriptor in the main virtqueue (with
> > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
> > `len` in this descriptor, we won't be able to get the
> > length of the data written by the device.
> > 
> > So I think the `len` in this descriptor will carry the
> > length of the data written by the device (if the buffers
> > are writable to the device) even if the VIRTQ_DESC_F_WRITE
> > isn't set by the device. How do you think?
> 
> Yes I think so. But 

Re: [RFC V3 PATCH 0/8] Packed ring for vhost

2018-04-23 Thread Jason Wang



On 2018年04月24日 04:11, Konrad Rzeszutek Wilk wrote:

On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote:

On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote:

On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote:

Hi all:

This RFC implement packed ring layout. The code were tested with
Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and
tweaks were needed on top of Tiwei's code to make it run. TCP stream
and pktgen does not show obvious difference compared with split ring.

I have to ask then - what is the benefit of this?

You can use this with dpdk within guest.
The DPDK version did see a performance improvement so hopefully with

Is there a link to this performance improvement document?



You probably want to this:

https://www.mail-archive.com/dev@dpdk.org/msg97735.html

Thanks


Re: [PATCH] dca: make function dca_common_get_tag static

2018-04-23 Thread David Miller
From: Colin King 
Date: Mon, 23 Apr 2018 13:49:38 +0100

> From: Colin Ian King 
> 
> Function dca_common_get_tag is local to the source and does not need to be
> in global scope, so make it static.
> 
> Cleans up sparse warning:
> drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was
> not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King 

Applied to net-next, thanks.


Re: [PATCH V6 net-next 08/15] net/tls: Support TLS device offload with IPv6

2018-04-23 Thread David Miller
From: Boris Pismenny 
Date: Sun, 22 Apr 2018 18:19:50 +0300

> @@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct 
> tls_context *ctx)
>   spin_unlock_irqrestore(_device_lock, flags);
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static struct net_device *ipv6_get_netdev(struct sock *sk)
> +{
> + struct net_device *dev = NULL;
> + struct inet_sock *inet = inet_sk(sk);
> + struct ipv6_pinfo *np = inet6_sk(sk);
> + struct flowi6 _fl6, *fl6 = &_fl6;
> + struct dst_entry *dst;

Ugh, please use sk->sk_dst_cache->dev and avoid all of the unnecessary
work.

Thank you.


Re: [RFC v2] virtio: support packed ring

2018-04-23 Thread Jason Wang



On 2018年04月23日 17:29, Tiwei Bie wrote:

On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote:

On 2018年04月01日 22:12, Tiwei Bie wrote:

Hello everyone,

This RFC implements packed ring support for virtio driver.

The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.

TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Signed-off-by: Tiwei Bie 
---
   drivers/virtio/virtio_ring.c   | 1094 
+---
   include/linux/virtio_ring.h|8 +-
   include/uapi/linux/virtio_config.h |   12 +-
   include/uapi/linux/virtio_ring.h   |   61 ++
   4 files changed, 980 insertions(+), 195 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@

[...]


+
+   if (vq->indirect) {
+   u32 len;
+
+   desc = vq->desc_state[head].indir_desc;
+   /* Free the indirect table, if any, now that it's unmapped. */
+   if (!desc)
+   goto out;
+
+   len = virtio32_to_cpu(vq->vq.vdev,
+ vq->vring_packed.desc[head].len);
+
+   BUG_ON(!(vq->vring_packed.desc[head].flags &
+cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT)));

It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we
can safely remove this BUG_ON() here.


+   BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc));

Len could be ignored for used descriptor according to the spec, so we need
remove this BUG_ON() too.

Yeah, you're right! The BUG_ON() isn't right. I'll remove it.
And I think something related to this in the spec isn't very
clear currently.

In the spec, there are below words:

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272
"""
In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE
is reserved and is ignored by the device.
"""

So when device writes back an used descriptor in this case,
device may not set the VIRTQ_DESC_F_WRITE flag as the flag
is reserved and should be ignored.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170
"""
Element Length is reserved for used descriptors without the
VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
"""

And this is the way how driver ignores the `len` in an used
descriptor.

https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241
"""
To increase ring capacity the driver can store a (read-only
by the device) table of indirect descriptors anywhere in memory,
and insert a descriptor in the main virtqueue (with \field{Flags}
bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element
containing this indirect descriptor table;
"""

So the indirect descriptors in the table are read-only by
the device. And the only descriptor which is writeable by
the device is the descriptor in the main virtqueue (with
Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the
`len` in this descriptor, we won't be able to get the
length of the data written by the device.

So I think the `len` in this descriptor will carry the
length of the data written by the device (if the buffers
are writable to the device) even if the VIRTQ_DESC_F_WRITE
isn't set by the device. How do you think?


Yes I think so. But we'd better need clarification from Michael.





The reason is we don't touch descriptor ring in the case of split, so
BUG_ON()s may help there.


+
+   for (j = 0; j < len / sizeof(struct vring_packed_desc); j++)
+   vring_unmap_one_packed(vq, [j]);
+
+   kfree(desc);
+   vq->desc_state[head].indir_desc = NULL;
+   } else if (ctx) {
+   *ctx = vq->desc_state[head].indir_desc;
+   }
+
+out:
+   return vq->desc_state[head].num;
+}
+
+static inline bool more_used_split(const struct vring_virtqueue *vq)
   {
return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, 
vq->vring.used->idx);
   }
+static inline bool more_used_packed(const struct vring_virtqueue *vq)
+{
+   u16 last_used, 

Re: [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure

2018-04-23 Thread David Miller
From: Boris Pismenny 
Date: Sun, 22 Apr 2018 18:19:49 +0300

> +/* We assume that the socket is already connected */
> +static struct net_device *get_netdev_for_sock(struct sock *sk)
> +{
> + struct inet_sock *inet = inet_sk(sk);
> + struct net_device *netdev = NULL;
> +
> + netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
> +
> + return netdev;
> +}

Please do this more directly by looking at sk->sk_dst_cache and taking
the device from dst->dev if non-NULL.

That avoids the dev_get_by_index() demux work, and also if
sk->sk_dst_cache is non-NULL then the socket is indeed connected.

Thanks.


Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-23 Thread Mikulas Patocka


On Mon, 23 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> 
> > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > 
> > > I don't remember that conversation, so I don't know whether I agree with
> > > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > > towards marking regions with memalloc_noio_save() / restore.  If you do
> > > that, you won't need vmalloc(GFP_NOIO).
> > 
> > He said the same thing a year ago. And there was small progress. 6 out of 
> > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> > infiniband and 1 in btrfs. (the whole discussion is here 
> > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> 
> Well this is not that easy. It requires a cooperation from maintainers.
> I can only do as much. I've posted patches in the past and actively
> bringing up this topic at LSFMM last two years...

You're right - but you have chosen the uneasy path. Fixing __vmalloc code 
is easy and it doesn't require cooperation with maintainers.

> > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > does he have veto over this part of the code? I'd much rather argue with 
> > people who have constructive comments about fixing bugs than with him.
> 
> I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> I would be much more willing to change my mind if you would back your
> patch by a real bug report. Hacks are acceptable when we have a real
> issue in hands. But if we want to fix potential issue then better make
> it properly.

Developers should fix bugs in advance, not to wait until a crash hapens, 
is analyzed and reported.

What's the problem with 15-line hack? Is the problem that kernel 
developers would feel depressed when looking the source code? Other than 
harming developers' feelings, I don't see what kind of damange could that 
piece of code do.

Mikulas


Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address

2018-04-23 Thread Yonghong Song


Hi, Peter,

I have a question regarding to one of your comments below.

On 3/12/18 3:01 PM, Peter Zijlstra wrote:

On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote:

+static void stack_map_get_build_id_offset(struct bpf_map *map,
+ struct stack_map_bucket *bucket,
+ u64 *ips, u32 trace_nr)
+{
+   int i;
+   struct vm_area_struct *vma;
+   struct bpf_stack_build_id *id_offs;
+
+   bucket->nr = trace_nr;
+   id_offs = (struct bpf_stack_build_id *)bucket->data;
+
+   if (!current || !current->mm ||
+   down_read_trylock(>mm->mmap_sem) == 0) {


You probably want an in_nmi() before the down_read_trylock(). Doing
up_read() is an absolute no-no from NMI context.


The below is the final code from Song:

/*
 * We cannot do up_read() in nmi context, so build_id lookup is
 * only supported for non-nmi events. If at some point, it is
 * possible to run find_vma() without taking the semaphore, we
 * would like to allow build_id lookup in nmi context.
 *
 * Same fallback is used for kernel stack (!user) on a stackmap
 * with build_id.
 */
if (!user || !current || !current->mm || in_nmi() ||
down_read_trylock(>mm->mmap_sem) == 0) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
id_offs[i].ip = ips[i];
}
return;
}





And IIUC its 'trivial' to use this stuff with hardware counters.


Here, you mentioned that it was 'trivial' to use buildid thing with 
hardware counters, if I interpreted correctly. However, the hardware 
counter overflow will trigger NMI. Based on the above logic,

it will default to old IP only behavior.

Could you explain a little more how to get buildid with hardware
counter overflow events?

Thanks!




+   /* cannot access current->mm, fall back to ips */
+   for (i = 0; i < trace_nr; i++) {
+   id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+   id_offs[i].ip = ips[i];
+   }
+   return;
+   }
+
+   for (i = 0; i < trace_nr; i++) {
+   vma = find_vma(current->mm, ips[i]);
+   if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
+   /* per entry fall back to ips */
+   id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+   id_offs[i].ip = ips[i];
+   continue;
+   }
+   id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
+   - vma->vm_start;
+   id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
+   }
+   up_read(>mm->mmap_sem);
+}




[PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-23 Thread Mikulas Patocka
The kvmalloc function tries to use kmalloc and falls back to vmalloc if
kmalloc fails.

Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
uses DMA-API on the returned memory or frees it with kfree. Such bugs were
found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
code.

These bugs are hard to reproduce because kvmalloc falls back to vmalloc
only if memory is fragmented.

In order to detect these bugs reliably I submit this patch that changes
kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG
is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer
verify the addresses passed to it, and so the user will get a reliable
stacktrace.

Some bugs (such as buffer overflows) are better detected
with kmalloc code, so we must test the kmalloc path too.

Signed-off-by: Mikulas Patocka 

---
 mm/util.c |   10 ++
 1 file changed, 10 insertions(+)

Index: linux-2.6/mm/util.c
===
--- linux-2.6.orig/mm/util.c2018-04-23 00:12:05.0 +0200
+++ linux-2.6/mm/util.c 2018-04-23 17:57:02.0 +0200
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
 */
WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
+#ifdef CONFIG_DEBUG_SG
+   /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
+   if (!(prandom_u32_max(2) & 1))
+   goto do_vmalloc;
+#endif
+
/*
 * We want to attempt a large physically contiguous block first because
 * it is less likely to fragment multiple larger blocks and therefore
@@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
if (ret || size <= PAGE_SIZE)
return ret;
 
+#ifdef CONFIG_DEBUG_SG
+do_vmalloc:
+#endif
return __vmalloc_node_flags_caller(size, node, flags,
__builtin_return_address(0));
 }


Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap

2018-04-23 Thread Willem de Bruijn
On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin  wrote:
> On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
>> From: Magnus Karlsson 
>>
>> Here, we add another setsockopt for registered user memory (umem)
>> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
>> ask the kernel to allocate a queue (ring buffer) and also mmap it
>> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
>>
>> The queue is used to explicitly pass ownership of umem frames from the
>> user process to the kernel. These frames will in a later patch be
>> filled in with Rx packet data by the kernel.
>>
>> Signed-off-by: Magnus Karlsson 
>> ---
>>  include/uapi/linux/if_xdp.h | 15 +++
>>  net/xdp/Makefile|  2 +-
>>  net/xdp/xdp_umem.c  |  5 
>>  net/xdp/xdp_umem.h  |  2 ++
>>  net/xdp/xsk.c   | 62 
>> -
>>  net/xdp/xsk_queue.c | 58 ++
>>  net/xdp/xsk_queue.h | 38 +++
>>  7 files changed, 180 insertions(+), 2 deletions(-)
>>  create mode 100644 net/xdp/xsk_queue.c
>>  create mode 100644 net/xdp/xsk_queue.h
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 41252135a0fe..975661e1baca 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -23,6 +23,7 @@
>>
>>  /* XDP socket options */
>>  #define XDP_UMEM_REG 3
>> +#define XDP_UMEM_FILL_RING   4
>>
>>  struct xdp_umem_reg {
>>   __u64 addr; /* Start of packet data area */
>> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>>   __u32 frame_headroom; /* Frame head room */
>>  };
>>
>> +/* Pgoff for mmaping the rings */
>> +#define XDP_UMEM_PGOFF_FILL_RING 0x1
>> +
>> +struct xdp_ring {
>> + __u32 producer __attribute__((aligned(64)));
>> + __u32 consumer __attribute__((aligned(64)));
>> +};
>
> Why 64? And do you still need these guys in uapi?

I was just about to ask the same. You mean cacheline_aligned?

>> +static int xsk_mmap(struct file *file, struct socket *sock,
>> + struct vm_area_struct *vma)
>> +{
>> + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
>> + unsigned long size = vma->vm_end - vma->vm_start;
>> + struct xdp_sock *xs = xdp_sk(sock->sk);
>> + struct xsk_queue *q;
>> + unsigned long pfn;
>> + struct page *qpg;
>> +
>> + if (!xs->umem)
>> + return -EINVAL;
>> +
>> + if (offset == XDP_UMEM_PGOFF_FILL_RING)
>> + q = xs->umem->fq;
>> + else
>> + return -EINVAL;
>> +
>> + qpg = virt_to_head_page(q->ring);

Is it assured that q is initialized with a call to setsockopt
XDP_UMEM_FILL_RING before the call the mmap?

In general, with such an extensive new API, it might be worthwhile to
run syzkaller locally on a kernel with these patches. It is pretty
easy to set up 
(https://github.com/google/syzkaller/blob/master/docs/linux/setup.md),
though it also needs to be taught about any new APIs.


Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets

2018-04-23 Thread Michael S. Tsirkin
On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson 
> 
> This is a sample application for AF_XDP sockets. The application
> supports three different modes of operation: rxdrop, txonly and l2fwd.
> 
> To show-case a simple round-robin load-balancing between a set of
> sockets in an xskmap, set the RR_LB compile time define option to 1 in
> "xdpsock.h".
> 
> Co-authored-by: Björn Töpel 
> Signed-off-by: Björn Töpel 
> Signed-off-by: Magnus Karlsson 
> ---
>  samples/bpf/Makefile   |   4 +
>  samples/bpf/xdpsock.h  |  11 +
>  samples/bpf/xdpsock_kern.c |  56 +++
>  samples/bpf/xdpsock_user.c | 947 
> +
>  4 files changed, 1018 insertions(+)
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index aa8c392e2e52..d0ddc1abf20d 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
>  hostprogs-y += syscall_tp
>  hostprogs-y += cpustat
>  hostprogs-y += xdp_adjust_tail
> +hostprogs-y += xdpsock
>  
>  # Libbpf dependencies
>  LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
> @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) 
> xdp_rxq_info_user.o
>  syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
>  cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
>  xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
> +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
>  
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
>  always += syscall_tp_kern.o
>  always += cpustat_kern.o
>  always += xdp_adjust_tail_kern.o
> +always += xdpsock_kern.o
>  
>  HOSTCFLAGS += -I$(objtree)/usr/include
>  HOSTCFLAGS += -I$(srctree)/tools/lib/
> @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
>  HOSTLOADLIBES_syscall_tp += -lelf
>  HOSTLOADLIBES_cpustat += -lelf
>  HOSTLOADLIBES_xdp_adjust_tail += -lelf
> +HOSTLOADLIBES_xdpsock += -lelf -pthread
>  
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
> cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
> new file mode 100644
> index ..533ab81adfa1
> --- /dev/null
> +++ b/samples/bpf/xdpsock.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef XDPSOCK_H_
> +#define XDPSOCK_H_
> +
> +/* Power-of-2 number of sockets */
> +#define MAX_SOCKS 4
> +
> +/* Round-robin receive */
> +#define RR_LB 0
> +
> +#endif /* XDPSOCK_H_ */
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> new file mode 100644
> index ..d8806c41362e
> --- /dev/null
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define KBUILD_MODNAME "foo"
> +#include 
> +#include "bpf_helpers.h"
> +
> +#include "xdpsock.h"
> +
> +struct bpf_map_def SEC("maps") qidconf_map = {
> + .type   = BPF_MAP_TYPE_ARRAY,
> + .key_size   = sizeof(int),
> + .value_size = sizeof(int),
> + .max_entries= 1,
> +};
> +
> +struct bpf_map_def SEC("maps") xsks_map = {
> + .type = BPF_MAP_TYPE_XSKMAP,
> + .key_size = sizeof(int),
> + .value_size = sizeof(int),
> + .max_entries = 4,
> +};
> +
> +struct bpf_map_def SEC("maps") rr_map = {
> + .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> + .key_size = sizeof(int),
> + .value_size = sizeof(unsigned int),
> + .max_entries = 1,
> +};
> +
> +SEC("xdp_sock")
> +int xdp_sock_prog(struct xdp_md *ctx)
> +{
> + int *qidconf, key = 0, idx;
> + unsigned int *rr;
> +
> + qidconf = bpf_map_lookup_elem(_map, );
> + if (!qidconf)
> + return XDP_ABORTED;
> +
> + if (*qidconf != ctx->rx_queue_index)
> + return XDP_PASS;
> +
> +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
> + rr = bpf_map_lookup_elem(_map, );
> + if (!rr)
> + return XDP_ABORTED;
> +
> + *rr = (*rr + 1) & (MAX_SOCKS - 1);
> + idx = *rr;
> +#else
> + idx = 0;
> +#endif
> +
> + return bpf_redirect_map(_map, idx, 0);
> +}
> +
> +char _license[] SEC("license") = "GPL";
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> new file mode 100644
> index ..690bac1a0ab7
> --- /dev/null
> +++ b/samples/bpf/xdpsock_user.c
> @@ -0,0 +1,947 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2017 - 2018 Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software 

Re: [PATCH bpf-next 00/15] Introducing AF_XDP support

2018-04-23 Thread Michael S. Tsirkin
On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
> 
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
> 
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
> 
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
> 
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
> 
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> 

Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap

2018-04-23 Thread Michael S. Tsirkin
On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson 
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson 
> ---
>  include/uapi/linux/if_xdp.h | 15 +++
>  net/xdp/Makefile|  2 +-
>  net/xdp/xdp_umem.c  |  5 
>  net/xdp/xdp_umem.h  |  2 ++
>  net/xdp/xsk.c   | 62 
> -
>  net/xdp/xsk_queue.c | 58 ++
>  net/xdp/xsk_queue.h | 38 +++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG 3
> +#define XDP_UMEM_FILL_RING   4
>  
>  struct xdp_umem_reg {
>   __u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>   __u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING 0x1
> +
> +struct xdp_ring {
> + __u32 producer __attribute__((aligned(64)));
> + __u32 consumer __attribute__((aligned(64)));
> +};

Why 64? And do you still need these guys in uapi?

> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> + struct xdp_ring ptrs;
> + __u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>   struct mm_struct *mm;
>   unsigned long diff;
>  
> + if (umem->fq) {
> + xskq_destroy(umem->fq);
> + umem->fq = NULL;
> + }
> +
>   if (umem->pgs) {
>   xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include 
>  #include 
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> + struct xsk_queue *fq;
>   struct page **pgs;
>   struct xdp_umem_props props;
>   u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>   return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> + struct xsk_queue *q;
> +
> + if (entries == 0 || *queue || !is_power_of_2(entries))
> + return -EINVAL;
> +
> + q = xskq_create(entries);
> + if (!q)
> + return -ENOMEM;
> +
> + *queue = q;
> + return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>   struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int 
> level, int optname,
>   mutex_unlock(>mutex);
>   return 0;
>   }
> + case XDP_UMEM_FILL_RING:
> + {
> + struct xsk_queue **q;
> + int entries;
> +
> + if (!xs->umem)
> + return -EINVAL;
> +
> + if (copy_from_user(, optval, sizeof(entries)))
> + return -EFAULT;
> +
> + mutex_lock(>mutex);
> + q = >umem->fq;
> + err = xsk_init_queue(entries, q);
> + mutex_unlock(>mutex);
> + return err;
> + }
>   default:
>   break;
>   }
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int 
> level, int optname,
>   return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, 

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-23 Thread Mikulas Patocka


On Mon, 23 Apr 2018, Michal Hocko wrote:

> On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
> 
> > > Really, we have a fault injection framework and this sounds like
> > > something to hook in there.
> > 
> > The testing people won't set it up. They install the "kernel-debug" 
> > package and run the tests in it.
> > 
> > If you introduce a hidden option that no one knows about, no one will use 
> > it.
> 
> then make sure people know about it. Fuzzers already do test fault
> injections.

I think that in the long term we can introduce a kernel parameter like 
"debug_level=1", "debug_level=2", "debug_level=3" that will turn on 
debugging features across all kernel subsystems and we can teach users to 
use it to diagnose problems.

But it won't work if every subsystem has different debug parameters. There 
are 192 distinct filenames in debugfs, if we add 193rd one, harly anyone 
notices it.

Mikulas


Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap

2018-04-23 Thread Michael S. Tsirkin
On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote:
> From: Magnus Karlsson 
> 
> Here, we add another setsockopt for registered user memory (umem)
> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
> ask the kernel to allocate a queue (ring buffer) and also mmap it
> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process.
> 
> The queue is used to explicitly pass ownership of umem frames from the
> user process to the kernel. These frames will in a later patch be
> filled in with Rx packet data by the kernel.
> 
> Signed-off-by: Magnus Karlsson 
> ---
>  include/uapi/linux/if_xdp.h | 15 +++
>  net/xdp/Makefile|  2 +-
>  net/xdp/xdp_umem.c  |  5 
>  net/xdp/xdp_umem.h  |  2 ++
>  net/xdp/xsk.c   | 62 
> -
>  net/xdp/xsk_queue.c | 58 ++
>  net/xdp/xsk_queue.h | 38 +++
>  7 files changed, 180 insertions(+), 2 deletions(-)
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 41252135a0fe..975661e1baca 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -23,6 +23,7 @@
>  
>  /* XDP socket options */
>  #define XDP_UMEM_REG 3
> +#define XDP_UMEM_FILL_RING   4
>  
>  struct xdp_umem_reg {
>   __u64 addr; /* Start of packet data area */
> @@ -31,4 +32,18 @@ struct xdp_umem_reg {
>   __u32 frame_headroom; /* Frame head room */
>  };
>  
> +/* Pgoff for mmaping the rings */
> +#define XDP_UMEM_PGOFF_FILL_RING 0x1
> +
> +struct xdp_ring {
> + __u32 producer __attribute__((aligned(64)));
> + __u32 consumer __attribute__((aligned(64)));
> +};
> +
> +/* Used for the fill and completion queues for buffers */
> +struct xdp_umem_ring {
> + struct xdp_ring ptrs;
> + __u32 desc[0] __attribute__((aligned(64)));
> +};
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/Makefile b/net/xdp/Makefile
> index a5d736640a0f..074fb2b2d51c 100644
> --- a/net/xdp/Makefile
> +++ b/net/xdp/Makefile
> @@ -1,2 +1,2 @@
> -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
> +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bff058f5a769..6fc233e03f30 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
>   struct mm_struct *mm;
>   unsigned long diff;
>  
> + if (umem->fq) {
> + xskq_destroy(umem->fq);
> + umem->fq = NULL;
> + }
> +
>   if (umem->pgs) {
>   xdp_umem_unpin_pages(umem);
>  
> diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
> index 58714f4f7f25..3086091aebdd 100644
> --- a/net/xdp/xdp_umem.h
> +++ b/net/xdp/xdp_umem.h
> @@ -18,9 +18,11 @@
>  #include 
>  #include 
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem_props.h"
>  
>  struct xdp_umem {
> + struct xsk_queue *fq;
>   struct page **pgs;
>   struct xdp_umem_props props;
>   u32 npgs;
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 19fc719cbe0d..bf6a1151df28 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  
> +#include "xsk_queue.h"
>  #include "xdp_umem.h"
>  
>  struct xdp_sock {
> @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
>   return (struct xdp_sock *)sk;
>  }
>  
> +static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
> +{
> + struct xsk_queue *q;
> +
> + if (entries == 0 || *queue || !is_power_of_2(entries))
> + return -EINVAL;
> +
> + q = xskq_create(entries);
> + if (!q)
> + return -ENOMEM;
> +
> + *queue = q;
> + return 0;
> +}
> +
>  static int xsk_release(struct socket *sock)
>  {
>   struct sock *sk = sock->sk;
> @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int 
> level, int optname,
>   mutex_unlock(>mutex);
>   return 0;
>   }
> + case XDP_UMEM_FILL_RING:
> + {
> + struct xsk_queue **q;
> + int entries;
> +
> + if (!xs->umem)
> + return -EINVAL;
> +
> + if (copy_from_user(, optval, sizeof(entries)))
> + return -EFAULT;
> +
> + mutex_lock(>mutex);
> + q = >umem->fq;
> + err = xsk_init_queue(entries, q);
> + mutex_unlock(>mutex);
> + return err;
> + }
>   default:
>   break;
>   }
> @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int 
> level, int optname,
>   return -ENOPROTOOPT;
>  }
>  
> +static int xsk_mmap(struct file *file, struct socket *sock,
> + struct 

Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-04-23 Thread Paul Moore
On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs  wrote:
> On 2018-04-18 19:47, Paul Moore wrote:
>> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  wrote:
>> > Implement the proc fs write to set the audit container ID of a process,
>> > emitting an AUDIT_CONTAINER record to document the event.
>> >
>> > This is a write from the container orchestrator task to a proc entry of
>> > the form /proc/PID/containerid where PID is the process ID of the newly
>> > created task that is to become the first task in a container, or an
>> > additional task added to a container.
>> >
>> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >
>> > This will produce a record such as this:
>> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
>> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 
>> > ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >
>> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> > the orchestrator while the "opid" field is the object's PID, the process
>> > being "contained".  Old and new container ID values are given in the
>> > "contid" fields, while res indicates its success.
>> >
>> > It is not permitted to self-set, unset or re-set the container ID.  A
>> > child inherits its parent's container ID, but then can be set only once
>> > after.
>> >
>> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >
>> > Signed-off-by: Richard Guy Briggs 
>> > ---
>> >  fs/proc/base.c | 37 
>> >  include/linux/audit.h  | 16 +
>> >  include/linux/init_task.h  |  4 ++-
>> >  include/linux/sched.h  |  1 +
>> >  include/uapi/linux/audit.h |  2 ++
>> >  kernel/auditsc.c   | 84 
>> > ++
>> >  6 files changed, 143 insertions(+), 1 deletion(-)

...

>> > diff --git a/include/linux/sched.h b/include/linux/sched.h
>> > index d258826..1b82191 100644
>> > --- a/include/linux/sched.h
>> > +++ b/include/linux/sched.h
>> > @@ -796,6 +796,7 @@ struct task_struct {
>> >  #ifdef CONFIG_AUDITSYSCALL
>> > kuid_t  loginuid;
>> > unsigned intsessionid;
>> > +   u64 containerid;
>>
>> This one line addition to the task_struct scares me the most of
>> anything in this patchset.  Why?  It's a field named "containerid" in
>> a perhaps one of the most widely used core kernel structures; the
>> possibilities for abuse are endless, and it's foolish to think we
>> would ever be able to adequately police this.
>
> Fair enough.
>
>> Unfortunately, we can't add the field to audit_context as things
>> currently stand because we don't always allocate an audit_context,
>> it's dependent on the system's configuration, and we need to track the
>> audit container ID for a given process, regardless of the audit
>> configuration.  Pretty much the same reason why loginuid and sessionid
>> are located directly in task_struct now.  As I stressed during the
>> design phase, I really want to keep this as an *audit* container ID
>> and not a general purpose kernel wide container ID.  If the kernel
>> ever grows a general purpose container ID token, I'll be the first in
>> line to convert the audit code, but I don't want audit to be that
>> general purpose mechanism ... audit is hated enough as-is ;)
>
> When would we need an audit container ID when audit is not enabled
> enough to have an audit_context?

I'm thinking of the audit_alloc() case where audit_filter_task()
returns AUDIT_DISABLED.

I believe this is the same reason why loginuid and sessionid live
directly in the task_struct and not in the audit_context; they need to
persist for the lifetime of the task.

> If it is only used for audit, and audit is the only consumer, and audit
> can only use it when it is enabled, then we can just return success to
> any write to the proc filehandle, or not even present it.  Nothing will
> be able to know that value wasn't used.
>
> When are loginuid and sessionid used now when audit is not enabled (or
> should I say, explicitly disabled)?

See above.  I think that should answer these questions.

>> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
>> > index 4e61a9e..921a71f 100644
>> > --- a/include/uapi/linux/audit.h
>> > +++ b/include/uapi/linux/audit.h
>> > @@ -71,6 +71,7 @@
>> >  #define AUDIT_TTY_SET  1017/* Set TTY auditing status */
>> >  #define AUDIT_SET_FEATURE  1018/* Turn an audit feature on or off 
>> > */
>> >  #define AUDIT_GET_FEATURE  1019/* Get which features are enabled 
>> > */
>> > +#define AUDIT_CONTAINER1020/* Define the container id 
>> > and information */
>> >
>> >  #define AUDIT_FIRST_USER_MSG   1100/* Userspace messages mostly 
>> > uninteresting to kernel */
>> >  #define AUDIT_USER_AVC  

Re: dev_loopback_xmit parameters

2018-04-23 Thread Emanuele

Ok, clear now.

Even though I don't understand what to set to avoid triggering the
WARN_ON(!skb_dst(skb));
inside dev_loopback_xmit.
I just would like to send the skb in loopback, i.e. moving the packet 
from the sending to the receiving queue of a certain struct net_device.



On 24/04/2018 00:36, Eric Dumazet wrote:


On 04/23/2018 02:40 PM, Emanuele wrote:

Hello,

I don't know if this is the right place where to ask, but I was wondering why 
the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * 
and struct sock *  as parameters. They are never used, so I believe passing 
only the struct  sk_buff * should be enough.


Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit().


In addition, it would like to know where I can read what is and how to set a 
skb dst_entry, since I don't really understand it.

Thanks a lot,

Emanuele





Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt

2018-04-23 Thread Willem de Bruijn
On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> In this commit the base structure of the AF_XDP address family is set
> up. Further, we introduce the abilty register a window of user memory
> to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
> window is viewed by an AF_XDP socket as a set of equally large
> frames. After a user memory registration all frames are "owned" by the
> user application, and not the kernel.
>
> Co-authored-by: Magnus Karlsson 
> Signed-off-by: Magnus Karlsson 
> Signed-off-by: Björn Töpel 

> +static void xdp_umem_release(struct xdp_umem *umem)
> +{
> +   struct task_struct *task;
> +   struct mm_struct *mm;
> +   unsigned long diff;
> +
> +   if (umem->pgs) {
> +   xdp_umem_unpin_pages(umem);
> +
> +   task = get_pid_task(umem->pid, PIDTYPE_PID);
> +   put_pid(umem->pid);
> +   if (!task)
> +   goto out;
> +   mm = get_task_mm(task);
> +   put_task_struct(task);
> +   if (!mm)
> +   goto out;
> +
> +   diff = umem->size >> PAGE_SHIFT;

Need to round up or size must always be a multiple of PAGE_SIZE.

> +
> +   down_write(>mmap_sem);
> +   mm->pinned_vm -= diff;
> +   up_write(>mmap_sem);

When using user->locked_vm for resource limit checks, no need
to also update mm->pinned_vm?

> +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
> +{
> +   u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
> +   u64 addr = mr->addr, size = mr->len;
> +   unsigned int nframes;
> +   int size_chk, err;
> +
> +   if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +   /* Strictly speaking we could support this, if:
> +* - huge pages, or*
> +* - using an IOMMU, or
> +* - making sure the memory area is consecutive
> +* but for now, we simply say "computer says no".
> +*/
> +   return -EINVAL;
> +   }

Ideally, AF_XDP subsumes all packet socket use cases. It does not
have packet v3's small packet optimizations of variable sized frames
and block signaling.

I don't suggest adding that now. But for the non-zerocopy case, it may
make sense to ensure that nothing is blocking a later addition of these
features. Especially for header-only (snaplen) workloads. So far, I don't
see any issues.

> +   if (!is_power_of_2(frame_size))
> +   return -EINVAL;
> +
> +   if (!PAGE_ALIGNED(addr)) {
> +   /* Memory area has to be page size aligned. For
> +* simplicity, this might change.
> +*/
> +   return -EINVAL;
> +   }
> +
> +   if ((addr + size) < addr)
> +   return -EINVAL;
> +
> +   nframes = size / frame_size;
> +   if (nframes == 0 || nframes > UINT_MAX)
> +   return -EINVAL;

You may also want a check here that nframes * frame_size is at least
PAGE_SIZE and probably a multiple of that.

> +   frame_headroom = ALIGN(frame_headroom, 64);
> +
> +   size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
> +   if (size_chk < 0)
> +   return -EINVAL;
> +
> +   umem->pid = get_task_pid(current, PIDTYPE_PID);
> +   umem->size = (size_t)size;
> +   umem->address = (unsigned long)addr;
> +   umem->props.frame_size = frame_size;
> +   umem->props.nframes = nframes;
> +   umem->frame_headroom = frame_headroom;
> +   umem->npgs = size / PAGE_SIZE;
> +   umem->pgs = NULL;
> +   umem->user = NULL;
> +
> +   umem->frame_size_log2 = ilog2(frame_size);
> +   umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1;
> +   umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
> +   atomic_set(>users, 1);
> +
> +   err = xdp_umem_account_pages(umem);
> +   if (err)
> +   goto out;
> +
> +   err = xdp_umem_pin_pages(umem);
> +   if (err)

need to call xdp_umem_unaccount_pages on error
> +   goto out;
> +   return 0;
> +
> +out:
> +   put_pid(umem->pid);
> +   return err;
> +}


[PATCH v2] selftests: bpf: update .gitignore with missing file

2018-04-23 Thread Anders Roxell
Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
Signed-off-by: Anders Roxell 
---
Rebased against bpf-next.

 tools/testing/selftests/bpf/.gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 9cf83f895d98..da19f0562bf8 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -12,3 +12,4 @@ test_tcpbpf_user
 test_verifier_log
 feature
 test_libbpf_open
+test_btf
-- 
2.17.0



Re: [bpf PATCH v3 0/3] BPF, a couple sockmap fixes

2018-04-23 Thread Daniel Borkmann
On 04/24/2018 12:39 AM, John Fastabend wrote:
> While testing sockmap with more programs (besides our test programs)
> I found a couple issues.
> 
> The attached series fixes an issue where pinned maps were not
> working correctly, blocking sockets returned zero, and an error
> path that when the sock hit an out of memory case resulted in a
> double page_put() while doing ingress redirects.
> 
> See individual patches for more details.
> 
> v2: Incorporated Daniel's feedback to use map ops for uref put op
> which also fixed the build error discovered in v1.
> v3: rename map_put_uref to map_release_uref

Applied to bpf tree, thanks John!


[PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters

2018-04-23 Thread Florian Fainelli
While adding support for ethtool::get_fecparam and set_fecparam, kernel
doc for these functions was missed, add those.

Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction 
modes")
Signed-off-by: Florian Fainelli 
---
Changes in v2:

- corrected set_fecparam in commit message

 include/linux/ethtool.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..b32cd2062f18 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,8 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
*legacy_u32,
  * fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  * instead of the latter), any change to them will be overwritten
  * by kernel. Returns a negative error code or zero.
+ * @get_fecparam: Get the network device Forward Error Correction parameters.
+ * @set_fecparam: Set the network device Forward Error Correction parameters.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
-- 
2.14.1



[bpf PATCH v3 3/3] bpf: sockmap, fix double page_put on ENOMEM error in redirect path

2018-04-23 Thread John Fastabend
In the case where the socket memory boundary is hit the redirect
path returns an ENOMEM error. However, before checking for this
condition the redirect scatterlist buffer is setup with a valid
page and length. This is never unwound so when the buffers are
released latter in the error path we do a put_page() and clear
the scatterlist fields. But, because the initial error happens
before completing the scatterlist buffer we end up with both the
original buffer and the redirect buffer pointing to the same page
resulting in duplicate put_page() calls.

To fix this simply move the initial configuration of the redirect
scatterlist buffer below the sock memory check.

Found this while running TCP_STREAM test with netperf using Cilium.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for 
BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index aaf50ec..634415c 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
i = md->sg_start;
 
do {
-   r->sg_data[i] = md->sg_data[i];
-
size = (apply && apply_bytes < md->sg_data[i].length) ?
apply_bytes : md->sg_data[i].length;
 
@@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes,
}
 
sk_mem_charge(sk, size);
+   r->sg_data[i] = md->sg_data[i];
r->sg_data[i].length = size;
md->sg_data[i].length -= size;
md->sg_data[i].offset += size;



[bpf PATCH v3 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps

2018-04-23 Thread John Fastabend
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not 
detached progs")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ac4340..a5782d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -33,6 +33,7 @@ struct bpf_map_ops {
void (*map_release)(struct bpf_map *map, struct file *map_file);
void (*map_free)(struct bpf_map *map);
int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key);
+   void (*map_release_uref)(struct bpf_map *map);
 
/* funcs callable from userspace and from eBPF programs */
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
@@ -448,7 +449,6 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, 
void *value,
 int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file,
 void *key, void *value, u64 map_flags);
 int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
-void bpf_fd_array_map_clear(struct bpf_map *map);
 int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file,
void *key, void *value, u64 map_flags);
 int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 02a1893..0fd8d8f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -526,7 +526,7 @@ static u32 prog_fd_array_sys_lookup_elem(void *ptr)
 }
 
 /* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_fd_array_map_clear(struct bpf_map *map)
+static void bpf_fd_array_map_clear(struct bpf_map *map)
 {
struct bpf_array *array = container_of(map, struct bpf_array, map);
int i;
@@ -545,6 +545,7 @@ void bpf_fd_array_map_clear(struct bpf_map *map)
.map_fd_get_ptr = prog_fd_array_get_ptr,
.map_fd_put_ptr = prog_fd_array_put_ptr,
.map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem,
+   .map_release_uref = bpf_fd_array_map_clear,
 };
 
 static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a3b2138..a73d484 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1831,7 +1831,7 @@ static int sock_map_update_elem(struct bpf_map *map,
return err;
 }
 
-static void sock_map_release(struct bpf_map *map, struct file *map_file)
+static void sock_map_release(struct bpf_map *map)
 {
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct bpf_prog *orig;
@@ -1855,7 +1855,7 @@ static void sock_map_release(struct bpf_map *map, struct 
file *map_file)
.map_get_next_key = sock_map_get_next_key,
.map_update_elem = sock_map_update_elem,
.map_delete_elem = sock_map_delete_elem,
-   .map_release = sock_map_release,
+   .map_release_uref = sock_map_release,
 };
 
 BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index fe23dc5a..a22c26e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -260,8 +260,8 @@ static void bpf_map_free_deferred(struct work_struct *work)
 static void bpf_map_put_uref(struct bpf_map *map)
 {
if (atomic_dec_and_test(>usercnt)) {
-   if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
-   bpf_fd_array_map_clear(map);
+   if (map->ops->map_release_uref)
+   map->ops->map_release_uref(map);
}
 }
 



[bpf PATCH v3 0/3] BPF, a couple sockmap fixes

2018-04-23 Thread John Fastabend
While testing sockmap with more programs (besides our test programs)
I found a couple issues.

The attached series fixes an issue where pinned maps were not
working correctly, blocking sockets returned zero, and an error
path that when the sock hit an out of memory case resulted in a
double page_put() while doing ingress redirects.

See individual patches for more details.

v2: Incorporated Daniel's feedback to use map ops for uref put op
which also fixed the build error discovered in v1.
v3: rename map_put_uref to map_release_uref

---

John Fastabend (3):
  bpf: sockmap, map_release does not hold refcnt for pinned maps
  bpf: sockmap, sk_wait_event needed to handle blocking cases
  bpf: sockmap, fix double page_put on ENOMEM error in redirect path


 0 files changed

--
Signature


[bpf PATCH v3 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases

2018-04-23 Thread John Fastabend
In the recvmsg handler we need to add a wait event to support the
blocking use cases. Without this we return zero and may confuse
user applications. In the wait event any data received on the
sk either via sk_receive_queue or the psock ingress list will
wake up the sock.

Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for 
BPF_SK_SKB_STREAM_VERDICT")
Signed-off-by: John Fastabend 
---
 0 files changed

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index a73d484..aaf50ec 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
@@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
return err;
 }
 
+static int bpf_wait_data(struct sock *sk,
+struct smap_psock *psk, int flags,
+long timeo, int *err)
+{
+   int rc;
+
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+   add_wait_queue(sk_sleep(sk), );
+   sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   rc = sk_wait_event(sk, ,
+  !list_empty(>ingress) ||
+  !skb_queue_empty(>sk_receive_queue),
+  );
+   sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   remove_wait_queue(sk_sleep(sk), );
+
+   return rc;
+}
+
 static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
   int nonblock, int flags, int *addr_len)
 {
@@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
 
lock_sock(sk);
+bytes_ready:
while (copied != len) {
struct scatterlist *sg;
struct sk_msg_buff *md;
@@ -809,6 +831,28 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr 
*msg, size_t len,
}
}
 
+   if (!copied) {
+   long timeo;
+   int data;
+   int err = 0;
+
+   timeo = sock_rcvtimeo(sk, nonblock);
+   data = bpf_wait_data(sk, psock, flags, timeo, );
+
+   if (data) {
+   if (!skb_queue_empty(>sk_receive_queue)) {
+   release_sock(sk);
+   smap_release_sock(psock, sk);
+   copied = tcp_recvmsg(sk, msg, len, nonblock, 
flags, addr_len);
+   return copied;
+   }
+   goto bytes_ready;
+   }
+
+   if (err)
+   copied = err;
+   }
+
release_sock(sk);
smap_release_sock(psock, sk);
return copied;



Re: dev_loopback_xmit parameters

2018-04-23 Thread Eric Dumazet


On 04/23/2018 02:40 PM, Emanuele wrote:
> Hello,
> 
> I don't know if this is the right place where to ask, but I was wondering why 
> the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * 
> and struct sock *  as parameters. They are never used, so I believe passing 
> only the struct  sk_buff * should be enough.
> 

Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit().

> In addition, it would like to know where I can read what is and how to set a 
> skb dst_entry, since I don't really understand it.
> 
> Thanks a lot,
> 
> Emanuele
> 


Re: [PATCH] selftests: bpf: update .gitignore with missing file

2018-04-23 Thread Daniel Borkmann
On 04/24/2018 12:14 AM, Anders Roxell wrote:
> On 23 April 2018 at 23:34, Daniel Borkmann  wrote:
>> On 04/23/2018 03:50 PM, Anders Roxell wrote:
>>> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
>>> Signed-off-by: Anders Roxell 
>>> ---
>>>  tools/testing/selftests/bpf/.gitignore | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/tools/testing/selftests/bpf/.gitignore 
>>> b/tools/testing/selftests/bpf/.gitignore
>>> index 5e1ab2f0eb79..3e3b3ced3f7c 100644
>>> --- a/tools/testing/selftests/bpf/.gitignore
>>> +++ b/tools/testing/selftests/bpf/.gitignore
>>> @@ -15,3 +15,4 @@ test_libbpf_open
>>>  test_sock
>>>  test_sock_addr
>>>  urandom_read
>>> +test_btf
>>
>> Against which tree is this? This doesn't apply to bpf-next. It would
>> apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests")
>> is part of bpf-next, so fits to neither of them.
> 
> I'm sorry,
> 
> I did it against this patch [1] that I thought was already applied to
> the bpf tree.

That was bpf tree since the original change already went to mainline; the
BTF is in bpf-next however, so you need to rebase your change against that.

Thanks,
Daniel


[PATCH 1/4] rhashtable: remove outdated comments about grow_decision etc

2018-04-23 Thread NeilBrown
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 1f8ad121eb43..87d443a5b11d 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert_key(
struct rhltable *hlt, const void *key, struct rhlist_head *list,
@@ -890,9 +888,8 @@ static inline int rhltable_insert_key(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhltable_insert(
struct rhltable *hlt, struct rhlist_head *list,
@@ -922,9 +919,8 @@ static inline int rhltable_insert(
  *
  * It is safe to call this function from atomic context.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  */
 static inline int rhashtable_lookup_insert_fast(
struct rhashtable *ht, struct rhash_head *obj,
@@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast(
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  *
- * Will trigger an automatic deferred table resizing if the size grows
- * beyond the watermark indicated by grow_decision() which can be passed
- * to rhashtable_init().
+ * Will trigger an automatic deferred table resizing if residency in the
+ * table grows beyond 70%.
  *
  * Returns zero on success.
  */
@@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%.
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */
@@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast(
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the
- * shrink_decision function specified at rhashtable_init() returns true.
+ * Will automatically shrink the table if permitted when residency drops
+ * below 30%
  *
  * Returns zero on success, -ENOENT if the entry could not be found.
  */




[PATCH 0/4] A few rhashtables cleanups

2018-04-23 Thread NeilBrown
2 patches fixes documentation
1 fixes a bit in rhashtable_walk_start()
1 improves rhashtable_walk stability.

All reviewed and Acked.

Thanks,
NeilBrown


---

NeilBrown (4):
  rhashtable: remove outdated comments about grow_decision etc
  rhashtable: Revise incorrect comment on r{hl,hash}table_walk_enter()
  rhashtable: reset iter when rhashtable_walk_start sees new table
  rhashtable: improve rhashtable_walk stability when stop/start used.


 include/linux/rhashtable.h |   38 +++--
 lib/rhashtable.c   |   51 
 2 files changed, 63 insertions(+), 26 deletions(-)

--
Signature



[PATCH 4/4] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-04-23 Thread NeilBrown
When a walk of an rhashtable is interrupted with rhastable_walk_stop()
and then rhashtable_walk_start(), the location to restart from is based
on a 'skip' count in the current hash chain, and this can be incorrect
if insertions or deletions have happened.  This does not happen when
the walk is not stopped and started as iter->p is a placeholder which
is safe to use while holding the RCU read lock.

In rhashtable_walk_start() we can revalidate that 'p' is still in the
same hash chain.  If it isn't then the current method is still used.

With this patch, if a rhashtable walker ensures that the current
object remains in the table over a stop/start period (possibly by
elevating the reference count if that is sufficient), it can be sure
that a walk will not miss objects that were in the hashtable for the
whole time of the walk.

rhashtable_walk_start() may not find the object even though it is
still in the hashtable if a rehash has moved it to a new table.  In
this case it will (eventually) get -EAGAIN and will need to proceed
through the whole table again to be sure to see everything at least
once.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |   44 +---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 81edf1ab38ab..9427b5766134 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -727,6 +727,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
__acquires(RCU)
 {
struct rhashtable *ht = iter->ht;
+   bool rhlist = ht->rhlist;
 
rcu_read_lock();
 
@@ -735,13 +736,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
list_del(>walker.list);
spin_unlock(>lock);
 
-   if (!iter->walker.tbl && !iter->end_of_table) {
+   if (iter->end_of_table)
+   return 0;
+   if (!iter->walker.tbl) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
iter->slot = 0;
iter->skip = 0;
return -EAGAIN;
}
 
+   if (iter->p && !rhlist) {
+   /*
+* We need to validate that 'p' is still in the table, and
+* if so, update 'skip'
+*/
+   struct rhash_head *p;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   skip++;
+   if (p == iter->p) {
+   iter->skip = skip;
+   goto found;
+   }
+   }
+   iter->p = NULL;
+   } else if (iter->p && rhlist) {
+   /* Need to validate that 'list' is still in the table, and
+* if so, update 'skip' and 'p'.
+*/
+   struct rhash_head *p;
+   struct rhlist_head *list;
+   int skip = 0;
+   rht_for_each_rcu(p, iter->walker.tbl, iter->slot) {
+   for (list = container_of(p, struct rhlist_head, rhead);
+list;
+list = rcu_dereference(list->next)) {
+   skip++;
+   if (list == iter->list) {
+   iter->p = p;
+   skip = skip;
+   goto found;
+   }
+   }
+   }
+   iter->p = NULL;
+   }
+found:
return 0;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_start_check);
@@ -917,8 +957,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter)
iter->walker.tbl = NULL;
spin_unlock(>lock);
 
-   iter->p = NULL;
-
 out:
rcu_read_unlock();
 }




[PATCH 3/4] rhashtable: reset iter when rhashtable_walk_start sees new table

2018-04-23 Thread NeilBrown
The documentation claims that when rhashtable_walk_start_check()
detects a resize event, it will rewind back to the beginning
of the table.  This is not true.  We need to set ->slot and
->skip to be zero for it to be true.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 lib/rhashtable.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6d490f51174e..81edf1ab38ab 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -737,6 +737,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter 
*iter)
 
if (!iter->walker.tbl && !iter->end_of_table) {
iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht);
+   iter->slot = 0;
+   iter->skip = 0;
return -EAGAIN;
}
 




[PATCH 2/4] rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()

2018-04-23 Thread NeilBrown
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
they do take a spinlock without irq protection.
So revise the comments to accurately state the contexts in which
these functions can be called.

Acked-by: Herbert Xu 
Signed-off-by: NeilBrown 
---
 include/linux/rhashtable.h |5 +++--
 lib/rhashtable.c   |5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 87d443a5b11d..4e1f535c2034 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -1268,8 +1268,9 @@ static inline int rhashtable_walk_init(struct rhashtable 
*ht,
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 2b2b79974b61..6d490f51174e 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -668,8 +668,9 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow);
  * For a completely stable walk you should construct your own data
  * structure outside the hash table.
  *
- * This function may sleep so you must not call it from interrupt
- * context or with spin locks held.
+ * This function may be called from any process context, including
+ * non-preemptable context, but cannot be called from softirq or
+ * hardirq context.
  *
  * You must call rhashtable_walk_exit after this function returns.
  */




Re: [PATCH] selftests: bpf: update .gitignore with missing file

2018-04-23 Thread Anders Roxell
On 23 April 2018 at 23:34, Daniel Borkmann  wrote:
> On 04/23/2018 03:50 PM, Anders Roxell wrote:
>> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
>> Signed-off-by: Anders Roxell 
>> ---
>>  tools/testing/selftests/bpf/.gitignore | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/tools/testing/selftests/bpf/.gitignore 
>> b/tools/testing/selftests/bpf/.gitignore
>> index 5e1ab2f0eb79..3e3b3ced3f7c 100644
>> --- a/tools/testing/selftests/bpf/.gitignore
>> +++ b/tools/testing/selftests/bpf/.gitignore
>> @@ -15,3 +15,4 @@ test_libbpf_open
>>  test_sock
>>  test_sock_addr
>>  urandom_read
>> +test_btf
>
> Against which tree is this? This doesn't apply to bpf-next. It would
> apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests")
> is part of bpf-next, so fits to neither of them.

I'm sorry,

I did it against this patch [1] that I thought was already applied to
the bpf tree.

Cheers,
Anders
[1] https://patchwork.kernel.org/patch/10332907/


Re: [bpf PATCH v2 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps

2018-04-23 Thread Daniel Borkmann
On 04/23/2018 08:29 PM, John Fastabend wrote:
> Relying on map_release hook to decrement the reference counts when a
> map is removed only works if the map is not being pinned. In the
> pinned case the ref is decremented immediately and the BPF programs
> released. After this BPF programs may not be in-use which is not
> what the user would expect.
> 
> This patch moves the release logic into bpf_map_put_uref() and brings
> sockmap in-line with how a similar case is handled in prog array maps.
> 
> Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not 
> detached progs")
> Signed-off-by: John Fastabend 

Patches look good, but one trivial request below.

[...]
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 4ca46df..4b70439 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -257,8 +257,8 @@ static void bpf_map_free_deferred(struct work_struct 
> *work)
>  static void bpf_map_put_uref(struct bpf_map *map)
>  {
>   if (atomic_dec_and_test(>usercnt)) {
> - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
> - bpf_fd_array_map_clear(map);
> + if (map->ops->map_put_uref)
> + map->ops->map_put_uref(map);

Could you change the callback name into something like 'map_release_uref'?
Naming it 'map_put_uref' is a bit confusing since this is only called when
the uref reference count already dropped to zero, and here we really release
the last reference point to user space. Given this is BPF core infra, would
be nice to still fix this up before applying.

Thanks,
Daniel


[PATCH v2] net: qrtr: Expose tunneling endpoint to user space

2018-04-23 Thread Bjorn Andersson
This implements a misc character device named "qrtr-tun" for the purpose
of allowing user space applications to implement endpoints in the qrtr
network.

This allows more advanced (and dynamic) testing of the qrtr code as well
as opens up the ability of tunneling qrtr over a network or USB link.

Signed-off-by: Bjorn Andersson 
---

Changes since v1:
- Dropped queue lock

 net/qrtr/Kconfig  |   7 +++
 net/qrtr/Makefile |   2 +
 net/qrtr/tun.c| 146 ++
 3 files changed, 155 insertions(+)
 create mode 100644 net/qrtr/tun.c

diff --git a/net/qrtr/Kconfig b/net/qrtr/Kconfig
index 326fd97444f5..1944834d225c 100644
--- a/net/qrtr/Kconfig
+++ b/net/qrtr/Kconfig
@@ -21,4 +21,11 @@ config QRTR_SMD
  Say Y here to support SMD based ipcrouter channels.  SMD is the
  most common transport for IPC Router.
 
+config QRTR_TUN
+   tristate "TUN device for Qualcomm IPC Router"
+   ---help---
+ Say Y here to expose a character device that allows user space to
+ implement endpoints of QRTR, for purpose of tunneling data to other
+ hosts or testing purposes.
+
 endif # QRTR
diff --git a/net/qrtr/Makefile b/net/qrtr/Makefile
index ab09e40f7c74..be012bfd3e52 100644
--- a/net/qrtr/Makefile
+++ b/net/qrtr/Makefile
@@ -2,3 +2,5 @@ obj-$(CONFIG_QRTR) := qrtr.o
 
 obj-$(CONFIG_QRTR_SMD) += qrtr-smd.o
 qrtr-smd-y := smd.o
+obj-$(CONFIG_QRTR_TUN) += qrtr-tun.o
+qrtr-tun-y := tun.o
diff --git a/net/qrtr/tun.c b/net/qrtr/tun.c
new file mode 100644
index ..48b2147c98f8
--- /dev/null
+++ b/net/qrtr/tun.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Linaro Ltd */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "qrtr.h"
+
+struct qrtr_tun {
+   struct qrtr_endpoint ep;
+
+   struct sk_buff_head queue;
+   wait_queue_head_t readq;
+};
+
+static int qrtr_tun_send(struct qrtr_endpoint *ep, struct sk_buff *skb)
+{
+   struct qrtr_tun *tun = container_of(ep, struct qrtr_tun, ep);
+
+   skb_queue_tail(>queue, skb);
+
+   /* wake up any blocking processes, waiting for new data */
+   wake_up_interruptible(>readq);
+
+   return 0;
+}
+
+static int qrtr_tun_open(struct inode *inode, struct file *filp)
+{
+   struct qrtr_tun *tun;
+
+   tun = kzalloc(sizeof(*tun), GFP_KERNEL);
+   if (!tun)
+   return -ENOMEM;
+
+   skb_queue_head_init(>queue);
+   init_waitqueue_head(>readq);
+
+   tun->ep.xmit = qrtr_tun_send;
+
+   filp->private_data = tun;
+
+   return qrtr_endpoint_register(>ep, QRTR_EP_NID_AUTO);
+}
+
+static ssize_t qrtr_tun_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+   struct file *filp = iocb->ki_filp;
+   struct qrtr_tun *tun = filp->private_data;
+   struct sk_buff *skb;
+   int count;
+
+   while (!(skb = skb_dequeue(>queue))) {
+   if (filp->f_flags & O_NONBLOCK)
+   return -EAGAIN;
+
+   /* Wait until we get data or the endpoint goes away */
+   if (wait_event_interruptible(tun->readq,
+!skb_queue_empty(>queue)))
+   return -ERESTARTSYS;
+   }
+
+   count = min_t(size_t, iov_iter_count(to), skb->len);
+   if (copy_to_iter(skb->data, count, to) != count)
+   count = -EFAULT;
+
+   kfree_skb(skb);
+
+   return count;
+}
+
+static ssize_t qrtr_tun_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+   struct file *filp = iocb->ki_filp;
+   struct qrtr_tun *tun = filp->private_data;
+   size_t len = iov_iter_count(from);
+   ssize_t ret;
+   void *kbuf;
+
+   kbuf = kzalloc(len, GFP_KERNEL);
+   if (!kbuf)
+   return -ENOMEM;
+
+   if (!copy_from_iter_full(kbuf, len, from))
+   return -EFAULT;
+
+   ret = qrtr_endpoint_post(>ep, kbuf, len);
+
+   return ret < 0 ? ret : len;
+}
+
+static int qrtr_tun_release(struct inode *inode, struct file *filp)
+{
+   struct qrtr_tun *tun = filp->private_data;
+   struct sk_buff *skb;
+
+   qrtr_endpoint_unregister(>ep);
+
+   /* Discard all SKBs */
+   while (!skb_queue_empty(>queue)) {
+   skb = skb_dequeue(>queue);
+   kfree_skb(skb);
+   }
+
+   kfree(tun);
+
+   return 0;
+}
+
+static const struct file_operations qrtr_tun_ops = {
+   .owner = THIS_MODULE,
+   .open = qrtr_tun_open,
+   .read_iter = qrtr_tun_read_iter,
+   .write_iter = qrtr_tun_write_iter,
+   .release = qrtr_tun_release,
+};
+
+static struct miscdevice qrtr_tun_miscdev = {
+   MISC_DYNAMIC_MINOR,
+   "qrtr-tun",
+   _tun_ops,
+};
+
+static int __init qrtr_tun_init(void)
+{
+   int ret;
+
+   ret = misc_register(_tun_miscdev);
+   if (ret)
+   pr_err("failed to register Qualcomm IPC Router tun device\n");
+

[PATCH net-next] tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets

2018-04-23 Thread Eric Dumazet
RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive,
given they have no result.
We can omit the calls for sockets that have no md5 keys.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
383cac0ff0ec059ca7dbc1a6304cc7f8183e008d..95feffb6d53f8a9eadfb15a2fffeec498d6e993a
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -585,14 +585,15 @@ static unsigned int tcp_syn_options(struct sock *sk, 
struct sk_buff *skb,
unsigned int remaining = MAX_TCP_OPTION_SPACE;
struct tcp_fastopen_request *fastopen = tp->fastopen_req;
 
+   *md5 = NULL;
 #ifdef CONFIG_TCP_MD5SIG
-   *md5 = tp->af_specific->md5_lookup(sk, sk);
-   if (*md5) {
-   opts->options |= OPTION_MD5;
-   remaining -= TCPOLEN_MD5SIG_ALIGNED;
+   if (unlikely(rcu_access_pointer(tp->md5sig_info))) {
+   *md5 = tp->af_specific->md5_lookup(sk, sk);
+   if (*md5) {
+   opts->options |= OPTION_MD5;
+   remaining -= TCPOLEN_MD5SIG_ALIGNED;
+   }
}
-#else
-   *md5 = NULL;
 #endif
 
/* We always get an MSS option.  The option bytes which will be seen in
@@ -720,14 +721,15 @@ static unsigned int tcp_established_options(struct sock 
*sk, struct sk_buff *skb
 
opts->options = 0;
 
+   *md5 = NULL;
 #ifdef CONFIG_TCP_MD5SIG
-   *md5 = tp->af_specific->md5_lookup(sk, sk);
-   if (unlikely(*md5)) {
-   opts->options |= OPTION_MD5;
-   size += TCPOLEN_MD5SIG_ALIGNED;
+   if (unlikely(rcu_access_pointer(tp->md5sig_info))) {
+   *md5 = tp->af_specific->md5_lookup(sk, sk);
+   if (*md5) {
+   opts->options |= OPTION_MD5;
+   size += TCPOLEN_MD5SIG_ALIGNED;
+   }
}
-#else
-   *md5 = NULL;
 #endif
 
if (likely(tp->rx_opt.tstamp_ok)) {
-- 
2.17.0.484.g0c8726318c-goog



Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG

2018-04-23 Thread Daniel Borkmann
Applied both to bpf tree, thanks John!


dev_loopback_xmit parameters

2018-04-23 Thread Emanuele

Hello,

I don't know if this is the right place where to ask, but I was 
wondering why the dev_loopback_xmit function defined in /net/core/dev.c 
takes struct net * and struct sock *  as parameters. They are never 
used, so I believe passing only the struct  sk_buff * should be enough.


In addition, it would like to know where I can read what is and how to 
set a skb dst_entry, since I don't really understand it.


Thanks a lot,

Emanuele



Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Eric Dumazet
Hi Andy

On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> On 04/20/2018 08:55 AM, Eric Dumazet wrote:
>> This patch series provide a new mmap_hook to fs willing to grab
>> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
>>
>> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
>> is held), and improve multi-threading scalability.
>>
> 
> I think that the right solution is to rework mmap() on TCP sockets a bit.  
> The current approach in net-next is very strange for a few reasons:
> 
> 1. It uses mmap() as an operation that has side effects besides just creating 
> a mapping.  If nothing else, it's surprising, since mmap() doesn't usually do 
> that.  But it's also causing problems like what you're seeing.
> 
> 2. The performance is worse than it needs to be.  mmap() is slow, and I doubt 
> you'll find many mm developers who consider this particular abuse of mmap() 
> to be a valid thing to optimize for.
> 
> 3. I'm not at all convinced the accounting is sane.  As far as I can tell, 
> you're allowing unprivileged users to increment the count on network-owned 
> pages, limited only by available virtual memory, without obviously charging 
> it to the socket buffer limits.  It looks like a program that simply forgot 
> to call munmap() would cause the system to run out of memory, and I see no 
> reason to expect the OOM killer to have any real chance of killing the right 
> task.

> 
> 4. Error handling sucks.  If I try to mmap() a large range (which is the 
> whole point -- using a small range will kill performance) and not quite all 
> of it can be mapped, then I waste a bunch of time in the kernel and get 
> *none* of the range mapped.
> 
> I would suggest that you rework the interface a bit.  First a user would call 
> mmap() on a TCP socket, which would create an empty VMA.  (It would set 
> vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize 
> it, but it would have no effect whatsoever on the TCP state machine.  Reading 
> the VMA would get SIGBUS.)  Then a user would call a new ioctl() or 
> setsockopt() function and pass something like:


> 
> struct tcp_zerocopy_receive {
>   void *address;
>   size_t length;
> };
> 
> The kernel would verify that [address, address+length) is entirely inside a 
> single TCP VMA and then would do the vm_insert_range magic.

I have no idea what is the proper API for that.
Where the TCP VMA(s) would be stored ?
In TCP socket, or MM layer ?


And I am not sure why the error handling would be better (point 4), unless we 
can return smaller @length than requested maybe ?

Also how the VMA space would be accounted (point 3) when creating an empty VMA 
(no pages in there yet)

  On success, length is changed to the length that actually got mapped.  The 
kernel could do this while holding mmap_sem for *read*, and it could get the 
lock ordering right.  If and when mm range locks ever get merged, it could 
switch to using a range lock.
> 
> Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the part 
> of the mapping that you're done with.
> 
> Does this seem reasonable?  It should involve very little code change, it 
> will run faster, it will scale better, and it is much less weird IMO.

Maybe, although I do not see the 'little code change' yet.

But at least, this seems pretty nice idea, especially if it could allow us to 
fill the mmap()ed area later when packets are received.



Re: WARNING in refcount_dec

2018-04-23 Thread Willem de Bruijn
On Thu, Apr 19, 2018 at 2:55 PM, Willem de Bruijn
 wrote:
> On Thu, Apr 19, 2018 at 2:32 AM, DaeRyong Jeong  wrote:
>> Hello.
>> We have analyzed the cause of the crash in v4.16-rc3, WARNING in 
>> refcount_dec,
>> which is found by RaceFuzzer (a modified version of Syzkaller).
>>
>> Since struct packet_sock's member variables, running, has_vnet_hdr, origdev
>> and auxdata are declared as bitfields, accessing these variables can race if
>> there is no synchronization mechanism.
>
> Great catch.
>
> These fields po->{running, auxdata, origdev, has_vnet_hdr} are
> accessed without a uniform locking strategy.
>
> po->running is always accessed with po->bind_lock held (with the
> exception of reading in packet_seq_show, but that is best effort).
>
> That is the only field written to outside setsockopt. If it is moved to
> a separate word, it will no longer interfere with the others.
>
> The other fields are read lockless in the various recv and send
> functions, but only set in setsockopt. We've had enough
> locking bugs around setsockopt that I suggest we wrap all of
> those in lock_sock, like the example I gave before for
> has_vnet_hdr.

Sent http://patchwork.ozlabs.org/patch/903190/


[net-next 2/2] ipv6: sr: Compute flowlabel of outer IPv6 header for seg6 encap mode

2018-04-23 Thread Ahmed Abdelsalam
ECMP (equal-cost multipath) hashes are typically computed on the
packets' 5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and
ECMP hashing will often revert to (src IP, dst IP). This will lead
to traffic polarization on a single ECMP path, causing congestion
and waste of network capacity.

In IPv6, the 20-bit flow label field is also used as part of the
ECMP hash. In the lack of L4 data, the hashing will be on (src IP,
dst IP, flow label).

Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated)

Currently, the seg6_do_srh_encap() function extracts the original
packet's flow label and set it as the outer IPv6 flow label. There
are two issues with this behaviour:

a) There is no guarantee that the inner flow label will be set
by the source.

b) If the original packet is not IPv6, the flow label will be set
to zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that
computes a flow label from a given skb. It supports IPv6, IPv4
and L2 payloads, and leverages the per namespace "seg6_flowlabel"
sysctl value.

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam 
---
 net/ipv6/seg6_iptunnel.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5fe1394..3d9cd86 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device 
*dev,
rcu_read_unlock();
 }
 
+/* Compute flowlabel for outer IPv6 header */
+__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+  struct ipv6hdr *inner_hdr)
+{
+   int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
+   __be32 flowlabel = 0;
+   u32 hash;
+
+   if (do_flowlabel > 0) {
+   hash = skb_get_hash(skb);
+   rol32(hash, 16);
+   flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
+   } else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) {
+   flowlabel = ip6_flowlabel(inner_hdr);
+   }
+   return flowlabel;
+}
+
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
 int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
@@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
struct ipv6hdr *hdr, *inner_hdr;
struct ipv6_sr_hdr *isrh;
int hdrlen, tot_len, err;
+   __be32 flowlabel;
 
hdrlen = (osrh->hdrlen + 1) << 3;
tot_len = hdrlen + sizeof(*hdr);
@@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
 * decapsulation will overwrite inner hlim with outer hlim
 */
 
+   flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
if (skb->protocol == htons(ETH_P_IPV6)) {
ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-ip6_flowlabel(inner_hdr));
+flowlabel);
hdr->hop_limit = inner_hdr->hop_limit;
} else {
-   ip6_flow_hdr(hdr, 0, 0);
+   ip6_flow_hdr(hdr, 0, flowlabel);
hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
}
 
-- 
2.1.4



[net-next 1/2] ipv6: sr: add a per namespace sysctl to control seg6 flowlabel

2018-04-23 Thread Ahmed Abdelsalam
This patch adds a per namespace sysctl, named 'seg6_flowlabel', to be used
by seg6_do_srh_encap() to control the behaviour of setting the flowlabel
value of outer IPv6.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
 0 copy flowlabel from Inner paceket in case of Inner IPv6 (0 for IPv4/L2)
 1 Compute the flowlabel using seg6_make_flowlabel()

Signed-off-by: Ahmed Abdelsalam 
---
 include/net/netns/ipv6.h   | 1 +
 net/ipv6/sysctl_net_ipv6.c | 8 
 2 files changed, 9 insertions(+)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 97b3a54..c978a31 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
int max_hbh_opts_cnt;
int max_dst_opts_len;
int max_hbh_opts_len;
+   int seg6_flowlabel;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6fbdef6..e15cd37 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = {
.extra1 = ,
.extra2 = ,
},
+   {
+   .procname   = "seg6_flowlabel",
+   .data   = _net.ipv6.sysctl.seg6_flowlabel,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
@@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
ipv6_table[12].data = >ipv6.sysctl.max_dst_opts_len;
ipv6_table[13].data = >ipv6.sysctl.max_hbh_opts_len;
ipv6_table[14].data = >ipv6.sysctl.multipath_hash_policy,
+   ipv6_table[15].data = >ipv6.sysctl.seg6_flowlabel;
 
ipv6_route_table = ipv6_route_sysctl_init(net);
if (!ipv6_route_table)
-- 
2.1.4



[PATCH net] packet: fix bitfield update race

2018-04-23 Thread Willem de Bruijn
From: Willem de Bruijn 

Updates to the bitfields in struct packet_sock are not atomic.
Serialize these read-modify-write cycles.

Move po->running into a separate variable. Its writes are protected by
po->bind_lock (except for one startup case at packet_create). Also
replace a textual precondition warning with lockdep annotation.

All others are set only in packet_setsockopt. Serialize these
updates by holding the socket lock. Analogous to other field updates,
also hold the lock when testing whether a ring is active (pg_vec).

Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg")
Reported-by: DaeRyong Jeong 
Reported-by: Byoungyoung Lee 
Signed-off-by: Willem de Bruijn 
---
 net/packet/af_packet.c | 60 +++---
 net/packet/internal.h  | 10 +++
 2 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index c31b0687396a..01f3515cada0 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -329,11 +329,11 @@ static void packet_pick_tx_queue(struct net_device *dev, 
struct sk_buff *skb)
skb_set_queue_mapping(skb, queue_index);
 }
 
-/* register_prot_hook must be invoked with the po->bind_lock held,
+/* __register_prot_hook must be invoked through register_prot_hook
  * or from a context in which asynchronous accesses to the packet
  * socket is not possible (packet_create()).
  */
-static void register_prot_hook(struct sock *sk)
+static void __register_prot_hook(struct sock *sk)
 {
struct packet_sock *po = pkt_sk(sk);
 
@@ -348,8 +348,13 @@ static void register_prot_hook(struct sock *sk)
}
 }
 
-/* {,__}unregister_prot_hook() must be invoked with the po->bind_lock
- * held.   If the sync parameter is true, we will temporarily drop
+static void register_prot_hook(struct sock *sk)
+{
+   lockdep_assert_held_once(_sk(sk)->bind_lock);
+   __register_prot_hook(sk);
+}
+
+/* If the sync parameter is true, we will temporarily drop
  * the po->bind_lock and do a synchronize_net to make sure no
  * asynchronous packet processing paths still refer to the elements
  * of po->prot_hook.  If the sync parameter is false, it is the
@@ -359,6 +364,8 @@ static void __unregister_prot_hook(struct sock *sk, bool 
sync)
 {
struct packet_sock *po = pkt_sk(sk);
 
+   lockdep_assert_held_once(>bind_lock);
+
po->running = 0;
 
if (po->fanout)
@@ -3252,7 +3259,7 @@ static int packet_create(struct net *net, struct socket 
*sock, int protocol,
 
if (proto) {
po->prot_hook.type = proto;
-   register_prot_hook(sk);
+   __register_prot_hook(sk);
}
 
mutex_lock(>packet.sklist_lock);
@@ -3732,12 +3739,18 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
 
if (optlen != sizeof(val))
return -EINVAL;
-   if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
-   return -EBUSY;
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;
-   po->tp_loss = !!val;
-   return 0;
+
+   lock_sock(sk);
+   if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) {
+   ret = -EBUSY;
+   } else {
+   po->tp_loss = !!val;
+   ret = 0;
+   }
+   release_sock(sk);
+   return ret;
}
case PACKET_AUXDATA:
{
@@ -3748,7 +3761,9 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;
 
+   lock_sock(sk);
po->auxdata = !!val;
+   release_sock(sk);
return 0;
}
case PACKET_ORIGDEV:
@@ -3760,7 +3775,9 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;
 
+   lock_sock(sk);
po->origdev = !!val;
+   release_sock(sk);
return 0;
}
case PACKET_VNET_HDR:
@@ -3769,15 +3786,20 @@ packet_setsockopt(struct socket *sock, int level, int 
optname, char __user *optv
 
if (sock->type != SOCK_RAW)
return -EINVAL;
-   if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
-   return -EBUSY;
if (optlen < sizeof(val))
return -EINVAL;
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;
 
-   po->has_vnet_hdr = !!val;
-   return 0;
+   lock_sock(sk);
+   if 

Re: [PATCH] selftests: bpf: update .gitignore with missing file

2018-04-23 Thread Daniel Borkmann
On 04/23/2018 03:50 PM, Anders Roxell wrote:
> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
> Signed-off-by: Anders Roxell 
> ---
>  tools/testing/selftests/bpf/.gitignore | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/.gitignore 
> b/tools/testing/selftests/bpf/.gitignore
> index 5e1ab2f0eb79..3e3b3ced3f7c 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -15,3 +15,4 @@ test_libbpf_open
>  test_sock
>  test_sock_addr
>  urandom_read
> +test_btf

Against which tree is this? This doesn't apply to bpf-next. It would
apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests")
is part of bpf-next, so fits to neither of them.


Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()

2018-04-23 Thread Marcelo Ricardo Leitner
Hi,

On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
> Each SCTP association can have up to 65535 input and output streams.
> For each stream type an array of sctp_stream_in or sctp_stream_out
> structures is allocated using kmalloc_array() function. This function
> allocates physically contiguous memory regions, so this can lead
> to allocation of memory regions of very high order, i.e.:
>
>   sizeof(struct sctp_stream_out) == 24,
>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>   which means 9th memory order.
>
> This can lead to a memory allocation failures on the systems
> under a memory stress.

Did you do performance tests while actually using these 65k streams
and with 256 (so it gets 2 pages)?

This will introduce another deref on each access to an element, but
I'm not expecting any impact due to it.

  Marcelo


Re: [PATCH net-next 1/2] net/sctp: Make wrappers for accessing in/out streams

2018-04-23 Thread Marcelo Ricardo Leitner
On Mon, Apr 23, 2018 at 09:41:05PM +0300, Oleg Babin wrote:
> This patch introduces wrappers for accessing in/out streams indirectly.
> This will enable to replace physically contiguous memory arrays
> of streams with flexible arrays (or maybe any other appropriate
> mechanism) which do memory allocation on a per-page basis.
>
> Signed-off-by: Oleg Babin 
> ---
>  include/net/sctp/structs.h   |  30 +++-
>  net/sctp/chunk.c |   6 ++-
>  net/sctp/outqueue.c  |  11 +++--
>  net/sctp/socket.c|   4 +-
>  net/sctp/stream.c| 107 
> +--
>  net/sctp/stream_interleave.c |   2 +-
>  net/sctp/stream_sched.c  |  13 +++---
>  net/sctp/stream_sched_prio.c |  22 -
>  net/sctp/stream_sched_rr.c   |   8 ++--
>  9 files changed, 116 insertions(+), 87 deletions(-)
>
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index a0ec462..578bb40 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -394,37 +394,37 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
> outcnt, __u16 incnt,
>
>  /* What is the current SSN number for this stream? */
>  #define sctp_ssn_peek(stream, type, sid) \
> - ((stream)->type[sid].ssn)
> + (sctp_stream_##type##_ptr((stream), (sid))->ssn)
>
>  /* Return the next SSN number for this stream.   */
>  #define sctp_ssn_next(stream, type, sid) \
> - ((stream)->type[sid].ssn++)
> + (sctp_stream_##type##_ptr((stream), (sid))->ssn++)
>
>  /* Skip over this ssn and all below. */
>  #define sctp_ssn_skip(stream, type, sid, ssn) \
> - ((stream)->type[sid].ssn = ssn + 1)
> + (sctp_stream_##type##_ptr((stream), (sid))->ssn = ssn + 1)
>
>  /* What is the current MID number for this stream? */
>  #define sctp_mid_peek(stream, type, sid) \
> - ((stream)->type[sid].mid)
> + (sctp_stream_##type##_ptr((stream), (sid))->mid)
>
>  /* Return the next MID number for this stream.  */
>  #define sctp_mid_next(stream, type, sid) \
> - ((stream)->type[sid].mid++)
> + (sctp_stream_##type##_ptr((stream), (sid))->mid++)
>
>  /* Skip over this mid and all below. */
>  #define sctp_mid_skip(stream, type, sid, mid) \
> - ((stream)->type[sid].mid = mid + 1)
> + (sctp_stream_##type##_ptr((stream), (sid))->mid = mid + 1)
>
> -#define sctp_stream_in(asoc, sid) (&(asoc)->stream.in[sid])
> +#define sctp_stream_in(asoc, sid) sctp_stream_in_ptr(&(asoc)->stream, (sid))

This will get confusing:
- sctp_stream_in(asoc, sid)
- sctp_stream_in_ptr(stream, sid)

Considering all usages of sctp_stream_in(), seems you can just update
them to do the ->stream deref and keep only the later implementation.
Which then don't need the _ptr suffix.

  Marcelo


[bpf-next PATCH 2/4] bpf: sockmap, add a set of tests to run by default

2018-04-23 Thread John Fastabend
If no options are passed to sockmap after this patch we run a set of
tests using various options and sendmsg/sendpage sizes. This replaces
the sockmap_test.sh script.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |  563 
 1 file changed, 512 insertions(+), 51 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 649f06a..22595b4 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -53,9 +54,11 @@
 #define S2_PORT 10001
 
 #define BPF_FILENAME "sockmap_kern.o"
+#define BPF_CGRP "/mnt/cgroup2/"
 
 /* global sockets */
 int s1, s2, c1, c2, p1, p2;
+int test_cnt;
 
 int txmsg_pass;
 int txmsg_noisy;
@@ -109,7 +112,7 @@ static void usage(char *argv[])
printf("\n");
 }
 
-static int sockmap_init_sockets(void)
+static int sockmap_init_sockets(int verbose)
 {
int i, err, one = 1;
struct sockaddr_in addr;
@@ -209,9 +212,11 @@ static int sockmap_init_sockets(void)
return errno;
}
 
-   printf("connected sockets: c1 <-> p1, c2 <-> p2\n");
-   printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n",
-   c1, s1, c2, s2);
+   if (verbose) {
+   printf("connected sockets: c1 <-> p1, c2 <-> p2\n");
+   printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> 
s2(%i)\n",
+   c1, s1, c2, s2);
+   }
return 0;
 }
 
@@ -329,10 +334,12 @@ static int msg_loop(int fd, int iov_count, int 
iov_length, int cnt,
clock_gettime(CLOCK_MONOTONIC, >end);
} else {
int slct, recv, max_fd = fd;
+   int fd_flags = O_NONBLOCK;
struct timeval timeout;
float total_bytes;
fd_set w;
 
+   fcntl(fd, fd_flags);
total_bytes = (float)iov_count * (float)iov_length * (float)cnt;
err = clock_gettime(CLOCK_MONOTONIC, >start);
if (err < 0)
@@ -351,7 +358,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
clock_gettime(CLOCK_MONOTONIC, >end);
goto out_errno;
} else if (!slct) {
-   fprintf(stderr, "unexpected timeout\n");
+   if (opt->verbose)
+   fprintf(stderr, "unexpected timeout\n");
errno = -EIO;
clock_gettime(CLOCK_MONOTONIC, >end);
goto out_errno;
@@ -440,7 +448,7 @@ static int sendmsg_test(struct sockmap_options *opt)
iov_count = 1;
err = msg_loop(rx_fd, iov_count, iov_buf,
   cnt, , false, opt);
-   if (err)
+   if (err && opt->verbose)
fprintf(stderr,
"msg_loop_rx: iov_count %i iov_buf %i cnt %i 
err %i\n",
iov_count, iov_buf, cnt, err);
@@ -450,10 +458,11 @@ static int sendmsg_test(struct sockmap_options *opt)
sent_Bps = sentBps(s);
recvd_Bps = recvdBps(s);
}
-   fprintf(stdout,
-   "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB %fB/s 
%fGB/s\n",
-   s.bytes_sent, sent_Bps, sent_Bps/giga,
-   s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
+   if (opt->verbose)
+   fprintf(stdout,
+   "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB 
%fB/s %fGB/s\n",
+   s.bytes_sent, sent_Bps, sent_Bps/giga,
+   s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
exit(1);
} else if (rxpid == -1) {
perror("msg_loop_rx: ");
@@ -477,10 +486,11 @@ static int sendmsg_test(struct sockmap_options *opt)
sent_Bps = sentBps(s);
recvd_Bps = recvdBps(s);
}
-   fprintf(stdout,
-   "tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB %fB/s 
%fGB/s\n",
-   s.bytes_sent, sent_Bps, sent_Bps/giga,
-   s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
+   if (opt->verbose)
+   fprintf(stdout,
+   "tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB 
%fB/s %fGB/s\n",
+   s.bytes_sent, sent_Bps, sent_Bps/giga,
+   s.bytes_recvd, recvd_Bps, recvd_Bps/giga);
exit(1);
} else if (txpid == -1) {
perror("msg_loop_tx: ");
@@ -575,17 +585,9 @@ enum 

[bpf-next PATCH 3/4] bpf: sockmap, add selftests

2018-04-23 Thread John Fastabend
This adds a new test program test_sockmap which is the old sample
sockmap program. By moving the sample program here we can now run it
as part of the self tests suite. To support this a populate_progs()
routine is added to load programs and maps which was previously done
with load_bpf_file(). This is needed because self test libs do not
provide a similar routine. Also we now use the cgroup_helpers
routines to manage cgroup use instead of manually creating one and
supplying it to the CLI.

Notice we keep the CLI around though because it is useful for dbg
and specialized testing.

To run use ./test_sockmap and the result should be,

Summary 660 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: John Fastabend 
---
 tools/include/uapi/linux/bpf.h  |1 
 tools/include/uapi/linux/if_link.h  |   39 +
 tools/lib/bpf/libbpf.c  |4 
 tools/lib/bpf/libbpf.h  |2 
 tools/testing/selftests/bpf/Makefile|5 
 tools/testing/selftests/bpf/test_sockmap.c  | 1465 +++
 tools/testing/selftests/bpf/test_sockmap_kern.c |  340 +
 7 files changed, 1852 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockmap.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f7fbb9..c8383a2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -884,6 +884,7 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
+#define BPF_F_SEQ_NUMBER   (1ULL << 3)
 
 /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
  * BPF_FUNC_perf_event_read_value flags.
diff --git a/tools/include/uapi/linux/if_link.h 
b/tools/include/uapi/linux/if_link.h
index 6d94477..68699f6 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -941,4 +941,43 @@ enum {
IFLA_EVENT_BONDING_OPTIONS, /* change in bonding options */
 };
 
+/* tun section */
+
+enum {
+   IFLA_TUN_UNSPEC,
+   IFLA_TUN_OWNER,
+   IFLA_TUN_GROUP,
+   IFLA_TUN_TYPE,
+   IFLA_TUN_PI,
+   IFLA_TUN_VNET_HDR,
+   IFLA_TUN_PERSIST,
+   IFLA_TUN_MULTI_QUEUE,
+   IFLA_TUN_NUM_QUEUES,
+   IFLA_TUN_NUM_DISABLED_QUEUES,
+   __IFLA_TUN_MAX,
+};
+
+#define IFLA_TUN_MAX (__IFLA_TUN_MAX - 1)
+
+/* rmnet section */
+
+#define RMNET_FLAGS_INGRESS_DEAGGREGATION (1U << 0)
+#define RMNET_FLAGS_INGRESS_MAP_COMMANDS  (1U << 1)
+#define RMNET_FLAGS_INGRESS_MAP_CKSUMV4   (1U << 2)
+#define RMNET_FLAGS_EGRESS_MAP_CKSUMV4(1U << 3)
+
+enum {
+   IFLA_RMNET_UNSPEC,
+   IFLA_RMNET_MUX_ID,
+   IFLA_RMNET_FLAGS,
+   __IFLA_RMNET_MAX,
+};
+
+#define IFLA_RMNET_MAX (__IFLA_RMNET_MAX - 1)
+
+struct ifla_rmnet_flags {
+   __u32   flags;
+   __u32   mask;
+};
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6513e0b..7bcdca1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1961,8 +1961,8 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
 
-static void bpf_program__set_expected_attach_type(struct bpf_program *prog,
-enum bpf_attach_type type)
+void bpf_program__set_expected_attach_type(struct bpf_program *prog,
+  enum bpf_attach_type type)
 {
prog->expected_attach_type = type;
 }
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index d6ac4fa..197f9ce 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -193,6 +193,8 @@ int bpf_program__set_prep(struct bpf_program *prog, int 
nr_instance,
 int bpf_program__set_xdp(struct bpf_program *prog);
 int bpf_program__set_perf_event(struct bpf_program *prog);
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
+void bpf_program__set_expected_attach_type(struct bpf_program *prog,
+  enum bpf_attach_type type);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..094d507 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-   test_sock test_sock_addr test_btf
+   

[bpf-next PATCH 1/4] bpf: sockmap, code sockmap_test in C

2018-04-23 Thread John Fastabend
By moving sockmap_test from shell script into C we can run it directly
from selftests, but we can also push the input/output around in proper
structures.

However, keep the CLI options around because they are useful for
debugging when a paticular pattern of msghdr or sockmap options
trips up the sockmap code path.

Signed-off-by: John Fastabend 
---
 samples/sockmap/sockmap_user.c |  209 ++--
 1 file changed, 113 insertions(+), 96 deletions(-)

diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c
index 6f23349..649f06a 100644
--- a/samples/sockmap/sockmap_user.c
+++ b/samples/sockmap/sockmap_user.c
@@ -52,6 +52,8 @@
 #define S1_PORT 1
 #define S2_PORT 10001
 
+#define BPF_FILENAME "sockmap_kern.o"
+
 /* global sockets */
 int s1, s2, c1, c2, p1, p2;
 
@@ -226,6 +228,9 @@ struct sockmap_options {
bool sendpage;
bool data_test;
bool drop_expected;
+   int iov_count;
+   int iov_length;
+   int rate;
 };
 
 static int msg_loop_sendpage(int fd, int iov_length, int cnt,
@@ -409,12 +414,14 @@ static inline float recvdBps(struct msg_stats s)
return s.bytes_recvd / (s.end.tv_sec - s.start.tv_sec);
 }
 
-static int sendmsg_test(int iov_count, int iov_buf, int cnt,
-   struct sockmap_options *opt)
+static int sendmsg_test(struct sockmap_options *opt)
 {
float sent_Bps = 0, recvd_Bps = 0;
int rx_fd, txpid, rxpid, err = 0;
struct msg_stats s = {0};
+   int iov_count = opt->iov_count;
+   int iov_buf = opt->iov_length;
+   int cnt = opt->rate;
int status;
 
errno = 0;
@@ -568,98 +575,13 @@ enum {
SENDPAGE,
 };
 
-int main(int argc, char **argv)
+static int run_options(struct sockmap_options options, int cg_fd,  int test)
 {
-   int iov_count = 1, length = 1024, rate = 1, tx_prog_fd;
-   struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
-   int opt, longindex, err, cg_fd = 0;
-   struct sockmap_options options = {0};
-   int test = PING_PONG;
+   char *bpf_file = BPF_FILENAME;
+   int err, tx_prog_fd;
char filename[256];
 
-   while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:",
- long_options, )) != -1) {
-   switch (opt) {
-   case 's':
-   txmsg_start = atoi(optarg);
-   break;
-   case 'e':
-   txmsg_end = atoi(optarg);
-   break;
-   case 'a':
-   txmsg_apply = atoi(optarg);
-   break;
-   case 'k':
-   txmsg_cork = atoi(optarg);
-   break;
-   case 'c':
-   cg_fd = open(optarg, O_DIRECTORY, O_RDONLY);
-   if (cg_fd < 0) {
-   fprintf(stderr,
-   "ERROR: (%i) open cg path failed: %s\n",
-   cg_fd, optarg);
-   return cg_fd;
-   }
-   break;
-   case 'r':
-   rate = atoi(optarg);
-   break;
-   case 'v':
-   options.verbose = 1;
-   break;
-   case 'i':
-   iov_count = atoi(optarg);
-   break;
-   case 'l':
-   length = atoi(optarg);
-   break;
-   case 'd':
-   options.data_test = true;
-   break;
-   case 't':
-   if (strcmp(optarg, "ping") == 0) {
-   test = PING_PONG;
-   } else if (strcmp(optarg, "sendmsg") == 0) {
-   test = SENDMSG;
-   } else if (strcmp(optarg, "base") == 0) {
-   test = BASE;
-   } else if (strcmp(optarg, "base_sendpage") == 0) {
-   test = BASE_SENDPAGE;
-   } else if (strcmp(optarg, "sendpage") == 0) {
-   test = SENDPAGE;
-   } else {
-   usage(argv);
-   return -1;
-   }
-   break;
-   case 0:
-   break;
-   case 'h':
-   default:
-   usage(argv);
-   return -1;
-   }
-   }
-
-   if (!cg_fd) {
-   fprintf(stderr, "%s requires cgroup option: --cgroup \n",
-   argv[0]);
-   return -1;
-   }
-
-   if (setrlimit(RLIMIT_MEMLOCK, )) {
-   perror("setrlimit(RLIMIT_MEMLOCK)");
-  

[bpf-next PATCH 4/4] bpf: sockmap, remove samples program

2018-04-23 Thread John Fastabend
The BPF sample sockmap is redundant now that equivelant tests exist
in the BPF selftests. Lets remove this sample and only keep the
selftest version that will be run as part of the selftest suite.

Signed-off-by: John Fastabend 
---
 samples/sockmap/Makefile|   75 --
 samples/sockmap/sockmap_kern.c  |  341 --
 samples/sockmap/sockmap_test.sh |  488 --
 samples/sockmap/sockmap_user.c  | 1372 ---
 4 files changed, 2276 deletions(-)
 delete mode 100644 samples/sockmap/Makefile
 delete mode 100644 samples/sockmap/sockmap_kern.c
 delete mode 100755 samples/sockmap/sockmap_test.sh
 delete mode 100644 samples/sockmap/sockmap_user.c

diff --git a/samples/sockmap/Makefile b/samples/sockmap/Makefile
deleted file mode 100644
index 9bf2881..000
--- a/samples/sockmap/Makefile
+++ /dev/null
@@ -1,75 +0,0 @@
-# List of programs to build
-hostprogs-y := sockmap
-
-# Libbpf dependencies
-LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
-
-HOSTCFLAGS += -I$(objtree)/usr/include
-HOSTCFLAGS += -I$(srctree)/tools/lib/
-HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
-HOSTCFLAGS += -I$(srctree)/tools/lib/ -I$(srctree)/tools/include
-HOSTCFLAGS += -I$(srctree)/tools/perf
-
-sockmap-objs := ../bpf/bpf_load.o $(LIBBPF) sockmap_user.o
-
-# Tell kbuild to always build the programs
-always := $(hostprogs-y)
-always += sockmap_kern.o
-
-HOSTLOADLIBES_sockmap += -lelf -lpthread
-
-# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
-#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
-LLC ?= llc
-CLANG ?= clang
-
-# Trick to allow make to be run from this directory
-all:
-   $(MAKE) -C ../../ $(CURDIR)/
-
-clean:
-   $(MAKE) -C ../../ M=$(CURDIR) clean
-   @rm -f *~
-
-$(obj)/syscall_nrs.s:  $(src)/syscall_nrs.c
-   $(call if_changed_dep,cc_s_c)
-
-$(obj)/syscall_nrs.h:  $(obj)/syscall_nrs.s FORCE
-   $(call filechk,offsets,__SYSCALL_NRS_H__)
-
-clean-files += syscall_nrs.h
-
-FORCE:
-
-
-# Verify LLVM compiler tools are available and bpf target is supported by llc
-.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
-
-verify_cmds: $(CLANG) $(LLC)
-   @for TOOL in $^ ; do \
-   if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
-   echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
-   exit 1; \
-   else true; fi; \
-   done
-
-verify_target_bpf: verify_cmds
-   @if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
-   echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
-   echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
-   exit 2; \
-   else true; fi
-
-$(src)/*.c: verify_target_bpf
-
-# asm/sysreg.h - inline assembly used by it is incompatible with llvm.
-# But, there is no easy way to fix it, so just exclude it since it is
-# useless for BPF samples.
-$(obj)/%.o: $(src)/%.c
-   $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-   -Wno-compare-distinct-pointer-types \
-   -Wno-gnu-variable-sized-type-not-at-end \
-   -Wno-address-of-packed-member -Wno-tautological-compare \
-   -Wno-unknown-warning-option \
-   -O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
diff --git a/samples/sockmap/sockmap_kern.c b/samples/sockmap/sockmap_kern.c
deleted file mode 100644
index 9ff8bc5..000
--- a/samples/sockmap/sockmap_kern.c
+++ /dev/null
@@ -1,341 +0,0 @@
-/* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
-#include 
-#include 
-#include 
-#include 
-#include "../../tools/testing/selftests/bpf/bpf_helpers.h"
-#include "../../tools/testing/selftests/bpf/bpf_endian.h"
-
-/* Sockmap sample program connects a client and a backend together
- * using cgroups.
- *
- *client:X <---> frontend:80 client:X <---> backend:80
- *
- * For simplicity we hard code values here and bind 1:1. The hard
- * coded values are part of the setup in sockmap.sh script that
- * is associated with this BPF program.
- *
- * The bpf_printk is verbose and prints information as connections
- * are established and verdicts are decided.
- */
-
-#define bpf_printk(fmt, ...)   \
-({ \
-  

[bpf-next PATCH 0/4] selftests for BPF sockmap use cases

2018-04-23 Thread John Fastabend
This series moves ./samples/sockmap into BPF selftests. There are a
few good reasons to do this. First, by pushing this into selftests
the tests will be run automatically. Second, sockmap was not really
a sample of anything anymore, but rather a large set of tests.

Note: There are three recent fixes outstanding against bpf branch
that can be detected occosionally by the automated tests here.

https://patchwork.ozlabs.org/patch/903138/
https://patchwork.ozlabs.org/patch/903139/
https://patchwork.ozlabs.org/patch/903140/

---

John Fastabend (4):
  bpf: sockmap, code sockmap_test in C
  bpf: sockmap, add a set of tests to run by default
  bpf: sockmap, add selftests
  bpf: sockmap, remove samples program


 samples/sockmap/Makefile|   75 -
 samples/sockmap/sockmap_kern.c  |  341 -
 samples/sockmap/sockmap_test.sh |  488 
 samples/sockmap/sockmap_user.c  |  894 --
 tools/include/uapi/linux/bpf.h  |1 
 tools/include/uapi/linux/if_link.h  |   39 +
 tools/lib/bpf/libbpf.c  |4 
 tools/lib/bpf/libbpf.h  |2 
 tools/testing/selftests/bpf/Makefile|5 
 tools/testing/selftests/bpf/test_sockmap.c  | 1465 +++
 tools/testing/selftests/bpf/test_sockmap_kern.c |  340 +
 11 files changed, 1852 insertions(+), 1802 deletions(-)
 delete mode 100644 samples/sockmap/Makefile
 delete mode 100644 samples/sockmap/sockmap_kern.c
 delete mode 100755 samples/sockmap/sockmap_test.sh
 delete mode 100644 samples/sockmap/sockmap_user.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.c

--
Signature


[PATCH bpf-next v6 06/10] tools/bpf: add bpf_get_stack helper to tools headers

2018-04-23 Thread Yonghong Song
The tools header file bpf.h is synced with kernel uapi bpf.h.
The new helper is also added to bpf_helpers.h.

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 19 +--
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f7fbb9..116eb5f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -529,6 +529,17 @@ union bpf_attr {
  * other bits - reserved
  * Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  * calculate csum diff
  * @from: raw from buffer
@@ -841,7 +852,8 @@ union bpf_attr {
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
FN(bind),   \
-   FN(xdp_adjust_tail),
+   FN(xdp_adjust_tail),\
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -875,11 +887,14 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
+/* flags used by BPF_FUNC_get_stackid only. */
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+/* flags used by BPF_FUNC_get_stack only. */
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 9271576..2d9d650 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -98,7 +98,8 @@ static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =
(void *) BPF_FUNC_xdp_adjust_tail;
-
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+   (void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5



Re: [PATCH bpf] bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY

2018-04-23 Thread Daniel Borkmann
On 04/23/2018 07:09 PM, Roman Gushchin wrote:
> Running bpf programs requires disabled preemption,
> however at least some* of the BPF_PROG_RUN_ARRAY users
> do not follow this rule.
> 
> To fix this bug, and also to make it not happen in the future,
> let's add explicit preemption disabling/re-enabling
> to the __BPF_PROG_RUN_ARRAY code.
> 
> * for example:
>  [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
>  ...
>  [   17.640890]  inet6_create+0x3eb/0x520
>  [   17.641405]  __sock_create+0x242/0x340
>  [   17.641939]  __sys_socket+0x57/0xe0
>  [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
>  [   17.642944]  SyS_socket+0xa/0x10
>  [   17.643357]  do_syscall_64+0x79/0x220
>  [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Signed-off-by: Roman Gushchin 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Applied to bpf, thanks Roman.


[PATCH bpf-next v6 09/10] tools/bpf: add a test for bpf_get_stack with raw tracepoint prog

2018-04-23 Thread Yonghong Song
The test attached a raw_tracepoint program to sched/sched_switch.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that the kernel function
for raw_tracepoint ___bpf_prog_run is part of the stack.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile   |   3 +-
 tools/testing/selftests/bpf/test_get_stack_rawtp.c | 102 +
 tools/testing/selftests/bpf/test_progs.c   | 122 +
 3 files changed, 226 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_get_stack_rawtp.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..54e9e74 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -32,7 +32,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o 
test_adjust_tail.o \
-   test_btf_haskv.o test_btf_nokv.o
+   test_btf_haskv.o test_btf_nokv.o test_get_stack_rawtp.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -56,6 +56,7 @@ $(TEST_GEN_PROGS_EXTENDED): $(OUTPUT)/libbpf.a
 $(OUTPUT)/test_dev_cgroup: cgroup_helpers.c
 $(OUTPUT)/test_sock: cgroup_helpers.c
 $(OUTPUT)/test_sock_addr: cgroup_helpers.c
+$(OUTPUT)/test_progs: trace_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/test_get_stack_rawtp.c 
b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
new file mode 100644
index 000..ba1dcf9
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_get_stack_rawtp.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK_RAWTP 100
+struct stack_trace_t {
+   int pid;
+   int kern_stack_size;
+   int user_stack_size;
+   int user_stack_buildid_size;
+   __u64 kern_stack[MAX_STACK_RAWTP];
+   __u64 user_stack[MAX_STACK_RAWTP];
+   struct bpf_stack_build_id user_stack_buildid[MAX_STACK_RAWTP];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(__u32),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(struct stack_trace_t),
+   .max_entries = 1,
+};
+
+/* Allocate per-cpu space twice the needed. For the code below
+ *   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+ *   if (usize < 0)
+ * return 0;
+ *   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize, 0);
+ *
+ * If we have value_size = MAX_STACK_RAWTP * sizeof(__u64),
+ * verifier will complain that access "raw_data + usize"
+ * with size "max_len - usize" may be out of bound.
+ * The maximum "raw_data + usize" is "raw_data + max_len"
+ * and the maximum "max_len - usize" is "max_len", verifier
+ * concludes that the maximum buffer access range is
+ * "raw_data[0...max_len * 2 - 1]" and hence reject the program.
+ *
+ * Doubling the to-be-used max buffer size can fix this verifier
+ * issue and avoid complicated C programming massaging.
+ * This is an acceptable workaround since there is one entry here.
+ */
+struct bpf_map_def SEC("maps") rawdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = MAX_STACK_RAWTP * sizeof(__u64) * 2,
+   .max_entries = 1,
+};
+
+SEC("tracepoint/sched/sched_switch")
+int bpf_prog1(void *ctx)
+{
+   int max_len, max_buildid_len, usize, ksize, total_size;
+   struct stack_trace_t *data;
+   void *raw_data;
+   __u32 key = 0;
+
+   data = bpf_map_lookup_elem(_map, );
+   if (!data)
+   return 0;
+
+   max_len = MAX_STACK_RAWTP * sizeof(__u64);
+   max_buildid_len = MAX_STACK_RAWTP * sizeof(struct bpf_stack_build_id);
+   data->pid = bpf_get_current_pid_tgid();
+   data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+   data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+   BPF_F_USER_STACK);
+   data->user_stack_buildid_size = bpf_get_stack(
+   ctx, data->user_stack_buildid, max_buildid_len,
+   BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+   bpf_perf_event_output(ctx, , 0, data, sizeof(*data));
+
+   /* write both 

  1   2   3   4   >