from:"\"Cong Wang\""

Re: [PATCH net-next v2] net/sched: act_police: add support for packet-per-second policing

2021-01-29 Thread Cong Wang

On Fri, Jan 29, 2021 at 2:29 AM Simon Horman  wrote:
> +/**
> + * psched_ratecfg_precompute__() - Pre-compute values for reciprocal division
> + * @rate:   Rate to compute reciprocal division values of
> + * @mult:   Multiplier for reciprocal division
> + * @shift:  Shift for reciprocal division
> + *
> + * The multiplier and shift for reciprocal division by rate are stored
> + * in mult and shift.
> + *
> + * The deal here is to replace a divide by a reciprocal one
> + * in fast path (a reciprocal divide is a multiply and a shift)
> + *
> + * Normal formula would be :
> + *  time_in_ns = (NSEC_PER_SEC * len) / rate_bps
> + *
> + * We compute mult/shift to use instead :
> + *  time_in_ns = (len * mult) >> shift;
> + *
> + * We try to get the highest possible mult value for accuracy,
> + * but have to make sure no overflows will ever happen.
> + *
> + * reciprocal_value() is not used here it doesn't handle 64-bit values.
> + */
> +static void psched_ratecfg_precompute__(u64 rate, u32 *mult, u8 *shift)

Am I the only one who thinks "foo__()" is an odd name? We usually use
"__foo()" for helper or unlocked version of "foo()".

And, you probably want to move the function doc to its exported variant
instead, right?

Thanks.

Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map

2021-01-28 Thread Cong Wang

On Thu, Jan 28, 2021 at 6:54 PM Alexei Starovoitov
 wrote:
>
> I meant it would look like:
>
> noinline per_elem_callback(map, key, value, ...)
> {
>   if (value->foo > ...)
> bpf_delete_map_elem(map, key);
> }
>
> noinline timer_callback(timer, ctx)
> {
>   map = ctx->map;
>   bpf_for_each_map_elem(map, per_elem_callback, ...);
> }
>
> int main_bpf_prog(skb)
> {
>   bpf_timer_setup(my_timer, timer_callback, ...);
>   bpf_mod_timer(my_timer, HZ);
> }
>
> The bpf_for_each_map_elem() work is already in progress. Expect patches to hit
> mailing list soon.

We don't want a per-element timer, we want a per-map timer but that
requires a way to iterate the whole map. If you or other can provide
bpf_for_each_map_elem(), we can certainly build our timeout map
on top of it.

> If you can work on patches for bpf_timer_*() it would be awesome.

Yeah, I will work on this, not only for timeout map, but also possibly for
the ebpf qdisc I plan to add soon.

Thanks.

Re: [Patch net] net: fix dev_ifsioc_locked() race condition

2021-01-28 Thread Cong Wang

On Thu, Jan 28, 2021 at 9:21 PM Jakub Kicinski  wrote:
>
> On Thu, 28 Jan 2021 21:08:05 -0800 Cong Wang wrote:
> > On Thu, Jan 28, 2021 at 12:55 PM Jakub Kicinski  wrote:
> > >
> > > On Sat, 23 Jan 2021 17:30:49 -0800 Cong Wang wrote:
> > > > From: Cong Wang 
> > > >
> > > > dev_ifsioc_locked() is called with only RCU read lock, so when
> > > > there is a parallel writer changing the mac address, it could
> > > > get a partially updated mac address, as shown below:
> > > >
> > > > Thread 1  Thread 2
> > > > // eth_commit_mac_addr_change()
> > > > memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
> > > >   // dev_ifsioc_locked()
> > > >   memcpy(ifr->ifr_hwaddr.sa_data,
> > > >   dev->dev_addr,...);
> > > >
> > > > Close this race condition by guarding them with a RW semaphore,
> > > > like netdev_get_name(). The writers take RTNL anyway, so this
> > > > will not affect the slow path.
> > > >
> > > > Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()")
> > > > Reported-by: "Gong, Sishuai" 
> > > > Cc: Eric Dumazet 
> > > > Signed-off-by: Cong Wang 
> > >
> > > The addition of the write lock scares me a little for a fix, there's a
> > > lot of code which can potentially run under the callbacks and notifiers
> > > there.
> > >
> > > What about using a seqlock?
> >
> > Actually I did use seqlock in my initial version (not posted), it does not
> > allow blocking inside write_seqlock() protection, so I have to change
> > to rwsem.
>
> Argh, you're right. No way we can construct something that tries to
> read once and if it fails falls back to waiting for RTNL?

I don't think there is any way to tell whether the read fails, a partially
updated address can not be detected without additional flags etc..

And devnet_rename_sem is already there, pretty much similar to this
one.

Thanks.

Re: [PATCH net-next v4] net: psample: Introduce stubs to remove NIC driver dependency

2021-01-28 Thread Cong Wang

On Wed, Jan 27, 2021 at 5:48 PM Chris Mi  wrote:
>
> In order to send sampled packets to userspace, NIC driver calls
> psample api directly. But it creates a hard dependency on module
> psample. Introduce psample_ops to remove the hard dependency.
> It is initialized when psample module is loaded and set to NULL
> when the module is unloaded.

Is it too crazy to just make CONFIG_PSAMPLE a bool so that it
won't be a module?

Thanks.

Re: [Patch net] net: fix dev_ifsioc_locked() race condition

2021-01-28 Thread Cong Wang

On Thu, Jan 28, 2021 at 12:55 PM Jakub Kicinski  wrote:
>
> On Sat, 23 Jan 2021 17:30:49 -0800 Cong Wang wrote:
> > From: Cong Wang 
> >
> > dev_ifsioc_locked() is called with only RCU read lock, so when
> > there is a parallel writer changing the mac address, it could
> > get a partially updated mac address, as shown below:
> >
> > Thread 1  Thread 2
> > // eth_commit_mac_addr_change()
> > memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
> >   // dev_ifsioc_locked()
> >   memcpy(ifr->ifr_hwaddr.sa_data,
> >   dev->dev_addr,...);
> >
> > Close this race condition by guarding them with a RW semaphore,
> > like netdev_get_name(). The writers take RTNL anyway, so this
> > will not affect the slow path.
> >
> > Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()")
> > Reported-by: "Gong, Sishuai" 
> > Cc: Eric Dumazet 
> > Signed-off-by: Cong Wang 
>
> The addition of the write lock scares me a little for a fix, there's a
> lot of code which can potentially run under the callbacks and notifiers
> there.
>
> What about using a seqlock?

Actually I did use seqlock in my initial version (not posted), it does not
allow blocking inside write_seqlock() protection, so I have to change
to rwsem.

Thanks.

Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map

2021-01-27 Thread Cong Wang

On Wed, Jan 27, 2021 at 10:00 AM Alexei Starovoitov
 wrote:
>
> On Tue, Jan 26, 2021 at 11:00 PM Cong Wang  wrote:
> > > >   ret = PTR_ERR(l_new);
> > > > + if (ret == -EAGAIN) {
> > > > + htab_unlock_bucket(htab, b, hash, flags);
> > > > + htab_gc_elem(htab, l_old);
> > > > + mod_delayed_work(system_unbound_wq, 
> > > > &htab->gc_work, 0);
> > > > + goto again;
> > >
> > > Also this one looks rather worrying, so the BPF prog is stalled here, 
> > > loop-waiting
> > > in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward 
> > > progress?
> >
> > In this case, the old one is scheduled for removal in GC, we just wait for 
> > GC
> > to finally remove it. It won't stall unless GC itself or the worker 
> > scheduler is
> > wrong, both of which should be kernel bugs.
> >
> > If we don't do this, users would get a -E2BIG when it is not too big. I 
> > don't
> > know a better way to handle this sad situation, maybe returning -EBUSY
> > to users and let them call again?
>
> I think using wq for timers is a non-starter.
> Tying a hash/lru map with a timer is not a good idea either.

Both xt_hashlimit and nf_conntrack_core use delayed/deferrable
works, probably since their beginnings. They seem to have started
well. ;)

>
> I think timers have to be done as independent objects similar to
> how the kernel uses them.
> Then there will be no question whether lru or hash map needs it.

Yeah, this probably could make the code easier, but when we have
millions of entries in a map, millions of timers would certainly bring
a lot of CPU overhead (timer interrupt storm?).

> The bpf prog author will be able to use timers with either.
> The prog will be able to use timers without hash maps too.
>
> I'm proposing a timer map where each object will go through
> bpf_timer_setup(timer, callback, flags);
> where "callback" is a bpf subprogram.
> Corresponding bpf_del_timer and bpf_mod_timer would work the same way
> they are in the kernel.
> The tricky part is kernel style of using from_timer() to access the
> object with additional info.
> I think bpf timer map can model it the same way.
> At map creation time the value_size will specify the amount of extra
> bytes necessary.
> Another alternative is to pass an extra data argument to a callback.

Hmm, this idea is very interesting. I still think arming a timer,
whether a kernel timer or a bpf timer, for each entry is overkill,
but we can arm one for each map, something like:

bpf_timer_run(interval, bpf_prog, &any_map);

so we run 'bpf_prog' on any map every 'interval', but the 'bpf_prog'
would have to iterate the whole map during each interval to delete
the expired ones. This is probably doable: the timestamps can be
stored either as a part of key or value, and bpf_jiffies64() is already
available, users would have to discard expired ones after lookup
when they are faster than the timer GC.

Let me take a deeper look tomorrow.

Thanks.

Re: [PATCH] neighbour: Prevent a dead entry from updating gc_list

2021-01-27 Thread Cong Wang

On Wed, Jan 27, 2021 at 8:55 AM Chinmay Agarwal  wrote:
>
> Following race condition was detected:
>  - neigh_flush_dev() is under execution and calls
> neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
>
>  - Executing: __netif_receive_skb() ->
> __netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
> calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
>
>  - Moves further along neigh_flush_dev() and calls
> neigh_cleanup_and_release(n), but since reference count increased in t2,
> 'n' couldn't be destroyed.
>
>  - Moves further along, arp_process() and calls
> neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
> the neighbour entry back in gc_list(neigh_mark_dead(), removed it
> earlier in t0 from gc_list)
>
>  - arp_process() finally calls neigh_release(n), destroying
> the neighbour entry.
>
> This leads to 'n' still being part of gc_list, but the actual
> neighbour structure has been freed.
>
> The situation can be prevented from happening if we disallow a dead
> entry to have any possibility of updating gc_list. This is what the
> patch intends to achieve.
>
> Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")
> Signed-off-by: Chinmay Agarwal 

Reviewed-by: Cong Wang 

Thanks.

Re: [Patch net] net: fix dev_ifsioc_locked() race condition

2021-01-27 Thread Cong Wang

On Mon, Jan 25, 2021 at 11:43 AM Gong, Sishuai  wrote:
>
> Hi,
>
> We also found another pair of writer and reader that may suffer the same 
> problem.
>
> A data race may also happen on the variable netdev->dev_addr when functions 
> e1000_set_mac() and packet_getname() run in parallel, eventually it could 
> return a partially updated MAC address to the user, as shown below:

Yeah, this one requires a separate patch, because at least
it uses an index instead of a name to lookup the devices.

Thanks.

Re: BUG: Incorrect MTU on GRE device if remote is unspecified

2021-01-27 Thread Cong Wang

On Wed, Jan 27, 2021 at 4:56 PM Jakub Kicinski  wrote:
>
> On Mon, 25 Jan 2021 22:10:10 +0200 Slava Bacherikov wrote:
> > Hi, I'd like to report a regression. Currently, if you create GRE
> > interface on the latest stable or LTS kernel (5.4 branch) with
> > unspecified remote destination it's MTU will be adjusted for header size
> > twice. For example:
> >
> > $ ip link add name test type gre local 127.0.0.32
> > $ ip link show test | grep mtu
> > 27: test@NONE:  mtu 1452 qdisc noop state DOWN mode DEFAULT group
> > default qlen 1000
> >
> > or with FOU
> >
> > $ ip link add name test2   type gre local 127.0.0.32 encap fou
> > encap-sport auto encap-dport 
> > $ ip link show test2 | grep mtu
> > 28: test2@NONE:  mtu 1436 qdisc noop state DOWN mode DEFAULT
> > group default qlen 1000
> >
> > The same happens with GUE too (MTU is 1428 instead of 1464).
> > As you can see that MTU in first case is 1452 (1500 - 24 - 24) and with
> > FOU it's 1436 (1500 - 32 - 32), GUE 1428 (1500 - 36 - 36). If remote
> > address is specified MTU is correct.
> >
> > This regression caused by fdafed459998e2be0e877e6189b24cb7a0183224 commit.
>
> Cong is this one on your radar?

Yeah, I guess ipgre_link_update() somehow gets called twice,
but I will need to look into it.

Thanks.

[Patch bpf-next] skmsg: make sk_psock_destroy() static

2021-01-27 Thread Cong Wang

From: Cong Wang 

sk_psock_destroy() is a RCU callback, I can't see any reason why
it could be used outside.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/linux/skmsg.h | 1 -
 net/core/skmsg.c  | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index fec0c5ac1c4f..8edbbf5f2f93 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -390,7 +390,6 @@ static inline struct sk_psock *sk_psock_get(struct sock *sk)
 }
 
 void sk_psock_stop(struct sock *sk, struct sk_psock *psock);
-void sk_psock_destroy(struct rcu_head *rcu);
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock);
 
 static inline void sk_psock_put(struct sock *sk, struct sk_psock *psock)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 25cdbb20f3a0..1261512d6807 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -669,14 +669,13 @@ static void sk_psock_destroy_deferred(struct work_struct 
*gc)
kfree(psock);
 }
 
-void sk_psock_destroy(struct rcu_head *rcu)
+static void sk_psock_destroy(struct rcu_head *rcu)
 {
struct sk_psock *psock = container_of(rcu, struct sk_psock, rcu);
 
INIT_WORK(&psock->gc, sk_psock_destroy_deferred);
schedule_work(&psock->gc);
 }
-EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-- 
2.25.1

Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map

2021-01-26 Thread Cong Wang

On Tue, Jan 26, 2021 at 2:04 PM Daniel Borkmann  wrote:
>
> On 1/22/21 9:54 PM, Cong Wang wrote:
> > From: Cong Wang 
> >
> > This borrows the idea from conntrack and will be used for conntrack in
> > ebpf too. Each element in a timeout map has a user-specified timeout
> > in msecs, after it expires it will be automatically removed from the
> > map. Cilium already does the same thing, it uses a regular map or LRU
> > map to track connections and has its own GC in user-space. This does
> > not scale well when we have millions of connections, as each removal
> > needs a syscall. Even if we could batch the operations, it still needs
> > to copy a lot of data between kernel and user space.
> >
> > There are two cases to consider here:
> >
> > 1. When the timeout map is idle, i.e. no one updates or accesses it,
> > we rely on the delayed work to scan the whole hash table and remove
> > these expired. The delayed work is scheduled every 1 sec when idle,
> > which is also what conntrack uses. It is fine to scan the whole
> > table as we do not actually remove elements during this scan,
> > instead we simply queue them to the lockless list and defer all the
> > removals to the next schedule.
> >
> > 2. When the timeout map is actively accessed, we could reach expired
> > elements before the idle work automatically scans them, we can
> > simply skip them and schedule the delayed work immediately to do
> > the actual removals. We have to avoid taking locks on fast path.
> >
> > The timeout of an element can be set or updated via bpf_map_update_elem()
> > and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
> > as there are only a few bits are used currently. Note, a zero timeout
> > means to expire immediately.
> >
> > To avoid adding memory overhead to regular map, we have to reuse some
> > field in struct htab_elem, that is, lru_node. Otherwise we would have
> > to rewrite a lot of code.
> >
> > For now, batch ops is not supported, we can add it later if needed.
>
> Back in earlier conversation [0], I mentioned also LRU map flavors and to look
> into adding a flag, so we wouldn't need new BPF_MAP_TYPE_TIMEOUT_HASH/*LRU 
> types
> that replicate existing types once again just with the timeout in addition, so
> UAPI wise new map type is not great.

Interestingly, Jamal also brought this flag up to me.

The reason why I don't use a flag is that I don't see any other maps need a
timeout. Take the LRU map you mentioned as an example, it evicts elements
based on size, not based on time, I can't think of a case where we need both
time and size based eviction. Another example is array, there is no way to
delete an element from an array, so we can't really expire an element.

Or do you have any use case for a non-regular hash map with timeout?

>
> Given you mention Cilium above, only for kernels where there is no LRU hash 
> map,
> that is < 4.10, we rely on plain hash, everything else LRU + prealloc to 
> mitigate
> ddos by refusing to add new entries when full whereas less active ones will be
> purged instead. Timeout /only/ for plain hash is less useful overall, did you

The difference between LRU and a timeout map is whether we should
drop the least recently used one too when it is full. For conntrack, this is not
a good idea, because when we have a burst of connections, the least recently
used might be just 1-s old, so evicting it out is not a good idea.
With a timeout
map, we just drop all new connection when it is full and wait for some connect
to timeout naturally, aligned with the definition of conntrack.

So it should be a good replacement for LRU map too in terms of conntrack.

> sketch a more generic approach in the meantime that would work for all the 
> htab/lru
> flavors (and ideally as non-delayed_work based)?

If you mean LRU maps may need timeout too, like I explained above, I can't think
of any such use cases.

>
>[0] 
> https://lore.kernel.org/bpf/20201214201118.148126-1-xiyou.wangc...@gmail.com/
>
> [...]
> > @@ -1012,6 +1081,8 @@ static int htab_map_update_elem(struct bpf_map *map, 
> > void *key, void *value,
> >   copy_map_value_locked(map,
> > l_old->key + round_up(key_size, 
> > 8),
> > value, false);
> > + if (timeout_map)
> > + l_old->expires = msecs_to_expire(timeout);
> >   return 0;
> >   }
> >   /* fall through, grab the bucket lock

Re: [PATCH] neighbour: Prevent a dead entry from updating gc_list

2021-01-25 Thread Cong Wang

On Mon, Jan 25, 2021 at 11:59 AM Chinmay Agarwal
 wrote:
>
> Following race condition was detected:
>  - neigh_flush_dev() is under execution and calls 
> neigh_mark_dead(n),
> marking the neighbour entry 'n' as dead.
>
>  - Executing: __netif_receive_skb() -> __netif_receive_skb_core()
> -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which takes
> a reference on neighbour entry 'n'.
>
>  - Moves further along neigh_flush_dev() and calls
> neigh_cleanup_and_release(n), but since reference count increased in t2,
> 'n' couldn't be destroyed.
>
>  - Moves further along, arp_process() and calls
> neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
> the neighbour entry back in gc_list(neigh_mark_dead(), removed it
> earlier in t0 from gc_list)
>
>  - arp_process() finally calls neigh_release(n), destroying
> the neighbour entry.
>
> This leads to 'n' still being part of gc_list, but the actual
> neighbour structure has been freed.
>
> The situation can be prevented from happening if we disallow a dead
> entry to have any possibility of updating gc_list. This is what the
> patch intends to achieve.
>
> Signed-off-by: Chinmay Agarwal 

Please add a Fixes tag for bug fixes, in this case it is probably:

Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")

And, make sure you run checkpatch.pl before sending out. For your
patch, it will definitely complain about the missing spaces around the
assignment "new=old;".

Thanks.

Re: [PATCH net] team: protect features update by RCU to avoid deadlock

2021-01-25 Thread Cong Wang

On Sun, Jan 24, 2021 at 11:44 PM Ivan Vecera  wrote:
>
> Function __team_compute_features() is protected by team->lock
> mutex when it is called from team_compute_features() used when
> features of an underlying device is changed. This causes
> a deadlock when NETDEV_FEAT_CHANGE notifier for underlying device
> is fired due to change propagated from team driver (e.g. MTU
> change). It's because callbacks like team_change_mtu() or
> team_vlan_rx_{add,del}_vid() protect their port list traversal
> by team->lock mutex.
>
> Example (r8169 case where this driver disables TSO for certain MTU
> values):
> ...
> [ 6391.348202]  __mutex_lock.isra.6+0x2d0/0x4a0
> [ 6391.358602]  team_device_event+0x9d/0x160 [team]
> [ 6391.363756]  notifier_call_chain+0x47/0x70
> [ 6391.368329]  netdev_update_features+0x56/0x60
> [ 6391.373207]  rtl8169_change_mtu+0x14/0x50 [r8169]
> [ 6391.378457]  dev_set_mtu_ext+0xe1/0x1d0
> [ 6391.387022]  dev_set_mtu+0x52/0x90
> [ 6391.390820]  team_change_mtu+0x64/0xf0 [team]
> [ 6391.395683]  dev_set_mtu_ext+0xe1/0x1d0
> [ 6391.399963]  do_setlink+0x231/0xf50
> ...
>
> In fact team_compute_features() called from team_device_event()
> does not need to be protected by team->lock mutex and rcu_read_lock()
> is sufficient there for port list traversal.

Reviewed-by: Cong Wang 

In the future, please version your patch so that we can easily
find out which is the latest.

Thanks.

[Patch net] net: fix dev_ifsioc_locked() race condition

2021-01-23 Thread Cong Wang

From: Cong Wang 

dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:

Thread 1Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
// dev_ifsioc_locked()
memcpy(ifr->ifr_hwaddr.sa_data,
dev->dev_addr,...);

Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). The writers take RTNL anyway, so this
will not affect the slow path.

Fixes: 3710becf8a58 ("net: RCU locking for simple ioctl()")
Reported-by: "Gong, Sishuai" 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c| 39 ---
 net/core/dev_ioctl.c  | 18 ++
 3 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 259be67644e3..7a871f2dcc03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3918,6 +3918,7 @@ int dev_pre_changeaddr_notify(struct net_device *dev, 
const char *addr,
  struct netlink_ext_ack *extack);
 int dev_set_mac_address(struct net_device *dev, struct sockaddr *sa,
struct netlink_ext_ack *extack);
+int dev_get_mac_address(struct sockaddr *sa, struct net *net, char *dev_name);
 int dev_change_carrier(struct net_device *, bool new_carrier);
 int dev_get_phys_port_id(struct net_device *dev,
 struct netdev_phys_item_id *ppid);
diff --git a/net/core/dev.c b/net/core/dev.c
index a979b86dbacd..55c0db7704c7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8709,6 +8709,8 @@ int dev_pre_changeaddr_notify(struct net_device *dev, 
const char *addr,
 }
 EXPORT_SYMBOL(dev_pre_changeaddr_notify);
 
+static DECLARE_RWSEM(dev_addr_sem);
+
 /**
  * dev_set_mac_address - Change Media Access Control Address
  * @dev: device
@@ -8729,19 +8731,50 @@ int dev_set_mac_address(struct net_device *dev, struct 
sockaddr *sa,
return -EINVAL;
if (!netif_device_present(dev))
return -ENODEV;
+
+   down_write(&dev_addr_sem);
err = dev_pre_changeaddr_notify(dev, sa->sa_data, extack);
if (err)
-   return err;
+   goto out;
err = ops->ndo_set_mac_address(dev, sa);
if (err)
-   return err;
+   goto out;
dev->addr_assign_type = NET_ADDR_SET;
call_netdevice_notifiers(NETDEV_CHANGEADDR, dev);
add_device_randomness(dev->dev_addr, dev->addr_len);
-   return 0;
+out:
+   up_write(&dev_addr_sem);
+   return err;
 }
 EXPORT_SYMBOL(dev_set_mac_address);
 
+int dev_get_mac_address(struct sockaddr *sa, struct net *net, char *dev_name)
+{
+   size_t size = sizeof(sa->sa_data);
+   struct net_device *dev;
+   int ret = 0;
+
+   down_read(&dev_addr_sem);
+   rcu_read_lock();
+
+   dev = dev_get_by_name_rcu(net, dev_name);
+   if (!dev) {
+   ret = -ENODEV;
+   goto unlock;
+   }
+   if (!dev->addr_len)
+   memset(sa->sa_data, 0, size);
+   else
+   memcpy(sa->sa_data, dev->dev_addr,
+  min_t(size_t, size, dev->addr_len));
+   sa->sa_family = dev->type;
+
+unlock:
+   rcu_read_unlock();
+   up_read(&dev_addr_sem);
+   return ret;
+}
+
 /**
  * dev_change_carrier - Change device carrier
  * @dev: device
diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index db8a0ff86f36..bfa0dbd3d36f 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -123,17 +123,6 @@ static int dev_ifsioc_locked(struct net *net, struct ifreq 
*ifr, unsigned int cm
ifr->ifr_mtu = dev->mtu;
return 0;
 
-   case SIOCGIFHWADDR:
-   if (!dev->addr_len)
-   memset(ifr->ifr_hwaddr.sa_data, 0,
-  sizeof(ifr->ifr_hwaddr.sa_data));
-   else
-   memcpy(ifr->ifr_hwaddr.sa_data, dev->dev_addr,
-  min(sizeof(ifr->ifr_hwaddr.sa_data),
-  (size_t)dev->addr_len));
-   ifr->ifr_hwaddr.sa_family = dev->type;
-   return 0;
-
case SIOCGIFSLAVE:
err = -EINVAL;
break;
@@ -418,6 +407,12 @@ int dev_ioctl(struct net *net, unsigned int cmd, struct 
ifreq *ifr, bool *need_c
 */
 
switch (cmd) {
+   case SIOCGIFHWADDR:
+   dev_load(net, ifr->ifr_name);
+   ret = dev_get_mac_address(&ifr->ifr_hwaddr, net, ifr->ifr

Re: BPF: unbounded bpf_map_free_deferred problem

2021-01-22 Thread Cong Wang

On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa
 wrote:
>
> Hello, BPF developers.
>
> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | 
> WQ_UNBOUND
> workqueue did not solve this problem. Is it possible that a refcount leak 
> somewhere
> preventing bpf_map_free_deferred from completing? Please see
> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+j0ek...@mail.gmail.com
>  .
>

Which map does the reproducer create? And where exactly do
those work block on?

Different map->ops->map_free() waits for different reasons,
for example, htab_map_free() waits for flying htab_elem_free_rcu().
I can't immediately see how they could wait for each other, if this
is what you meant above.

Thanks.

[Patch bpf-next v5 3/3] selftests/bpf: add timeout map check in map_ptr tests

2021-01-22 Thread Cong Wang

From: Cong Wang 

Similar to regular hashmap test.

Acked-by: Andrey Ignatov 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 .../selftests/bpf/progs/map_ptr_kern.c| 20 +++
 1 file changed, 20 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c 
b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
index d8850bc6a9f1..424a9e76c93f 100644
--- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c
+++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
@@ -648,6 +648,25 @@ static inline int check_ringbuf(void)
return 1;
 }
 
+struct {
+   __uint(type, BPF_MAP_TYPE_TIMEOUT_HASH);
+   __uint(max_entries, MAX_ENTRIES);
+   __type(key, __u32);
+   __type(value, __u32);
+} m_timeout SEC(".maps");
+
+static inline int check_timeout_hash(void)
+{
+   struct bpf_htab *timeout_hash = (struct bpf_htab *)&m_timeout;
+   struct bpf_map *map = (struct bpf_map *)&m_timeout;
+
+   VERIFY(check_default(&timeout_hash->map, map));
+   VERIFY(timeout_hash->n_buckets == MAX_ENTRIES);
+   VERIFY(timeout_hash->elem_size == 64);
+
+   return 1;
+}
+
 SEC("cgroup_skb/egress")
 int cg_skb(void *ctx)
 {
@@ -679,6 +698,7 @@ int cg_skb(void *ctx)
VERIFY_TYPE(BPF_MAP_TYPE_SK_STORAGE, check_sk_storage);
VERIFY_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, check_devmap_hash);
VERIFY_TYPE(BPF_MAP_TYPE_RINGBUF, check_ringbuf);
+   VERIFY_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, check_timeout_hash);
 
return 1;
 }
-- 
2.25.1

[Patch bpf-next v5 2/3] selftests/bpf: add test cases for bpf timeout map

2021-01-22 Thread Cong Wang

From: Cong Wang 

Add some test cases in test_maps.c for timeout map, which focus
on testing timeout. The parallel tests make sure to catch race
conditions and the large map test stresses GC.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/test_maps.c | 68 +
 1 file changed, 68 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 51adc42b2b40..58b4712a0f98 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -639,6 +639,52 @@ static void test_stackmap(unsigned int task, void *data)
close(fd);
 }
 
+static void test_timeout_map(unsigned int task, void *data)
+{
+   int val1 = 1, val2 = 2, val3 = 3;
+   int key1 = 1, key2 = 2, key3 = 3;
+   int fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   3, map_flags);
+   if (fd < 0) {
+   printf("Failed to create timeout map '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Timeout after 1 secs */
+   assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1000<<32) == 0);
+   /* Timeout after 2 secs */
+   assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2000<<32) == 0);
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   val2 = 0;
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+   /* Modify timeout to expire it earlier */
+   val3 = 0;
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1000<<32) == 0);
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+   /* Add one elem expired immediately and try to delete this expired */
+   assert(bpf_map_update_elem(fd, &key3, &val3, 0) == 0);
+   assert(bpf_map_delete_elem(fd, &key3) == -1 && errno == ENOENT);
+
+   /* Add one elem and let the map removal clean up */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+
+   close(fd);
+}
+
 #include 
 #include 
 #include 
@@ -1305,6 +1351,7 @@ static void test_map_stress(void)
 
run_parallel(100, test_arraymap, NULL);
run_parallel(100, test_arraymap_percpu, NULL);
+   run_parallel(100, test_timeout_map, NULL);
 }
 
 #define TASKS 1024
@@ -1759,6 +1806,25 @@ static void test_reuseport_array(void)
close(map_fd);
 }
 
+static void test_large_timeout_map(int nr_elems)
+{
+   int val, key;
+   int i, fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   nr_elems, map_flags);
+   if (fd < 0) {
+   printf("Failed to create a large timeout map '%s'!\n", 
strerror(errno));
+   exit(1);
+   }
+   for (i = 0; i < nr_elems; i++) {
+   key = val = i;
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key, &val, (u64)1<<32) == 
0);
+   }
+   close(fd);
+}
+
 static void run_all_tests(void)
 {
test_hashmap(0, NULL);
@@ -1788,6 +1854,8 @@ static void run_all_tests(void)
test_stackmap(0, NULL);
 
test_map_in_map();
+
+   test_large_timeout_map(100);
 }
 
 #define DEFINE_TEST(name) extern void test_##name(void);
-- 
2.25.1

[Patch bpf-next v5 0/3] bpf: introduce timeout hash map

2021-01-22 Thread Cong Wang

From: Cong Wang 

This patchset introduces a new eBPF hash map whose elements have
timeouts. Patch 1 is the implementation of timeout map, patch 2 adds
some test cases for timeout map in test_maps, and patch 3 adds a test
case in map ptr test. This patchset has been tested with the provided
test cases for hours.

Please check each patch description for more details.

---
v5: add a lost piece of patch during rebase
fix the extra_elems corner case
add a stress test case

v4: merge gc_work into gc_idle_work to avoid a nasty race condition
fix a potential use-after-free
add one more test case
improve comments and update changelog

v3: move gc list from bucket to elem
reuse lru_node in struct htab_elem
drop patches which are no longer necessary
fix delete path
add a test case for delete path
add parallel test cases
change timeout to ms
drop batch ops

v2: fix hashmap ptr test
add a test case in map ptr test
factor out htab_timeout_map_alloc()

Cong Wang (3):
  bpf: introduce timeout hash map
  selftests/bpf: add test cases for bpf timeout map
  selftests/bpf: add timeout map check in map_ptr tests

 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   5 +-
 kernel/bpf/hashtab.c  | 274 +-
 kernel/bpf/syscall.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   1 +
 .../selftests/bpf/progs/map_ptr_kern.c|  20 ++
 tools/testing/selftests/bpf/test_maps.c   |  68 +
 7 files changed, 357 insertions(+), 15 deletions(-)

-- 
2.25.1

[Patch bpf-next v5 1/3] bpf: introduce timeout hash map

2021-01-22 Thread Cong Wang

From: Cong Wang 

This borrows the idea from conntrack and will be used for conntrack in
ebpf too. Each element in a timeout map has a user-specified timeout
in msecs, after it expires it will be automatically removed from the
map. Cilium already does the same thing, it uses a regular map or LRU
map to track connections and has its own GC in user-space. This does
not scale well when we have millions of connections, as each removal
needs a syscall. Even if we could batch the operations, it still needs
to copy a lot of data between kernel and user space.

There are two cases to consider here:

1. When the timeout map is idle, i.e. no one updates or accesses it,
   we rely on the delayed work to scan the whole hash table and remove
   these expired. The delayed work is scheduled every 1 sec when idle,
   which is also what conntrack uses. It is fine to scan the whole
   table as we do not actually remove elements during this scan,
   instead we simply queue them to the lockless list and defer all the
   removals to the next schedule.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work automatically scans them, we can
   simply skip them and schedule the delayed work immediately to do
   the actual removals. We have to avoid taking locks on fast path.

The timeout of an element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
as there are only a few bits are used currently. Note, a zero timeout
means to expire immediately.

To avoid adding memory overhead to regular map, we have to reuse some
field in struct htab_elem, that is, lru_node. Otherwise we would have
to rewrite a lot of code.

For now, batch ops is not supported, we can add it later if needed.

Cc: Andrii Nakryiko 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   5 +-
 kernel/bpf/hashtab.c   | 274 +++--
 kernel/bpf/syscall.c   |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 269 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..00a3b17b6af2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..9c9d8c194b39 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -164,6 +164,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
+   BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -399,7 +400,9 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout for
+ * BPF_MAP_TYPE_TIMEOUT_HASH (in milliseconds).
+ */
 enum {
BPF_ANY = 0, /* create new element or update existing */
BPF_NOEXIST = 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c1ac7f964bc9..90a899d1706f 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "percpu_freelist.h"
@@ -104,6 +106,9 @@ struct bpf_htab {
u32 hashrnd;
struct lock_class_key lockdep_key;
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+   struct llist_head gc_list;
+   struct delayed_work gc_work;
+   u64 next_gc; /* in jiffies64 */
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -122,6 +127,11 @@ struct htab_elem {
union {
struct rcu_head rcu;
struct bpf_lru_node lru_node;
+   struct {
+   u64 expires; /* in jiffies64 */
+   struct llist_node gc_node;
+   atomic_t pending;
+   };
};
u32 hash;
char key[] __aligned(8);
@@ -429,8 +439,34 @@ static int htab_map_alloc_check(union bpf_attr *attr)
return 0;
 }
 
+static bool htab_elem_expired(struct htab_elem *e)
+{
+   return time_after_eq64(get_jiffies_64(), e->expires);
+}
+
+/* Schedule GC to remove an expired element, unless it is already pending. */
+static void htab_gc_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+   if (atomic_xchg(&e->pending, 1

Re: [PATCH net] team: postpone features update to avoid deadlock

2021-01-20 Thread Cong Wang

On Wed, Jan 20, 2021 at 4:56 AM Ivan Vecera  wrote:
>
> To fix the problem __team_compute_features() needs to be postponed
> for these cases.

Is there any user-visible effect after deferring this feature change?

Thanks.

Re: [PATCH v2 net-next ] net/sched: cls_flower add CT_FLAGS_INVALID flag support

2021-01-20 Thread Cong Wang

On Tue, Jan 19, 2021 at 12:33 AM  wrote:
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 2d70ded..c565c7a 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -237,9 +237,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
>  void
>  skb_flow_dissect_ct(const struct sk_buff *skb,
> struct flow_dissector *flow_dissector,
> -   void *target_container,
> -   u16 *ctinfo_map,
> -   size_t mapsize)
> +   void *target_container, u16 *ctinfo_map,
> +   size_t mapsize, bool post_ct)

Why do you pass this boolean as a parameter when you
can just read it from qdisc_skb_cb(skb)?

Thanks.

Re: [PATCH v2 net-next ] net/sched: cls_flower add CT_FLAGS_INVALID flag support

2021-01-20 Thread Cong Wang

On Wed, Jan 20, 2021 at 3:40 PM Marcelo Ricardo Leitner
 wrote:
>
> On Wed, Jan 20, 2021 at 02:18:41PM -0800, Cong Wang wrote:
> > On Tue, Jan 19, 2021 at 12:33 AM  wrote:
> > > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > > index 2d70ded..c565c7a 100644
> > > --- a/net/core/flow_dissector.c
> > > +++ b/net/core/flow_dissector.c
> > > @@ -237,9 +237,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
> > >  void
> > >  skb_flow_dissect_ct(const struct sk_buff *skb,
> > > struct flow_dissector *flow_dissector,
> > > -   void *target_container,
> > > -   u16 *ctinfo_map,
> > > -   size_t mapsize)
> > > +   void *target_container, u16 *ctinfo_map,
> > > +   size_t mapsize, bool post_ct)
> >
> > Why do you pass this boolean as a parameter when you
> > can just read it from qdisc_skb_cb(skb)?
>
> In this case, yes, but this way skb_flow_dissect_ct() can/is able to
> not care about what the ->cb actually is. It could be called from
> somewhere else too.

This sounds reasonable, it is in net/core/ directory anyway,
so should be independent of tc even though cls_flower is its
only caller.

Thanks.

Re: [Patch bpf-next v4 1/3] bpf: introduce timeout hash map

2021-01-18 Thread Cong Wang

On Sat, Jan 16, 2021 at 8:22 PM Cong Wang  wrote:
> +static void htab_gc(struct work_struct *work)
> +{
> +   struct htab_elem *e, *tmp;
> +   struct llist_node *lhead;
> +   struct bpf_htab *htab;
> +   int i, count;
> +
> +   htab = container_of(work, struct bpf_htab, gc_work.work);
> +   lhead = llist_del_all(&htab->gc_list);
> +
> +   llist_for_each_entry_safe(e, tmp, lhead, gc_node) {
> +   unsigned long flags;
> +   struct bucket *b;
> +   u32 hash;
> +
> +   hash = e->hash;
> +   b = __select_bucket(htab, hash);
> +   if (htab_lock_bucket(htab, b, hash, &flags))
> +   continue;
> +   hlist_nulls_del_rcu(&e->hash_node);
> +   atomic_set(&e->pending, 0);
> +   free_htab_elem(htab, e);
> +   htab_unlock_bucket(htab, b, hash, flags);
> +
> +   cond_resched();
> +   }
> +
> +   for (count = 0, i = 0; i < htab->n_buckets; i++) {

I just realized a followup fix is not folded into this patch, I
actually added a timestamp check here to avoid scanning the whole
table more frequently than once per second. It is clearly my mistake
to miss it when formatting this patchset.

I will send v5 after waiting for other feedback.

Thanks!

[Patch bpf-next v4 2/3] selftests/bpf: add test cases for bpf timeout map

2021-01-16 Thread Cong Wang

From: Cong Wang 

Add some test cases in test_maps.c for timeout map, which focus
on testing timeout.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/test_maps.c | 47 +
 1 file changed, 47 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 51adc42b2b40..5e443d76512b 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -639,6 +639,52 @@ static void test_stackmap(unsigned int task, void *data)
close(fd);
 }
 
+static void test_timeout_map(unsigned int task, void *data)
+{
+   int val1 = 1, val2 = 2, val3 = 3;
+   int key1 = 1, key2 = 2, key3 = 3;
+   int fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   3, map_flags);
+   if (fd < 0) {
+   printf("Failed to create timeout map '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Timeout after 1 secs */
+   assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1000<<32) == 0);
+   /* Timeout after 2 secs */
+   assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2000<<32) == 0);
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   val2 = 0;
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+   /* Modify timeout to expire it earlier */
+   val3 = 0;
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1000<<32) == 0);
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+   /* Add one elem expired immediately and try to delete this expired */
+   assert(bpf_map_update_elem(fd, &key3, &val3, 0) == 0);
+   assert(bpf_map_delete_elem(fd, &key3) == -1 && errno == ENOENT);
+
+   /* Add one elem and let the map removal clean up */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+
+   close(fd);
+}
+
 #include 
 #include 
 #include 
@@ -1305,6 +1351,7 @@ static void test_map_stress(void)
 
run_parallel(100, test_arraymap, NULL);
run_parallel(100, test_arraymap_percpu, NULL);
+   run_parallel(100, test_timeout_map, NULL);
 }
 
 #define TASKS 1024
-- 
2.25.1

[Patch bpf-next v4 3/3] selftests/bpf: add timeout map check in map_ptr tests

2021-01-16 Thread Cong Wang

From: Cong Wang 

Similar to regular hashmap test.

Acked-by: Andrey Ignatov 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 .../selftests/bpf/progs/map_ptr_kern.c| 20 +++
 1 file changed, 20 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c 
b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
index d8850bc6a9f1..424a9e76c93f 100644
--- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c
+++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
@@ -648,6 +648,25 @@ static inline int check_ringbuf(void)
return 1;
 }
 
+struct {
+   __uint(type, BPF_MAP_TYPE_TIMEOUT_HASH);
+   __uint(max_entries, MAX_ENTRIES);
+   __type(key, __u32);
+   __type(value, __u32);
+} m_timeout SEC(".maps");
+
+static inline int check_timeout_hash(void)
+{
+   struct bpf_htab *timeout_hash = (struct bpf_htab *)&m_timeout;
+   struct bpf_map *map = (struct bpf_map *)&m_timeout;
+
+   VERIFY(check_default(&timeout_hash->map, map));
+   VERIFY(timeout_hash->n_buckets == MAX_ENTRIES);
+   VERIFY(timeout_hash->elem_size == 64);
+
+   return 1;
+}
+
 SEC("cgroup_skb/egress")
 int cg_skb(void *ctx)
 {
@@ -679,6 +698,7 @@ int cg_skb(void *ctx)
VERIFY_TYPE(BPF_MAP_TYPE_SK_STORAGE, check_sk_storage);
VERIFY_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, check_devmap_hash);
VERIFY_TYPE(BPF_MAP_TYPE_RINGBUF, check_ringbuf);
+   VERIFY_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, check_timeout_hash);
 
return 1;
 }
-- 
2.25.1

[Patch bpf-next v4 1/3] bpf: introduce timeout hash map

2021-01-16 Thread Cong Wang

From: Cong Wang 

This borrows the idea from conntrack and will be used for conntrack in
ebpf too. Each element in a timeout map has a user-specified timeout
in msecs, after it expires it will be automatically removed from the
map. Cilium already does the same thing, it uses a regular map or LRU
map to track connections and has its own GC in user-space. This does
not scale well when we have millions of connections, as each removal
needs a syscall. Even if we could batch the operations, it still needs
to copy a lot of data between kernel and user space.

There are two cases to consider here:

1. When the timeout map is idle, i.e. no one updates or accesses it,
   we rely on the delayed work to scan the whole hash table and remove
   these expired. The delayed work is scheduled every 1 sec when idle,
   which is also what conntrack uses. It is fine to scan the whole
   table as we do not actually remove elements during this scan,
   instead we simply queue them to the lockless list and defer all the
   removals to the next schedule.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work automatically scans them, we can
   simply skip them and schedule the delayed work immediately to do
   the actual removals. We have to avoid taking locks on fast path.

The timeout of an element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
as there are only a few bits are used currently. Note, a zero timeout
means to expire immediately.

To avoid adding memory overhead to regular map, we have to reuse some
field in struct htab_elem, that is, lru_node. Otherwise we would have
to rewrite a lot of code.

For now, batch ops is not supported, we can add it later if needed.

Cc: Andrii Nakryiko 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   5 +-
 kernel/bpf/hashtab.c   | 239 -
 kernel/bpf/syscall.c   |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 240 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..00a3b17b6af2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..9c9d8c194b39 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -164,6 +164,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
+   BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -399,7 +400,9 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout for
+ * BPF_MAP_TYPE_TIMEOUT_HASH (in milliseconds).
+ */
 enum {
BPF_ANY = 0, /* create new element or update existing */
BPF_NOEXIST = 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c1ac7f964bc9..1347d782eb1d 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "percpu_freelist.h"
@@ -104,6 +106,8 @@ struct bpf_htab {
u32 hashrnd;
struct lock_class_key lockdep_key;
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+   struct llist_head gc_list;
+   struct delayed_work gc_work;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -122,6 +126,11 @@ struct htab_elem {
union {
struct rcu_head rcu;
struct bpf_lru_node lru_node;
+   struct {
+   u64 expires; /* in jiffies64 */
+   struct llist_node gc_node;
+   atomic_t pending;
+   };
};
u32 hash;
char key[] __aligned(8);
@@ -429,6 +438,31 @@ static int htab_map_alloc_check(union bpf_attr *attr)
return 0;
 }
 
+static bool htab_elem_expired(struct htab_elem *e)
+{
+   return time_after_eq64(get_jiffies_64(), e->expires);
+}
+
+/* Schedule GC to remove an expired element, unless it is already pending. */
+static void htab_gc_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+   if (atomic_xchg(&e->pending, 1))
+   return;
+   lli

[Patch bpf-next v4 0/3] bpf: introduce timeout hash map

2021-01-16 Thread Cong Wang

From: Cong Wang 

This patchset introduces a new eBPF hash map whose elements have
timeouts. Patch 1 is the implementation of timeout map, patch 2 adds
some test cases for timeout map in test_maps, and patch 3 adds a test
case in map ptr test.

Please check each patch description for more details.

---
v4: merge gc_work into gc_idle_work to avoid a nasty race condition
fix a potential use-after-free
add one more test case
improve comments and update changelog

v3: move gc list from bucket to elem
reuse lru_node in struct htab_elem
drop patches which are no longer necessary
fix delete path
add a test case for delete path
add parallel test cases
change timeout to ms
drop batch ops

v2: fix hashmap ptr test
add a test case in map ptr test
factor out htab_timeout_map_alloc()

Cong Wang (3):
  bpf: introduce timeout hash map
  selftests/bpf: add test cases for bpf timeout map
  selftests/bpf: add timeout map check in map_ptr tests

 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   5 +-
 kernel/bpf/hashtab.c  | 239 +-
 kernel/bpf/syscall.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   1 +
 .../selftests/bpf/progs/map_ptr_kern.c|  20 ++
 tools/testing/selftests/bpf/test_maps.c   |  47 
 7 files changed, 307 insertions(+), 9 deletions(-)

-- 
2.25.1

[Patch net-next] net_sched: fix RTNL deadlock again caused by request_module()

2021-01-16 Thread Cong Wang

From: Cong Wang 

tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.

One of the problem is deadlock:

CPU 0   CPU 1
rtnl_lock();
for (...) {
  tcf_action_init_1();
-> rtnl_unlock();
-> request_module();
rtnl_lock();
for (...) {
  tcf_action_init_1();
-> tcf_idr_check_alloc();
   // Insert one action into idr,
   // but it is not committed until
   // tcf_idr_insert_many(), then drop
   // the RTNL lock in the _next_
   // iteration
   -> rtnl_unlock();
-> rtnl_lock();
-> a_o->init();
  -> tcf_idr_check_alloc();
  // Now waiting for the same index
  // to be committed
-> request_module();
-> rtnl_lock()
// Now waiting for RTNL lock
}
rtnl_unlock();
}
rtnl_unlock();

This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:

for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
struct tc_action_ops *a_o;

a_o = tc_action_load_ops(name, tb[i]...);
ops[i - 1] = a_o;
}

for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
act = tcf_action_init_1(ops[i - 1]...);
}

Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.

This patch has been tested by syzbot and tested with tdc.py by me.

Fixes: 0fedc63fadf ("net_sched: commit action insertions together")
Reported-and-tested-by: syzbot+82752bc5331601cf4...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+b3b63b6bff456bd95...@syzkaller.appspotmail.com
Reported-by: syzbot+ba67b12b1ca729912...@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h |   5 +-
 net/sched/act_api.c   | 104 +++---
 net/sched/cls_api.c   |  11 -
 3 files changed, 79 insertions(+), 41 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 55dab604861f..761c0e331915 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -186,10 +186,13 @@ int tcf_action_init(struct net *net, struct tcf_proto 
*tp, struct nlattr *nla,
struct nlattr *est, char *name, int ovr, int bind,
struct tc_action *actions[], size_t *attr_size,
bool rtnl_held, struct netlink_ext_ack *extack);
+struct tc_action_ops *tc_action_load_ops(char *name, struct nlattr *nla,
+bool rtnl_held,
+struct netlink_ext_ack *extack);
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
-   bool rtnl_held,
+   struct tc_action_ops *ops, bool rtnl_held,
struct netlink_ext_ack *extack);
 int tcf_action_dump(struct sk_buff *skb, struct tc_action *actions[], int bind,
int ref, bool terse);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 2e85b636b27b..4dd235ce9a07 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -928,19 +928,13 @@ static void tcf_idr_insert_many(struct tc_action 
*actions[])
}
 }
 
-struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
-   struct nlattr *nla, struct nlattr *est,
-   char *name, int ovr, int bind,
-   bool rtnl_held,
-   struct netlink_ext_ack *extack)
+struct tc_action_ops *tc_action_load_ops(char *name, struct nlattr *nla,
+bool rtnl_held,
+struct netlink_ext_ack *extack)
 {
-   struct nla_bitf

[Patch net v4] cls_flower: call nla_ok() before nla_next()

2021-01-15 Thread Cong Wang

From: Cong Wang 

fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim 
Cc: Xin Long 
Cc: Jiri Pirko 
Cc: Jakub Kicinski 
Signed-off-by: Cong Wang 
---
 net/sched/cls_flower.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1319986693fc..84f932532db7 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1272,6 +1272,10 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
 
nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "Invalid nested attribute for 
masks");
+   return -EINVAL;
+   }
}
 
nla_for_each_attr(nla_opt_key, nla_enc_key,
@@ -1307,9 +1311,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:
if (key->enc_opts.dst_opt_type) {
@@ -1340,9 +1341,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
if (key->enc_opts.dst_opt_type) {
@@ -1373,14 +1371,20 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
default:
NL_SET_ERR_MSG(extack, "Unknown tunnel option type");
return -EINVAL;
}
+
+   if (!msk_depth)
+   continue;
+
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "A mask attribute is invalid");
+   return -EINVAL;
+   }
+   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
}
 
return 0;
-- 
2.25.1

Re: [Patch net v3] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

On Thu, Jan 14, 2021 at 2:30 PM Jakub Kicinski  wrote:
>
> On Thu, 14 Jan 2021 13:57:13 -0800 Cong Wang wrote:
> > On Thu, Jan 14, 2021 at 1:36 PM Jakub Kicinski  wrote:
> > >
> > > On Thu, 14 Jan 2021 13:07:49 -0800 Cong Wang wrote:
> > > > - if (msk_depth)
> > > > - nla_opt_msk = nla_next(nla_opt_msk, 
> > > > &msk_depth);
> > > >   break;
> > > >   default:
> > > >   NL_SET_ERR_MSG(extack, "Unknown tunnel option 
> > > > type");
> > > >   return -EINVAL;
> > > >   }
> > > > +
> > > > + if (!nla_opt_msk)
> > > > + continue;
> > >
> > > Why the switch from !msk_depth to !nla_opt_msk?
> >
> > It is the same, when nla_opt_msk is NULL, msk_depth is 0.
> > Checking nla_opt_msk is NULL is more readable to express that
> > mask is not provided.
> >
> > >
> > > Seems like previously providing masks for only subset of options
> > > would have worked.
> >
> > I don't think so, every type has this check:
> >
> > if (key->enc_opts.len != mask->enc_opts.len) {
> > NL_SET_ERR_MSG(extack, "Key and mask
> > miss aligned");
> > return -EINVAL;
> > }
> >
> > which guarantees the numbers are aligned.
> >
> > Thanks.
>
> static int fl_set_vxlan_opt(const struct nlattr *nla, struct fl_flow_key *key,
> int depth, int option_len,
> struct netlink_ext_ack *extack)
> {
> struct nlattr *tb[TCA_FLOWER_KEY_ENC_OPT_VXLAN_MAX + 1];
> struct vxlan_metadata *md;
> int err;
>
> md = (struct vxlan_metadata *)&key->enc_opts.data[key->enc_opts.len];
> memset(md, 0xff, sizeof(*md));
>
> if (!depth)
> return sizeof(*md);
> ^^^
>
> The mask is filled with all 1s if attribute is not provided.

Hmm, then what is the length comparison check for?

fl_set_vxlan_opt() either turns negative or sizeof(*md), and negitve
is already checked when it returns, so when we hit the length comparison
it is always equal. So it must be redundant.

(Note, I am only talking about the vxlan case you pick here, because the
geneve case is different, as it returns different sizes.)

Thanks.

Re: [Patch net v3] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

On Thu, Jan 14, 2021 at 1:36 PM Jakub Kicinski  wrote:
>
> On Thu, 14 Jan 2021 13:07:49 -0800 Cong Wang wrote:
> > - if (msk_depth)
> > - nla_opt_msk = nla_next(nla_opt_msk, 
> > &msk_depth);
> >   break;
> >   default:
> >   NL_SET_ERR_MSG(extack, "Unknown tunnel option type");
> >   return -EINVAL;
> >   }
> > +
> > + if (!nla_opt_msk)
> > + continue;
>
> Why the switch from !msk_depth to !nla_opt_msk?

It is the same, when nla_opt_msk is NULL, msk_depth is 0.
Checking nla_opt_msk is NULL is more readable to express that
mask is not provided.

>
> Seems like previously providing masks for only subset of options
> would have worked.

I don't think so, every type has this check:

if (key->enc_opts.len != mask->enc_opts.len) {
NL_SET_ERR_MSG(extack, "Key and mask
miss aligned");
return -EINVAL;
}

which guarantees the numbers are aligned.

Thanks.

[Patch net v3] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

From: Cong Wang 

fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim 
Cc: Xin Long 
Cc: Jiri Pirko 
Cc: Jakub Kicinski 
Signed-off-by: Cong Wang 
---
 net/sched/cls_flower.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1319986693fc..280eab0c833f 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1272,6 +1272,10 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
 
nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "Invalid attribute for masks");
+   return -EINVAL;
+   }
}
 
nla_for_each_attr(nla_opt_key, nla_enc_key,
@@ -1307,9 +1311,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:
if (key->enc_opts.dst_opt_type) {
@@ -1340,9 +1341,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
if (key->enc_opts.dst_opt_type) {
@@ -1373,14 +1371,20 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
default:
NL_SET_ERR_MSG(extack, "Unknown tunnel option type");
return -EINVAL;
}
+
+   if (!nla_opt_msk)
+   continue;
+
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "A mask attribute is invalid");
+   return -EINVAL;
+   }
+   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
}
 
return 0;
-- 
2.25.1

Re: [Patch net v2] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

On Thu, Jan 14, 2021 at 12:16 PM Jakub Kicinski  wrote:
>
> On Thu, 14 Jan 2021 12:03:16 -0800 Cong Wang wrote:
> > On Thu, Jan 14, 2021 at 10:38 AM Jakub Kicinski  wrote:
> > >
> > > On Thu, 14 Jan 2021 08:38:22 -0800 Cong Wang wrote:
> > > > From: Cong Wang 
> > > >
> > > > fl_set_enc_opt() simply checks if there are still bytes left to parse,
> > > > but this is not sufficent as syzbot seems to be able to generate
> > > > malformatted netlink messages. nla_ok() is more strict so should be
> > > > used to validate the next nlattr here.
> > > >
> > > > And nla_validate_nested_deprecated() has less strict check too, it is
> > > > probably too late to switch to the strict version, but we can just
> > > > call nla_ok() too after it.
> > > >
> > > > Reported-and-tested-by: 
> > > > syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> > > > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> > > > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
> > >
> > > > @@ -1340,9 +1341,6 @@ static int fl_set_enc_opt(struct nlattr **tb, 
> > > > struct fl_flow_key *key,
> > > >   NL_SET_ERR_MSG(extack, "Key and mask miss 
> > > > aligned");
> > > >   return -EINVAL;
> > > >   }
> > > > -
> > > > - if (msk_depth)
> > > > - nla_opt_msk = nla_next(nla_opt_msk, 
> > > > &msk_depth);
> > > >   break;
> > > >   case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
> > > >   if (key->enc_opts.dst_opt_type) {
> > > > @@ -1373,14 +1371,17 @@ static int fl_set_enc_opt(struct nlattr **tb, 
> > > > struct fl_flow_key *key,
> > > >   NL_SET_ERR_MSG(extack, "Key and mask miss 
> > > > aligned");
> > > >   return -EINVAL;
> > > >   }
> > > > -
> > > > - if (msk_depth)
> > > > - nla_opt_msk = nla_next(nla_opt_msk, 
> > > > &msk_depth);
> > > >   break;
> > > >   default:
> > > >   NL_SET_ERR_MSG(extack, "Unknown tunnel option 
> > > > type");
> > > >   return -EINVAL;
> > > >   }
> > > > +
> > > > + if (!nla_ok(nla_opt_msk, msk_depth)) {
> > > > + NL_SET_ERR_MSG(extack, "Mask attribute is 
> > > > invalid");
> > > > + return -EINVAL;
> > > > + }
> > > > + nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
> > >
> > > we lost the if (msk_depth) now, nla_opt_msk may be NULL -
> > > neither nla_ok() nor nla_next() take NULL
> >
> > How is "if (msk_depth)" lost when nla_ok() has a stricter one?
> >
> > 1156 static inline int nla_ok(const struct nlattr *nla, int remaining)
> > 1157 {
> > 1158 return remaining >= (int) sizeof(*nla) &&
> > 1159nla->nla_len >= sizeof(*nla) &&
> > 1160nla->nla_len <= remaining;
> > 1161 }
> >
> > Line 1156 assures msk_depth is not only non-zero but also larger
> > than the nla struct size, and clearly nla won't be dereferenced unless
> > this check is passed.
>
> Fair, depth will but 0 so first check already fails, but nla_next()
> would crash since it tries to access the length of the attribute
> unconditionally.

nla_next() is only called when nla_ok() returns true, which is not
the case for msk_depth==0, therefore NULL won't crash here.

The only problem is we become too strict to reject optionally missing
masks, we should not even call nla_ok() here, otherwise it would
break user-space. So,

+   if (!nla_opt_msk)
+   continue;

Thanks.

Re: [Patch net v2] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

On Thu, Jan 14, 2021 at 10:38 AM Jakub Kicinski  wrote:
>
> On Thu, 14 Jan 2021 08:38:22 -0800 Cong Wang wrote:
> > From: Cong Wang 
> >
> > fl_set_enc_opt() simply checks if there are still bytes left to parse,
> > but this is not sufficent as syzbot seems to be able to generate
> > malformatted netlink messages. nla_ok() is more strict so should be
> > used to validate the next nlattr here.
> >
> > And nla_validate_nested_deprecated() has less strict check too, it is
> > probably too late to switch to the strict version, but we can just
> > call nla_ok() too after it.
> >
> > Reported-and-tested-by: 
> > syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
>
> > @@ -1340,9 +1341,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> > fl_flow_key *key,
> >   NL_SET_ERR_MSG(extack, "Key and mask miss 
> > aligned");
> >   return -EINVAL;
> >   }
> > -
> > - if (msk_depth)
> > - nla_opt_msk = nla_next(nla_opt_msk, 
> > &msk_depth);
> >   break;
> >   case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
> >   if (key->enc_opts.dst_opt_type) {
> > @@ -1373,14 +1371,17 @@ static int fl_set_enc_opt(struct nlattr **tb, 
> > struct fl_flow_key *key,
> >   NL_SET_ERR_MSG(extack, "Key and mask miss 
> > aligned");
> >   return -EINVAL;
> >   }
> > -
> > - if (msk_depth)
> > - nla_opt_msk = nla_next(nla_opt_msk, 
> > &msk_depth);
> >   break;
> >   default:
> >   NL_SET_ERR_MSG(extack, "Unknown tunnel option type");
> >   return -EINVAL;
> >   }
> > +
> > + if (!nla_ok(nla_opt_msk, msk_depth)) {
> > + NL_SET_ERR_MSG(extack, "Mask attribute is invalid");
> > + return -EINVAL;
> > + }
> > + nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
>
> we lost the if (msk_depth) now, nla_opt_msk may be NULL -
> neither nla_ok() nor nla_next() take NULL

How is "if (msk_depth)" lost when nla_ok() has a stricter one?

1156 static inline int nla_ok(const struct nlattr *nla, int remaining)
1157 {
1158 return remaining >= (int) sizeof(*nla) &&
1159nla->nla_len >= sizeof(*nla) &&
1160nla->nla_len <= remaining;
1161 }

Line 1156 assures msk_depth is not only non-zero but also larger
than the nla struct size, and clearly nla won't be dereferenced unless
this check is passed.

I guess you mean we should not error out for nla_opt_msk==NULL
case as masks are optional?

Thanks.

Re: [PATCH net] net_sched: reject silly cell_log in qdisc_get_rtab()

2021-01-14 Thread Cong Wang

On Thu, Jan 14, 2021 at 8:06 AM Eric Dumazet  wrote:
>
> From: Eric Dumazet 
>
> iproute2 probably never goes beyond 8 for the cell exponent,
> but stick to the max shift exponent for signed 32bit.
>
> UBSAN reported:
> UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
> shift exponent 130 is too large for 32-bit type 'int'
> CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x183/0x22e lib/dump_stack.c:120
>  ubsan_epilogue lib/ubsan.c:148 [inline]
>  __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
>  __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
>  qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
>  cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
>  qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
>  tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
>  rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
>  netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
>  netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
>  netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
>  netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
>  sock_sendmsg_nosec net/socket.c:652 [inline]
>  sock_sendmsg net/socket.c:672 [inline]
>  sys_sendmsg+0x5a2/0x900 net/socket.c:2345
>  ___sys_sendmsg net/socket.c:2399 [inline]
>  __sys_sendmsg+0x319/0x400 net/socket.c:2432
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Acked-by: Cong Wang 

Thanks.

[Patch net v2] cls_flower: call nla_ok() before nla_next()

2021-01-14 Thread Cong Wang

From: Cong Wang 

fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim 
Cc: Xin Long 
Cc: Jiri Pirko 
Cc: Jakub Kicinski 
Signed-off-by: Cong Wang 
---
 net/sched/cls_flower.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1319986693fc..740d9018e45f 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1272,6 +1272,10 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
 
nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "Invalid attribute for masks");
+   return -EINVAL;
+   }
}
 
nla_for_each_attr(nla_opt_key, nla_enc_key,
@@ -1307,9 +1311,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:
if (key->enc_opts.dst_opt_type) {
@@ -1340,9 +1341,6 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
if (key->enc_opts.dst_opt_type) {
@@ -1373,14 +1371,17 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
NL_SET_ERR_MSG(extack, "Key and mask miss 
aligned");
return -EINVAL;
}
-
-   if (msk_depth)
-   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
default:
NL_SET_ERR_MSG(extack, "Unknown tunnel option type");
return -EINVAL;
}
+
+   if (!nla_ok(nla_opt_msk, msk_depth)) {
+   NL_SET_ERR_MSG(extack, "Mask attribute is invalid");
+   return -EINVAL;
+   }
+   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
}
 
return 0;
-- 
2.25.1

Re: [Patch net] cls_flower: call nla_ok() before nla_next()

2021-01-13 Thread Cong Wang

On Tue, Jan 12, 2021 at 6:22 PM Xin Long  wrote:
>
> On Wed, Jan 13, 2021 at 1:43 AM Cong Wang  wrote:
> >
> > On Tue, Jan 12, 2021 at 3:52 AM Xin Long  wrote:
> > >
> > > On Tue, Jan 12, 2021 at 10:56 AM Cong Wang  
> > > wrote:
> > > >
> > > > From: Cong Wang 
> > > >
> > > > fl_set_enc_opt() simply checks if there are still bytes left to parse,
> > > > but this is not sufficent as syzbot seems to be able to generate
> > > > malformatted netlink messages. nla_ok() is more strict so should be
> > > > used to validate the next nlattr here.
> > > >
> > > > And nla_validate_nested_deprecated() has less strict check too, it is
> > > > probably too late to switch to the strict version, but we can just
> > > > call nla_ok() too after it.
> > > >
> > > > Reported-and-tested-by: 
> > > > syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> > > > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> > > > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
> > > > Cc: Pieter Jansen van Vuuren 
> > > > Cc: Jamal Hadi Salim 
> > > > Cc: Xin Long 
> > > > Cc: Jiri Pirko 
> > > > Signed-off-by: Cong Wang 
> > > > ---
> > > >  net/sched/cls_flower.c | 8 +---
> > > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > > > index 1319986693fc..e265c443536e 100644
> > > > --- a/net/sched/cls_flower.c
> > > > +++ b/net/sched/cls_flower.c
> > > > @@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, 
> > > > struct fl_flow_key *key,
> > > >
> > > > nla_opt_msk = 
> > > > nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> > > > msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> > > > +   if (!nla_ok(nla_opt_msk, msk_depth))
> > > > +   return -EINVAL;
> > > > }
> > > I think it's better to also add  NL_SET_ERR_MSG(extack, );
> > > for this error return, like all the other places in this function.
> >
> > I think ext message is primarily for end users who usually can not
> > generate malformat netlink messages.
> >
> > On the other hand, the nla_validate_nested_deprecated() right above
> > the quoted code does not set ext message either.
> nla_validate_nested_deprecated(..., extack), it's already done inside
> when returns error, no?

Yeah, seems fair.

Thanks.

Re: [Patch net] cls_flower: call nla_ok() before nla_next()

2021-01-13 Thread Cong Wang

On Tue, Jan 12, 2021 at 5:38 PM Jakub Kicinski  wrote:
>
> On Mon, 11 Jan 2021 18:55:48 -0800 Cong Wang wrote:
> > From: Cong Wang 
> >
> > fl_set_enc_opt() simply checks if there are still bytes left to parse,
> > but this is not sufficent as syzbot seems to be able to generate
> > malformatted netlink messages. nla_ok() is more strict so should be
> > used to validate the next nlattr here.
> >
> > And nla_validate_nested_deprecated() has less strict check too, it is
> > probably too late to switch to the strict version, but we can just
> > call nla_ok() too after it.
> >
> > Reported-and-tested-by: 
> > syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
> > Cc: Pieter Jansen van Vuuren 
> > Cc: Jamal Hadi Salim 
> > Cc: Xin Long 
> > Cc: Jiri Pirko 
> > Signed-off-by: Cong Wang 
>
> Thanks for keeping up with the syzbot bugs!
>
> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > index 1319986693fc..e265c443536e 100644
> > --- a/net/sched/cls_flower.c
> > +++ b/net/sched/cls_flower.c
> > @@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> > fl_flow_key *key,
> >
> >   nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> >   msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> > + if (!nla_ok(nla_opt_msk, msk_depth))
> > + return -EINVAL;
>
> Can we just add another call to nla_validate_nested_deprecated()
> here instead of having to worry about each attr individually?

No, we can not parse the nested attr here because different key types
have different attributes.

> See below..
>
> >   }
> >
> >   nla_for_each_attr(nla_opt_key, nla_enc_key,
> > @@ -1308,7 +1310,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> > fl_flow_key *key,
> >   return -EINVAL;
> >   }
> >
> > - if (msk_depth)
> > + if (nla_ok(nla_opt_msk, msk_depth))
> >   nla_opt_msk = nla_next(nla_opt_msk, 
> > &msk_depth);
>
> Should we not error otherwise? if msk_depth && !nla_ok() then the
> message is clearly misformatted. If we don't error out we'll keep
> reusing the same mask over and over, while the intention of this
> code was to have mask per key AFAICT.

Yeah, erroring out sounds better. Will change this in v2.

Thanks!

Re: [Patch net] cls_flower: call nla_ok() before nla_next()

2021-01-12 Thread Cong Wang

On Tue, Jan 12, 2021 at 3:52 AM Xin Long  wrote:
>
> On Tue, Jan 12, 2021 at 10:56 AM Cong Wang  wrote:
> >
> > From: Cong Wang 
> >
> > fl_set_enc_opt() simply checks if there are still bytes left to parse,
> > but this is not sufficent as syzbot seems to be able to generate
> > malformatted netlink messages. nla_ok() is more strict so should be
> > used to validate the next nlattr here.
> >
> > And nla_validate_nested_deprecated() has less strict check too, it is
> > probably too late to switch to the strict version, but we can just
> > call nla_ok() too after it.
> >
> > Reported-and-tested-by: 
> > syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
> > Cc: Pieter Jansen van Vuuren 
> > Cc: Jamal Hadi Salim 
> > Cc: Xin Long 
> > Cc: Jiri Pirko 
> > Signed-off-by: Cong Wang 
> > ---
> >  net/sched/cls_flower.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> > index 1319986693fc..e265c443536e 100644
> > --- a/net/sched/cls_flower.c
> > +++ b/net/sched/cls_flower.c
> > @@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> > fl_flow_key *key,
> >
> > nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> > msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> > +   if (!nla_ok(nla_opt_msk, msk_depth))
> > +   return -EINVAL;
> > }
> I think it's better to also add  NL_SET_ERR_MSG(extack, );
> for this error return, like all the other places in this function.

I think ext message is primarily for end users who usually can not
generate malformat netlink messages.

On the other hand, the nla_validate_nested_deprecated() right above
the quoted code does not set ext message either.

Thanks.

[Patch net] cls_flower: call nla_ok() before nla_next()

2021-01-11 Thread Cong Wang

From: Cong Wang 

fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
Cc: Pieter Jansen van Vuuren 
Cc: Jamal Hadi Salim 
Cc: Xin Long 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/cls_flower.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1319986693fc..e265c443536e 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
 
nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
+   if (!nla_ok(nla_opt_msk, msk_depth))
+   return -EINVAL;
}
 
nla_for_each_attr(nla_opt_key, nla_enc_key,
@@ -1308,7 +1310,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
return -EINVAL;
}
 
-   if (msk_depth)
+   if (nla_ok(nla_opt_msk, msk_depth))
nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:
@@ -1341,7 +1343,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
return -EINVAL;
}
 
-   if (msk_depth)
+   if (nla_ok(nla_opt_msk, msk_depth))
nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
case TCA_FLOWER_KEY_ENC_OPTS_ERSPAN:
@@ -1374,7 +1376,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
fl_flow_key *key,
return -EINVAL;
}
 
-   if (msk_depth)
+   if (nla_ok(nla_opt_msk, msk_depth))
nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);
break;
default:
-- 
2.25.1

Re: KMSAN: kernel-infoleak in move_addr_to_user (4)

2021-01-11 Thread Cong Wang

On Mon, Jan 11, 2021 at 11:33 AM Jakub Kicinski  wrote:
>
> Looks like a AF_CAN socket:
>
> r0 = socket(0x1d, 0x2, 0x6)
> getsockname$packet(r0, &(0x7f000100)={0x11, 0x0, 0x0, 0x1, 0x0, 0x6, 
> @broadcast}, &(0x7f00)=0x14)
>

Right, it seems we need a memset(0) in isotp_getname().

Thanks.

Re: [PATCH net-next] net: bareudp: simplify error paths calling dellink

2021-01-11 Thread Cong Wang

On Sun, Jan 10, 2021 at 9:29 PM Jakub Kicinski  wrote:
>
> bareudp_dellink() only needs the device list to hand it to
> unregister_netdevice_queue(). We can pass NULL in, and
> unregister_netdevice_queue() will do the unregistering.
> There is no chance for batching on the error path, anyway.
>
> Suggested-by: Cong Wang 
> Signed-off-by: Jakub Kicinski 


Reviewed-by: Cong Wang 

Thanks for cleaning up!

Re: [PATCH net v2] net: bareudp: add missing error handling for bareudp_link_config()

2021-01-05 Thread Cong Wang

On Tue, Jan 5, 2021 at 2:39 PM Jakub Kicinski  wrote:
>
> On Tue, 5 Jan 2021 12:38:54 -0800 Cong Wang wrote:
> > On Tue, Jan 5, 2021 at 11:07 AM Jakub Kicinski  wrote:
> > > +static void bareudp_dellink(struct net_device *dev, struct list_head 
> > > *head)
> > > +{
> > > +   struct bareudp_dev *bareudp = netdev_priv(dev);
> > > +
> > > +   list_del(&bareudp->next);
> > > +   unregister_netdevice_queue(dev, head);
> > > +}
> > > +
> > >  static int bareudp_newlink(struct net *net, struct net_device *dev,
> > >struct nlattr *tb[], struct nlattr *data[],
> > >struct netlink_ext_ack *extack)
> > >  {
> > > struct bareudp_conf conf;
> > > +   LIST_HEAD(list_kill);
> > > int err;
> > >
> > > err = bareudp2info(data, &conf, extack);
> > > @@ -662,17 +671,14 @@ static int bareudp_newlink(struct net *net, struct 
> > > net_device *dev,
> > >
> > > err = bareudp_link_config(dev, tb);
> > > if (err)
> > > -   return err;
> > > +   goto err_unconfig;
> > >
> > > return 0;
> > > -}
> > > -
> > > -static void bareudp_dellink(struct net_device *dev, struct list_head 
> > > *head)
> > > -{
> > > -   struct bareudp_dev *bareudp = netdev_priv(dev);
> > >
> > > -   list_del(&bareudp->next);
> > > -   unregister_netdevice_queue(dev, head);
> > > +err_unconfig:
> > > +   bareudp_dellink(dev, &list_kill);
> > > +   unregister_netdevice_many(&list_kill);
> >
> > Why do we need unregister_netdevice_many() here? I think
> > bareudp_dellink(dev, NULL) is sufficient as we always have
> > one instance to unregister?
> >
> > (For the same reason, bareudp_dev_create() does not need it
> > either.)
>
> Ack, I'm following how bareudp_dev_create() is written.
>
> I can follow up in net-next and change both, sounds good?

Yes, either way is fine.

Thanks.

Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-05 Thread Cong Wang

On Mon, Jan 4, 2021 at 7:29 AM Alan Maguire  wrote:
>
> BPF Type Format (BTF) provides a description of kernel data structures
> and of the types kernel functions utilize as arguments and return values.
>
> A helper was recently added - bpf_snprintf_btf() - that uses that
> description to create a string representation of the data provided,
> using the BTF id of its type.  For example to create a string
> representation of a "struct sk_buff", the pointer to the skb
> is provided along with the type id of "struct sk_buff".
>
> Here that functionality is utilized to support tracing kernel
> function entry and return using k[ret]probes.  The "struct pt_regs"
> context can be used to derive arguments and return values, and
> when the user supplies a function name we
>
> - look it up in /proc/kallsyms to find its address/module
> - look it up in the BTF kernel data to get types of arguments
>   and return value
> - store a map representation of the trace information, keyed by
>   instruction pointer
> - on function entry/return we look up the map to retrieve the BTF
>   ids of the arguments/return values and can call bpf_snprintf_btf()
>   with these argument/return values along with the type ids to store
>   a string representation in the map.
> - this is then sent via perf event to userspace where it can be
>   displayed.
>
> ksnoop can be used to show function signatures; for example:

This is definitely quite useful!

Is it possible to integrate this with bpftrace? That would save people
from learning yet another tool. ;)

Thanks.

Re: [PATCH net v2] net: bareudp: add missing error handling for bareudp_link_config()

2021-01-05 Thread Cong Wang

On Tue, Jan 5, 2021 at 11:07 AM Jakub Kicinski  wrote:
>
> .dellink does not get called after .newlink fails,
> bareudp_newlink() must undo what bareudp_configure()
> has done if bareudp_link_config() fails.
>
> v2: call bareudp_dellink(), like bareudp_dev_create() does

Thanks for the update. Just one question below.

>
> Fixes: 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling 
> different protocols like MPLS, IP, NSH etc.")
> Signed-off-by: Jakub Kicinski 
> ---
>  drivers/net/bareudp.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/bareudp.c b/drivers/net/bareudp.c
> index 708171c0d628..85de5f96c02b 100644
> --- a/drivers/net/bareudp.c
> +++ b/drivers/net/bareudp.c
> @@ -645,11 +645,20 @@ static int bareudp_link_config(struct net_device *dev,
> return 0;
>  }
>
> +static void bareudp_dellink(struct net_device *dev, struct list_head *head)
> +{
> +   struct bareudp_dev *bareudp = netdev_priv(dev);
> +
> +   list_del(&bareudp->next);
> +   unregister_netdevice_queue(dev, head);
> +}
> +
>  static int bareudp_newlink(struct net *net, struct net_device *dev,
>struct nlattr *tb[], struct nlattr *data[],
>struct netlink_ext_ack *extack)
>  {
> struct bareudp_conf conf;
> +   LIST_HEAD(list_kill);
> int err;
>
> err = bareudp2info(data, &conf, extack);
> @@ -662,17 +671,14 @@ static int bareudp_newlink(struct net *net, struct 
> net_device *dev,
>
> err = bareudp_link_config(dev, tb);
> if (err)
> -   return err;
> +   goto err_unconfig;
>
> return 0;
> -}
> -
> -static void bareudp_dellink(struct net_device *dev, struct list_head *head)
> -{
> -   struct bareudp_dev *bareudp = netdev_priv(dev);
>
> -   list_del(&bareudp->next);
> -   unregister_netdevice_queue(dev, head);
> +err_unconfig:
> +   bareudp_dellink(dev, &list_kill);
> +   unregister_netdevice_many(&list_kill);

Why do we need unregister_netdevice_many() here? I think
bareudp_dellink(dev, NULL) is sufficient as we always have
one instance to unregister?

(For the same reason, bareudp_dev_create() does not need it
either.)

Thanks.

Re: [PATCH net] net: bareudp: add missing error handling for bareudp_link_config()

2021-01-04 Thread Cong Wang

On Mon, Jan 4, 2021 at 9:49 AM Jakub Kicinski  wrote:
>
> On Sat, 2 Jan 2021 15:49:54 -0800 Cong Wang wrote:
> > On Wed, Dec 30, 2020 at 7:46 PM Jakub Kicinski  wrote:
> > > @@ -661,9 +662,14 @@ static int bareudp_newlink(struct net *net, struct 
> > > net_device *dev,
> > >
> > > err = bareudp_link_config(dev, tb);
> > > if (err)
> > > -   return err;
> > > +   goto err_unconfig;
> > >
> > > return 0;
> > > +
> > > +err_unconfig:
> >
> > I think we can save this goto.
>
> I personally prefer more idiomatic code flow to saving a single LoC.
>
> > > +   list_del(&bareudp->next);
> > > +   unregister_netdevice(dev);
> >
> > Which is bareudp_dellink(dev, NULL). ;)
>
> I know, but calling full dellink when only parts of newlink fails felt
> weird. And it's not lower LoC, unless called with NULL as second arg,
> which again could be surprising to a person changing dellink.

I think calling a function with "bareudp_" prefix is more readable
than interpreting list_del()+unregister_netdevice(). I mean

if (bareudp_*())
  goto err;
...
err:
bareudp_*();

this looks cleaner, right?

Thanks.

Re: [PATCH net] net: bareudp: add missing error handling for bareudp_link_config()

2021-01-02 Thread Cong Wang

On Wed, Dec 30, 2020 at 7:46 PM Jakub Kicinski  wrote:
> @@ -661,9 +662,14 @@ static int bareudp_newlink(struct net *net, struct 
> net_device *dev,
>
> err = bareudp_link_config(dev, tb);
> if (err)
> -   return err;
> +   goto err_unconfig;
>
> return 0;
> +
> +err_unconfig:

I think we can save this goto.

> +   list_del(&bareudp->next);
> +   unregister_netdevice(dev);

Which is bareudp_dellink(dev, NULL). ;)

Thanks.

Re: Race Condition Observed in ARP Processing.

2020-12-31 Thread Cong Wang

On Tue, Dec 29, 2020 at 8:06 AM Chinmay Agarwal  wrote:
>
> Hi All,
>
> We found a crash while performing some automated stress tests on a 5.4 kernel 
> based device.
>
> We found out that it there is a freed neighbour address which was still part 
> of the gc_list and was leading to crash.
> Upon adding some debugs and checking neigh_put/neigh_hold/neigh_destroy calls 
> stacks, looks like there is a possibility of a Race condition happening in 
> the code.
[...]
> The crash may have been due to out of order ARP replies.
> As neighbour is marked dead should we go ahead with updating our ARP Tables?

I think you are probably right, we should just do unlock and return
in __neigh_update() when hitting if (neigh->dead) branch. Something
like below:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 9500d28a43b0..0ce592f585c8 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1250,6 +1250,7 @@ static int __neigh_update(struct neighbour
*neigh, const u8 *lladdr,
goto out;
if (neigh->dead) {
NL_SET_ERR_MSG(extack, "Neighbor entry is now dead");
+   new = old;
goto out;
}

But given the old state probably contains NUD_PERMANENT, I guess
you hit the following branch instead:

if (!(flags & NEIGH_UPDATE_F_ADMIN) &&
(old & (NUD_NOARP | NUD_PERMANENT)))
goto out;

So we may have to check ->dead before this. Please double check.

This bug is probably introduced by commit 9c29a2f55ec05cc8b525ee.
Can you make a patch and send it out formally after testing?

Thanks!

Re: inconsistent lock state in ila_xlat_nl_cmd_add_mapping

2020-12-29 Thread Cong Wang

On Tue, Dec 29, 2020 at 5:39 PM Jakub Kicinski  wrote:
>
> On Mon, 13 Aug 2018 21:40:03 -0700 syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:78cbac647e61 Merge branch 'ip-faster-in-order-IP-fragments'
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14df482840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=9100338df26ab75
> > dashboard link: https://syzkaller.appspot.com/bug?extid=eaaf6c4a6a8cb1869d86
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=13069ad240
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+eaaf6c4a6a8cb1869...@syzkaller.appspotmail.com
>
> #syz invalid
>
> Hard to track down what fixed this, but the lockdep splat is mixing up
> locks from two different hashtables, so there was never a real issue
> here.

This one is probably fixed by:

commit ff93bca769925a2d8fd7f910cdf543d992e17f07
Author: Cong Wang 
Date:   Tue Aug 14 15:21:31 2018 -0700

ila: make lockdep happy again

given the time of last reproducing...

Thanks.

Re: [PATCH net] mld: fix panic in mld_newpack()

2020-12-27 Thread Cong Wang

On Sun, Dec 27, 2020 at 6:40 AM Taehee Yoo  wrote:
> But I'm so sorry I didn't understand some points.
>
> 1. you said "both side" and I understand these as follows:
> a) failure of allocation because of a high order and it is fixed
> by 72e09ad107e7
> b) kernel panic because of 72e09ad107e7
> Are these two issues right?

Yes, we can't fix one by reverting the fix for the other.

>
> 2. So, as far as I understand your mention, these timers are
> good to be changed to the delayed works And these timers are mca_timer,
> mc_gq_timer, mc_ifc_timer, mc_dad_timer.
> Do I understand your mention correctly?
> If so, what is the benefit of it?
> I, unfortunately, couldn't understand the relationship between changing
> timers to the delayed works and these issues.

Because a work has process context so we can use GFP_KERNEL
allocation rather than GFP_ATOMIC, which is what commit 72e09ad107e7
addresses.

Thanks.

[Patch net] af_key: relax availability checks for skb size calculation

2020-12-26 Thread Cong Wang

From: Cong Wang 

xfrm_probe_algs() probes kernel crypto modules and changes the
availability of struct xfrm_algo_desc. But there is a small window
where ealg->available and aalg->available get changed between
count_ah_combs()/count_esp_combs() and dump_ah_combs()/dump_esp_combs(),
in this case we may allocate a smaller skb but later put a larger
amount of data and trigger the panic in skb_put().

Fix this by relaxing the checks when counting the size, that is,
skipping the test of ->available. We may waste some memory for a few
of sizeof(struct sadb_comb), but it is still much better than a panic.

Reported-by: syzbot+b2bf2652983d23734...@syzkaller.appspotmail.com
Cc: Steffen Klassert 
Cc: Herbert Xu 
Signed-off-by: Cong Wang 
---
 net/key/af_key.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index c12dbc51ef5f..ef9b4ac03e7b 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -2902,7 +2902,7 @@ static int count_ah_combs(const struct xfrm_tmpl *t)
break;
if (!aalg->pfkey_supported)
continue;
-   if (aalg_tmpl_set(t, aalg) && aalg->available)
+   if (aalg_tmpl_set(t, aalg))
sz += sizeof(struct sadb_comb);
}
return sz + sizeof(struct sadb_prop);
@@ -2920,7 +2920,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t)
if (!ealg->pfkey_supported)
continue;
 
-   if (!(ealg_tmpl_set(t, ealg) && ealg->available))
+   if (!(ealg_tmpl_set(t, ealg)))
continue;
 
for (k = 1; ; k++) {
@@ -2931,7 +2931,7 @@ static int count_esp_combs(const struct xfrm_tmpl *t)
if (!aalg->pfkey_supported)
continue;
 
-   if (aalg_tmpl_set(t, aalg) && aalg->available)
+   if (aalg_tmpl_set(t, aalg))
sz += sizeof(struct sadb_comb);
}
}
-- 
2.25.1

[Patch net] erspan: fix version 1 check in gre_parse_header()

2020-12-26 Thread Cong Wang

From: Cong Wang 

Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.

We can just check the gre header length like is_erspan_type1().

Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b...@syzkaller.appspotmail.com
Cc: William Tu 
Cc: Lorenzo Bianconi 
Signed-off-by: Cong Wang 
---
 net/ipv4/gre_demux.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
index 66fdbfe5447c..5d1e6fe9d838 100644
--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -128,7 +128,7 @@ int gre_parse_header(struct sk_buff *skb, struct 
tnl_ptk_info *tpi,
 * to 0 and sets the configured key in the
 * inner erspan header field
 */
-   if (greh->protocol == htons(ETH_P_ERSPAN) ||
+   if ((greh->protocol == htons(ETH_P_ERSPAN) && hdr_len != 4) ||
greh->protocol == htons(ETH_P_ERSPAN2)) {
struct erspan_base_hdr *ershdr;
 
-- 
2.25.1

Re: [PATCH net] mld: fix panic in mld_newpack()

2020-12-26 Thread Cong Wang

On Wed, Dec 23, 2020 at 8:55 AM Taehee Yoo  wrote:
>
> mld_newpack() doesn't allow to allocate high order page,
> just order-0 allocation is allowed.
> If headroom size is too large, a kernel panic could occur in skb_put().
...
> Allowing high order page allocation could fix this problem.
>
> Fixes: 72e09ad107e7 ("ipv6: avoid high order allocations")

So you just revert this commit which fixes another issue. ;)

How about changing timers to delayed works so that we can
make both sides happy? It is certainly much more work, but
looks worthy of it.

Thanks.

[Patch bpf-next v3 3/3] selftests/bpf: add timeout map check in map_ptr tests

2020-12-21 Thread Cong Wang

From: Cong Wang 

Similar to regular hashmap test.

Acked-by: Andrey Ignatov 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 .../selftests/bpf/progs/map_ptr_kern.c| 20 +++
 1 file changed, 20 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c 
b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
index d8850bc6a9f1..424a9e76c93f 100644
--- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c
+++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
@@ -648,6 +648,25 @@ static inline int check_ringbuf(void)
return 1;
 }
 
+struct {
+   __uint(type, BPF_MAP_TYPE_TIMEOUT_HASH);
+   __uint(max_entries, MAX_ENTRIES);
+   __type(key, __u32);
+   __type(value, __u32);
+} m_timeout SEC(".maps");
+
+static inline int check_timeout_hash(void)
+{
+   struct bpf_htab *timeout_hash = (struct bpf_htab *)&m_timeout;
+   struct bpf_map *map = (struct bpf_map *)&m_timeout;
+
+   VERIFY(check_default(&timeout_hash->map, map));
+   VERIFY(timeout_hash->n_buckets == MAX_ENTRIES);
+   VERIFY(timeout_hash->elem_size == 64);
+
+   return 1;
+}
+
 SEC("cgroup_skb/egress")
 int cg_skb(void *ctx)
 {
@@ -679,6 +698,7 @@ int cg_skb(void *ctx)
VERIFY_TYPE(BPF_MAP_TYPE_SK_STORAGE, check_sk_storage);
VERIFY_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, check_devmap_hash);
VERIFY_TYPE(BPF_MAP_TYPE_RINGBUF, check_ringbuf);
+   VERIFY_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, check_timeout_hash);
 
return 1;
 }
-- 
2.25.1

[Patch bpf-next v3 2/3] selftests/bpf: add test cases for bpf timeout map

2020-12-21 Thread Cong Wang

From: Cong Wang 

Add some test cases in test_maps.c for timeout map, which focus
on testing timeout.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/test_maps.c | 46 +
 1 file changed, 46 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 0ad3e6305ff0..19c7c0f64a5d 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -639,6 +639,49 @@ static void test_stackmap(unsigned int task, void *data)
close(fd);
 }
 
+static void test_timeout_map(unsigned int task, void *data)
+{
+   int val1 = 1, val2 = 2, val3 = 3;
+   int key1 = 1, key2 = 2, key3 = 3;
+   int fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   3, map_flags);
+   if (fd < 0) {
+   printf("Failed to create timeout map '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Timeout after 1 secs */
+   assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1000<<32) == 0);
+   /* Timeout after 2 secs */
+   assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2000<<32) == 0);
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   val2 = 0;
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+   /* Modify timeout to expire it earlier */
+   val3 = 0;
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1000<<32) == 0);
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+   /* Add one elem expired immediately and try to delete this expired */
+   assert(bpf_map_update_elem(fd, &key3, &val3, 0) == 0);
+   assert(bpf_map_delete_elem(fd, &key3) == -1 && errno == ENOENT);
+
+   close(fd);
+}
+
 #include 
 #include 
 #include 
@@ -1305,6 +1348,7 @@ static void test_map_stress(void)
 
run_parallel(100, test_arraymap, NULL);
run_parallel(100, test_arraymap_percpu, NULL);
+   run_parallel(100, test_timeout_map, NULL);
 }
 
 #define TASKS 1024
@@ -1752,6 +1796,8 @@ static void run_all_tests(void)
test_stackmap(0, NULL);
 
test_map_in_map();
+
+   test_timeout_map(0, NULL);
 }
 
 #define DEFINE_TEST(name) extern void test_##name(void);
-- 
2.25.1

[Patch bpf-next v3 1/3] bpf: introduce timeout map

2020-12-21 Thread Cong Wang

From: Cong Wang 

This borrows the idea from conntrack and will be used for conntrack in
ebpf too. Each element in a timeout map has a user-specified timeout
in msecs, after it expires it will be automatically removed from the
map. Cilium already does the same thing, it uses a regular map or LRU
map to track connections and has its own GC in user-space. This does
not scale well when we have millions of connections, as each removal
needs a syscall. Even if we could batch the operations, it still needs
to copy a lot of data between kernel and user space.

There are two cases to consider here:

1. When the timeout map is idle, i.e. no one updates or accesses it,
   we rely on the idle work to scan the whole hash table and remove
   these expired. The idle work is scheduled every 1 sec, which is also
   what conntrack uses.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work kicks in, we can simply skip them and
   schedule another work to do the actual removal work. Thusly we avoid
   taking locks on fast path.

The timeout of each element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value.
For now, batch ops is not supported, we can add it later iff needed.

To avoid adding memory overhead to regular map, we have to reuse some
field in struct htab_elem, that is, lru_node. Otherwise we would have to
rewrite a lot of code.

Cc: Andrii Nakryiko 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   5 +-
 kernel/bpf/hashtab.c   | 252 -
 kernel/bpf/syscall.c   |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 253 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..00a3b17b6af2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d7c1bb2923..88507e6aaddd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -158,6 +158,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
+   BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -393,7 +394,9 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout for
+ * BPF_MAP_TYPE_TIMEOUT_HASH (in milliseconds).
+ */
 enum {
BPF_ANY = 0, /* create new element or update existing */
BPF_NOEXIST = 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7e848200cd26..7f59b05e46b1 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "percpu_freelist.h"
@@ -104,6 +106,9 @@ struct bpf_htab {
u32 hashrnd;
struct lock_class_key lockdep_key;
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+   struct llist_head gc_list;
+   struct work_struct gc_work;
+   struct delayed_work gc_idle_work;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -122,6 +127,11 @@ struct htab_elem {
union {
struct rcu_head rcu;
struct bpf_lru_node lru_node;
+   struct {
+   u64 expires; /* in jiffies64 */
+   struct llist_node gc_node;
+   atomic_t pending;
+   };
};
u32 hash;
char key[] __aligned(8);
@@ -428,6 +438,31 @@ static int htab_map_alloc_check(union bpf_attr *attr)
return 0;
 }
 
+static bool htab_elem_expired(struct htab_elem *e)
+{
+   return time_after_eq64(get_jiffies_64(), e->expires);
+}
+
+/* Simply put the element in gc list, unless it is already there. */
+static void htab_gc_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+   if (atomic_fetch_or(1, &e->pending))
+   return;
+   llist_add(&e->gc_node, &htab->gc_list);
+   queue_work(system_unbound_wq, &htab->gc_work);
+}
+
+/* GC an element if it has been expired, returns whether the element is expired
+ * or not.
+ */
+static bool htab_expire_elem(struct bpf_htab *htab, struct htab_elem *e)
+{
+   if (!htab

[Patch bpf-next v3 0/3] bpf: introduce timeout map

2020-12-21 Thread Cong Wang

From: Cong Wang 

This patchset introduces a new bpf hash map which has timeout.
Patch 1 is the implementation of timeout map, patch 2 adds some test
cases for timeout map in test_maps, and patch 3 adds a test case in
map ptr test.

Please check each patch description for more details.

---
v3: move gc list from bucket to elem
reuse lru_node in struct htab_elem
drop patches which are no longer necessary
fix delete path
add a test case for delete path
add parallel test cases
change timeout to ms
drop batch ops

v2: fix hashmap ptr test
add a test case in map ptr test
factor out htab_timeout_map_alloc()

Cong Wang (3):
  bpf: introduce timeout map
  selftests/bpf: add test cases for bpf timeout map
  selftests/bpf: add timeout map check in map_ptr tests

 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   5 +-
 kernel/bpf/hashtab.c  | 252 +-
 kernel/bpf/syscall.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   1 +
 .../selftests/bpf/progs/map_ptr_kern.c|  20 ++
 tools/testing/selftests/bpf/test_maps.c   |  46 
 7 files changed, 319 insertions(+), 9 deletions(-)

-- 
2.25.1

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-17 Thread Cong Wang

On Wed, Dec 16, 2020 at 10:29 PM Cong Wang  wrote:
>
> On Wed, Dec 16, 2020 at 10:35 AM Andrii Nakryiko
>  wrote:
> > Minimize duplication of the code, no one said copy/paste all the code.
> > But memory bloat is a real problem and should be justification enough
> > to at least consider other options.
>
> Sure, I have no problem with this. The question is how do we balance?
> Is rewriting 200 lines of code to save 8 bytes of each entry acceptable?
> What about rewriting 2000 lines of code? Do people prefer to review 200
> or 2000 (or whatever number) lines of code? Or people just want a
> minimal change for easier reviews?

No worry any more. I manage to find some way to reuse the existing
members, that is lru_node. So the end result is putting gc stuff into
the union with lru_node without increasing the size of htab_elem.
And of course, without duplicating/refactoring regular htab code.

Thanks.

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-16 Thread Cong Wang

On Wed, Dec 16, 2020 at 10:35 AM Andrii Nakryiko
 wrote:
>
> On Tue, Dec 15, 2020 at 4:15 PM Cong Wang  wrote:
> >
> > On Tue, Dec 15, 2020 at 2:08 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang  
> > > wrote:
> > > >
> > > > On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > > >  wrote:
> > > > >
> > > > > On Mon, Dec 14, 2020 at 12:17 PM Cong Wang  
> > > > > wrote:
> > > > > >
> > > > > > From: Cong Wang 
> > > > > >
> > > > > > This borrows the idea from conntrack and will be used for conntrack 
> > > > > > in
> > > > > > bpf too. Each element in a timeout map has a user-specified timeout
> > > > > > in secs, after it expires it will be automatically removed from the 
> > > > > > map.
> > > > > >
> > > > > > There are two cases here:
> > > > > >
> > > > > > 1. When the timeout map is idle, that is, no one updates or 
> > > > > > accesses it,
> > > > > >we rely on the idle work to scan the whole hash table and remove
> > > > > >these expired. The idle work is scheduled every 1 sec.
> > > > >
> > > > > Would 1 second be a good period for a lot of cases? Probably would be
> > > > > good to expand on what went into this decision.
> > > >
> > > > Sure, because our granularity is 1 sec, I will add it into changelog.
> > > >
> > >
> > > Granularity of a timeout is not that coupled with the period of
> > > garbage collection. In this case, with 1 second period, you can have
> > > some items not garbage collected for up to 2 seconds due to timing and
> > > races. Just keep that in mind.
> >
> > Well, it is. Let's say we add entries every ms and kick gc every sec, we
> > could end up with thousands of expired entries in hash map, which could
> > be a problem under memory pressure.
>
> You can have the same thousands of entries expired with 1 second
> timeout granularity, so not sure what point you are making. Think

It is impossible to have expired entries within 1 sec when the granularity
is 1 sec and GC interval is 1 sec (which is my current code).

> about entries being added 1 every millisecond with 1 second timeout.
> So at time +1ms you have 1 message with timeout at +1001ms, at +2ms,
> you have 2 messages, one expiring at +1001ms and another at +1002ms.
> So when you 1 second period GC kicks in at, say, +1000ms, it discards
> nothing. By the time it kicks in second time at +2000ms, you are going
> to expire 1000items, but you could have expired 500 at +1500ms, if
> your period was 500ms, for example. With a 200ms period, you'd have at
> most 200 not expired entries.
>
> This is why I'm saying granularity of timeout units is not coupled
> with the GC period. I hope this makes it clearer. More granular
> timeout units give more flexibility and power to users without
> changing anything else.

The point is the smaller the granularity is, the more entries could be
expired within a GC schedule interval. This is an issue when we have
a burst within an interval and it would cause memory pressure during this
interval.

>
> But relying purely on GC period is bad, because with more frequent
> updates you can accumulate almost arbitrary number of expired entries
> between two GC passes. No matter the timeout granularity.

True, this is why xt_hashlimit simply lets users pick the gc interval. And
in fact, my initial implementation of timeout map exposed gc interval to
user too, I removed it when I learned the granularity can be just 1 sec
for conntrack use case (see Documentation/networking/nf_conntrack-sysctl.txt).

Anyway, it is not a simple task to just convert sec to ms here, the gc
interval matters more when the granularity becomes smaller.


> > >
> > > >
> > > > >
> > > > > >  enum {
> > > > > > BPF_ANY = 0, /* create new element or update 
> > > > > > existing */
> > > > > > BPF_NOEXIST = 1, /* create new element if it didn't 
> > > > > > exist */
> > > > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > > > index f0b7b54fa3a8..178cb376c397 100644
> > > > > > --- a/kernel/bpf/hashtab.c
> > > > > > +++ b/kernel/bpf/hashtab.c
> > > > > > @@ -8,6 +8,8 @@
> > > > > >

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-16 Thread Cong Wang

On Tue, Dec 15, 2020 at 6:35 PM Alexei Starovoitov
 wrote:
>
> On Tue, Dec 15, 2020 at 6:10 PM Cong Wang  wrote:
> >
> > Sure, people also implement CT on native hash map too and timeout
> > with user-space timers. ;)
>
> exactly. what's wrong with that?
> Perfectly fine way to do CT.

Seriously? When we have 8 millions of entries in a hash map, it is
definitely seriously wrong to purge entries one by one from user-space.

In case you don't believe me, take a look at what cilium CT GC does,
which is precisely expires entries one by one in user-space:

https://github.com/cilium/cilium/blob/0f57292c0037ee23ba1ca2f9abb113f36a664645/pkg/bpf/map_linux.go#L728
https://github.com/cilium/cilium/blob/master/pkg/maps/ctmap/ctmap.go#L398

and of course what people complained:

https://github.com/cilium/cilium/issues/5048

>
> > > Anything extra can be added on top from user space
> > > which can easily copy with 1 sec granularity.
> >
> > The problem is never about granularity, it is about how efficient we can
> > GC. User-space has to scan the whole table one by one, while the kernel
> > can just do this behind the scene with a much lower overhead.
> >
> > Let's say we arm a timer for each entry in user-space, it requires a syscall
> > and locking buckets each time for each entry. Kernel could do it without
> > any additional syscall and batching. Like I said above, we could have
> > millions of entries, so the overhead would be big in this scenario.
>
> and the user space can pick any other implementation instead
> of trivial entry by entry gc with timer.

Unless they don't have to, right? With timeout implementation in kernel,
user space does not need to invent any wheel.


>
> > > Say the kernel does GC and deletes htab entries.
> > > How user space will know that it's gone? There would need to be
> >
> > By a lookup.
> >
> > > an event sent to user space when entry is being deleted by the kernel.
> > > But then such event will be racy. Instead when timers and expirations
> > > are done by user space everything is in sync.
> >
> > Why there has to be an event?
>
> because when any production worthy implementation moves
> past the prototype stage there is something that user space needs to keep
> as well. Sometimes the bpf map in the kernel is alone.
> But a lot of times there is a user space mirror of the map in c++ or golang
> with the same key where user space keeps extra data.

So... what event does LRU map send when it deletes a different entry
when the map is full?

Thanks.

Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware offload

2020-12-16 Thread Cong Wang

On Mon, Dec 14, 2020 at 12:30 PM Maxim Mikityanskiy  wrote:
>
> On 2020-12-14 21:35, Cong Wang wrote:
> > On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy  
> > wrote:
> >>
> >> On 2020-12-11 21:16, Cong Wang wrote:
> >>> On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy  
> >>> wrote:
> >>>>
> >>>> HTB doesn't scale well because of contention on a single lock, and it
> >>>> also consumes CPU. This patch adds support for offloading HTB to
> >>>> hardware that supports hierarchical rate limiting.
> >>>>
> >>>> This solution addresses two main problems of scaling HTB:
> >>>>
> >>>> 1. Contention by flow classification. Currently the filters are attached
> >>>> to the HTB instance as follows:
> >>>
> >>> I do not think this is the reason, tcf_classify() has been called with RCU
> >>> only on the ingress side for a rather long time. What contentions are you
> >>> talking about here?
> >>
> >> When one attaches filters to HTB, tcf_classify is called from
> >> htb_classify, which is called from htb_enqueue, which is called with the
> >> root spinlock of the qdisc taken.
> >
> > So it has nothing to do with tcf_classify() itself... :-/
> >
> > [...]
> >
> >>> And doesn't TBF already work with mq? I mean you can attach it as
> >>> a leaf to each mq so that the tree lock will not be shared either, but 
> >>> you'd
> >>> lose the benefits of a global rate limit too.
> >>
> >> Yes, I'd lose not only the global rate limit, but also multi-level
> >> hierarchical limits, which are all provided by this HTB offload - that's
> >> why TBF is not really a replacement for this feature.
> >
> > Interesting, please explain how your HTB offload still has a global rate
> > limit and borrowing across queues?
>
> Sure, I will explain that.
>
> > I simply can't see it, all I can see
> > is you offload HTB into each queue in ->attach(),
>
> In the non-offload mode, the same HTB instance would be attached to all
> queues. In the offload mode, HTB behaves like MQ: there is a root
> instance of HTB, but each queue gets a separate simple qdisc (pfifo).
> Only the root qdisc (HTB) gets offloaded, and when that happens, the NIC
> creates an object for the QoS root.

Please add this to your changelog.

And why is the offloaded root qdisc not visible to software? All you add to
root HTB are pointers of direct qdisc's and a boolean, this is what I meant
by "not reflected". I expect the hardware parameters/stats are exposed to
software too, but I can't find any.

>
> Then all configuration changes are sent to the driver, and it issues the
> corresponding firmware commands to replicate the whole hierarchy in the
> NIC. Leaf classes correspond to queue groups (in this implementation
> queue groups contain only one queue, but it can be extended), and inner
> classes correspond to entities called TSARs.
>
> The information about rate limits is stored inside TSARs and queue
> groups. Queues know what groups they belong to, and groups and TSARs
> know what TSAR is their parent. A queue is picked in ndo_select_queue by
> looking at the classification result of clsact. So, when a packet is put
> onto a queue, the NIC can track the whole hierarchy and do the HTB
> algorithm.

Glad to know hardware still keeps HTB as a hierarchy.

Please also add this either to source code as comments or in your
changelog, it is very important to understand what is done by hardware.

Thanks.

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-15 Thread Cong Wang

On Tue, Dec 15, 2020 at 5:14 PM Alexei Starovoitov
 wrote:
>
> On Tue, Dec 15, 2020 at 04:22:21PM -0800, Cong Wang wrote:
> > On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann  
> > wrote:
> > >
> > > On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > > > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang  
> > > > wrote:
> > > >>
> > > >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> > > >>  wrote:
> > > >>>
> > > >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang  
> > > >>> wrote:
> > > >>>>
> > > >>>> From: Cong Wang 
> > > >>>>
> > > >>>> This borrows the idea from conntrack and will be used for conntrack 
> > > >>>> in
> > > >>>> bpf too. Each element in a timeout map has a user-specified timeout
> > > >>>> in secs, after it expires it will be automatically removed from the 
> > > >>>> map.
> > > [...]
> > > >>>>  char key[] __aligned(8);
> > > >>>>   };
> > > >>>>
> > > >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab 
> > > >>>> *htab)
> > > >>>>
> > > >>>>  for (i = 0; i < htab->n_buckets; i++) {
> > > >>>>  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> > > >>>> +   atomic_set(&htab->buckets[i].pending, 0);
> > > >>>>  if (htab_use_raw_lock(htab)) {
> > > >>>>  
> > > >>>> raw_spin_lock_init(&htab->buckets[i].raw_lock);
> > > >>>>  
> > > >>>> lockdep_set_class(&htab->buckets[i].raw_lock,
> > > >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr 
> > > >>>> *attr)
> > > >>>>  return 0;
> > > >>>>   }
> > > >>>>
> > > >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> > > >>>> +{
> > > >>>> +   if (atomic_fetch_or(1, &b->pending))
> > > >>>> +   return;
> > > >>>> +   llist_add(&b->gc_node, &htab->gc_list);
> > > >>>> +   queue_work(system_unbound_wq, &htab->gc_work);
> > > >>>> +}
> > > >>>
> > > >>> I'm concerned about each bucket being scheduled individually... And
> > > >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> > > >>> scheduling independently. Can you think about the way to have a
> > > >>> "global" gc/purging logic, and just make sure that buckets that need
> > > >>> processing would be just internally chained together. So the purging
> > > >>> routing would iterate all the scheduled hashmaps, and within each it
> > > >>> will have a linked list of buckets that need processing? And all that
> > > >>> is done just once each GC period. Not N times for N maps or N*M times
> > > >>> for N maps with M buckets in each.
> > > >>
> > > >> Our internal discussion went to the opposite actually, people here 
> > > >> argued
> > > >> one work is not sufficient for a hashtable because there would be 
> > > >> millions
> > > >> of entries (max_entries, which is also number of buckets). ;)
> > > >
> > > > I was hoping that it's possible to expire elements without iterating
> > > > the entire hash table every single time, only items that need to be
> > > > processed. Hashed timing wheel is one way to do something like this,
> > > > kernel has to solve similar problems with timeouts as well, why not
> > > > taking inspiration there?
> > >
> > > Couldn't this map be coupled with LRU map for example through flag on map
> > > creation so that the different LRU map flavors can be used with it? For 
> > > BPF
> > > CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> > > wonder if for that case you then still need to schedule a GC at all.. e.g.
> > > if you hit the condition time_after_eq64(now, ent

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-15 Thread Cong Wang

On Tue, Dec 15, 2020 at 3:23 PM Daniel Borkmann  wrote:
>
> On 12/15/20 11:03 PM, Andrii Nakryiko wrote:
> > On Tue, Dec 15, 2020 at 12:06 PM Cong Wang  wrote:
> >>
> >> On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> >>  wrote:
> >>>
> >>> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang  
> >>> wrote:
> >>>>
> >>>> From: Cong Wang 
> >>>>
> >>>> This borrows the idea from conntrack and will be used for conntrack in
> >>>> bpf too. Each element in a timeout map has a user-specified timeout
> >>>> in secs, after it expires it will be automatically removed from the map.
> [...]
> >>>>  char key[] __aligned(8);
> >>>>   };
> >>>>
> >>>> @@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
> >>>>
> >>>>  for (i = 0; i < htab->n_buckets; i++) {
> >>>>  INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
> >>>> +   atomic_set(&htab->buckets[i].pending, 0);
> >>>>  if (htab_use_raw_lock(htab)) {
> >>>>  raw_spin_lock_init(&htab->buckets[i].raw_lock);
> >>>>  lockdep_set_class(&htab->buckets[i].raw_lock,
> >>>> @@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr 
> >>>> *attr)
> >>>>  return 0;
> >>>>   }
> >>>>
> >>>> +static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
> >>>> +{
> >>>> +   if (atomic_fetch_or(1, &b->pending))
> >>>> +   return;
> >>>> +   llist_add(&b->gc_node, &htab->gc_list);
> >>>> +   queue_work(system_unbound_wq, &htab->gc_work);
> >>>> +}
> >>>
> >>> I'm concerned about each bucket being scheduled individually... And
> >>> similarly concerned that each instance of TIMEOUT_HASH will do its own
> >>> scheduling independently. Can you think about the way to have a
> >>> "global" gc/purging logic, and just make sure that buckets that need
> >>> processing would be just internally chained together. So the purging
> >>> routing would iterate all the scheduled hashmaps, and within each it
> >>> will have a linked list of buckets that need processing? And all that
> >>> is done just once each GC period. Not N times for N maps or N*M times
> >>> for N maps with M buckets in each.
> >>
> >> Our internal discussion went to the opposite actually, people here argued
> >> one work is not sufficient for a hashtable because there would be millions
> >> of entries (max_entries, which is also number of buckets). ;)
> >
> > I was hoping that it's possible to expire elements without iterating
> > the entire hash table every single time, only items that need to be
> > processed. Hashed timing wheel is one way to do something like this,
> > kernel has to solve similar problems with timeouts as well, why not
> > taking inspiration there?
>
> Couldn't this map be coupled with LRU map for example through flag on map
> creation so that the different LRU map flavors can be used with it? For BPF
> CT use case we do rely on LRU map to purge 'inactive' entries once full. I
> wonder if for that case you then still need to schedule a GC at all.. e.g.
> if you hit the condition time_after_eq64(now, entry->expires) you'd just
> re-link the expired element from the public htab to e.g. the LRU's local
> CPU's free/pending-list instead.

I doubt we can use size as a limit to kick off GC or LRU, it must be
time-based. And in case of idle, there has to be an async GC, right?

Thanks.

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-15 Thread Cong Wang

On Tue, Dec 15, 2020 at 2:08 PM Andrii Nakryiko
 wrote:
>
> On Tue, Dec 15, 2020 at 12:06 PM Cong Wang  wrote:
> >
> > On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Mon, Dec 14, 2020 at 12:17 PM Cong Wang  
> > > wrote:
> > > >
> > > > From: Cong Wang 
> > > >
> > > > This borrows the idea from conntrack and will be used for conntrack in
> > > > bpf too. Each element in a timeout map has a user-specified timeout
> > > > in secs, after it expires it will be automatically removed from the map.
> > > >
> > > > There are two cases here:
> > > >
> > > > 1. When the timeout map is idle, that is, no one updates or accesses it,
> > > >we rely on the idle work to scan the whole hash table and remove
> > > >these expired. The idle work is scheduled every 1 sec.
> > >
> > > Would 1 second be a good period for a lot of cases? Probably would be
> > > good to expand on what went into this decision.
> >
> > Sure, because our granularity is 1 sec, I will add it into changelog.
> >
>
> Granularity of a timeout is not that coupled with the period of
> garbage collection. In this case, with 1 second period, you can have
> some items not garbage collected for up to 2 seconds due to timing and
> races. Just keep that in mind.

Well, it is. Let's say we add entries every ms and kick gc every sec, we
could end up with thousands of expired entries in hash map, which could
be a problem under memory pressure.

>
> > >
> > > >
> > > > 2. When the timeout map is actively accessed, we could reach expired
> > > >elements before the idle work kicks in, we can simply skip them and
> > > >schedule another work to do the actual removal work. We avoid taking
> > > >    locks on fast path.
> > > >
> > > > The timeout of each element can be set or updated via 
> > > > bpf_map_update_elem()
> > > > and we reuse the upper 32-bit of the 64-bit flag for the timeout value.
> > > >
> > > > Cc: Alexei Starovoitov 
> > > > Cc: Daniel Borkmann 
> > > > Cc: Dongdong Wang 
> > > > Signed-off-by: Cong Wang 
> > > > ---
> > > >  include/linux/bpf_types.h  |   1 +
> > > >  include/uapi/linux/bpf.h   |   3 +-
> > > >  kernel/bpf/hashtab.c   | 244 -
> > > >  kernel/bpf/syscall.c   |   3 +-
> > > >  tools/include/uapi/linux/bpf.h |   1 +
> > > >  5 files changed, 248 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > index 99f7fd657d87..00a3b17b6af2 100644
> > > > --- a/include/linux/bpf_types.h
> > > > +++ b/include/linux/bpf_types.h
> > > > @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> > > >  BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> > > >  #endif
> > > >  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
> > > > +BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
> > > >
> > > >  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> > > >  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 30b477a26482..dedb47bc3f52 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -158,6 +158,7 @@ enum bpf_map_type {
> > > > BPF_MAP_TYPE_RINGBUF,
> > > > BPF_MAP_TYPE_INODE_STORAGE,
> > > > BPF_MAP_TYPE_TASK_STORAGE,
> > > > +   BPF_MAP_TYPE_TIMEOUT_HASH,
> > > >  };
> > > >
> > > >  /* Note that tracing related programs such as
> > > > @@ -393,7 +394,7 @@ enum bpf_link_type {
> > > >   */
> > > >  #define BPF_PSEUDO_CALL1
> > > >
> > > > -/* flags for BPF_MAP_UPDATE_ELEM command */
> > > > +/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout */
> > >
> > > timeout in what units of time?
> >
> > 1 sec, should I also add it in this comment (and everywhere else)?
>
> yes, please

Sure.

>
> >
> > >
> > > >  enum {
> > > > BPF_ANY = 0, /* create new element or update existing */
>

Re: [Patch bpf-next v2 0/5] bpf: introduce timeout map

2020-12-15 Thread Cong Wang

On Tue, Dec 15, 2020 at 11:29 AM Andrii Nakryiko
 wrote:
>
> On Mon, Dec 14, 2020 at 12:13 PM Cong Wang  wrote:
> >
> > From: Cong Wang 
> >
> > This patchset introduces a new bpf hash map which has timeout.
>
> It's a bit too short a cover letter for a pretty major new type of
> hash maps. Please expand on the problem it's trying to solve, how you
> tested and benchmarked it, etc.

I prefer to put everything in each patch description, because `git log` is easy
to find it after merging. (I know we can retain this cover letter but
it is not directly
shown in `git log kernel/bpf/hashtab.c`)

Please reply to patch 3 if you think anything is missing in its description.

Thanks!

Re: [Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-15 Thread Cong Wang

On Tue, Dec 15, 2020 at 11:27 AM Andrii Nakryiko
 wrote:
>
> On Mon, Dec 14, 2020 at 12:17 PM Cong Wang  wrote:
> >
> > From: Cong Wang 
> >
> > This borrows the idea from conntrack and will be used for conntrack in
> > bpf too. Each element in a timeout map has a user-specified timeout
> > in secs, after it expires it will be automatically removed from the map.
> >
> > There are two cases here:
> >
> > 1. When the timeout map is idle, that is, no one updates or accesses it,
> >we rely on the idle work to scan the whole hash table and remove
> >these expired. The idle work is scheduled every 1 sec.
>
> Would 1 second be a good period for a lot of cases? Probably would be
> good to expand on what went into this decision.

Sure, because our granularity is 1 sec, I will add it into changelog.

>
> >
> > 2. When the timeout map is actively accessed, we could reach expired
> >elements before the idle work kicks in, we can simply skip them and
> >schedule another work to do the actual removal work. We avoid taking
> >locks on fast path.
> >
> > The timeout of each element can be set or updated via bpf_map_update_elem()
> > and we reuse the upper 32-bit of the 64-bit flag for the timeout value.
> >
> > Cc: Alexei Starovoitov 
> > Cc: Daniel Borkmann 
> > Cc: Dongdong Wang 
> > Signed-off-by: Cong Wang 
> > ---
> >  include/linux/bpf_types.h  |   1 +
> >  include/uapi/linux/bpf.h   |   3 +-
> >  kernel/bpf/hashtab.c   | 244 -
> >  kernel/bpf/syscall.c   |   3 +-
> >  tools/include/uapi/linux/bpf.h |   1 +
> >  5 files changed, 248 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 99f7fd657d87..00a3b17b6af2 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> >  #endif
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
> > +BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
> >
> >  BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> >  BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 30b477a26482..dedb47bc3f52 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -158,6 +158,7 @@ enum bpf_map_type {
> > BPF_MAP_TYPE_RINGBUF,
> > BPF_MAP_TYPE_INODE_STORAGE,
> > BPF_MAP_TYPE_TASK_STORAGE,
> > +   BPF_MAP_TYPE_TIMEOUT_HASH,
> >  };
> >
> >  /* Note that tracing related programs such as
> > @@ -393,7 +394,7 @@ enum bpf_link_type {
> >   */
> >  #define BPF_PSEUDO_CALL1
> >
> > -/* flags for BPF_MAP_UPDATE_ELEM command */
> > +/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout */
>
> timeout in what units of time?

1 sec, should I also add it in this comment (and everywhere else)?

>
> >  enum {
> > BPF_ANY = 0, /* create new element or update existing */
> > BPF_NOEXIST = 1, /* create new element if it didn't exist */
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index f0b7b54fa3a8..178cb376c397 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -8,6 +8,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  #include 
> >  #include 
> >  #include "percpu_freelist.h"
> > @@ -84,6 +86,8 @@ struct bucket {
> > raw_spinlock_t raw_lock;
> > spinlock_t lock;
> > };
> > +   struct llist_node gc_node;
> > +   atomic_t pending;
>
> HASH is an extremely frequently used type of map, and oftentimes with
> a lot of entries/buckets. I don't think users of normal
> BPF_MAP_TYPE_HASH should pay the price of way more niche hashmap with
> timeouts. So I think it's not appropriate to increase the size of the
> struct bucket here.

I understand that, but what's a better way to do this? I can wrap it up
on top of struct bucket for sure, but it would need to change a lot of code.
So, basically code reuse vs. struct bucket size increase. ;)

>
> >  };
> >
> >  #define HASHTAB_MAP_LOCK_COUNT 8
> > @@ -104,6 +108,9 @@ struct bpf_htab {
> > u32 hashrnd;
> > struct lo

[Patch bpf-next v2 1/5] bpf: use index instead of hash for map_locked[]

2020-12-14 Thread Cong Wang

From: Cong Wang 

Commit 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
introduced a percpu counter map_locked to bail out NMI case. It uses
the hash of each bucket for indexing, which requires callers of
htab_lock_bucket()/htab_unlock_bucket() to pass it in. But hash value
is not always available, especially when we traverse the whole hash
table where we do not have keys to compute the hash. We can just
compute the index of each bucket with its address and use index
instead.

This is a prerequisite for the following timeout map patch.

Cc: Song Liu 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 kernel/bpf/hashtab.c | 57 +++-
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7e848200cd26..f0b7b54fa3a8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -156,16 +156,17 @@ static void htab_init_buckets(struct bpf_htab *htab)
 }
 
 static inline int htab_lock_bucket(const struct bpf_htab *htab,
-  struct bucket *b, u32 hash,
-  unsigned long *pflags)
+  struct bucket *b, unsigned long *pflags)
 {
unsigned long flags;
+   unsigned int index;
 
-   hash = hash & HASHTAB_MAP_LOCK_MASK;
+   index = b - htab->buckets;
+   index &= HASHTAB_MAP_LOCK_MASK;
 
migrate_disable();
-   if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
-   __this_cpu_dec(*(htab->map_locked[hash]));
+   if (unlikely(__this_cpu_inc_return(*(htab->map_locked[index])) != 1)) {
+   __this_cpu_dec(*(htab->map_locked[index]));
migrate_enable();
return -EBUSY;
}
@@ -180,15 +181,17 @@ static inline int htab_lock_bucket(const struct bpf_htab 
*htab,
 }
 
 static inline void htab_unlock_bucket(const struct bpf_htab *htab,
- struct bucket *b, u32 hash,
- unsigned long flags)
+ struct bucket *b, unsigned long flags)
 {
-   hash = hash & HASHTAB_MAP_LOCK_MASK;
+   unsigned int index;
+
+   index = b - htab->buckets;
+   index &= HASHTAB_MAP_LOCK_MASK;
if (htab_use_raw_lock(htab))
raw_spin_unlock_irqrestore(&b->raw_lock, flags);
else
spin_unlock_irqrestore(&b->lock, flags);
-   __this_cpu_dec(*(htab->map_locked[hash]));
+   __this_cpu_dec(*(htab->map_locked[index]));
migrate_enable();
 }
 
@@ -710,7 +713,7 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node)
b = __select_bucket(htab, tgt_l->hash);
head = &b->head;
 
-   ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return false;
 
@@ -720,7 +723,7 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node)
break;
}
 
-   htab_unlock_bucket(htab, b, tgt_l->hash, flags);
+   htab_unlock_bucket(htab, b, flags);
 
return l == tgt_l;
 }
@@ -1019,7 +1022,7 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
 */
}
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1062,7 +1065,7 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
}
ret = 0;
 err:
-   htab_unlock_bucket(htab, b, hash, flags);
+   htab_unlock_bucket(htab, b, flags);
return ret;
 }
 
@@ -1100,7 +1103,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, 
void *key, void *value,
return -ENOMEM;
memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1121,7 +1124,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, 
void *key, void *value,
ret = 0;
 
 err:
-   htab_unlock_bucket(htab, b, hash, flags);
+   htab_unlock_bucket(htab, b, flags);
 
if (ret)
bpf_lru_push_free(&htab->lru, &l_new->lru_node);
@@ -1156,7 +1159,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map 
*map, void *key,
b = __select_bucket(htab, hash);
head = &b->head;
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1181,7 +1184,7 @@ static int _

[Patch bpf-next v2 0/5] bpf: introduce timeout map

2020-12-14 Thread Cong Wang

From: Cong Wang 

This patchset introduces a new bpf hash map which has timeout.
Patch 1 is a preparation, patch 2 is the implementation of timeout
map, patch 3 updates an existing hash map ptr test, patch 4 and
patch 5 contain two test cases for timeout map.

Please check each patch description for more details.

---
v2: fix hashmap ptr test
add a test case in map ptr test
factor out htab_timeout_map_alloc()

Cong Wang (5):
  bpf: use index instead of hash for map_locked[]
  bpf: introduce timeout map
  selftests/bpf: update elem_size check in map ptr test
  selftests/bpf: add a test case for bpf timeout map
  selftests/bpf: add timeout map check in map_ptr tests

 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |   3 +-
 kernel/bpf/hashtab.c  | 301 --
 kernel/bpf/syscall.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   1 +
 .../selftests/bpf/progs/map_ptr_kern.c|  22 +-
 tools/testing/selftests/bpf/test_maps.c   |  41 +++
 7 files changed, 340 insertions(+), 32 deletions(-)

-- 
2.25.1

[Patch bpf-next v2 4/5] selftests/bpf: add a test case for bpf timeout map

2020-12-14 Thread Cong Wang

From: Cong Wang 

Add a test case in test_maps.c for timeout map, which focuses
on testing timeout.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/test_maps.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 0ad3e6305ff0..9fc7b5424fdf 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1723,6 +1723,45 @@ static void test_reuseport_array(void)
close(map_fd);
 }
 
+static void test_timeout_map(void)
+{
+   int val1 = 1, val2 = 2, val3 = 3;
+   int key1 = 1, key2 = 2, key3 = 3;
+   int fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   3, map_flags);
+   if (fd < 0) {
+   printf("Failed to create timeout map '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Timeout after 1 secs */
+   assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1<<32) == 0);
+   /* Timeout after 2 secs */
+   assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2<<32) == 0);
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)10<<32) == 0);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   val2 = 0;
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+   /* Modify timeout to expire it earlier */
+   val3 = 0;
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+   close(fd);
+}
+
 static void run_all_tests(void)
 {
test_hashmap(0, NULL);
@@ -1752,6 +1791,8 @@ static void run_all_tests(void)
test_stackmap(0, NULL);
 
test_map_in_map();
+
+   test_timeout_map();
 }
 
 #define DEFINE_TEST(name) extern void test_##name(void);
-- 
2.25.1

[Patch bpf-next v2 5/5] selftests/bpf: add timeout map check in map_ptr tests

2020-12-14 Thread Cong Wang

From: Cong Wang 

Similar to regular hashmap test.

Cc: Andrey Ignatov 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 .../selftests/bpf/progs/map_ptr_kern.c| 20 +++
 1 file changed, 20 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c 
b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
index 34f9880a1903..f158b4f7e6c8 100644
--- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c
+++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
@@ -648,6 +648,25 @@ static inline int check_ringbuf(void)
return 1;
 }
 
+struct {
+   __uint(type, BPF_MAP_TYPE_TIMEOUT_HASH);
+   __uint(max_entries, MAX_ENTRIES);
+   __type(key, __u32);
+   __type(value, __u32);
+} m_timeout SEC(".maps");
+
+static inline int check_timeout_hash(void)
+{
+   struct bpf_htab *timeout_hash = (struct bpf_htab *)&m_timeout;
+   struct bpf_map *map = (struct bpf_map *)&m_timeout;
+
+   VERIFY(check_default(&timeout_hash->map, map));
+   VERIFY(timeout_hash->n_buckets == MAX_ENTRIES);
+   VERIFY(timeout_hash->elem_size == 72);
+
+   return 1;
+}
+
 SEC("cgroup_skb/egress")
 int cg_skb(void *ctx)
 {
@@ -679,6 +698,7 @@ int cg_skb(void *ctx)
VERIFY_TYPE(BPF_MAP_TYPE_SK_STORAGE, check_sk_storage);
VERIFY_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, check_devmap_hash);
VERIFY_TYPE(BPF_MAP_TYPE_RINGBUF, check_ringbuf);
+   VERIFY_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, check_timeout_hash);
 
return 1;
 }
-- 
2.25.1

[Patch bpf-next v2 2/5] bpf: introduce timeout map

2020-12-14 Thread Cong Wang

From: Cong Wang 

This borrows the idea from conntrack and will be used for conntrack in
bpf too. Each element in a timeout map has a user-specified timeout
in secs, after it expires it will be automatically removed from the map.

There are two cases here:

1. When the timeout map is idle, that is, no one updates or accesses it,
   we rely on the idle work to scan the whole hash table and remove
   these expired. The idle work is scheduled every 1 sec.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work kicks in, we can simply skip them and
   schedule another work to do the actual removal work. We avoid taking
   locks on fast path.

The timeout of each element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   3 +-
 kernel/bpf/hashtab.c   | 244 -
 kernel/bpf/syscall.c   |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 248 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..00a3b17b6af2 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -125,6 +125,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 30b477a26482..dedb47bc3f52 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -158,6 +158,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
+   BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -393,7 +394,7 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout */
 enum {
BPF_ANY = 0, /* create new element or update existing */
BPF_NOEXIST = 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index f0b7b54fa3a8..178cb376c397 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "percpu_freelist.h"
@@ -84,6 +86,8 @@ struct bucket {
raw_spinlock_t raw_lock;
spinlock_t lock;
};
+   struct llist_node gc_node;
+   atomic_t pending;
 };
 
 #define HASHTAB_MAP_LOCK_COUNT 8
@@ -104,6 +108,9 @@ struct bpf_htab {
u32 hashrnd;
struct lock_class_key lockdep_key;
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+   struct llist_head gc_list;
+   struct work_struct gc_work;
+   struct delayed_work gc_idle_work;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -124,6 +131,7 @@ struct htab_elem {
struct bpf_lru_node lru_node;
};
u32 hash;
+   u64 expires;
char key[] __aligned(8);
 };
 
@@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
 
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
+   atomic_set(&htab->buckets[i].pending, 0);
if (htab_use_raw_lock(htab)) {
raw_spin_lock_init(&htab->buckets[i].raw_lock);
lockdep_set_class(&htab->buckets[i].raw_lock,
@@ -431,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
return 0;
 }
 
+static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
+{
+   if (atomic_fetch_or(1, &b->pending))
+   return;
+   llist_add(&b->gc_node, &htab->gc_list);
+   queue_work(system_unbound_wq, &htab->gc_work);
+}
+
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
@@ -732,10 +749,13 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node)
 static int htab_map_get_next_key(struct bpf_map *map, void *key, void 
*next_key)
 {
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+   bool is_timeout = map->map_type == BPF_MAP_TYPE_TIMEOUT_HASH;
struct hlist_nulls_head *head;
struct htab_elem *l, *next_l;
u32 hash, key_size;
+   struct bucket *b;
int

[Patch bpf-next v2 3/5] selftests/bpf: update elem_size check in map ptr test

2020-12-14 Thread Cong Wang

From: Cong Wang 

In map ptr test, a hard-coded 64 is used to check hash element size.
Increase it to 72 as we increase the size of struct htab_elem. It
seems struct htab_elem is not visible here.

Cc: Andrey Ignatov 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/progs/map_ptr_kern.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c 
b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
index d8850bc6a9f1..34f9880a1903 100644
--- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c
+++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c
@@ -111,7 +111,7 @@ static inline int check_hash(void)
VERIFY(check_default_noinline(&hash->map, map));
 
VERIFY(hash->n_buckets == MAX_ENTRIES);
-   VERIFY(hash->elem_size == 64);
+   VERIFY(hash->elem_size == 72);
 
VERIFY(hash->count.counter == 0);
for (i = 0; i < HALF_ENTRIES; ++i) {
-- 
2.25.1

Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware offload

2020-12-14 Thread Cong Wang

On Mon, Dec 14, 2020 at 7:13 AM Maxim Mikityanskiy  wrote:
>
> On 2020-12-11 21:16, Cong Wang wrote:
> > On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy  
> > wrote:
> >>
> >> HTB doesn't scale well because of contention on a single lock, and it
> >> also consumes CPU. This patch adds support for offloading HTB to
> >> hardware that supports hierarchical rate limiting.
> >>
> >> This solution addresses two main problems of scaling HTB:
> >>
> >> 1. Contention by flow classification. Currently the filters are attached
> >> to the HTB instance as follows:
> >
> > I do not think this is the reason, tcf_classify() has been called with RCU
> > only on the ingress side for a rather long time. What contentions are you
> > talking about here?
>
> When one attaches filters to HTB, tcf_classify is called from
> htb_classify, which is called from htb_enqueue, which is called with the
> root spinlock of the qdisc taken.

So it has nothing to do with tcf_classify() itself... :-/

[...]

> > And doesn't TBF already work with mq? I mean you can attach it as
> > a leaf to each mq so that the tree lock will not be shared either, but you'd
> > lose the benefits of a global rate limit too.
>
> Yes, I'd lose not only the global rate limit, but also multi-level
> hierarchical limits, which are all provided by this HTB offload - that's
> why TBF is not really a replacement for this feature.

Interesting, please explain how your HTB offload still has a global rate
limit and borrowing across queues? I simply can't see it, all I can see
is you offload HTB into each queue in ->attach(), where I assume the
hardware will do rate limit on each queue, if the hardware also has a
global control, why it is not reflected on the root qdisc?

Thanks!

Re: [Patch bpf-next 0/3] bpf: introduce timeout map

2020-12-14 Thread Cong Wang

On Sun, Dec 13, 2020 at 5:33 PM Andrey Ignatov  wrote:
>
> Cong Wang  [Sat, 2020-12-12 15:18 -0800]:
> > On Sat, Dec 12, 2020 at 2:25 PM Cong Wang  wrote:
> > >
> > > On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
> > >  wrote:
> > > >
> > > > On Fri, Dec 11, 2020 at 2:28 AM Cong Wang  
> > > > wrote:
> > > > >
> > > > > From: Cong Wang 
> > > > >
> > > > > This patchset introduces a new bpf hash map which has timeout.
> > > > > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > > > > map, patch 3 contains a test case for timeout map. Please check each
> > > > > patch description for more details.
> > > > >
> > > > > ---
> > > >
> > > > This patch set seems to be breaking existing selftests. Please take a
> > > > look ([0]).
> > >
> > > Interesting, looks unrelated to my patches but let me double check.
> >
> > Cc'ing Andrey...
> >
> > Looks like the failure is due to the addition of a new member to struct
> > htab_elem. Any reason why it is hard-coded as 64 in check_hash()?
> > And what's the point of verifying its size? htab_elem should be only
> > visible to the kernel itself.
> >
> > I can certainly change 64 to whatever its new size is, but I do wonder
> > why the test is there.
>
> Cong, the test is there to make sure that access to map pointers from
> BPF program works.
>
> Please see (41c48f3a9823 "bpf: Support access to bpf map fields") for
> more details on what "access to map pointer" means, but it's basically a
> way to access any field (e.g. max_entries) of common `struct bpf_map` or
> any type-specific struct like `struct bpf_htab` from BPF program, i.e.
> these structs are visible to not only kernel but also to BPF programs.

I see, I was not aware of this.

>
> The point of the test is to access a few fields from every map struct
> and make sure it works. Changing `struct htab_elem` indeed breaks the
> `VERIFY(hash->elem_size == 64);` check. But it can be easily updated
> (from 64 to whatever new size is) or replaced by some other field check.
> `htab->elem_size` was chosen semi-randomly since any bpf_htab-specific
> field would work for the test's purposes.

Good to know it is useful, I will have to change 64 to 72, as I tried to use
sizeof but struct htab_elem is not visible to that test.

>
> Hope it clarifies.
>
> Also since you add a new map type it would be great to cover it in
> tools/testing/selftests/bpf/progs/map_ptr_kern.c as well.

Yeah, will do.

Thanks.

Re: [Patch bpf-next 0/3] bpf: introduce timeout map

2020-12-12 Thread Cong Wang

On Sat, Dec 12, 2020 at 2:25 PM Cong Wang  wrote:
>
> On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
>  wrote:
> >
> > On Fri, Dec 11, 2020 at 2:28 AM Cong Wang  wrote:
> > >
> > > From: Cong Wang 
> > >
> > > This patchset introduces a new bpf hash map which has timeout.
> > > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > > map, patch 3 contains a test case for timeout map. Please check each
> > > patch description for more details.
> > >
> > > ---
> >
> > This patch set seems to be breaking existing selftests. Please take a
> > look ([0]).
>
> Interesting, looks unrelated to my patches but let me double check.

Cc'ing Andrey...

Looks like the failure is due to the addition of a new member to struct
htab_elem. Any reason why it is hard-coded as 64 in check_hash()?
And what's the point of verifying its size? htab_elem should be only
visible to the kernel itself.

I can certainly change 64 to whatever its new size is, but I do wonder
why the test is there.

Thanks.

Re: [Patch bpf-next 0/3] bpf: introduce timeout map

2020-12-12 Thread Cong Wang

On Fri, Dec 11, 2020 at 11:55 AM Andrii Nakryiko
 wrote:
>
> On Fri, Dec 11, 2020 at 2:28 AM Cong Wang  wrote:
> >
> > From: Cong Wang 
> >
> > This patchset introduces a new bpf hash map which has timeout.
> > Patch 1 is a preparation, patch 2 is the implementation of timeout
> > map, patch 3 contains a test case for timeout map. Please check each
> > patch description for more details.
> >
> > ---
>
> This patch set seems to be breaking existing selftests. Please take a
> look ([0]).

Interesting, looks unrelated to my patches but let me double check.

> Also patch #3 should have a commit message, even if pretty trivial one.

Yeah, I thought its subject is sufficient for a trivial patch.

Thanks.

Re: [PATCH net-next v2 2/4] sch_htb: Hierarchical QoS hardware offload

2020-12-11 Thread Cong Wang

On Fri, Dec 11, 2020 at 7:26 AM Maxim Mikityanskiy  wrote:
>
> HTB doesn't scale well because of contention on a single lock, and it
> also consumes CPU. This patch adds support for offloading HTB to
> hardware that supports hierarchical rate limiting.
>
> This solution addresses two main problems of scaling HTB:
>
> 1. Contention by flow classification. Currently the filters are attached
> to the HTB instance as follows:

I do not think this is the reason, tcf_classify() has been called with RCU
only on the ingress side for a rather long time. What contentions are you
talking about here?

>
> # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
> classid 1:10
>
> It's possible to move classification to clsact egress hook, which is
> thread-safe and lock-free:
>
> # tc filter add dev eth0 egress protocol ip flower dst_port 80
> action skbedit priority 1:10
>
> This way classification still happens in software, but the lock
> contention is eliminated, and it happens before selecting the TX queue,
> allowing the driver to translate the class to the corresponding hardware
> queue.

Sure, you can use clsact with HTB, or any combinations you like, but you
can't assume your HTB only works with clsact, can you?


>
> Note that this is already compatible with non-offloaded HTB and doesn't
> require changes to the kernel nor iproute2.
>
> 2. Contention by handling packets. HTB is not multi-queue, it attaches
> to a whole net device, and handling of all packets takes the same lock.
> When HTB is offloaded, its algorithm is done in hardware. HTB registers
> itself as a multi-queue qdisc, similarly to mq: HTB is attached to the
> netdev, and each queue has its own qdisc. The control flow is still done
> by HTB: it calls the driver via ndo_setup_tc to replicate the hierarchy
> of classes in the NIC. Leaf classes are presented by hardware queues.
> The data path works as follows: a packet is classified by clsact, the
> driver selects a hardware queue according to its class, and the packet
> is enqueued into this queue's qdisc.

I do _not_ read your code, from what you describe here, it sounds like
you just want a per-queue rate limit, instead of a global one. So why
bothering HTB whose goal is a global rate limit?

And doesn't TBF already work with mq? I mean you can attach it as
a leaf to each mq so that the tree lock will not be shared either, but you'd
lose the benefits of a global rate limit too. EDT does basically the same,
but it never claims to completely replace HTB. ;)

Thanks.

[Patch bpf-next 3/3] tools: add a test case for bpf timeout map

2020-12-10 Thread Cong Wang

From: Cong Wang 

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 tools/testing/selftests/bpf/test_maps.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 0ad3e6305ff0..9fc7b5424fdf 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1723,6 +1723,45 @@ static void test_reuseport_array(void)
close(map_fd);
 }
 
+static void test_timeout_map(void)
+{
+   int val1 = 1, val2 = 2, val3 = 3;
+   int key1 = 1, key2 = 2, key3 = 3;
+   int fd;
+
+   fd = bpf_create_map(BPF_MAP_TYPE_TIMEOUT_HASH, sizeof(int), sizeof(int),
+   3, map_flags);
+   if (fd < 0) {
+   printf("Failed to create timeout map '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   /* Timeout after 1 secs */
+   assert(bpf_map_update_elem(fd, &key1, &val1, (u64)1<<32) == 0);
+   /* Timeout after 2 secs */
+   assert(bpf_map_update_elem(fd, &key2, &val2, (u64)2<<32) == 0);
+   /* Timeout after 10 secs */
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)10<<32) == 0);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   val2 = 0;
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) == 0 && val2 == 2);
+
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key1, &val1) != 0);
+   assert(bpf_map_lookup_elem(fd, &key2, &val2) != 0);
+
+   /* Modify timeout to expire it earlier */
+   val3 = 0;
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) == 0 && val3 == 3);
+   assert(bpf_map_update_elem(fd, &key3, &val3, (u64)1<<32) == 0);
+   sleep(1);
+   assert(bpf_map_lookup_elem(fd, &key3, &val3) != 0);
+
+   close(fd);
+}
+
 static void run_all_tests(void)
 {
test_hashmap(0, NULL);
@@ -1752,6 +1791,8 @@ static void run_all_tests(void)
test_stackmap(0, NULL);
 
test_map_in_map();
+
+   test_timeout_map();
 }
 
 #define DEFINE_TEST(name) extern void test_##name(void);
-- 
2.25.1

[Patch bpf-next 1/3] bpf: use index instead of hash for map_locked[]

2020-12-10 Thread Cong Wang

From: Cong Wang 

Commit 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked")
introduced a percpu counter map_locked to bail out NMI case. It uses
the hash of each bucket for indexing, which requires callers of
htab_lock_bucket()/htab_unlock_bucket() to pass it in. But hash value
is not always available, especially when we traverse the whole hash
table where we do not have keys to compute the hash. We can just
compute the index of each bucket with its address and use index
instead.

This is a prerequisite for the following timeout map patch.

Cc: Song Liu 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 kernel/bpf/hashtab.c | 57 +++-
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 7e848200cd26..f0b7b54fa3a8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -156,16 +156,17 @@ static void htab_init_buckets(struct bpf_htab *htab)
 }
 
 static inline int htab_lock_bucket(const struct bpf_htab *htab,
-  struct bucket *b, u32 hash,
-  unsigned long *pflags)
+  struct bucket *b, unsigned long *pflags)
 {
unsigned long flags;
+   unsigned int index;
 
-   hash = hash & HASHTAB_MAP_LOCK_MASK;
+   index = b - htab->buckets;
+   index &= HASHTAB_MAP_LOCK_MASK;
 
migrate_disable();
-   if (unlikely(__this_cpu_inc_return(*(htab->map_locked[hash])) != 1)) {
-   __this_cpu_dec(*(htab->map_locked[hash]));
+   if (unlikely(__this_cpu_inc_return(*(htab->map_locked[index])) != 1)) {
+   __this_cpu_dec(*(htab->map_locked[index]));
migrate_enable();
return -EBUSY;
}
@@ -180,15 +181,17 @@ static inline int htab_lock_bucket(const struct bpf_htab 
*htab,
 }
 
 static inline void htab_unlock_bucket(const struct bpf_htab *htab,
- struct bucket *b, u32 hash,
- unsigned long flags)
+ struct bucket *b, unsigned long flags)
 {
-   hash = hash & HASHTAB_MAP_LOCK_MASK;
+   unsigned int index;
+
+   index = b - htab->buckets;
+   index &= HASHTAB_MAP_LOCK_MASK;
if (htab_use_raw_lock(htab))
raw_spin_unlock_irqrestore(&b->raw_lock, flags);
else
spin_unlock_irqrestore(&b->lock, flags);
-   __this_cpu_dec(*(htab->map_locked[hash]));
+   __this_cpu_dec(*(htab->map_locked[index]));
migrate_enable();
 }
 
@@ -710,7 +713,7 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node)
b = __select_bucket(htab, tgt_l->hash);
head = &b->head;
 
-   ret = htab_lock_bucket(htab, b, tgt_l->hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return false;
 
@@ -720,7 +723,7 @@ static bool htab_lru_map_delete_node(void *arg, struct 
bpf_lru_node *node)
break;
}
 
-   htab_unlock_bucket(htab, b, tgt_l->hash, flags);
+   htab_unlock_bucket(htab, b, flags);
 
return l == tgt_l;
 }
@@ -1019,7 +1022,7 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
 */
}
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1062,7 +1065,7 @@ static int htab_map_update_elem(struct bpf_map *map, void 
*key, void *value,
}
ret = 0;
 err:
-   htab_unlock_bucket(htab, b, hash, flags);
+   htab_unlock_bucket(htab, b, flags);
return ret;
 }
 
@@ -1100,7 +1103,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, 
void *key, void *value,
return -ENOMEM;
memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size);
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1121,7 +1124,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, 
void *key, void *value,
ret = 0;
 
 err:
-   htab_unlock_bucket(htab, b, hash, flags);
+   htab_unlock_bucket(htab, b, flags);
 
if (ret)
bpf_lru_push_free(&htab->lru, &l_new->lru_node);
@@ -1156,7 +1159,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map 
*map, void *key,
b = __select_bucket(htab, hash);
head = &b->head;
 
-   ret = htab_lock_bucket(htab, b, hash, &flags);
+   ret = htab_lock_bucket(htab, b, &flags);
if (ret)
return ret;
 
@@ -1181,7 +1184,7 @@ static int _

[Patch bpf-next 0/3] bpf: introduce timeout map

2020-12-10 Thread Cong Wang

From: Cong Wang 

This patchset introduces a new bpf hash map which has timeout.
Patch 1 is a preparation, patch 2 is the implementation of timeout
map, patch 3 contains a test case for timeout map. Please check each
patch description for more details.

---
Cong Wang (3):
  bpf: use index instead of hash for map_locked[]
  bpf: introduce timeout map
  tools: add a test case for bpf timeout map

 include/linux/bpf_types.h   |   1 +
 include/uapi/linux/bpf.h|   3 +-
 kernel/bpf/hashtab.c| 296 +---
 kernel/bpf/syscall.c|   3 +-
 tools/include/uapi/linux/bpf.h  |   1 +
 tools/testing/selftests/bpf/test_maps.c |  41 
 6 files changed, 314 insertions(+), 31 deletions(-)

-- 
2.25.1

[Patch bpf-next 2/3] bpf: introduce timeout map

2020-12-10 Thread Cong Wang

From: Cong Wang 

This borrows the idea from conntrack and will be used for conntrack in
bpf too. Each element in a timeout map has a user-specified timeout
in secs, after it expires it will be automatically removed from the map.

There are two cases here:

1. When the timeout map is idle, that is, no one updates or accesses it,
   we rely on the idle work to scan the whole hash table and remove
   these expired. The idle work is scheduled every 1 sec.

2. When the timeout map is actively accessed, we could reach expired
   elements before the idle work kicks in, we can simply skip them and
   schedule another work to do the actual removal work. We avoid taking
   locks on fast path.

The timeout of each element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value.

Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Dongdong Wang 
Signed-off-by: Cong Wang 
---
 include/linux/bpf_types.h  |   1 +
 include/uapi/linux/bpf.h   |   3 +-
 kernel/bpf/hashtab.c   | 239 -
 kernel/bpf/syscall.c   |   3 +-
 tools/include/uapi/linux/bpf.h |   1 +
 5 files changed, 243 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..901c4e01d726 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -90,6 +90,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, 
cgroup_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TIMEOUT_HASH, htab_timeout_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 30b477a26482..dedb47bc3f52 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -158,6 +158,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
BPF_MAP_TYPE_TASK_STORAGE,
+   BPF_MAP_TYPE_TIMEOUT_HASH,
 };
 
 /* Note that tracing related programs such as
@@ -393,7 +394,7 @@ enum bpf_link_type {
  */
 #define BPF_PSEUDO_CALL1
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM command, upper 32 bits are timeout */
 enum {
BPF_ANY = 0, /* create new element or update existing */
BPF_NOEXIST = 1, /* create new element if it didn't exist */
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index f0b7b54fa3a8..1f2bad82d52b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -8,6 +8,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "percpu_freelist.h"
@@ -84,6 +86,8 @@ struct bucket {
raw_spinlock_t raw_lock;
spinlock_t lock;
};
+   struct llist_node gc_node;
+   atomic_t pending;
 };
 
 #define HASHTAB_MAP_LOCK_COUNT 8
@@ -104,6 +108,9 @@ struct bpf_htab {
u32 hashrnd;
struct lock_class_key lockdep_key;
int __percpu *map_locked[HASHTAB_MAP_LOCK_COUNT];
+   struct llist_head gc_list;
+   struct work_struct gc_work;
+   struct delayed_work gc_idle_work;
 };
 
 /* each htab element is struct htab_elem + key + value */
@@ -124,6 +131,7 @@ struct htab_elem {
struct bpf_lru_node lru_node;
};
u32 hash;
+   u64 expires;
char key[] __aligned(8);
 };
 
@@ -143,6 +151,7 @@ static void htab_init_buckets(struct bpf_htab *htab)
 
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i);
+   atomic_set(&htab->buckets[i].pending, 0);
if (htab_use_raw_lock(htab)) {
raw_spin_lock_init(&htab->buckets[i].raw_lock);
lockdep_set_class(&htab->buckets[i].raw_lock,
@@ -431,12 +440,24 @@ static int htab_map_alloc_check(union bpf_attr *attr)
return 0;
 }
 
+static void htab_sched_gc(struct bpf_htab *htab, struct bucket *b)
+{
+   if (atomic_fetch_or(1, &b->pending))
+   return;
+   llist_add(&b->gc_node, &htab->gc_list);
+   queue_work(system_unbound_wq, &htab->gc_work);
+}
+
+static void htab_gc(struct work_struct *work);
+static void htab_gc_idle(struct work_struct *work);
+
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 {
bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
   attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH ||
attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH);
+   bool timeout = attr->map_type == BPF_MAP_TY

[Patch net v2 2/2] lwt_bpf: replace preempt_disable() with migrate_disable()

2020-12-05 Thread Cong Wang

From: Cong Wang 

migrate_disable() is just a wrapper for preempt_disable() in
non-RT kernel. It is safe to replace it, and RT kernel will
benefit.

Note that it is introduced since Feb 2020.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Cong Wang 
---
 net/core/lwt_bpf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 4f3cb7c15ddf..2f7940bcf715 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -39,10 +39,10 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
 {
int ret;
 
-   /* Preempt disable and BH disable are needed to protect per-cpu
+   /* Migration disable and BH disable are needed to protect per-cpu
 * redirect_info between BPF prog and skb_do_redirect().
 */
-   preempt_disable();
+   migrate_disable();
local_bh_disable();
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
@@ -78,7 +78,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
}
 
local_bh_enable();
-   preempt_enable();
+   migrate_enable();
 
return ret;
 }
-- 
2.25.1

[Patch net v2 1/2] lwt: disable BH too in run_lwt_bpf()

2020-12-05 Thread Cong Wang

From: Dongdong Wang 

The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Cc: Thomas Graf 
Cc: Alexei Starovoitov 
Reviewed-by: Cong Wang 
Signed-off-by: Dongdong Wang 
---
 net/core/lwt_bpf.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 7d3438215f32..4f3cb7c15ddf 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
 {
int ret;
 
-   /* Preempt disable is needed to protect per-cpu redirect_info between
-* BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
-* access to maps strictly require a rcu_read_lock() for protection,
-* mixing with BH RCU lock doesn't work.
+   /* Preempt disable and BH disable are needed to protect per-cpu
+* redirect_info between BPF prog and skb_do_redirect().
 */
preempt_disable();
+   local_bh_disable();
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -78,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
break;
}
 
+   local_bh_enable();
preempt_enable();
 
return ret;
-- 
2.25.1

Re: [PATCH net] net/sched: fq_pie: initialize timer earlier in fq_pie_init()

2020-12-04 Thread Cong Wang

On Thu, Dec 3, 2020 at 10:41 AM Davide Caratti  wrote:
>
> with the following tdc testcase:
>
>  83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows
>
> as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the
> timer is not yet initialized, it's possible to observe a splat like this:
...
> fix it moving timer_setup() before any failure, like it was done on 'red'
> with former commit 608b4adab178 ("net_sched: initialize timer earlier in
> red_init()").
>
> Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
> Signed-off-by: Davide Caratti 

Reviewed-by: Cong Wang

Re: [Patch net] lwt: disable BH too in run_lwt_bpf()

2020-12-03 Thread Cong Wang

On Thu, Dec 3, 2020 at 10:30 AM Alexei Starovoitov
 wrote:
>
> On Thu, Dec 3, 2020 at 10:28 AM Cong Wang  wrote:
> >
> > On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov
> >  wrote:
> > >
> > > I guess my previous comment could be misinterpreted.
> > > Cong,
> > > please respin with changing preempt_disable to migrate_disable
> > > and adding local_bh_disable.
> >
> > I have no objection, just want to point out migrate_disable() may
> > not exist if we backport this further to -stable, this helper was
> > introduced in Feb 2020.
>
> I see. Then please split it into two patches for ease of backporting.

You mean the first patch does the same as this patch and the second
patch replaces preempt_disable() with migrate_disable(). Right?

Thanks.

Re: [Patch net] lwt: disable BH too in run_lwt_bpf()

2020-12-03 Thread Cong Wang

On Thu, Dec 3, 2020 at 10:22 AM Alexei Starovoitov
 wrote:
>
> I guess my previous comment could be misinterpreted.
> Cong,
> please respin with changing preempt_disable to migrate_disable
> and adding local_bh_disable.

I have no objection, just want to point out migrate_disable() may
not exist if we backport this further to -stable, this helper was
introduced in Feb 2020.

Thanks.

Re: [Patch net] lwt: disable BH too in run_lwt_bpf()

2020-12-02 Thread Cong Wang

On Wed, Dec 2, 2020 at 5:10 PM Jakub Kicinski  wrote:
>
> On Tue,  1 Dec 2020 11:44:38 -0800 Cong Wang wrote:
> > From: Dongdong Wang 
> >
> > The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
> > and BPF redirect helpers. Callers on RX path are all in BH context,
> > disabling preemption is not sufficient to prevent BH interruption.
> >
> > In production, we observed strange packet drops because of the race
> > condition between LWT xmit and TC ingress, and we verified this issue
> > is fixed after we disable BH.
> >
> > Although this bug was technically introduced from the beginning, that
> > is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
> > at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
> > So this patch may not work well before RCU flavor consolidation has been
> > completed around v5.0.
> >
> > Update the comments above the code too, as call_rcu() is now BH friendly.
> >
> > Cc: Thomas Graf 
> > Cc: b...@vger.kernel.org
> > Reviewed-by: Cong Wang 
> > Signed-off-by: Dongdong Wang 
> > ---
> >  net/core/lwt_bpf.c | 8 
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> > index 7d3438215f32..4f3cb7c15ddf 100644
> > --- a/net/core/lwt_bpf.c
> > +++ b/net/core/lwt_bpf.c
> > @@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
> > bpf_lwt_prog *lwt,
> >  {
> >   int ret;
> >
> > - /* Preempt disable is needed to protect per-cpu redirect_info between
> > -  * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > -  * access to maps strictly require a rcu_read_lock() for protection,
> > -  * mixing with BH RCU lock doesn't work.
> > + /* Preempt disable and BH disable are needed to protect per-cpu
> > +  * redirect_info between BPF prog and skb_do_redirect().
> >*/
> >   preempt_disable();
> > + local_bh_disable();
>
> Why not remove the preempt_disable()? Disabling BH must also disable
> preemption AFAIK.

It seems RT kernel still needs preempt disable:
https://www.spinics.net/lists/kernel/msg3710124.html
but my RT knowledge is not sufficient to tell. So I just follow the
same pattern
in x86 FPU (as of today):

static inline void fpregs_lock(void)
{
preempt_disable();
local_bh_disable();
}

static inline void fpregs_unlock(void)
{
local_bh_enable();
preempt_enable();
}

There are other similar patterns in the current code base, so if this
needs a clean up, RT people can clean up them all together.

Thanks.

[Patch net] lwt: disable BH too in run_lwt_bpf()

2020-12-01 Thread Cong Wang

From: Dongdong Wang 

The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Cc: Thomas Graf 
Cc: b...@vger.kernel.org
Reviewed-by: Cong Wang 
Signed-off-by: Dongdong Wang 
---
 net/core/lwt_bpf.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 7d3438215f32..4f3cb7c15ddf 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -39,12 +39,11 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
 {
int ret;
 
-   /* Preempt disable is needed to protect per-cpu redirect_info between
-* BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
-* access to maps strictly require a rcu_read_lock() for protection,
-* mixing with BH RCU lock doesn't work.
+   /* Preempt disable and BH disable are needed to protect per-cpu
+* redirect_info between BPF prog and skb_do_redirect().
 */
preempt_disable();
+   local_bh_disable();
bpf_compute_data_pointers(skb);
ret = bpf_prog_run_save_cb(lwt->prog, skb);
 
@@ -78,6 +77,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct 
bpf_lwt_prog *lwt,
break;
}
 
+   local_bh_enable();
preempt_enable();
 
return ret;
-- 
2.25.1

Re: [PATCH v4 net-next 3/3] net/sched: sch_frag: add generic packet fragment support.

2020-11-25 Thread Cong Wang

On Wed, Nov 25, 2020 at 12:04 PM Marcelo Ricardo Leitner
 wrote:
>
> On Wed, Nov 25, 2020 at 12:01:23PM +0800, we...@ucloud.cn wrote:
> > From: wenxu 
> >
> > Currently kernel tc subsystem can do conntrack in cat_ct. But when several
>typo ^^^
>
> > fragment packets go through the act_ct, function tcf_ct_handle_fragments
> > will defrag the packets to a big one. But the last action will redirect
> > mirred to a device which maybe lead the reassembly big packet over the mtu
> > of target device.
> >
> > This patch add support for a xmit hook to mirred, that gets executed before
> > xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
> > The frag xmit hook maybe reused by other modules.
>
> (I'm back from PTO)
>
> This paragraph was kept from previous version and now although it can
> match the current implementation, it's a somewhat forced
> understanding. So what about:
> """
> This patch adds support for a xmit hook to mirred, that gets executed
> before xmiting the packet. Then, when act_ct gets loaded, it enables
> such hook.
> The hook may also be enabled by other modules.
> """
>
> Rationale is to not give room for the understanding that the hook is
> configurable (i.e., replaceable with something else), to cope with v4
> changes.
>
> >
> > Signed-off-by: wenxu 
> > ---
> > v2: make act_frag just buildin for tc core but not a module
> > return an error code from tcf_fragment
> > depends on INET for ip_do_fragment
> > v3: put the whole sch_frag.c under CONFIG_INET
>
> I was reading on past discussions that led to this and I miss one
> point of discussion. Cong had mentioned that as this is a must have
> for act_ct, that we should get rid of user visible Kconfigs for it
> (which makes sense).  v3 then removed the Kconfig entirely.
> My question then is: shouldn't it have an *invisible* Kconfig instead?
> As is, sch_frag will be always enabled, regardless of having act_ct
> enabled or not.
>
> I don't think it's worth tiying this to act_ct itself, as I think
> other actions can do defrag later on or so. So I'm thinking act_ct
> could select this other Kconfig, that depends on INET, and use it to
> enable/disable building this code.

Yeah, I think compiler is able to compile out unused code with LTO, so
probably not a big deal here.

Thanks.

Re: [PATCH v4 net-next 3/3] net/sched: sch_frag: add generic packet fragment support.

2020-11-25 Thread Cong Wang

On Wed, Nov 25, 2020 at 11:11 AM Jakub Kicinski  wrote:
>
> On Wed, 25 Nov 2020 12:01:23 +0800 we...@ucloud.cn wrote:
> > From: wenxu 
> >
> > Currently kernel tc subsystem can do conntrack in cat_ct. But when several
> > fragment packets go through the act_ct, function tcf_ct_handle_fragments
> > will defrag the packets to a big one. But the last action will redirect
> > mirred to a device which maybe lead the reassembly big packet over the mtu
> > of target device.
> >
> > This patch add support for a xmit hook to mirred, that gets executed before
> > xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
> > The frag xmit hook maybe reused by other modules.
> >
> > Signed-off-by: wenxu 
>
> LGMT. Cong, Jamal still fine by you guys?

Yes, I do not look much into detail, but overall it definitely looks good.
This is targeting net-next, so it is fine to fix anything we miss later.

Acked-by: Cong Wang 

Thanks.

[Patch stable] netfilter: clear skb->next in NF_HOOK_LIST()

2020-11-20 Thread Cong Wang

From: Cong Wang 

NF_HOOK_LIST() uses list_del() to remove skb from the linked list,
however, it is not sufficient as skb->next still points to other
skb. We should just call skb_list_del_init() to clear skb->next,
like the rest places which using skb list.

This has been fixed in upstream by commit ca58fbe06c54
("netfilter: add and use nf_hook_slow_list()").

Fixes: 9f17dbf04ddf ("netfilter: fix use-after-free in NF_HOOK_LIST")
Reported-by: li...@knownsec.com
Tested-by: li...@knownsec.com
Cc: Florian Westphal 
Cc: Edward Cree 
Cc: sta...@vger.kernel.org # between 4.19 and 5.4
Cc: Greg Kroah-Hartman 
Signed-off-by: Cong Wang 
---
 include/linux/netfilter.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 77ebb61faf48..4c0e6539effd 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -316,7 +316,7 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net 
*net, struct sock *sk,
 
INIT_LIST_HEAD(&sublist);
list_for_each_entry_safe(skb, next, head, list) {
-   list_del(&skb->list);
+   skb_list_del_init(skb);
if (nf_hook(pf, hook, net, sk, skb, in, out, okfn) == 1)
list_add_tail(&skb->list, &sublist);
}
-- 
2.25.1

Re: [PATCH v3 net-next 0/3] net/sched: fix over mtu packet of defrag in

2020-11-20 Thread Cong Wang

On Thu, Nov 19, 2020 at 3:39 PM  wrote:
>
> From: wenxu 
>
> Currently kernel tc subsystem can do conntrack in act_ct. But when several
> fragment packets go through the act_ct, function tcf_ct_handle_fragments
> will defrag the packets to a big one. But the last action will redirect
> mirred to a device which maybe lead the reassembly big packet over the mtu
> of target device.
>
> The first patch fix miss init the qdisc_skb_cb->mru
> The send one refactor the hanle of xmit in act_mirred and prepare for the
> third one
> The last one add implict packet fragment support to fix the over mtu for
> defrag in act_ct.

Overall it looks much better to me now, so:

Acked-by: Cong Wang 

Thanks!

Re: [PATCH v2 net-next 3/3] net/sched: sch_frag: add generic packet fragment support.

2020-11-19 Thread Cong Wang

On Wed, Nov 18, 2020 at 12:04 AM wenxu  wrote:
>
>
> 在 2020/11/18 15:00, Cong Wang 写道:
> > On Tue, Nov 17, 2020 at 5:37 PM  wrote:
> >> From: wenxu 
> >>
> >> Currently kernel tc subsystem can do conntrack in cat_ct. But when several
> >> fragment packets go through the act_ct, function tcf_ct_handle_fragments
> >> will defrag the packets to a big one. But the last action will redirect
> >> mirred to a device which maybe lead the reassembly big packet over the mtu
> >> of target device.
> >>
> >> This patch add support for a xmit hook to mirred, that gets executed before
> >> xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
> >> The frag xmit hook maybe reused by other modules.
> >>
> >> Signed-off-by: wenxu 
> >> ---
> >> v2: make act_frag just buildin for tc core but not a module
> >> return an error code from tcf_fragment
> >> depends on INET for ip_do_fragment
> > Much better now.
> >
> >
> >> +#ifdef CONFIG_INET
> >> +   ret = ip_do_fragment(net, skb->sk, skb, sch_frag_xmit);
> >> +#endif
> >
> > Doesn't the whole sch_frag need to be put under CONFIG_INET?
> > I don't think fragmentation could work without CONFIG_INET.
>
> I have already test with this. Without CONFIG_INET it is work.
>
> And only the ip_do_fragment depends on CONFIG_INET

Passing the compiler test is not what I meant. The whole ipv4/ directory
is under CONFIG_INET:

 obj-$(CONFIG_INET)  += ipv4/

Without it, what code does the fragmentation? I suggest you to just put
the sch_frag in this way:

obj-$(CONFIG_INET)  += sch_frag.o

(and remove the #ifdef CONFIG_INET within it.)

Thanks.

Re: [PATCH v2 net-next 3/3] net/sched: sch_frag: add generic packet fragment support.

2020-11-17 Thread Cong Wang

On Tue, Nov 17, 2020 at 5:37 PM  wrote:
>
> From: wenxu 
>
> Currently kernel tc subsystem can do conntrack in cat_ct. But when several
> fragment packets go through the act_ct, function tcf_ct_handle_fragments
> will defrag the packets to a big one. But the last action will redirect
> mirred to a device which maybe lead the reassembly big packet over the mtu
> of target device.
>
> This patch add support for a xmit hook to mirred, that gets executed before
> xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
> The frag xmit hook maybe reused by other modules.
>
> Signed-off-by: wenxu 
> ---
> v2: make act_frag just buildin for tc core but not a module
> return an error code from tcf_fragment
> depends on INET for ip_do_fragment

Much better now.


> +#ifdef CONFIG_INET
> +   ret = ip_do_fragment(net, skb->sk, skb, sch_frag_xmit);
> +#endif


Doesn't the whole sch_frag need to be put under CONFIG_INET?
I don't think fragmentation could work without CONFIG_INET.

Thanks.

Re: [PATCH v10 net-next 3/3] net/sched: act_frag: add implict packet fragment support.

2020-11-17 Thread Cong Wang

On Mon, Nov 16, 2020 at 8:06 PM wenxu  wrote:
>
>
> On 11/17/2020 3:01 AM, Cong Wang wrote:
> > On Sun, Nov 15, 2020 at 5:06 AM wenxu  wrote:
> >>
> >> 在 2020/11/15 2:05, Cong Wang 写道:
> >>> On Wed, Nov 11, 2020 at 9:44 PM  wrote:
> >>>> diff --git a/net/sched/act_frag.c b/net/sched/act_frag.c
> >>>> new file mode 100644
> >>>> index 000..3a7ab92
> >>>> --- /dev/null
> >>>> +++ b/net/sched/act_frag.c
> >>> It is kinda confusing to see this is a module. It provides some
> >>> wrappers and hooks the dev_xmit_queue(), it belongs more to
> >>> the core tc code than any modularized code. How about putting
> >>> this into net/sched/sch_generic.c?
> >>>
> >>> Thanks.
> >> All the operations in the act_frag  are single L3 action.
> >>
> >> So we put in a single module. to keep it as isolated/contained as possible
> > Yeah, but you hook dev_queue_xmit() which is L2.
> >
> >> Maybe put this in a single file is better than a module? Buildin in the tc 
> >> core code or not.
> >>
> >> Enable this feature in Kconifg with NET_ACT_FRAG?
> > Sort of... If this is not an optional feature, that is a must-have
> > feature for act_ct,
> > we should just get rid of this Kconfig.
> >
> > Also, you need to depend on CONFIG_INET somewhere to use the IP
> > fragment, no?
> >
> > Thanks.
>
> Maybe the act_frag should rename to sch_frag and buildin kernel.

sch_frag still sounds like a module. ;) This is why I proposed putting
it into sch_generic.c.

>
> This fcuntion can be used for all tc subsystem. There is no need for
>
> CONFIG_INET. The sched system depends on NET.

CONFIG_INET is different from CONFIG_NET, right?

Thanks.

Re: [PATCH v10 net-next 3/3] net/sched: act_frag: add implict packet fragment support.

2020-11-16 Thread Cong Wang

On Sun, Nov 15, 2020 at 5:06 AM wenxu  wrote:
>
>
> 在 2020/11/15 2:05, Cong Wang 写道:
> > On Wed, Nov 11, 2020 at 9:44 PM  wrote:
> >> diff --git a/net/sched/act_frag.c b/net/sched/act_frag.c
> >> new file mode 100644
> >> index 000..3a7ab92
> >> --- /dev/null
> >> +++ b/net/sched/act_frag.c
> > It is kinda confusing to see this is a module. It provides some
> > wrappers and hooks the dev_xmit_queue(), it belongs more to
> > the core tc code than any modularized code. How about putting
> > this into net/sched/sch_generic.c?
> >
> > Thanks.
>
> All the operations in the act_frag  are single L3 action.
>
> So we put in a single module. to keep it as isolated/contained as possible

Yeah, but you hook dev_queue_xmit() which is L2.

>
> Maybe put this in a single file is better than a module? Buildin in the tc 
> core code or not.
>
> Enable this feature in Kconifg with NET_ACT_FRAG?

Sort of... If this is not an optional feature, that is a must-have
feature for act_ct,
we should just get rid of this Kconfig.

Also, you need to depend on CONFIG_INET somewhere to use the IP
fragment, no?

Thanks.

Re: [PATCH v10 net-next 3/3] net/sched: act_frag: add implict packet fragment support.

2020-11-16 Thread Cong Wang

On Sat, Nov 14, 2020 at 2:46 PM Marcelo Ricardo Leitner
 wrote:
> Davide had shared similar concerns with regards of the new module too.
> The main idea behind the new module was to keep it as
> isolated/contained as possible, and only so. So thumbs up from my
> side.
>
> To be clear, you're only talking about the module itself, right? It
> would still need to have the Kconfig to enable this feature, or not?

Both. The code itself doesn't look like a module, and it doesn't
look like an optional feature for act_ct either, does it? If not, there is
no need to have a user visible Kconfig, we just select it, or no Kconfig
at all.

Thanks.

Re: [PATCH v10 net-next 3/3] net/sched: act_frag: add implict packet fragment support.

2020-11-14 Thread Cong Wang

On Wed, Nov 11, 2020 at 9:44 PM  wrote:
> diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
> index 9c79fb9..dff3c40 100644
> --- a/net/sched/act_ct.c
> +++ b/net/sched/act_ct.c
> @@ -1541,8 +1541,14 @@ static int __init ct_init_module(void)
> if (err)
> goto err_register;
>
> +   err = tcf_set_xmit_hook(tcf_frag_xmit_hook);

Yeah, this approach is certainly much better than extending act_mirred.
Just one comment below.

> diff --git a/net/sched/act_frag.c b/net/sched/act_frag.c
> new file mode 100644
> index 000..3a7ab92
> --- /dev/null
> +++ b/net/sched/act_frag.c

It is kinda confusing to see this is a module. It provides some
wrappers and hooks the dev_xmit_queue(), it belongs more to
the core tc code than any modularized code. How about putting
this into net/sched/sch_generic.c?

Thanks.

Re: some question about "/sys/class/net//operstate"

2020-11-11 Thread Cong Wang

On Tue, Nov 10, 2020 at 8:32 PM 杜英杰  wrote:
>
> I want to use inotify to monitor /sys/class/net//operstate  to detect status 
> of a iface in real time.
> when I ifdown && ifup eth3, the content of operstate changed, but the 
> file's Modify time didn't change.
> I don't know the reason, is there any file which can be monitored by inotify 
> to get iface status in real time?
> Much appreciation for any advice!

You need to listen to netdev netlink messages for changes like
this. These messages are generated in real-time.

Thanks.

Re: [PATCH net-next v2] net: sched: implement action-specific terse dump

2020-11-04 Thread Cong Wang

On Wed, Nov 4, 2020 at 4:39 PM Jakub Kicinski  wrote:
>
> On Mon,  2 Nov 2020 22:12:43 +0200 Vlad Buslov wrote:
> > Allow user to request action terse dump with new flag value
> > TCA_FLAG_TERSE_DUMP. Only output essential action info in terse dump (kind,
> > stats, index and cookie, if set by the user when creating the action). This
> > is different from filter terse dump where index is excluded (filter can be
> > identified by its own handle).
> >
> > Move tcf_action_dump_terse() function to the beginning of source file in
> > order to call it from tcf_dump_walker().
> >
> > Signed-off-by: Vlad Buslov 
> > Suggested-by: Jamal Hadi Salim 
>
> Jiri, Cong, can I get an ack?
>
> The previous terse dump made sense because it fulfilled the need of
> an important user (OvS). IDK if this is as clear-cut, and I haven't
> followed the iproute2 thread closely enough, so please weigh in.

Like I said in the previous discussion, I am not a fan of terse dump,
but before we have a better solution here, using this flag is probably
the best we have on the table, so at least for a temporary solution:

Acked-by: Cong Wang 

Thanks.

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 2972 matches

Mail list logo