Re: [PATCHv8 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-20 Thread Martin KaFai Lau
On Tue, Apr 20, 2021 at 10:56:56PM +0200, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau  writes:
> 
> > On Thu, Apr 15, 2021 at 09:53:17PM +0800, Hangbin Liu wrote:
> >> From: Jesper Dangaard Brouer 
> >> 
> >> This changes the devmap XDP program support to run the program when the
> >> bulk queue is flushed instead of before the frame is enqueued. This has
> >> a couple of benefits:
> >> 
> >> - It "sorts" the packets by destination devmap entry, and then runs the
> >>   same BPF program on all the packets in sequence. This ensures that we
> >>   keep the XDP program and destination device properties hot in I-cache.
> >> 
> >> - It makes the multicast implementation simpler because it can just
> >>   enqueue packets using bq_enqueue() without having to deal with the
> >>   devmap program at all.
> >> 
> >> The drawback is that if the devmap program drops the packet, the enqueue
> >> step is redundant. However, arguably this is mostly visible in a
> >> micro-benchmark, and with more mixed traffic the I-cache benefit should
> >> win out. The performance impact of just this patch is as follows:
> >> 
> >> When bq_xmit_all() is called from bq_enqueue(), another packet will
> >> always be enqueued immediately after, so clearing dev_rx, xdp_prog and
> >> flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
> >> and only check them once in bq_enqueue() since they are all modified
> >> together.
> 
> (side note, while we're modifying the commit message, this paragraph
> should probably be moved to the end)
> 
> >> Using 10Gb i40e NIC, do XDP_DROP on veth peer, with xdp_redirect_map in
> >> sample/bpf, send pkts via pktgen cmd:
> >> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 
> >> 10 -s 64
> >> 
> >> There are about +/- 0.1M deviation for native testing, the performance
> >> improved for the base-case, but some drop back with xdp devmap prog 
> >> attached.
> >> 
> >> Version  | Test   | Generic | Native | 
> >> Native + 2nd xdp_prog
> >> 5.12 rc4 | xdp_redirect_map   i40e->i40e  |1.9M |   9.6M |  
> >> 8.4M
> >> 5.12 rc4 | xdp_redirect_map   i40e->veth  |1.7M |  11.7M |  
> >> 9.8M
> >> 5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |1.9M |   9.8M |  
> >> 8.0M
> >> 5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |1.7M |  12.0M |  
> >> 9.4M
> > Based on the discussion in v7, a summary of what still needs to be
> > addressed will be useful.
> 
> That's fair. How about we add a paragraph like this (below the one I
> just suggested above that we move to the end):
> 
> This change also has the side effect of extending the lifetime of the
> RCU-protected xdp_prog that lives inside the devmap entries: Instead of
> just living for the duration of the XDP program invocation, the
> reference now lives all the way until the bq is flushed. This is safe
> because the bq flush happens at the end of the NAPI poll loop, so
> everything happens between a local_bh_disable()/local_bh_enable() pair.
> However, this is by no means obvious from looking at the call sites; in
> particular, some drivers have an additional rcu_read_lock() around only
> the XDP program invocation, which only confuses matters further.
> Clearing this up will be done in a separate patch series.
lgtm


Re: [PATCHv8 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-20 Thread Martin KaFai Lau
On Thu, Apr 15, 2021 at 09:53:17PM +0800, Hangbin Liu wrote:
> From: Jesper Dangaard Brouer 
> 
> This changes the devmap XDP program support to run the program when the
> bulk queue is flushed instead of before the frame is enqueued. This has
> a couple of benefits:
> 
> - It "sorts" the packets by destination devmap entry, and then runs the
>   same BPF program on all the packets in sequence. This ensures that we
>   keep the XDP program and destination device properties hot in I-cache.
> 
> - It makes the multicast implementation simpler because it can just
>   enqueue packets using bq_enqueue() without having to deal with the
>   devmap program at all.
> 
> The drawback is that if the devmap program drops the packet, the enqueue
> step is redundant. However, arguably this is mostly visible in a
> micro-benchmark, and with more mixed traffic the I-cache benefit should
> win out. The performance impact of just this patch is as follows:
> 
> When bq_xmit_all() is called from bq_enqueue(), another packet will
> always be enqueued immediately after, so clearing dev_rx, xdp_prog and
> flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(),
> and only check them once in bq_enqueue() since they are all modified
> together.
> 
> Using 10Gb i40e NIC, do XDP_DROP on veth peer, with xdp_redirect_map in
> sample/bpf, send pkts via pktgen cmd:
> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 
> -s 64
> 
> There are about +/- 0.1M deviation for native testing, the performance
> improved for the base-case, but some drop back with xdp devmap prog attached.
> 
> Version  | Test   | Generic | Native | Native 
> + 2nd xdp_prog
> 5.12 rc4 | xdp_redirect_map   i40e->i40e  |1.9M |   9.6M |  8.4M
> 5.12 rc4 | xdp_redirect_map   i40e->veth  |1.7M |  11.7M |  9.8M
> 5.12 rc4 + patch | xdp_redirect_map   i40e->i40e  |1.9M |   9.8M |  8.0M
> 5.12 rc4 + patch | xdp_redirect_map   i40e->veth  |1.7M |  12.0M |  9.4M
Based on the discussion in v7, a summary of what still needs to be
addressed will be useful.


Re: [PATCHv8 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support

2021-04-20 Thread Martin KaFai Lau
On Thu, Apr 15, 2021 at 09:53:18PM +0800, Hangbin Liu wrote:
> This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
> extend xdp_redirect_map for broadcast support.
> 
> With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
> in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
> excluded when do broadcasting.
> 
> When getting the devices in dev hash map via dev_map_hash_get_next_key(),
> there is a possibility that we fall back to the first key when a device
> was removed. This will duplicate packets on some interfaces. So just walk
> the whole buckets to avoid this issue. For dev array map, we also walk the
> whole map to find valid interfaces.
> 
> Function bpf_clear_redirect_map() was removed in
> commit ee75aef23afe ("bpf, xdp: Restructure redirect actions").
> Add it back as we need to use ri->map again.
> 
> Here is the performance result by using 10Gb i40e NIC, do XDP_DROP on
> veth peer, run xdp_redirect_{map, map_multi} in sample/bpf and send pkts
> via pktgen cmd:
> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 
> -s 64
> 
> There are some drop back as we need to loop the map and get each interface.
> 
> Version  | Test| Generic | Native
> 5.12 rc4 | redirect_mapi40e->i40e  |1.9M |  9.6M
> 5.12 rc4 | redirect_mapi40e->veth  |1.7M | 11.7M
> 5.12 rc4 + patch | redirect_mapi40e->i40e  |1.9M |  9.3M
> 5.12 rc4 + patch | redirect_mapi40e->veth  |1.7M | 11.4M
> 5.12 rc4 + patch | redirect_map multi  i40e->i40e  |1.9M |  8.9M
> 5.12 rc4 + patch | redirect_map multi  i40e->veth  |1.7M | 10.9M
> 5.12 rc4 + patch | redirect_map multi  i40e->mlx4+veth |1.2M |  3.8M
> 
> Acked-by: Toke Høiland-Jørgensen 
> Signed-off-by: Hangbin Liu 
> 
> ---
> v8:
> use hlist_for_each_entry_rcu() when looping the devmap hash ojbs
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v4 0/3] add batched ops for percpu array

2021-04-16 Thread Martin KaFai Lau
On Thu, Apr 15, 2021 at 02:46:16PM -0300, Pedro Tammela wrote:
> This patchset introduces batched operations for the per-cpu variant of
> the array map.
> 
> It also removes the percpu macros from 'bpf_util.h'. This change was
> suggested by Andrii in a earlier iteration of this patchset.
> 
> The tests were updated to reflect all the new changes.
Acked-by: Martin KaFai Lau 


Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-16 Thread Martin KaFai Lau
On Fri, Apr 16, 2021 at 03:45:23PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 15 Apr 2021 17:39:13 -0700
> Martin KaFai Lau  wrote:
> 
> > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> > > Jesper Dangaard Brouer  writes:
> > >   
> > > > On Thu, 15 Apr 2021 10:35:51 -0700
> > > > Martin KaFai Lau  wrote:
> > > >  
> > > >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen 
> > > >> wrote:  
> > > >> > Hangbin Liu  writes:
> > > >> > 
> > > >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> > > >> > >   
> > > >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 
> > > >> > >> > flags)
> > > >> > >> >  {
> > > >> > >> > struct net_device *dev = bq->dev;
> > > >> > >> > -   int sent = 0, err = 0;
> > > >> > >> > +   int sent = 0, drops = 0, err = 0;
> > > >> > >> > +   unsigned int cnt = bq->count;
> > > >> > >> > +   int to_send = cnt;
> > > >> > >> > int i;
> > > >> > >> >  
> > > >> > >> > -   if (unlikely(!bq->count))
> > > >> > >> > +   if (unlikely(!cnt))
> > > >> > >> > return;
> > > >> > >> >  
> > > >> > >> > -   for (i = 0; i < bq->count; i++) {
> > > >> > >> > +   for (i = 0; i < cnt; i++) {
> > > >> > >> > struct xdp_frame *xdpf = bq->q[i];
> > > >> > >> >  
> > > >> > >> > prefetch(xdpf);
> > > >> > >> > }
> > > >> > >> >  
> > > >> > >> > -   sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, 
> > > >> > >> > bq->q, flags);
> > > >> > >> > +   if (bq->xdp_prog) {
> > > >> > >> bq->xdp_prog is used here
> > > >> > >> 
> > > >> > >> > +   to_send = dev_map_bpf_prog_run(bq->xdp_prog, 
> > > >> > >> > bq->q, cnt, dev);
> > > >> > >> > +   if (!to_send)
> > > >> > >> > +   goto out;
> > > >> > >> > +
> > > >> > >> > +   drops = cnt - to_send;
> > > >> > >> > +   }
> > > >> > >> > +
> > > >> > >> 
> > > >> > >> [ ... ]
> > > >> > >> 
> > > >> > >> >  static void bq_enqueue(struct net_device *dev, struct 
> > > >> > >> > xdp_frame *xdpf,
> > > >> > >> > -  struct net_device *dev_rx)
> > > >> > >> > +  struct net_device *dev_rx, struct 
> > > >> > >> > bpf_prog *xdp_prog)
> > > >> > >> >  {
> > > >> > >> > struct list_head *flush_list = 
> > > >> > >> > this_cpu_ptr(&dev_flush_list);
> > > >> > >> > struct xdp_dev_bulk_queue *bq = 
> > > >> > >> > this_cpu_ptr(dev->xdp_bulkq);
> > > >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device 
> > > >> > >> > *dev, struct xdp_frame *xdpf,
> > > >> > >> > /* Ingress dev_rx will be the same for all xdp_frame's 
> > > >> > >> > in
> > > >> > >> >  * bulk_queue, because bq stored per-CPU and must be 
> > > >> > >> > flushed
> > > >> > >> >  * from net_device drivers NAPI func end.
> > > >> > >> > +*
> > > >> > >> > +* Do the same with xdp_prog and flush_list since these 
> > > >> > >> > fields
> > > >> > >> > +* are only ever modified together.
> > > >> > >> >  */
> > 

Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-16 Thread Martin KaFai Lau
On Fri, Apr 16, 2021 at 12:03:41PM +0200, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau  writes:
> 
> > On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> >> Jesper Dangaard Brouer  writes:
> >> 
> >> > On Thu, 15 Apr 2021 10:35:51 -0700
> >> > Martin KaFai Lau  wrote:
> >> >
> >> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> >> >> > Hangbin Liu  writes:
> >> >> >   
> >> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> >> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 
> >> >> > >> > flags)
> >> >> > >> >  {
> >> >> > >> >  struct net_device *dev = bq->dev;
> >> >> > >> > -int sent = 0, err = 0;
> >> >> > >> > +int sent = 0, drops = 0, err = 0;
> >> >> > >> > +unsigned int cnt = bq->count;
> >> >> > >> > +int to_send = cnt;
> >> >> > >> >  int i;
> >> >> > >> >  
> >> >> > >> > -if (unlikely(!bq->count))
> >> >> > >> > +if (unlikely(!cnt))
> >> >> > >> >  return;
> >> >> > >> >  
> >> >> > >> > -for (i = 0; i < bq->count; i++) {
> >> >> > >> > +for (i = 0; i < cnt; i++) {
> >> >> > >> >  struct xdp_frame *xdpf = bq->q[i];
> >> >> > >> >  
> >> >> > >> >  prefetch(xdpf);
> >> >> > >> >  }
> >> >> > >> >  
> >> >> > >> > -sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, 
> >> >> > >> > bq->q, flags);
> >> >> > >> > +if (bq->xdp_prog) {  
> >> >> > >> bq->xdp_prog is used here
> >> >> > >>   
> >> >> > >> > +to_send = dev_map_bpf_prog_run(bq->xdp_prog, 
> >> >> > >> > bq->q, cnt, dev);
> >> >> > >> > +if (!to_send)
> >> >> > >> > +goto out;
> >> >> > >> > +
> >> >> > >> > +drops = cnt - to_send;
> >> >> > >> > +}
> >> >> > >> > +  
> >> >> > >> 
> >> >> > >> [ ... ]
> >> >> > >>   
> >> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame 
> >> >> > >> > *xdpf,
> >> >> > >> > -   struct net_device *dev_rx)
> >> >> > >> > +   struct net_device *dev_rx, struct 
> >> >> > >> > bpf_prog *xdp_prog)
> >> >> > >> >  {
> >> >> > >> >  struct list_head *flush_list = 
> >> >> > >> > this_cpu_ptr(&dev_flush_list);
> >> >> > >> >  struct xdp_dev_bulk_queue *bq = 
> >> >> > >> > this_cpu_ptr(dev->xdp_bulkq);
> >> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device 
> >> >> > >> > *dev, struct xdp_frame *xdpf,
> >> >> > >> >  /* Ingress dev_rx will be the same for all xdp_frame's 
> >> >> > >> > in
> >> >> > >> >   * bulk_queue, because bq stored per-CPU and must be 
> >> >> > >> > flushed
> >> >> > >> >   * from net_device drivers NAPI func end.
> >> >> > >> > + *
> >> >> > >> > + * Do the same with xdp_prog and flush_list since these 
> >> >> > >> > fields
> >> >> > >> > + * are only ever modified together.
> >> >> > >> >   */
> >> >> > >> > -if (!bq->dev_rx)
> >> >> > >> > +if (!bq->dev_rx) {
> &g

Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-15 Thread Martin KaFai Lau
On Thu, Apr 15, 2021 at 10:29:40PM +0200, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer  writes:
> 
> > On Thu, 15 Apr 2021 10:35:51 -0700
> > Martin KaFai Lau  wrote:
> >
> >> On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> >> > Hangbin Liu  writes:
> >> >   
> >> > > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:  
> >> > >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> > >> >  {
> >> > >> > struct net_device *dev = bq->dev;
> >> > >> > -   int sent = 0, err = 0;
> >> > >> > +   int sent = 0, drops = 0, err = 0;
> >> > >> > +   unsigned int cnt = bq->count;
> >> > >> > +   int to_send = cnt;
> >> > >> > int i;
> >> > >> >  
> >> > >> > -   if (unlikely(!bq->count))
> >> > >> > +   if (unlikely(!cnt))
> >> > >> > return;
> >> > >> >  
> >> > >> > -   for (i = 0; i < bq->count; i++) {
> >> > >> > +   for (i = 0; i < cnt; i++) {
> >> > >> > struct xdp_frame *xdpf = bq->q[i];
> >> > >> >  
> >> > >> > prefetch(xdpf);
> >> > >> > }
> >> > >> >  
> >> > >> > -   sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, 
> >> > >> > flags);
> >> > >> > +   if (bq->xdp_prog) {  
> >> > >> bq->xdp_prog is used here
> >> > >>   
> >> > >> > +   to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, 
> >> > >> > cnt, dev);
> >> > >> > +   if (!to_send)
> >> > >> > +   goto out;
> >> > >> > +
> >> > >> > +   drops = cnt - to_send;
> >> > >> > +   }
> >> > >> > +  
> >> > >> 
> >> > >> [ ... ]
> >> > >>   
> >> > >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame 
> >> > >> > *xdpf,
> >> > >> > -  struct net_device *dev_rx)
> >> > >> > +  struct net_device *dev_rx, struct bpf_prog 
> >> > >> > *xdp_prog)
> >> > >> >  {
> >> > >> > struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> > >> > struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> > >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device 
> >> > >> > *dev, struct xdp_frame *xdpf,
> >> > >> > /* Ingress dev_rx will be the same for all xdp_frame's in
> >> > >> >  * bulk_queue, because bq stored per-CPU and must be flushed
> >> > >> >  * from net_device drivers NAPI func end.
> >> > >> > +*
> >> > >> > +* Do the same with xdp_prog and flush_list since these fields
> >> > >> > +* are only ever modified together.
> >> > >> >  */
> >> > >> > -   if (!bq->dev_rx)
> >> > >> > +   if (!bq->dev_rx) {
> >> > >> > bq->dev_rx = dev_rx;
> >> > >> > +   bq->xdp_prog = xdp_prog;  
> >> > >> bp->xdp_prog is assigned here and could be used later in 
> >> > >> bq_xmit_all().
> >> > >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> > >> It is not very obvious after taking a quick look at 
> >> > >> xdp_do_flush[_map].
> >> > >> 
> >> > >> e.g. what if the devmap elem gets deleted.  
> >> > >
> >> > > Jesper knows better than me. From my veiw, based on the description of
> >> > > __dev_flush():
> >> > >
> >> > > On devmap tear down we ensure the flush list is empty before 
> >> > > completing to
> >> > > ensure all flush operations have completed. When drivers update the bpf
> >> > > program they may need to ensure any flush ops are also complete.  
> >>
> >> AFAICT, th

Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-15 Thread Martin KaFai Lau
On Thu, Apr 15, 2021 at 11:22:19AM +0200, Toke Høiland-Jørgensen wrote:
> Hangbin Liu  writes:
> 
> > On Wed, Apr 14, 2021 at 05:17:11PM -0700, Martin KaFai Lau wrote:
> >> >  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
> >> >  {
> >> >  struct net_device *dev = bq->dev;
> >> > -int sent = 0, err = 0;
> >> > +int sent = 0, drops = 0, err = 0;
> >> > +unsigned int cnt = bq->count;
> >> > +int to_send = cnt;
> >> >  int i;
> >> >  
> >> > -if (unlikely(!bq->count))
> >> > +if (unlikely(!cnt))
> >> >  return;
> >> >  
> >> > -for (i = 0; i < bq->count; i++) {
> >> > +for (i = 0; i < cnt; i++) {
> >> >  struct xdp_frame *xdpf = bq->q[i];
> >> >  
> >> >  prefetch(xdpf);
> >> >  }
> >> >  
> >> > -sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, 
> >> > flags);
> >> > +if (bq->xdp_prog) {
> >> bq->xdp_prog is used here
> >> 
> >> > +to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, 
> >> > cnt, dev);
> >> > +if (!to_send)
> >> > +goto out;
> >> > +
> >> > +drops = cnt - to_send;
> >> > +}
> >> > +
> >> 
> >> [ ... ]
> >> 
> >> >  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> >> > -   struct net_device *dev_rx)
> >> > +   struct net_device *dev_rx, struct bpf_prog 
> >> > *xdp_prog)
> >> >  {
> >> >  struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
> >> >  struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> >> > @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, 
> >> > struct xdp_frame *xdpf,
> >> >  /* Ingress dev_rx will be the same for all xdp_frame's in
> >> >   * bulk_queue, because bq stored per-CPU and must be flushed
> >> >   * from net_device drivers NAPI func end.
> >> > + *
> >> > + * Do the same with xdp_prog and flush_list since these fields
> >> > + * are only ever modified together.
> >> >   */
> >> > -if (!bq->dev_rx)
> >> > +if (!bq->dev_rx) {
> >> >  bq->dev_rx = dev_rx;
> >> > +bq->xdp_prog = xdp_prog;
> >> bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
> >> How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
> >> It is not very obvious after taking a quick look at xdp_do_flush[_map].
> >> 
> >> e.g. what if the devmap elem gets deleted.
> >
> > Jesper knows better than me. From my veiw, based on the description of
> > __dev_flush():
> >
> > On devmap tear down we ensure the flush list is empty before completing to
> > ensure all flush operations have completed. When drivers update the bpf
> > program they may need to ensure any flush ops are also complete.
AFAICT, the bq->xdp_prog is not from the dev. It is from a devmap's elem.

> 
> Yeah, drivers call xdp_do_flush() before exiting their NAPI poll loop,
> which also runs under one big rcu_read_lock(). So the storage in the
> bulk queue is quite temporary, it's just used for bulking to increase
> performance :)
I am missing the one big rcu_read_lock() part.  For example, in i40e_txrx.c,
i40e_run_xdp() has its own rcu_read_lock/unlock().  dst->xdp_prog used to run
in i40e_run_xdp() and it is fine.

In this patch, dst->xdp_prog is run outside of i40e_run_xdp() where the
rcu_read_unlock() has already done.  It is now run in xdp_do_flush_map().
or I missed the big rcu_read_lock() in i40e_napi_poll()?

I do see the big rcu_read_lock() in mlx5e_napi_poll().


Re: [PATCH] tools/testing: Remove unused variable

2021-04-14 Thread Martin KaFai Lau
On Wed, Apr 14, 2021 at 10:16:39PM +0800, zuoqil...@163.com wrote:
> From: zuoqilin 
> 
> Remove unused variable "ret2".
Please tag the targeting branch in the future as described in
Documentation/bpf/bpf_devel_QA.rst.

This one belongs to bpf-next.

Acked-by: Martin KaFai Lau 


Re: [PATCHv7 bpf-next 2/4] xdp: extend xdp_redirect_map with broadcast support

2021-04-14 Thread Martin KaFai Lau
On Wed, Apr 14, 2021 at 08:26:08PM +0800, Hangbin Liu wrote:
[ ... ]

> +static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 
> ifindex,
> +   u64 flags, u64 flag_mask,
> void *lookup_elem(struct 
> bpf_map *map, u32 key))
>  {
>   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
>   /* Lower bits of the flags are used as return code on lookup failure */
> - if (unlikely(flags > XDP_TX))
> + if (unlikely(flags & ~(BPF_F_ACTION_MASK | flag_mask)))
>   return XDP_ABORTED;
>  
>   ri->tgt_value = lookup_elem(map, ifindex);
> - if (unlikely(!ri->tgt_value)) {
> + if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
>   /* If the lookup fails we want to clear out the state in the
>* redirect_info struct completely, so that if an eBPF program
>* performs multiple lookups, the last one always takes
> @@ -1482,13 +1484,21 @@ static __always_inline int 
> __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>*/
>   ri->map_id = INT_MAX; /* Valid map id idr range: [1,INT_MAX[ */
>   ri->map_type = BPF_MAP_TYPE_UNSPEC;
> - return flags;
> + return flags & BPF_F_ACTION_MASK;
>   }
>  
>   ri->tgt_index = ifindex;
>   ri->map_id = map->id;
>   ri->map_type = map->map_type;
>  
> + if (flags & BPF_F_BROADCAST) {
> + WRITE_ONCE(ri->map, map);
Why only WRITE_ONCE on ri->map?  Is it needed?

> + ri->flags = flags;
> + } else {
> + WRITE_ONCE(ri->map, NULL);
> + ri->flags = 0;
> + }
> +
>   return XDP_REDIRECT;
>  }
>  
[ ... ]

> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +   struct bpf_map *map, bool exclude_ingress)
> +{
> + struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> + int exclude_ifindex = exclude_ingress ? dev_rx->ifindex : 0;
> + struct bpf_dtab_netdev *dst, *last_dst = NULL;
> + struct hlist_head *head;
> + struct hlist_node *next;
> + struct xdp_frame *xdpf;
> + unsigned int i;
> + int err;
> +
> + xdpf = xdp_convert_buff_to_frame(xdp);
> + if (unlikely(!xdpf))
> + return -EOVERFLOW;
> +
> + if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
> + for (i = 0; i < map->max_entries; i++) {
> + dst = READ_ONCE(dtab->netdev_map[i]);
> + if (!is_valid_dst(dst, xdp, exclude_ifindex))
> + continue;
> +
> + /* we only need n-1 clones; last_dst enqueued below */
> + if (!last_dst) {
> + last_dst = dst;
> + continue;
> + }
> +
> + err = dev_map_enqueue_clone(last_dst, dev_rx, xdpf);
> + if (err)
> + return err;
> +
> + last_dst = dst;
> + }
> + } else { /* BPF_MAP_TYPE_DEVMAP_HASH */
> + for (i = 0; i < dtab->n_buckets; i++) {
> + head = dev_map_index_hash(dtab, i);
> + hlist_for_each_entry_safe(dst, next, head, index_hlist) 
> {
hmm should it be hlist_for_each_entry_rcu() instead?

> + if (!is_valid_dst(dst, xdp, exclude_ifindex))
> + continue;
> +
> + /* we only need n-1 clones; last_dst enqueued 
> below */
> + if (!last_dst) {
> + last_dst = dst;
> + continue;
> + }
> +
> + err = dev_map_enqueue_clone(last_dst, dev_rx, 
> xdpf);
> + if (err)
> + return err;
> +
> + last_dst = dst;
> + }
> + }
> + }
> +
> + /* consume the last copy of the frame */
> + if (last_dst)
> + bq_enqueue(last_dst->dev, xdpf, dev_rx, last_dst->xdp_prog);
> + else
> + xdp_return_frame_rx_napi(xdpf); /* dtab is empty */
> +
> + return 0;
> +}
> +


Re: [PATCHv7 bpf-next 1/4] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-04-14 Thread Martin KaFai Lau
On Wed, Apr 14, 2021 at 08:26:07PM +0800, Hangbin Liu wrote:
[ ... ]

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index aa516472ce46..3980fb3bfb09 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {
>   struct list_head flush_node;
>   struct net_device *dev;
>   struct net_device *dev_rx;
> + struct bpf_prog *xdp_prog;
>   unsigned int count;
>  };
>  
> @@ -326,22 +327,71 @@ bool dev_map_can_have_prog(struct bpf_map *map)
>   return false;
>  }
>  
> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
> + struct xdp_frame **frames, int n,
> + struct net_device *dev)
> +{
> + struct xdp_txq_info txq = { .dev = dev };
> + struct xdp_buff xdp;
> + int i, nframes = 0;
> +
> + for (i = 0; i < n; i++) {
> + struct xdp_frame *xdpf = frames[i];
> + u32 act;
> + int err;
> +
> + xdp_convert_frame_to_buff(xdpf, &xdp);
> + xdp.txq = &txq;
> +
> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
> + switch (act) {
> + case XDP_PASS:
> + err = xdp_update_frame_from_buff(&xdp, xdpf);
> + if (unlikely(err < 0))
> + xdp_return_frame_rx_napi(xdpf);
> + else
> + frames[nframes++] = xdpf;
> + break;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + fallthrough;
> + case XDP_ABORTED:
> + trace_xdp_exception(dev, xdp_prog, act);
> + fallthrough;
> + case XDP_DROP:
> + xdp_return_frame_rx_napi(xdpf);
> + break;
> + }
> + }
> + return nframes; /* sent frames count */
> +}
> +
>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>  {
>   struct net_device *dev = bq->dev;
> - int sent = 0, err = 0;
> + int sent = 0, drops = 0, err = 0;
> + unsigned int cnt = bq->count;
> + int to_send = cnt;
>   int i;
>  
> - if (unlikely(!bq->count))
> + if (unlikely(!cnt))
>   return;
>  
> - for (i = 0; i < bq->count; i++) {
> + for (i = 0; i < cnt; i++) {
>   struct xdp_frame *xdpf = bq->q[i];
>  
>   prefetch(xdpf);
>   }
>  
> - sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);
> + if (bq->xdp_prog) {
bq->xdp_prog is used here

> + to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
> + if (!to_send)
> + goto out;
> +
> + drops = cnt - to_send;
> + }
> +

[ ... ]

>  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
> -struct net_device *dev_rx)
> +struct net_device *dev_rx, struct bpf_prog *xdp_prog)
>  {
>   struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
>   struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
> @@ -412,18 +466,22 @@ static void bq_enqueue(struct net_device *dev, struct 
> xdp_frame *xdpf,
>   /* Ingress dev_rx will be the same for all xdp_frame's in
>* bulk_queue, because bq stored per-CPU and must be flushed
>* from net_device drivers NAPI func end.
> +  *
> +  * Do the same with xdp_prog and flush_list since these fields
> +  * are only ever modified together.
>*/
> - if (!bq->dev_rx)
> + if (!bq->dev_rx) {
>   bq->dev_rx = dev_rx;
> + bq->xdp_prog = xdp_prog;
bp->xdp_prog is assigned here and could be used later in bq_xmit_all().
How is bq->xdp_prog protected? Are they all under one rcu_read_lock()?
It is not very obvious after taking a quick look at xdp_do_flush[_map].

e.g. what if the devmap elem gets deleted.

[ ... ]

>  static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
> -struct net_device *dev_rx)
> + struct net_device *dev_rx,
> + struct bpf_prog *xdp_prog)
>  {
>   struct xdp_frame *xdpf;
>   int err;
> @@ -439,42 +497,14 @@ static inline int __xdp_enqueue(struct net_device *dev, 
> struct xdp_buff *xdp,
>   if (unlikely(!xdpf))
>   return -EOVERFLOW;
>  
> - bq_enqueue(dev, xdpf, dev_rx);
> + bq_enqueue(dev, xdpf, dev_rx, xdp_prog);
>   return 0;
>  }
>  
[ ... ]

> @@ -482,12 +512,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct 
> xdp_buff *xdp,
>  {
>   struct net_device *dev = dst->dev;
>  
> - if (dst->xdp_prog) {
> - xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);
> - if (!xdp)
> - return 0;
> - }
> - return __xdp_enqueue(dev, xdp, dev_rx);
> + return __xdp_

Re: [PATCH bpf-next v3 3/3] bpf: selftests: update array map tests for per-cpu batched ops

2021-04-12 Thread Martin KaFai Lau
On Mon, Apr 12, 2021 at 04:40:01PM -0300, Pedro Tammela wrote:
> Follows the same logic as the hashtable tests.
> 
> Signed-off-by: Pedro Tammela 
> ---
>  .../bpf/map_tests/array_map_batch_ops.c   | 110 +-
>  1 file changed, 80 insertions(+), 30 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c 
> b/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
> index e42ea1195d18..707d17414dee 100644
> --- a/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
> +++ b/tools/testing/selftests/bpf/map_tests/array_map_batch_ops.c
> @@ -10,32 +10,59 @@
>  #include 
>  
>  static void map_batch_update(int map_fd, __u32 max_entries, int *keys,
> -  int *values)
> +  __s64 *values, bool is_pcpu)
>  {
> - int i, err;
> + int nr_cpus = libbpf_num_possible_cpus();
Instead of getting it multiple times, how about moving it out to
a static global and initialize it in test_array_map_batch_ops().


> + int i, j, err;
> + int offset = 0;
>   DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
>   .elem_flags = 0,
>   .flags = 0,
>   );
>  
> + CHECK(nr_cpus < 0, "nr_cpus checking",
> +   "error: get possible cpus failed");
> +
>   for (i = 0; i < max_entries; i++) {
>   keys[i] = i;
> - values[i] = i + 1;
> + if (is_pcpu)
> + for (j = 0; j < nr_cpus; j++)
> + (values + offset)[j] = i + 1 + j;
> + else
> + values[i] = i + 1;
> + offset += nr_cpus;
This "offset" update here is confusing to read because it is only
used in the is_pcpu case but it always gets updated regardless.
How about only defines and uses offset in the "if (is_pcpu)" case and
rename it to "cpu_offset": cpu_offset = i * nr_cpus.

The same goes for other occasions.

>   }
>  
>   err = bpf_map_update_batch(map_fd, keys, values, &max_entries, &opts);
>   CHECK(err, "bpf_map_update_batch()", "error:%s\n", strerror(errno));
>  }
>  
> -static void map_batch_verify(int *visited, __u32 max_entries,
> -  int *keys, int *values)
> +static void map_batch_verify(int *visited, __u32 max_entries, int *keys,
> +  __s64 *values, bool is_pcpu)
>  {
> - int i;
> + int nr_cpus = libbpf_num_possible_cpus();
> + int i, j;
> + int offset = 0;
> +
> + CHECK(nr_cpus < 0, "nr_cpus checking",
> +   "error: get possible cpus failed");
>  
>   memset(visited, 0, max_entries * sizeof(*visited));
>   for (i = 0; i < max_entries; i++) {
> - CHECK(keys[i] + 1 != values[i], "key/value checking",
> -   "error: i %d key %d value %d\n", i, keys[i], values[i]);
> + if (is_pcpu) {
> + for (j = 0; j < nr_cpus; j++) {
> + __s64 value = (values + offset)[j];
> + CHECK(keys[i] + j + 1 != value,
> +   "key/value checking",
> +   "error: i %d j %d key %d value %d\n", i,
> +   j, keys[i], value);
> + }
> + } else {
> + CHECK(keys[i] + 1 != values[i], "key/value checking",
> +   "error: i %d key %d value %d\n", i, keys[i],
> +   values[i]);
> + }
> + offset += nr_cpus;
>   visited[i] = 1;
>   }
>   for (i = 0; i < max_entries; i++) {
> @@ -44,45 +71,52 @@ static void map_batch_verify(int *visited, __u32 
> max_entries,
>   }
>  }
>  
> -void test_array_map_batch_ops(void)
> +void __test_map_lookup_and_update_batch(bool is_pcpu)
static

>  {
> + int nr_cpus = libbpf_num_possible_cpus();
>   struct bpf_create_map_attr xattr = {
>   .name = "array_map",
> - .map_type = BPF_MAP_TYPE_ARRAY,
> + .map_type = is_pcpu ? BPF_MAP_TYPE_PERCPU_ARRAY :
> +   BPF_MAP_TYPE_ARRAY,
>   .key_size = sizeof(int),
> - .value_size = sizeof(int),
> + .value_size = sizeof(__s64),
>   };
> - int map_fd, *keys, *values, *visited;
> + int map_fd, *keys, *visited;
>   __u32 count, total, total_success;
>   const __u32 max_entries = 10;
>   __u64 batch = 0;
> - int err, step;
> + int err, step, value_size;
> + void *values;
>   DECLARE_LIBBPF_OPTS(bpf_map_batch_opts, opts,
>   .elem_flags = 0,
>   .flags = 0,
>   );
>  
> + CHECK(nr_cpus < 0, "nr_cpus checking",
> +   "error: get possible cpus failed");
> +
>   xattr.max_entries = max_entries;
>   map_fd = bpf_create_map_xattr(&xattr);
>   CHECK(map_fd == -1,
> "bpf_create_map_xattr()", "error:%s\n", strerror(errno));
>  
>

[PATCH bpf-next] bpf: selftests: Specify CONFIG_DYNAMIC_FTRACE in the testing config

2021-04-02 Thread Martin KaFai Lau
The tracing test and the recent kfunc call test require
CONFIG_DYNAMIC_FTRACE.  This patch adds it to the config file.

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/config 
b/tools/testing/selftests/bpf/config
index 37e1f303fc11..528af74e0c8f 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -44,3 +44,4 @@ CONFIG_SECURITYFS=y
 CONFIG_IMA_WRITE_POLICY=y
 CONFIG_IMA_READ_POLICY=y
 CONFIG_BLK_DEV_LOOP=y
+CONFIG_DYNAMIC_FTRACE=y
-- 
2.30.2



Re: [PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-04-01 Thread Martin KaFai Lau
On Tue, Mar 30, 2021 at 11:44:39PM -0700, Andrii Nakryiko wrote:
> On Mon, Mar 29, 2021 at 12:11 PM Martin KaFai Lau  wrote:
> >
> > On Mon, Mar 29, 2021 at 05:06:26PM +0100, Lorenz Bauer wrote:
> > > On Mon, 29 Mar 2021 at 02:25, Martin KaFai Lau  wrote:
> > > >
> > > > > > >
> > > > > > > # pahole --version
> > > > > > > v1.17
> > > > > >
> > > > > > That is the most likely reason.
> > > > > > In lib/Kconfig.debug
> > > > > > we have pahole >= 1.19 requirement for BTF in modules.
> > > > > > Though your config has CUBIC=y I suspect something odd goes on.
> > > > > > Could you please try the latest pahole 1.20 ?
> > > > >
> > > > > Sure, I will give it a try tomorrow, I am not in control of the CI I 
> > > > > ran.
> > > > Could you also check the CONFIG_DYNAMIC_FTRACE and also try 'y' if it
> > > > is not set?
> > >
> > > I hit the same problem on newer pahole:
> > >
> > > $ pahole --version
> > > v1.20
> > >
> > > CONFIG_DYNAMIC_FTRACE=y resolves the issue.
> > Thanks for checking.
> >
> > pahole only generates the btf_id for external function
> > and ftrace-able function.  Some functions in the bpf_tcp_ca_kfunc_ids list
> > are static (e.g. cubictcp_init), so it fails during resolve_btfids.
> >
> > I will post a patch to limit the bpf_tcp_ca_kfunc_ids list
> > to CONFIG_DYNAMIC_FTRACE.  I will address the pahole
> > generation in a followup and then remove this
> > CONFIG_DYNAMIC_FTRACE limitation.
> 
> We should still probably add CONFIG_DYNAMIC_FTRACE=y to selftests/bpf/config?
I thought the tracing tests have been requiring this already.  Together with
the new kfunc call, it may be good to make it explicit in selftests/bpf/config.
I can post a diff.


Re: [External] Re: [PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-03-30 Thread Martin KaFai Lau
On Tue, Mar 30, 2021 at 08:28:34PM -0700, Jiang Wang . wrote:
> I am working with Cong to get Clang/LLVM-13 to test. But I am confused
> whether CLANG/LLVM-13 is released or not.
> 
> From https://apt.llvm.org/  , I saw llvm-13 was released in Feb, but it
> does not have the diff you mentioned.
I haven't used the Debian/Ubuntu nightly packages, so don't know.

> 
> From the following links, I am not sure if LLVM-13 was really released
> or still in the process.
> https://llvm.org/docs/ReleaseNotes.html#external-open-source-projects-using-llvm-13
>  
> https://github.com/llvm/llvm-project/releases
AFAIK, it is not released, so please directly clone from the llvm-project
as also suggested earlier by Pedro.

Please refer to the "how do I build LLVM" in Documentation/bpf/bpf_devel_QA.rst.

[ Please do not top post.  Reply inline instead.
  It will be difficult for others to follow. ]

> On Tue, Mar 30, 2021 at 2:43 PM Martin KaFai Lau  wrote:
> >
> > On Tue, Mar 30, 2021 at 12:58:22PM -0700, Cong Wang wrote:
> > > On Tue, Mar 30, 2021 at 7:36 AM Alexei Starovoitov
> > >  wrote:
> > > >
> > > > On Tue, Mar 30, 2021 at 2:43 AM Lorenz Bauer  
> > > > wrote:
> > > > >
> > > > > On Thu, 25 Mar 2021 at 01:52, Martin KaFai Lau  wrote:
> > > > > >
> > > > > > This series adds support to allow bpf program calling kernel 
> > > > > > function.
> > > > >
> > > > > I think there are more build problems with this. Has anyone hit this 
> > > > > before?
> > > > >
> > > > > $ CLANG=clang-12 O=../kbuild/vm 
> > > > > ./tools/testing/selftests/bpf/vmtest.sh -j 7
> > > > >
> > > > >   GEN-SKEL [test_progs-no_alu32] bind6_prog.skel.h
> > > > > libbpf: elf: skipping unrecognized data section(5) .rodata.str1.1
> > > > >   GEN-SKEL [test_progs-no_alu32] bind_perm.skel.h
> > > > > libbpf: elf: skipping unrecognized data section(5) .rodata.str1.1
> > > > >   GEN-SKEL [test_progs-no_alu32] bpf_cubic.skel.h
> > > > >   GEN-SKEL [test_progs-no_alu32] bpf_dctcp.skel.h
> > > > >   GEN-SKEL [test_progs-no_alu32] bpf_flow.skel.h
> > > > > libbpf: failed to find BTF for extern 'tcp_cong_avoid_ai' [27] 
> > > > > section: -2
> > > > > Error: failed to open BPF object file: No such file or directory
> > > >
> > > > The doc update is on its way:
> > > > https://patchwork.kernel.org/project/netdevbpf/patch/20210330054156.2933804-1-ka...@fb.com/
> > >
> > > We just updated our clang to 13, and I still get the same error above.
> > Please check if the llvm/clang has this diff
> > https://reviews.llvm.org/D93563 


Re: [PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-03-30 Thread Martin KaFai Lau
On Tue, Mar 30, 2021 at 12:58:22PM -0700, Cong Wang wrote:
> On Tue, Mar 30, 2021 at 7:36 AM Alexei Starovoitov
>  wrote:
> >
> > On Tue, Mar 30, 2021 at 2:43 AM Lorenz Bauer  wrote:
> > >
> > > On Thu, 25 Mar 2021 at 01:52, Martin KaFai Lau  wrote:
> > > >
> > > > This series adds support to allow bpf program calling kernel function.
> > >
> > > I think there are more build problems with this. Has anyone hit this 
> > > before?
> > >
> > > $ CLANG=clang-12 O=../kbuild/vm ./tools/testing/selftests/bpf/vmtest.sh 
> > > -j 7
> > >
> > >   GEN-SKEL [test_progs-no_alu32] bind6_prog.skel.h
> > > libbpf: elf: skipping unrecognized data section(5) .rodata.str1.1
> > >   GEN-SKEL [test_progs-no_alu32] bind_perm.skel.h
> > > libbpf: elf: skipping unrecognized data section(5) .rodata.str1.1
> > >   GEN-SKEL [test_progs-no_alu32] bpf_cubic.skel.h
> > >   GEN-SKEL [test_progs-no_alu32] bpf_dctcp.skel.h
> > >   GEN-SKEL [test_progs-no_alu32] bpf_flow.skel.h
> > > libbpf: failed to find BTF for extern 'tcp_cong_avoid_ai' [27] section: -2
> > > Error: failed to open BPF object file: No such file or directory
> >
> > The doc update is on its way:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210330054156.2933804-1-ka...@fb.com/
> 
> We just updated our clang to 13, and I still get the same error above.
Please check if the llvm/clang has this diff
https://reviews.llvm.org/D93563


[PATCH bpf-next 2/2] bpf: selftests: Update clang requirement in README.rst for testing kfunc call

2021-03-29 Thread Martin KaFai Lau
This patch updates the README.rst to specify the clang requirement
to compile the bpf selftests that call kernel function.

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/README.rst | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/bpf/README.rst 
b/tools/testing/selftests/bpf/README.rst
index 3464161c8eea..65fe318d1e71 100644
--- a/tools/testing/selftests/bpf/README.rst
+++ b/tools/testing/selftests/bpf/README.rst
@@ -179,3 +179,17 @@ types, which was introduced in `Clang 13`__. The older 
Clang versions will
 either crash when compiling these tests, or generate an incorrect BTF.
 
 __  https://reviews.llvm.org/D83289
+
+Kernel function call test and Clang version
+===
+
+Some selftests (e.g. kfunc_call and bpf_tcp_ca) require a LLVM support
+to generate extern function in BTF.  It was introduced in `Clang 13`__.
+
+Without it, the error from compiling bpf selftests looks like:
+
+.. code-block:: console
+
+  libbpf: failed to find BTF for extern 'tcp_slow_start' [25] section: -2
+
+__ https://reviews.llvm.org/D93563
-- 
2.30.2



[PATCH bpf-next 1/2] bpf: Update bpf_design_QA.rst to clarify the kfunc call is not ABI

2021-03-29 Thread Martin KaFai Lau
This patch updates bpf_design_QA.rst to clarify that the kernel
function callable by bpf program is not an ABI.

Signed-off-by: Martin KaFai Lau 
---
 Documentation/bpf/bpf_design_QA.rst | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/bpf/bpf_design_QA.rst 
b/Documentation/bpf/bpf_design_QA.rst
index 0e15f9b05c9d..437de2a7a5de 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -258,3 +258,18 @@ Q: Can BPF functionality such as new program or map types, 
new
 helpers, etc be added out of kernel module code?
 
 A: NO.
+
+Q: Directly calling kernel function is an ABI?
+--
+Q: Some kernel functions (e.g. tcp_slow_start) can be called
+by BPF programs.  Do these kernel functions become an ABI?
+
+A: NO.
+
+The kernel function protos will change and the bpf programs will be
+rejected by the verifier.  Also, for example, some of the bpf-callable
+kernel functions have already been used by other kernel tcp
+cc (congestion-control) implementations.  If any of these kernel
+functions has changed, both the in-tree and out-of-tree kernel tcp cc
+implementations have to be changed.  The same goes for the bpf
+programs and they have to be adjusted accordingly.
-- 
2.30.2



[PATCH bpf-next 0/2] bpf: Update doc about calling kernel function

2021-03-29 Thread Martin KaFai Lau
This set updates the document about the bpf program calling kernel
function.  In particular, updates are regarding to the clang
requirement in selftests and kfunc-call not an ABI.

Martin KaFai Lau (2):
  bpf: Update bpf_design_QA.rst to clarify the kfunc call is not ABI
  bpf: selftests: Update clang requirement in README.rst for testing
kfunc call

 Documentation/bpf/bpf_design_QA.rst| 15 +++
 tools/testing/selftests/bpf/README.rst | 14 ++
 2 files changed, 29 insertions(+)

-- 
2.30.2



[PATCH bpf-next] bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE

2021-03-29 Thread Martin KaFai Lau
pahole currently only generates the btf_id for external function and
ftrace-able function.  Some functions in the bpf_tcp_ca_kfunc_ids
are static (e.g. cubictcp_init).  Thus, unless CONFIG_DYNAMIC_FTRACE
is set, btf_ids for those functions will not be generated and the
compilation fails during resolve_btfids.

This patch limits those functions to CONFIG_DYNAMIC_FTRACE.  I will
address the pahole generation in a followup and then remove the
CONFIG_DYNAMIC_FTRACE limitation.

Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for 
bpf-tcp-cc")
Reported-by: Cong Wang 
Reported-by: Lorenz Bauer 
Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/bpf_tcp_ca.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 6bb7b335ff9f..dff4f0eb96b0 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -185,6 +185,7 @@ BTF_ID(func, tcp_reno_cong_avoid)
 BTF_ID(func, tcp_reno_undo_cwnd)
 BTF_ID(func, tcp_slow_start)
 BTF_ID(func, tcp_cong_avoid_ai)
+#ifdef CONFIG_DYNAMIC_FTRACE
 #if IS_BUILTIN(CONFIG_TCP_CONG_CUBIC)
 BTF_ID(func, cubictcp_init)
 BTF_ID(func, cubictcp_recalc_ssthresh)
@@ -211,6 +212,7 @@ BTF_ID(func, bbr_ssthresh)
 BTF_ID(func, bbr_min_tso_segs)
 BTF_ID(func, bbr_set_state)
 #endif
+#endif  /* CONFIG_DYNAMIC_FTRACE */
 BTF_SET_END(bpf_tcp_ca_kfunc_ids)
 
 static bool bpf_tcp_ca_check_kfunc_call(u32 kfunc_btf_id)
-- 
2.30.2



Re: [PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-03-29 Thread Martin KaFai Lau
On Mon, Mar 29, 2021 at 05:06:26PM +0100, Lorenz Bauer wrote:
> On Mon, 29 Mar 2021 at 02:25, Martin KaFai Lau  wrote:
> >
> > > > >
> > > > > # pahole --version
> > > > > v1.17
> > > >
> > > > That is the most likely reason.
> > > > In lib/Kconfig.debug
> > > > we have pahole >= 1.19 requirement for BTF in modules.
> > > > Though your config has CUBIC=y I suspect something odd goes on.
> > > > Could you please try the latest pahole 1.20 ?
> > >
> > > Sure, I will give it a try tomorrow, I am not in control of the CI I ran.
> > Could you also check the CONFIG_DYNAMIC_FTRACE and also try 'y' if it
> > is not set?
> 
> I hit the same problem on newer pahole:
> 
> $ pahole --version
> v1.20
> 
> CONFIG_DYNAMIC_FTRACE=y resolves the issue.
Thanks for checking.

pahole only generates the btf_id for external function
and ftrace-able function.  Some functions in the bpf_tcp_ca_kfunc_ids list
are static (e.g. cubictcp_init), so it fails during resolve_btfids.

I will post a patch to limit the bpf_tcp_ca_kfunc_ids list
to CONFIG_DYNAMIC_FTRACE.  I will address the pahole
generation in a followup and then remove this
CONFIG_DYNAMIC_FTRACE limitation.


Re: [PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-03-28 Thread Martin KaFai Lau
On Sun, Mar 28, 2021 at 01:13:35PM -0700, Cong Wang wrote:
> On Sat, Mar 27, 2021 at 3:54 PM Alexei Starovoitov
>  wrote:
> >
> > On Sat, Mar 27, 2021 at 3:08 PM Cong Wang  wrote:
> > >   BTFIDS  vmlinux
> > > FAILED unresolved symbol cubictcp_state
> > > make: *** [Makefile:1199: vmlinux] Error 255
> > >
> > > I suspect it is related to the kernel config or linker version.
> > >
> > > # grep TCP_CONG .config
> > > CONFIG_TCP_CONG_ADVANCED=y
> > > CONFIG_TCP_CONG_BIC=m
> > > CONFIG_TCP_CONG_CUBIC=y
> > ..
> > >
> > > # pahole --version
> > > v1.17
> >
> > That is the most likely reason.
> > In lib/Kconfig.debug
> > we have pahole >= 1.19 requirement for BTF in modules.
> > Though your config has CUBIC=y I suspect something odd goes on.
> > Could you please try the latest pahole 1.20 ?
> 
> Sure, I will give it a try tomorrow, I am not in control of the CI I ran.
Could you also check the CONFIG_DYNAMIC_FTRACE and also try 'y' if it
is not set?


[PATCH bpf-next] bpf: tcp: Fix an error in the bpf_tcp_ca_kfunc_ids list

2021-03-28 Thread Martin KaFai Lau
There is a typo in the bbr function, s/even/event/.
This patch fixes it.

Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for 
bpf-tcp-cc")
Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/bpf_tcp_ca.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 12777d444d0f..6bb7b335ff9f 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -206,7 +206,7 @@ BTF_ID(func, bbr_init)
 BTF_ID(func, bbr_main)
 BTF_ID(func, bbr_sndbuf_expand)
 BTF_ID(func, bbr_undo_cwnd)
-BTF_ID(func, bbr_cwnd_even)
+BTF_ID(func, bbr_cwnd_event)
 BTF_ID(func, bbr_ssthresh)
 BTF_ID(func, bbr_min_tso_segs)
 BTF_ID(func, bbr_set_state)
-- 
2.30.2



Re: [PATCH v2 bpf-next 03/14] bpf: Support bpf program calling kernel function

2021-03-25 Thread Martin KaFai Lau
On Thu, Mar 25, 2021 at 11:02:23PM +0100, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau  writes:
> 
> > This patch adds support to BPF verifier to allow bpf program calling
> > kernel function directly.
> 
> Hi Martin
> 
> This is exciting stuff! :)
> 
> Just one quick question about this:
> 
> > [ For the future calling function-in-kernel-module support, an array
> >   of module btf_fds can be passed at the load time and insn->off
> >   can be used to index into this array. ]
> 
> Is adding the support for extending this to modules also on your radar,
> or is this more of an "in case someone needs it" comment? :)
It is in my list.  I don't mind someone beats me to it though
if he/she has an immediate use case. ;)


Re: [PATCH bpf v2 2/2] bpf/selftests: test that kernel rejects a TCP CC with an invalid license

2021-03-25 Thread Martin KaFai Lau
On Thu, Mar 25, 2021 at 10:11:22PM +0100, Toke Høiland-Jørgensen wrote:
> This adds a selftest to check that the verifier rejects a TCP CC struct_ops
> with a non-GPL license.
> 
> v2:
> - Use a minimal struct_ops BPF program instead of rewriting bpf_dctcp's
>   license in memory.
> - Check for the verifier reject message instead of just the return code.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 44 +++
>  .../selftests/bpf/progs/bpf_nogpltcp.c| 19 
>  2 files changed, 63 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/bpf_nogpltcp.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c 
> b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> index 37c5494a0381..a09c716528e1 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include "bpf_dctcp.skel.h"
>  #include "bpf_cubic.skel.h"
> +#include "bpf_nogpltcp.skel.h"
>  
>  #define min(a, b) ((a) < (b) ? (a) : (b))
>  
> @@ -227,10 +228,53 @@ static void test_dctcp(void)
>   bpf_dctcp__destroy(dctcp_skel);
>  }
>  
> +static char *err_str = NULL;
> +static bool found = false;
Nit. These two inits are not needed.

> +
> +static int libbpf_debug_print(enum libbpf_print_level level,
> +   const char *format, va_list args)
> +{
> + char *log_buf;
> +
> + if (level != LIBBPF_WARN ||
> + strcmp(format, "libbpf: \n%s\n")) {
> + vprintf(format, args);
> + return 0;
> + }
> +
> + log_buf = va_arg(args, char *);
> + if (!log_buf)
> + goto out;
> + if (err_str && strstr(log_buf, err_str) != NULL)
> + found = true;
> +out:
> + printf(format, log_buf);
> + return 0;
> +}
> +
> +static void test_invalid_license(void)
> +{
> + libbpf_print_fn_t old_print_fn = NULL;
Nit. Same here.  Not need to init NULL.

Others lgtm.

Acked-by: Martin KaFai Lau 

> + struct bpf_nogpltcp *skel;
> +
> + err_str = "struct ops programs must have a GPL compatible license";
> + old_print_fn = libbpf_set_print(libbpf_debug_print);
> +
> + skel = bpf_nogpltcp__open_and_load();
> + if (CHECK(skel, "bpf_nogplgtcp__open_and_load()", "didn't fail\n"))
> + bpf_nogpltcp__destroy(skel);
> +
> + CHECK(!found, "errmsg check", "expected string '%s'", err_str);
> +
> + libbpf_set_print(old_print_fn);
> +}


Re: [PATCH bpf 2/2] bpf/selftests: test that kernel rejects a TCP CC with an invalid license

2021-03-25 Thread Martin KaFai Lau
On Thu, Mar 25, 2021 at 04:40:34PM +0100, Toke Høiland-Jørgensen wrote:
> This adds a selftest to check that the verifier rejects a TCP CC struct_ops
> with a non-GPL license. To save having to add a whole new BPF object just
> for this, reuse the dctcp CC, but rewrite the license field before loading.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c 
> b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> index 37c5494a0381..613cf8a00b22 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
> @@ -227,10 +227,41 @@ static void test_dctcp(void)
>   bpf_dctcp__destroy(dctcp_skel);
>  }
>  
> +static void test_invalid_license(void)
> +{
> + /* We want to check that the verifier refuses to load a non-GPL TCP CC.
> +  * Rather than create a whole new file+skeleton, just reuse an existing
> +  * object and rewrite the license in memory after loading. Sine libbpf
> +  * doesn't expose this, we define a struct that includes the first 
> couple
> +  * of internal fields for struct bpf_object so we can overwrite the 
> right
> +  * bits. Yes, this is a bit of a hack, but it makes the test a lot 
> simpler.
> +  */
> + struct bpf_object_fragment {
> + char name[BPF_OBJ_NAME_LEN];
> + char license[64];
> + } *obj;
It is fragile.  A new bpf_nogpltcp.c should be created and it does
not have to be a full tcp-cc.  A very minimal implementation with
only .init. Something like this (uncompiled code):

char _license[] SEC("license") = "X";

void BPF_STRUCT_OPS(nogpltcp_init, struct sock *sk)
{
}

SEC(".struct_ops")
struct tcp_congestion_ops bpf_nogpltcp = {
.init   = (void *)nogpltcp_init,
.name   = "bpf_nogpltcp",
};

libbpf_set_print() can also be used to look for the
the verifier log "struct ops programs must have a GPL compatible license".


Re: [PATCH bpf 1/2] bpf: enforce that struct_ops programs be GPL-only

2021-03-25 Thread Martin KaFai Lau
On Thu, Mar 25, 2021 at 04:40:33PM +0100, Toke Høiland-Jørgensen wrote:
> With the introduction of the struct_ops program type, it became possible to
> implement kernel functionality in BPF, making it viable to use BPF in place
> of a regular kernel module for these particular operations.
> 
> Thus far, the only user of this mechanism is for implementing TCP
> congestion control algorithms. These are clearly marked as GPL-only when
> implemented as modules (as seen by the use of EXPORT_SYMBOL_GPL for
> tcp_register_congestion_control()), so it seems like an oversight that this
> was not carried over to BPF implementations. And sine this is the only user
> of the struct_ops mechanism, just enforcing GPL-only for the struct_ops
> program type seems like the simplest way to fix this.
> 
> Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf")
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  kernel/bpf/verifier.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 44e4ec1640f1..48dd0c0f087c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12166,6 +12166,11 @@ static int check_struct_ops_btf_id(struct 
> bpf_verifier_env *env)
>   return -ENOTSUPP;
>   }
>  
> + if (!prog->gpl_compatible) {
> + verbose(env, "struct ops programs must have a GPL compatible 
> license\n");
> + return -EINVAL;
> + }
> +
Thanks for the patch.

A nit.  Instead of sitting in between of the attach_btf_id check
and expected_attach_type check, how about moving it to the beginning
of this function.  Checking attach_btf_id and expected_attach_type
would make more sense to be done next to each other as in the current
code.

Acked-by: Martin KaFai Lau 


[PATCH v2 bpf-next 14/14] bpf: selftests: Add kfunc_call test

2021-03-24 Thread Martin KaFai Lau
This patch adds a few kernel function bpf_kfunc_call_test*() for the
selftest's test_run purpose.  They will be allowed for tc_cls prog.

The selftest calling the kernel function bpf_kfunc_call_test*()
is also added in this patch.

Signed-off-by: Martin KaFai Lau 
---
 include/linux/bpf.h   |  6 ++
 net/bpf/test_run.c| 28 +
 net/core/filter.c |  1 +
 .../selftests/bpf/prog_tests/kfunc_call.c | 59 +++
 .../selftests/bpf/progs/kfunc_call_test.c | 47 +++
 .../bpf/progs/kfunc_call_test_subprog.c   | 42 +
 6 files changed, 183 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kfunc_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c6439e96fa4a..4b31b30c4961 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1501,6 +1501,7 @@ int bpf_prog_test_run_raw_tp(struct bpf_prog *prog,
 int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog,
const union bpf_attr *kattr,
union bpf_attr __user *uattr);
+bool bpf_prog_test_check_kfunc_call(u32 kfunc_id);
 bool btf_ctx_access(int off, int size, enum bpf_access_type type,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info);
@@ -1700,6 +1701,11 @@ static inline int bpf_prog_test_run_sk_lookup(struct 
bpf_prog *prog,
return -ENOTSUPP;
 }
 
+static inline bool bpf_prog_test_check_kfunc_call(u32 kfunc_id)
+{
+   return false;
+}
+
 static inline void bpf_map_put(struct bpf_map *map)
 {
 }
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0abdd67f44b1..7f3bce909b42 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2017 Facebook
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -209,10 +210,37 @@ int noinline bpf_modify_return_test(int a, int *b)
*b += 1;
return a + *b;
 }
+
+u64 noinline bpf_kfunc_call_test1(struct sock *sk, u32 a, u64 b, u32 c, u64 d)
+{
+   return a + b + c + d;
+}
+
+int noinline bpf_kfunc_call_test2(struct sock *sk, u32 a, u32 b)
+{
+   return a + b;
+}
+
+struct sock * noinline bpf_kfunc_call_test3(struct sock *sk)
+{
+   return sk;
+}
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
 
+BTF_SET_START(test_sk_kfunc_ids)
+BTF_ID(func, bpf_kfunc_call_test1)
+BTF_ID(func, bpf_kfunc_call_test2)
+BTF_ID(func, bpf_kfunc_call_test3)
+BTF_SET_END(test_sk_kfunc_ids)
+
+bool bpf_prog_test_check_kfunc_call(u32 kfunc_id)
+{
+   return btf_id_set_contains(&test_sk_kfunc_ids, kfunc_id);
+}
+
 static void *bpf_test_init(const union bpf_attr *kattr, u32 size,
   u32 headroom, u32 tailroom)
 {
diff --git a/net/core/filter.c b/net/core/filter.c
index 10dac9dd5086..8a7d23c75ee3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9805,6 +9805,7 @@ const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
.convert_ctx_access = tc_cls_act_convert_ctx_access,
.gen_prologue   = tc_cls_act_prologue,
.gen_ld_abs = bpf_gen_ld_abs,
+   .check_kfunc_call   = bpf_prog_test_check_kfunc_call,
 };
 
 const struct bpf_prog_ops tc_cls_act_prog_ops = {
diff --git a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c 
b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
new file mode 100644
index ..7fc0951ee75f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include 
+#include 
+#include "kfunc_call_test.skel.h"
+#include "kfunc_call_test_subprog.skel.h"
+
+static void test_main(void)
+{
+   struct kfunc_call_test *skel;
+   int prog_fd, retval, err;
+
+   skel = kfunc_call_test__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "skel"))
+   return;
+
+   prog_fd = bpf_program__fd(skel->progs.kfunc_call_test1);
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, (__u32 *)&retval, NULL);
+   ASSERT_OK(err, "bpf_prog_test_run(test1)");
+   ASSERT_EQ(retval, 12, "test1-retval");
+
+   prog_fd = bpf_program__fd(skel->progs.kfunc_call_test2);
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, (__u32 *)&retval, NULL);
+   ASSERT_OK(err, "bpf_prog_test_run(test2)");
+   ASSERT_EQ(retval, 3, "test2-retval");
+
+   kfunc_call_test__destroy(skel);
+}
+
+static void test_subprog(void)
+{
+   struct kfunc_call_test_subprog *skel;
+ 

[PATCH v2 bpf-next 13/14] bpf: selftests: bpf_cubic and bpf_dctcp calling kernel functions

2021-03-24 Thread Martin KaFai Lau
This patch removes the bpf implementation of tcp_slow_start()
and tcp_cong_avoid_ai().  Instead, it directly uses the kernel
implementation.

It also replaces the bpf_cubic_undo_cwnd implementation by directly
calling tcp_reno_undo_cwnd().  bpf_dctcp also directly calls
tcp_reno_cong_avoid() instead.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 29 ++-
 tools/testing/selftests/bpf/progs/bpf_cubic.c |  6 ++--
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 22 --
 3 files changed, 11 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h 
b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index 91f0fac632f4..029589c008c9 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -187,16 +187,6 @@ struct tcp_congestion_ops {
typeof(y) __y = (y);\
__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
 
-static __always_inline __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked)
-{
-   __u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
-
-   acked -= cwnd - tp->snd_cwnd;
-   tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
-
-   return acked;
-}
-
 static __always_inline bool tcp_in_slow_start(const struct tcp_sock *tp)
 {
return tp->snd_cwnd < tp->snd_ssthresh;
@@ -213,22 +203,7 @@ static __always_inline bool tcp_is_cwnd_limited(const 
struct sock *sk)
return !!BPF_CORE_READ_BITFIELD(tp, is_cwnd_limited);
 }
 
-static __always_inline void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, 
__u32 acked)
-{
-   /* If credits accumulated at a higher w, apply them gently now. */
-   if (tp->snd_cwnd_cnt >= w) {
-   tp->snd_cwnd_cnt = 0;
-   tp->snd_cwnd++;
-   }
-
-   tp->snd_cwnd_cnt += acked;
-   if (tp->snd_cwnd_cnt >= w) {
-   __u32 delta = tp->snd_cwnd_cnt / w;
-
-   tp->snd_cwnd_cnt -= delta * w;
-   tp->snd_cwnd += delta;
-   }
-   tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
-}
+extern __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked) __ksym;
+extern void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, __u32 acked) 
__ksym;
 
 #endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c 
b/tools/testing/selftests/bpf/progs/bpf_cubic.c
index 33c4d2bded64..f62df4d023f9 100644
--- a/tools/testing/selftests/bpf/progs/bpf_cubic.c
+++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c
@@ -525,11 +525,11 @@ void BPF_STRUCT_OPS(bpf_cubic_acked, struct sock *sk,
hystart_update(sk, delay);
 }
 
+extern __u32 tcp_reno_undo_cwnd(struct sock *sk) __ksym;
+
 __u32 BPF_STRUCT_OPS(bpf_cubic_undo_cwnd, struct sock *sk)
 {
-   const struct tcp_sock *tp = tcp_sk(sk);
-
-   return max(tp->snd_cwnd, tp->prior_cwnd);
+   return tcp_reno_undo_cwnd(sk);
 }
 
 SEC(".struct_ops")
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c 
b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
index 4dc1a967776a..fd42247da8b4 100644
--- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -194,22 +194,12 @@ __u32 BPF_PROG(dctcp_cwnd_undo, struct sock *sk)
return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd);
 }
 
-SEC("struct_ops/tcp_reno_cong_avoid")
-void BPF_PROG(tcp_reno_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
-{
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (!tcp_is_cwnd_limited(sk))
-   return;
+extern void tcp_reno_cong_avoid(struct sock *sk, __u32 ack, __u32 acked) 
__ksym;
 
-   /* In "safe" area, increase. */
-   if (tcp_in_slow_start(tp)) {
-   acked = tcp_slow_start(tp, acked);
-   if (!acked)
-   return;
-   }
-   /* In dangerous area, increase slowly. */
-   tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);
+SEC("struct_ops/dctcp_reno_cong_avoid")
+void BPF_PROG(dctcp_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
+{
+   tcp_reno_cong_avoid(sk, ack, acked);
 }
 
 SEC(".struct_ops")
@@ -226,7 +216,7 @@ struct tcp_congestion_ops dctcp = {
.in_ack_event   = (void *)dctcp_update_alpha,
.cwnd_event = (void *)dctcp_cwnd_event,
.ssthresh   = (void *)dctcp_ssthresh,
-   .cong_avoid = (void *)tcp_reno_cong_avoid,
+   .cong_avoid = (void *)dctcp_cong_avoid,
.undo_cwnd  = (void *)dctcp_cwnd_undo,
.set_state  = (void *)dctcp_state,
.flags  = TCP_CONG_NEEDS_ECN,
-- 
2.30.2



[PATCH v2 bpf-next 12/14] bpf: selftests: Rename bictcp to bpf_cubic

2021-03-24 Thread Martin KaFai Lau
As a similar chanage in the kernel, this patch gives the proper
name to the bpf cubic.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/progs/bpf_cubic.c | 30 +--
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c 
b/tools/testing/selftests/bpf/progs/bpf_cubic.c
index 6939bfd8690f..33c4d2bded64 100644
--- a/tools/testing/selftests/bpf/progs/bpf_cubic.c
+++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c
@@ -174,8 +174,8 @@ static __always_inline void bictcp_hystart_reset(struct 
sock *sk)
  * as long as it is used in one of the func ptr
  * under SEC(".struct_ops").
  */
-SEC("struct_ops/bictcp_init")
-void BPF_PROG(bictcp_init, struct sock *sk)
+SEC("struct_ops/bpf_cubic_init")
+void BPF_PROG(bpf_cubic_init, struct sock *sk)
 {
struct bictcp *ca = inet_csk_ca(sk);
 
@@ -192,7 +192,7 @@ void BPF_PROG(bictcp_init, struct sock *sk)
  * The remaining tcp-cubic functions have an easier way.
  */
 SEC("no-sec-prefix-bictcp_cwnd_event")
-void BPF_PROG(bictcp_cwnd_event, struct sock *sk, enum tcp_ca_event event)
+void BPF_PROG(bpf_cubic_cwnd_event, struct sock *sk, enum tcp_ca_event event)
 {
if (event == CA_EVENT_TX_START) {
struct bictcp *ca = inet_csk_ca(sk);
@@ -384,7 +384,7 @@ static __always_inline void bictcp_update(struct bictcp 
*ca, __u32 cwnd,
 }
 
 /* Or simply use the BPF_STRUCT_OPS to avoid the SEC boiler plate. */
-void BPF_STRUCT_OPS(bictcp_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
+void BPF_STRUCT_OPS(bpf_cubic_cong_avoid, struct sock *sk, __u32 ack, __u32 
acked)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -403,7 +403,7 @@ void BPF_STRUCT_OPS(bictcp_cong_avoid, struct sock *sk, 
__u32 ack, __u32 acked)
tcp_cong_avoid_ai(tp, ca->cnt, acked);
 }
 
-__u32 BPF_STRUCT_OPS(bictcp_recalc_ssthresh, struct sock *sk)
+__u32 BPF_STRUCT_OPS(bpf_cubic_recalc_ssthresh, struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -420,7 +420,7 @@ __u32 BPF_STRUCT_OPS(bictcp_recalc_ssthresh, struct sock 
*sk)
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-void BPF_STRUCT_OPS(bictcp_state, struct sock *sk, __u8 new_state)
+void BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 new_state)
 {
if (new_state == TCP_CA_Loss) {
bictcp_reset(inet_csk_ca(sk));
@@ -496,7 +496,7 @@ static __always_inline void hystart_update(struct sock *sk, 
__u32 delay)
}
 }
 
-void BPF_STRUCT_OPS(bictcp_acked, struct sock *sk,
+void BPF_STRUCT_OPS(bpf_cubic_acked, struct sock *sk,
const struct ack_sample *sample)
 {
const struct tcp_sock *tp = tcp_sk(sk);
@@ -525,7 +525,7 @@ void BPF_STRUCT_OPS(bictcp_acked, struct sock *sk,
hystart_update(sk, delay);
 }
 
-__u32 BPF_STRUCT_OPS(tcp_reno_undo_cwnd, struct sock *sk)
+__u32 BPF_STRUCT_OPS(bpf_cubic_undo_cwnd, struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
 
@@ -534,12 +534,12 @@ __u32 BPF_STRUCT_OPS(tcp_reno_undo_cwnd, struct sock *sk)
 
 SEC(".struct_ops")
 struct tcp_congestion_ops cubic = {
-   .init   = (void *)bictcp_init,
-   .ssthresh   = (void *)bictcp_recalc_ssthresh,
-   .cong_avoid = (void *)bictcp_cong_avoid,
-   .set_state  = (void *)bictcp_state,
-   .undo_cwnd  = (void *)tcp_reno_undo_cwnd,
-   .cwnd_event = (void *)bictcp_cwnd_event,
-   .pkts_acked = (void *)bictcp_acked,
+   .init   = (void *)bpf_cubic_init,
+   .ssthresh   = (void *)bpf_cubic_recalc_ssthresh,
+   .cong_avoid = (void *)bpf_cubic_cong_avoid,
+   .set_state  = (void *)bpf_cubic_state,
+   .undo_cwnd  = (void *)bpf_cubic_undo_cwnd,
+   .cwnd_event = (void *)bpf_cubic_cwnd_event,
+   .pkts_acked = (void *)bpf_cubic_acked,
.name   = "bpf_cubic",
 };
-- 
2.30.2



[PATCH v2 bpf-next 11/14] libbpf: Support extern kernel function

2021-03-24 Thread Martin KaFai Lau
This patch is to make libbpf able to handle the following extern
kernel function declaration and do the needed relocations before
loading the bpf program to the kernel.

extern int foo(struct sock *) __attribute__((section(".ksyms")))

In the collect extern phase, needed changes is made to
bpf_object__collect_externs() and find_extern_btf_id() to collect
extern function in ".ksyms" section.  The func in the BTF datasec also
needs to be replaced by an int var.  The idea is similar to the existing
handling in extern var.  In case the BTF may not have a var, a dummy ksym
var is added at the beginning of bpf_object__collect_externs()
if there is func under ksyms datasec.  It will also change the
func linkage from extern to global which the kernel can support.
It also assigns a param name if it does not have one.

In the collect relo phase, it will record the kernel function
call as RELO_EXTERN_FUNC.

bpf_object__resolve_ksym_func_btf_id() is added to find the func
btf_id of the running kernel.

During actual relocation, it will patch the BPF_CALL instruction with
src_reg = BPF_PSEUDO_FUNC_CALL and insn->imm set to the running
kernel func's btf_id.

The required LLVM patch: https://reviews.llvm.org/D93563

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 174 ++---
 1 file changed, 162 insertions(+), 12 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 23148566ab3a..c65e56c581f2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -186,6 +186,7 @@ enum reloc_type {
RELO_CALL,
RELO_DATA,
RELO_EXTERN_VAR,
+   RELO_EXTERN_FUNC,
RELO_SUBPROG_ADDR,
 };
 
@@ -1954,6 +1955,11 @@ static const char *btf_kind_str(const struct btf_type *t)
return __btf_kind_str(btf_kind(t));
 }
 
+static enum btf_func_linkage btf_func_linkage(const struct btf_type *t)
+{
+   return (enum btf_func_linkage)BTF_INFO_VLEN(t->info);
+}
+
 /*
  * Fetch integer attribute of BTF map definition. Such attributes are
  * represented using a pointer to an array, in which dimensionality of array
@@ -3018,7 +3024,7 @@ static bool sym_is_subprog(const GElf_Sym *sym, int 
text_shndx)
 static int find_extern_btf_id(const struct btf *btf, const char *ext_name)
 {
const struct btf_type *t;
-   const char *var_name;
+   const char *tname;
int i, n;
 
if (!btf)
@@ -3028,14 +3034,18 @@ static int find_extern_btf_id(const struct btf *btf, 
const char *ext_name)
for (i = 1; i <= n; i++) {
t = btf__type_by_id(btf, i);
 
-   if (!btf_is_var(t))
+   if (!btf_is_var(t) && !btf_is_func(t))
continue;
 
-   var_name = btf__name_by_offset(btf, t->name_off);
-   if (strcmp(var_name, ext_name))
+   tname = btf__name_by_offset(btf, t->name_off);
+   if (strcmp(tname, ext_name))
continue;
 
-   if (btf_var(t)->linkage != BTF_VAR_GLOBAL_EXTERN)
+   if (btf_is_var(t) &&
+   btf_var(t)->linkage != BTF_VAR_GLOBAL_EXTERN)
+   return -EINVAL;
+
+   if (btf_is_func(t) && btf_func_linkage(t) != BTF_FUNC_EXTERN)
return -EINVAL;
 
return i;
@@ -3148,12 +3158,48 @@ static int find_int_btf_id(const struct btf *btf)
return 0;
 }
 
+static int add_dummy_ksym_var(struct btf *btf)
+{
+   int i, int_btf_id, sec_btf_id, dummy_var_btf_id;
+   const struct btf_var_secinfo *vs;
+   const struct btf_type *sec;
+
+   sec_btf_id = btf__find_by_name_kind(btf, KSYMS_SEC,
+   BTF_KIND_DATASEC);
+   if (sec_btf_id < 0)
+   return 0;
+
+   sec = btf__type_by_id(btf, sec_btf_id);
+   vs = btf_var_secinfos(sec);
+   for (i = 0; i < btf_vlen(sec); i++, vs++) {
+   const struct btf_type *vt;
+
+   vt = btf__type_by_id(btf, vs->type);
+   if (btf_is_func(vt))
+   break;
+   }
+
+   /* No func in ksyms sec.  No need to add dummy var. */
+   if (i == btf_vlen(sec))
+   return 0;
+
+   int_btf_id = find_int_btf_id(btf);
+   dummy_var_btf_id = btf__add_var(btf,
+   "dummy_ksym",
+   BTF_VAR_GLOBAL_ALLOCATED,
+   int_btf_id);
+   if (dummy_var_btf_id < 0)
+   pr_warn("cannot create a dummy_ksym var\n");
+
+   return dummy_var_btf_id;
+}
+
 static int bpf_object__collect_externs(struct bpf_object *obj)
 {
struct btf_type *sec, *kcfg_sec = NULL, *ksym_sec = NULL;
const struct btf_type *t;
struct extern_desc *ext;
-   int i, n, off;
+   int i, n, off,

[PATCH v2 bpf-next 10/14] libbpf: Record extern sym relocation first

2021-03-24 Thread Martin KaFai Lau
This patch records the extern sym relocs first before recording
subprog relocs.  The later patch will have relocs for extern
kernel function call which is also using BPF_JMP | BPF_CALL.
It will be easier to handle the extern symbols first in
the later patch.

is_call_insn() helper is added.  The existing is_ldimm64() helper
is renamed to is_ldimm64_insn() for consistency.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 63 +++---
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1a2dbde19b7e..23148566ab3a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -573,14 +573,19 @@ static bool insn_is_subprog_call(const struct bpf_insn 
*insn)
   insn->off == 0;
 }
 
-static bool is_ldimm64(struct bpf_insn *insn)
+static bool is_ldimm64_insn(struct bpf_insn *insn)
 {
return insn->code == (BPF_LD | BPF_IMM | BPF_DW);
 }
 
+static bool is_call_insn(const struct bpf_insn *insn)
+{
+   return insn->code == (BPF_JMP | BPF_CALL);
+}
+
 static bool insn_is_pseudo_func(struct bpf_insn *insn)
 {
-   return is_ldimm64(insn) && insn->src_reg == BPF_PSEUDO_FUNC;
+   return is_ldimm64_insn(insn) && insn->src_reg == BPF_PSEUDO_FUNC;
 }
 
 static int
@@ -3407,31 +3412,7 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
 
reloc_desc->processed = false;
 
-   /* sub-program call relocation */
-   if (insn->code == (BPF_JMP | BPF_CALL)) {
-   if (insn->src_reg != BPF_PSEUDO_CALL) {
-   pr_warn("prog '%s': incorrect bpf_call opcode\n", 
prog->name);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   /* text_shndx can be 0, if no default "main" program exists */
-   if (!shdr_idx || shdr_idx != obj->efile.text_shndx) {
-   sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, 
shdr_idx));
-   pr_warn("prog '%s': bad call relo against '%s' in 
section '%s'\n",
-   prog->name, sym_name, sym_sec_name);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   if (sym->st_value % BPF_INSN_SZ) {
-   pr_warn("prog '%s': bad call relo against '%s' at 
offset %zu\n",
-   prog->name, sym_name, (size_t)sym->st_value);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   reloc_desc->type = RELO_CALL;
-   reloc_desc->insn_idx = insn_idx;
-   reloc_desc->sym_off = sym->st_value;
-   return 0;
-   }
-
-   if (!is_ldimm64(insn)) {
+   if (!is_call_insn(insn) && !is_ldimm64_insn(insn)) {
pr_warn("prog '%s': invalid relo against '%s' for 
insns[%d].code 0x%x\n",
prog->name, sym_name, insn_idx, insn->code);
return -LIBBPF_ERRNO__RELOC;
@@ -3460,6 +3441,30 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
return 0;
}
 
+   /* sub-program call relocation */
+   if (is_call_insn(insn)) {
+   if (insn->src_reg != BPF_PSEUDO_CALL) {
+   pr_warn("prog '%s': incorrect bpf_call opcode\n", 
prog->name);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   /* text_shndx can be 0, if no default "main" program exists */
+   if (!shdr_idx || shdr_idx != obj->efile.text_shndx) {
+   sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, 
shdr_idx));
+   pr_warn("prog '%s': bad call relo against '%s' in 
section '%s'\n",
+   prog->name, sym_name, sym_sec_name);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   if (sym->st_value % BPF_INSN_SZ) {
+   pr_warn("prog '%s': bad call relo against '%s' at 
offset %zu\n",
+   prog->name, sym_name, (size_t)sym->st_value);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   reloc_desc->type = RELO_CALL;
+   reloc_desc->insn_idx = insn_idx;
+   reloc_desc->sym_off = sym->st_value;
+   return 0;
+   }
+
if (!shdr_idx || shdr_idx >= SHN_LORESERVE) {
pr_warn("prog '%s': invalid relo against '%s' in special 
section 0x%x; forgot to initialize global var?..\n",
prog->name, sym_name, 

[PATCH v2 bpf-next 09/14] libbpf: Rename RELO_EXTERN to RELO_EXTERN_VAR

2021-03-24 Thread Martin KaFai Lau
This patch renames RELO_EXTERN to RELO_EXTERN_VAR.
It is to avoid the confusion with a later patch adding
RELO_EXTERN_FUNC.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5a0cae981784..1a2dbde19b7e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -185,7 +185,7 @@ enum reloc_type {
RELO_LD64,
RELO_CALL,
RELO_DATA,
-   RELO_EXTERN,
+   RELO_EXTERN_VAR,
RELO_SUBPROG_ADDR,
 };
 
@@ -3454,7 +3454,7 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
}
pr_debug("prog '%s': found extern #%d '%s' (sym %d) for insn 
#%u\n",
 prog->name, i, ext->name, ext->sym_idx, insn_idx);
-   reloc_desc->type = RELO_EXTERN;
+   reloc_desc->type = RELO_EXTERN_VAR;
reloc_desc->insn_idx = insn_idx;
reloc_desc->sym_off = i; /* sym_off stores extern index */
return 0;
@@ -6217,7 +6217,7 @@ bpf_object__relocate_data(struct bpf_object *obj, struct 
bpf_program *prog)
insn[0].imm = obj->maps[relo->map_idx].fd;
relo->processed = true;
break;
-   case RELO_EXTERN:
+   case RELO_EXTERN_VAR:
ext = &obj->externs[relo->sym_off];
if (ext->type == EXT_KCFG) {
insn[0].src_reg = BPF_PSEUDO_MAP_VALUE;
-- 
2.30.2



[PATCH v2 bpf-next 08/14] libbpf: Refactor codes for finding btf id of a kernel symbol

2021-03-24 Thread Martin KaFai Lau
This patch refactors code, that finds kernel btf_id by kind
and symbol name, to a new function find_ksym_btf_id().

It also adds a new helper __btf_kind_str() to return
a string by the numeric kind value.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 44 +++---
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 57123a2179b4..5a0cae981784 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1920,9 +1920,9 @@ resolve_func_ptr(const struct btf *btf, __u32 id, __u32 
*res_id)
return btf_is_func_proto(t) ? t : NULL;
 }
 
-static const char *btf_kind_str(const struct btf_type *t)
+static const char *__btf_kind_str(__u16 kind)
 {
-   switch (btf_kind(t)) {
+   switch (kind) {
case BTF_KIND_UNKN: return "void";
case BTF_KIND_INT: return "int";
case BTF_KIND_PTR: return "ptr";
@@ -1944,6 +1944,11 @@ static const char *btf_kind_str(const struct btf_type *t)
}
 }
 
+static const char *btf_kind_str(const struct btf_type *t)
+{
+   return __btf_kind_str(btf_kind(t));
+}
+
 /*
  * Fetch integer attribute of BTF map definition. Such attributes are
  * represented using a pointer to an array, in which dimensionality of array
@@ -7394,18 +7399,17 @@ static int bpf_object__read_kallsyms_file(struct 
bpf_object *obj)
return err;
 }
 
-static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
-  struct extern_desc *ext)
+static int find_ksym_btf_id(struct bpf_object *obj, const char *ksym_name,
+   __u16 kind, struct btf **res_btf,
+   int *res_btf_fd)
 {
-   const struct btf_type *targ_var, *targ_type;
-   __u32 targ_type_id, local_type_id;
-   const char *targ_var_name;
int i, id, btf_fd, err;
struct btf *btf;
 
btf = obj->btf_vmlinux;
btf_fd = 0;
-   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
+   id = btf__find_by_name_kind(btf, ksym_name, kind);
+
if (id == -ENOENT) {
err = load_module_btfs(obj);
if (err)
@@ -7415,17 +7419,35 @@ static int bpf_object__resolve_ksym_var_btf_id(struct 
bpf_object *obj,
btf = obj->btf_modules[i].btf;
/* we assume module BTF FD is always >0 */
btf_fd = obj->btf_modules[i].fd;
-   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
+   id = btf__find_by_name_kind(btf, ksym_name, kind);
if (id != -ENOENT)
break;
}
}
if (id <= 0) {
-   pr_warn("extern (var ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
-   ext->name);
+   pr_warn("extern (%s ksym) '%s': failed to find BTF ID in kernel 
BTF(s).\n",
+   __btf_kind_str(kind), ksym_name);
return -ESRCH;
}
 
+   *res_btf = btf;
+   *res_btf_fd = btf_fd;
+   return id;
+}
+
+static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
+  struct extern_desc *ext)
+{
+   const struct btf_type *targ_var, *targ_type;
+   __u32 targ_type_id, local_type_id;
+   const char *targ_var_name;
+   int id, btf_fd = 0, err;
+   struct btf *btf = NULL;
+
+   id = find_ksym_btf_id(obj, ext->name, BTF_KIND_VAR, &btf, &btf_fd);
+   if (id < 0)
+   return id;
+
/* find local type_id */
local_type_id = ext->ksym.type_id;
 
-- 
2.30.2



[PATCH v2 bpf-next 07/14] libbpf: Refactor bpf_object__resolve_ksyms_btf_id

2021-03-24 Thread Martin KaFai Lau
This patch refactors most of the logic from
bpf_object__resolve_ksyms_btf_id() into a new function
bpf_object__resolve_ksym_var_btf_id().
It is to get ready for a later patch adding
bpf_object__resolve_ksym_func_btf_id() which resolves
a kernel function to the running kernel btf_id.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 124 ++---
 1 file changed, 67 insertions(+), 57 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 058b643cbcb1..57123a2179b4 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -7394,75 +7394,85 @@ static int bpf_object__read_kallsyms_file(struct 
bpf_object *obj)
return err;
 }
 
-static int bpf_object__resolve_ksyms_btf_id(struct bpf_object *obj)
+static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
+  struct extern_desc *ext)
 {
-   struct extern_desc *ext;
+   const struct btf_type *targ_var, *targ_type;
+   __u32 targ_type_id, local_type_id;
+   const char *targ_var_name;
+   int i, id, btf_fd, err;
struct btf *btf;
-   int i, j, id, btf_fd, err;
 
-   for (i = 0; i < obj->nr_extern; i++) {
-   const struct btf_type *targ_var, *targ_type;
-   __u32 targ_type_id, local_type_id;
-   const char *targ_var_name;
-   int ret;
+   btf = obj->btf_vmlinux;
+   btf_fd = 0;
+   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
+   if (id == -ENOENT) {
+   err = load_module_btfs(obj);
+   if (err)
+   return err;
 
-   ext = &obj->externs[i];
-   if (ext->type != EXT_KSYM || !ext->ksym.type_id)
-   continue;
+   for (i = 0; i < obj->btf_module_cnt; i++) {
+   btf = obj->btf_modules[i].btf;
+   /* we assume module BTF FD is always >0 */
+   btf_fd = obj->btf_modules[i].fd;
+   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
+   if (id != -ENOENT)
+   break;
+   }
+   }
+   if (id <= 0) {
+   pr_warn("extern (var ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
+   ext->name);
+   return -ESRCH;
+   }
 
-   btf = obj->btf_vmlinux;
-   btf_fd = 0;
-   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
-   if (id == -ENOENT) {
-   err = load_module_btfs(obj);
-   if (err)
-   return err;
+   /* find local type_id */
+   local_type_id = ext->ksym.type_id;
 
-   for (j = 0; j < obj->btf_module_cnt; j++) {
-   btf = obj->btf_modules[j].btf;
-   /* we assume module BTF FD is always >0 */
-   btf_fd = obj->btf_modules[j].fd;
-   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
-   if (id != -ENOENT)
-   break;
-   }
-   }
-   if (id <= 0) {
-   pr_warn("extern (ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
-   ext->name);
-   return -ESRCH;
-   }
+   /* find target type_id */
+   targ_var = btf__type_by_id(btf, id);
+   targ_var_name = btf__name_by_offset(btf, targ_var->name_off);
+   targ_type = skip_mods_and_typedefs(btf, targ_var->type, &targ_type_id);
 
-   /* find local type_id */
-   local_type_id = ext->ksym.type_id;
+   err = bpf_core_types_are_compat(obj->btf, local_type_id,
+   btf, targ_type_id);
+   if (err <= 0) {
+   const struct btf_type *local_type;
+   const char *targ_name, *local_name;
 
-   /* find target type_id */
-   targ_var = btf__type_by_id(btf, id);
-   targ_var_name = btf__name_by_offset(btf, targ_var->name_off);
-   targ_type = skip_mods_and_typedefs(btf, targ_var->type, 
&targ_type_id);
+   local_type = btf__type_by_id(obj->btf, local_type_id);
+   local_name = btf__name_by_offset(obj->btf, 
local_type->name_off);
+   targ_name = btf__name_by_offset(btf, targ_type->name_off);
 
-   ret = bpf_core_types_are_compat(obj->btf, local_type_id,
-   btf, targ_type_id);
-   if (r

[PATCH v2 bpf-next 06/14] bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc

2021-03-24 Thread Martin KaFai Lau
This patch puts some tcp cong helper functions, tcp_slow_start()
and tcp_cong_avoid_ai(), into the allowlist for the bpf-tcp-cc
program.

A few tcp cc implementation functions are also put into the
allowlist.  A potential use case is the bpf-tcp-cc implementation
may only want to override a subset of a tcp_congestion_ops.  For others,
the bpf-tcp-cc can directly call the kernel counter parts instead of
re-implementing (or copy-and-pasting) them to the bpf program.

They will only be available to the bpf-tcp-cc typed program.
The allowlist functions are not bounded to a fixed ABI contract.
When any of them has changed, the bpf-tcp-cc program has to be changed
like any in-tree/out-of-tree kernel tcp-cc implementations do also.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/bpf_tcp_ca.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index d520e61649c8..40520b77a307 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -178,10 +179,50 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
}
 }
 
+BTF_SET_START(bpf_tcp_ca_kfunc_ids)
+BTF_ID(func, tcp_reno_ssthresh)
+BTF_ID(func, tcp_reno_cong_avoid)
+BTF_ID(func, tcp_reno_undo_cwnd)
+BTF_ID(func, tcp_slow_start)
+BTF_ID(func, tcp_cong_avoid_ai)
+#if IS_BUILTIN(CONFIG_TCP_CONG_CUBIC)
+BTF_ID(func, cubictcp_init)
+BTF_ID(func, cubictcp_recalc_ssthresh)
+BTF_ID(func, cubictcp_cong_avoid)
+BTF_ID(func, cubictcp_state)
+BTF_ID(func, cubictcp_cwnd_event)
+BTF_ID(func, cubictcp_acked)
+#endif
+#if IS_BUILTIN(CONFIG_TCP_CONG_DCTCP)
+BTF_ID(func, dctcp_init)
+BTF_ID(func, dctcp_update_alpha)
+BTF_ID(func, dctcp_cwnd_event)
+BTF_ID(func, dctcp_ssthresh)
+BTF_ID(func, dctcp_cwnd_undo)
+BTF_ID(func, dctcp_state)
+#endif
+#if IS_BUILTIN(CONFIG_TCP_CONG_BBR)
+BTF_ID(func, bbr_init)
+BTF_ID(func, bbr_main)
+BTF_ID(func, bbr_sndbuf_expand)
+BTF_ID(func, bbr_undo_cwnd)
+BTF_ID(func, bbr_cwnd_even),
+BTF_ID(func, bbr_ssthresh)
+BTF_ID(func, bbr_min_tso_segs)
+BTF_ID(func, bbr_set_state)
+#endif
+BTF_SET_END(bpf_tcp_ca_kfunc_ids)
+
+static bool bpf_tcp_ca_check_kfunc_call(u32 kfunc_btf_id)
+{
+   return btf_id_set_contains(&bpf_tcp_ca_kfunc_ids, kfunc_btf_id);
+}
+
 static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
.get_func_proto = bpf_tcp_ca_get_func_proto,
.is_valid_access= bpf_tcp_ca_is_valid_access,
.btf_struct_access  = bpf_tcp_ca_btf_struct_access,
+   .check_kfunc_call   = bpf_tcp_ca_check_kfunc_call,
 };
 
 static int bpf_tcp_ca_init_member(const struct btf_type *t,
-- 
2.30.2



[PATCH v2 bpf-next 05/14] tcp: Rename bictcp function prefix to cubictcp

2021-03-24 Thread Martin KaFai Lau
The cubic functions in tcp_cubic.c are using the bictcp prefix as
in tcp_bic.c.  This patch gives it the proper name cubictcp
because the later patch will allow the bpf prog to directly
call the cubictcp implementation.  Renaming them will avoid
the name collision when trying to find the intended
one to call during bpf prog load time.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/tcp_cubic.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index ffcbe46dacdb..4a30deaa9a37 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -124,7 +124,7 @@ static inline void bictcp_hystart_reset(struct sock *sk)
ca->sample_cnt = 0;
 }
 
-static void bictcp_init(struct sock *sk)
+static void cubictcp_init(struct sock *sk)
 {
struct bictcp *ca = inet_csk_ca(sk);
 
@@ -137,7 +137,7 @@ static void bictcp_init(struct sock *sk)
tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
 }
 
-static void bictcp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
+static void cubictcp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
 {
if (event == CA_EVENT_TX_START) {
struct bictcp *ca = inet_csk_ca(sk);
@@ -319,7 +319,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 
cwnd, u32 acked)
ca->cnt = max(ca->cnt, 2U);
 }
 
-static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+static void cubictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -338,7 +338,7 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 
acked)
tcp_cong_avoid_ai(tp, ca->cnt, acked);
 }
 
-static u32 bictcp_recalc_ssthresh(struct sock *sk)
+static u32 cubictcp_recalc_ssthresh(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -355,7 +355,7 @@ static u32 bictcp_recalc_ssthresh(struct sock *sk)
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-static void bictcp_state(struct sock *sk, u8 new_state)
+static void cubictcp_state(struct sock *sk, u8 new_state)
 {
if (new_state == TCP_CA_Loss) {
bictcp_reset(inet_csk_ca(sk));
@@ -442,7 +442,7 @@ static void hystart_update(struct sock *sk, u32 delay)
}
 }
 
-static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
+static void cubictcp_acked(struct sock *sk, const struct ack_sample *sample)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -471,13 +471,13 @@ static void bictcp_acked(struct sock *sk, const struct 
ack_sample *sample)
 }
 
 static struct tcp_congestion_ops cubictcp __read_mostly = {
-   .init   = bictcp_init,
-   .ssthresh   = bictcp_recalc_ssthresh,
-   .cong_avoid = bictcp_cong_avoid,
-   .set_state  = bictcp_state,
+   .init   = cubictcp_init,
+   .ssthresh   = cubictcp_recalc_ssthresh,
+   .cong_avoid = cubictcp_cong_avoid,
+   .set_state  = cubictcp_state,
.undo_cwnd  = tcp_reno_undo_cwnd,
-   .cwnd_event = bictcp_cwnd_event,
-   .pkts_acked = bictcp_acked,
+   .cwnd_event = cubictcp_cwnd_event,
+   .pkts_acked = cubictcp_acked,
.owner  = THIS_MODULE,
.name   = "cubic",
 };
-- 
2.30.2



[PATCH v2 bpf-next 03/14] bpf: Support bpf program calling kernel function

2021-03-24 Thread Martin KaFai Lau
This patch adds support to BPF verifier to allow bpf program calling
kernel function directly.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

This patch is to make the required changes in the bpf verifier.

First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
When the passed in "btf->kernel_btf == true", it means matching the
verifier regs' states with a kernel function.  This will handle the
PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
and PTR_TO_TCP_SOCK to its kernel's btf_id.

In the later libbpf patch, the insn calling a kernel function will
look like:

insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
insn->imm == func_btf_id /* btf_id of the running kernel */

[ For the future calling function-in-kernel-module support, an array
  of module btf_fds can be passed at the load time and insn->off
  can be used to index into this array. ]

At the early stage of verifier, the verifier will collect all kernel
function calls into "struct bpf_kfunc_desc".  Those
descriptors are stored in "prog->aux->kfunc_tab" and will
be available to the JIT.  Since this "add" operation is similar
to the current "add_subprog()" and looking for the same insn->code,
they are done together in the new "add_subprog_and_kfunc()".

In the "do_check()" stage, the new "check_kfunc_call()" is added
to verify the kernel function call instruction:
1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
   A new bpf_verifier_ops "check_kfunc_call" is added to do that.
   The bpf-tcp-cc struct_ops program will implement this function in
   a later patch.
2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
   used as the args of a kernel function.
3. Mark the regs' type, subreg_def, and zext_dst.

At the later do_misc_fixups() stage, the new fixup_kfunc_call()
will replace the insn->imm with the function address (relative
to __bpf_call_base).  If needed, the jit can find the btf_func_model
by calling the new bpf_jit_find_kfunc_model(prog, insn).
With the imm set to the function address, "bpftool prog dump xlated"
will be able to display the kernel function calls the same way as
it displays other bpf helper calls.

gpl_compatible program is required to call kernel function.

This feature currently requires JIT.

The verifier selftests are adjusted because of the changes in
the verbose log in add_subprog_and_kfunc().

Signed-off-by: Martin KaFai Lau 
---
 arch/x86/net/bpf_jit_comp.c   |   5 +
 include/linux/bpf.h   |  24 ++
 include/linux/btf.h   |   1 +
 include/linux/filter.h|   1 +
 include/uapi/linux/bpf.h  |   4 +
 kernel/bpf/btf.c  |  65 +++-
 kernel/bpf/core.c |  18 +-
 kernel/bpf/disasm.c   |  13 +-
 kernel/bpf/syscall.c  |   1 +
 kernel/bpf/verifier.c | 368 --
 tools/include/uapi/linux/bpf.h|   4 +
 tools/testing/selftests/bpf/verifier/calls.c  |  12 +-
 .../selftests/bpf/verifier/dead_code.c|  10 +-
 13 files changed, 480 insertions(+), 46 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 6926d0ca6c71..bcb957234410 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2327,3 +2327,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
*prog)
   tmp : orig_prog);
return prog;
 }
+
+bool bpf_jit_supports_kfunc_call(void)
+{
+   return true;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ebd044182f8d..c6439e96fa4a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -426,6 +426,7 @@ enum bpf_reg_type {
 

[PATCH v2 bpf-next 04/14] bpf: Support kernel function call in x86-32

2021-03-24 Thread Martin KaFai Lau
This patch adds kernel function call support to the x86-32 bpf jit.

Signed-off-by: Martin KaFai Lau 
---
 arch/x86/net/bpf_jit_comp32.c | 198 ++
 1 file changed, 198 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index d17b67c69f89..0a7a2870f111 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -1390,6 +1390,19 @@ static inline void emit_push_r64(const u8 src[], u8 
**pprog)
*pprog = prog;
 }
 
+static void emit_push_r32(const u8 src[], u8 **pprog)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   /* mov ecx,dword ptr [ebp+off] */
+   EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+   /* push ecx */
+   EMIT1(0x51);
+
+   *pprog = prog;
+}
+
 static u8 get_cond_jmp_opcode(const u8 op, bool is_cmp_lo)
 {
u8 jmp_cond;
@@ -1459,6 +1472,174 @@ static u8 get_cond_jmp_opcode(const u8 op, bool 
is_cmp_lo)
return jmp_cond;
 }
 
+/* i386 kernel compiles with "-mregparm=3".  From gcc document:
+ *
+ *  snippet 
+ * regparm (number)
+ * On x86-32 targets, the regparm attribute causes the compiler
+ * to pass arguments number one to (number) if they are of integral
+ * type in registers EAX, EDX, and ECX instead of on the stack.
+ * Functions that take a variable number of arguments continue
+ * to be passed all of their arguments on the stack.
+ *  snippet 
+ *
+ * The first three args of a function will be considered for
+ * putting into the 32bit register EAX, EDX, and ECX.
+ *
+ * Two 32bit registers are used to pass a 64bit arg.
+ *
+ * For example,
+ * void foo(u32 a, u32 b, u32 c, u32 d):
+ * u32 a: EAX
+ * u32 b: EDX
+ * u32 c: ECX
+ * u32 d: stack
+ *
+ * void foo(u64 a, u32 b, u32 c):
+ * u64 a: EAX (lo32) EDX (hi32)
+ * u32 b: ECX
+ * u32 c: stack
+ *
+ * void foo(u32 a, u64 b, u32 c):
+ * u32 a: EAX
+ * u64 b: EDX (lo32) ECX (hi32)
+ * u32 c: stack
+ *
+ * void foo(u32 a, u32 b, u64 c):
+ * u32 a: EAX
+ * u32 b: EDX
+ * u64 c: stack
+ *
+ * The return value will be stored in the EAX (and EDX for 64bit value).
+ *
+ * For example,
+ * u32 foo(u32 a, u32 b, u32 c):
+ * return value: EAX
+ *
+ * u64 foo(u32 a, u32 b, u32 c):
+ * return value: EAX (lo32) EDX (hi32)
+ *
+ * Notes:
+ * The verifier only accepts function having integer and pointers
+ * as its args and return value, so it does not have
+ * struct-by-value.
+ *
+ * emit_kfunc_call() finds out the btf_func_model by calling
+ * bpf_jit_find_kfunc_model().  A btf_func_model
+ * has the details about the number of args, size of each arg,
+ * and the size of the return value.
+ *
+ * It first decides how many args can be passed by EAX, EDX, and ECX.
+ * That will decide what args should be pushed to the stack:
+ * [first_stack_regno, last_stack_regno] are the bpf regnos
+ * that should be pushed to the stack.
+ *
+ * It will first push all args to the stack because the push
+ * will need to use ECX.  Then, it moves
+ * [BPF_REG_1, first_stack_regno) to EAX, EDX, and ECX.
+ *
+ * When emitting a call (0xE8), it needs to figure out
+ * the jmp_offset relative to the jit-insn address immediately
+ * following the call (0xE8) instruction.  At this point, it knows
+ * the end of the jit-insn address after completely translated the
+ * current (BPF_JMP | BPF_CALL) bpf-insn.  It is passed as "end_addr"
+ * to the emit_kfunc_call().  Thus, it can learn the "immediate-follow-call"
+ * address by figuring out how many jit-insn is generated between
+ * the call (0xE8) and the end_addr:
+ * - 0-1 jit-insn (3 bytes each) to restore the esp pointer if there
+ *   is arg pushed to the stack.
+ * - 0-2 jit-insns (3 bytes each) to handle the return value.
+ */
+static int emit_kfunc_call(const struct bpf_prog *bpf_prog, u8 *end_addr,
+  const struct bpf_insn *insn, u8 **pprog)
+{
+   const u8 arg_regs[] = { IA32_EAX, IA32_EDX, IA32_ECX };
+   int i, cnt = 0, first_stack_regno, last_stack_regno;
+   int free_arg_regs = ARRAY_SIZE(arg_regs);
+   const struct btf_func_model *fm;
+   int bytes_in_stack = 0;
+   const u8 *cur_arg_reg;
+   u8 *prog = *pprog;
+   s64 jmp_offset;
+
+   fm = bpf_jit_find_kfunc_model(bpf_prog, insn);
+   if (!fm)
+   return -EINVAL;
+
+   first_stack_regno = BPF_REG_1;
+   for (i = 0; i < fm->nr_args; i++) {
+   int regs_needed = fm->arg_size[i] > sizeof(u32) ? 2 : 1;
+
+   if (regs_needed > free_arg_regs)
+   break;
+
+   free_arg_regs -= regs_needed;
+   first_stack_regno++;
+   }
+
+   /* Push the args to the stack */
+   last_stack_regno = BPF_REG_0 + fm->nr_args;
+   for (i = last_stack_regno; i >= first_stack_regno; i--) {
+

[PATCH v2 bpf-next 01/14] bpf: Simplify freeing logic in linfo and jited_linfo

2021-03-24 Thread Martin KaFai Lau
This patch simplifies the linfo freeing logic by combining
"bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
into the new "bpf_prog_jit_attempt_done()".
It is a prep work for the kernel function call support.  In a later
patch, freeing the kernel function call descriptors will also
be done in the "bpf_prog_jit_attempt_done()".

"bpf_prog_free_linfo()" is removed since it is only called by
"__bpf_prog_put_noref()".  The kvfree() are directly called
instead.

It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
allocation.

Signed-off-by: Martin KaFai Lau 
---
 include/linux/filter.h |  3 +--
 kernel/bpf/core.c  | 35 ---
 kernel/bpf/syscall.c   |  3 ++-
 kernel/bpf/verifier.c  |  4 ++--
 4 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b2b85b2cad8e..0d9c710eb050 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -877,8 +877,7 @@ void bpf_prog_free_linfo(struct bpf_prog *prog);
 void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
   const u32 *insn_to_jit_off);
 int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog);
-void bpf_prog_free_jited_linfo(struct bpf_prog *prog);
-void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog);
+void bpf_prog_jit_attempt_done(struct bpf_prog *prog);
 
 struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags);
 struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t 
gfp_extra_flags);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3a283bf97f2f..4a6dd327446b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -143,25 +143,22 @@ int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog)
if (!prog->aux->nr_linfo || !prog->jit_requested)
return 0;
 
-   prog->aux->jited_linfo = kcalloc(prog->aux->nr_linfo,
-sizeof(*prog->aux->jited_linfo),
-GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+   prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
+ sizeof(*prog->aux->jited_linfo),
+ GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (!prog->aux->jited_linfo)
return -ENOMEM;
 
return 0;
 }
 
-void bpf_prog_free_jited_linfo(struct bpf_prog *prog)
+void bpf_prog_jit_attempt_done(struct bpf_prog *prog)
 {
-   kfree(prog->aux->jited_linfo);
-   prog->aux->jited_linfo = NULL;
-}
-
-void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog)
-{
-   if (prog->aux->jited_linfo && !prog->aux->jited_linfo[0])
-   bpf_prog_free_jited_linfo(prog);
+   if (prog->aux->jited_linfo &&
+   (!prog->jited || !prog->aux->jited_linfo[0])) {
+   kvfree(prog->aux->jited_linfo);
+   prog->aux->jited_linfo = NULL;
+   }
 }
 
 /* The jit engine is responsible to provide an array
@@ -217,12 +214,6 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
insn_to_jit_off[linfo[i].insn_off - insn_start - 1];
 }
 
-void bpf_prog_free_linfo(struct bpf_prog *prog)
-{
-   bpf_prog_free_jited_linfo(prog);
-   kvfree(prog->aux->linfo);
-}
-
 struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
  gfp_t gfp_extra_flags)
 {
@@ -1866,15 +1857,13 @@ struct bpf_prog *bpf_prog_select_runtime(struct 
bpf_prog *fp, int *err)
return fp;
 
fp = bpf_int_jit_compile(fp);
-   if (!fp->jited) {
-   bpf_prog_free_jited_linfo(fp);
+   bpf_prog_jit_attempt_done(fp);
 #ifdef CONFIG_BPF_JIT_ALWAYS_ON
+   if (!fp->jited) {
*err = -ENOTSUPP;
return fp;
-#endif
-   } else {
-   bpf_prog_free_unused_jited_linfo(fp);
}
+#endif
} else {
*err = bpf_prog_offload_compile(fp);
if (*err)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c859bc46d06c..78a653e25df0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1689,7 +1689,8 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, 
bool deferred)
 {
bpf_prog_kallsyms_del_all(prog);
btf_put(prog->aux->btf);
-   bpf_prog_free_linfo(prog);
+   kvfree(prog->aux->jited_linfo);
+   kvfree(prog->aux->linfo);
if (prog->aux->attach_btf)
btf_put(prog->aux->attach_btf);
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e26c5170c953..0cfe39023fe5 100644
--- a/kernel/bpf/verifier.c
+++ b/ke

[PATCH v2 bpf-next 02/14] bpf: Refactor btf_check_func_arg_match

2021-03-24 Thread Martin KaFai Lau
This patch moved the subprog specific logic from
btf_check_func_arg_match() to the new btf_check_subprog_arg_match().
The core logic is left in btf_check_func_arg_match() which
will be reused later to check the kernel function call.

The "if (!btf_type_is_ptr(t))" is checked first to improve the
indentation which will be useful for a later patch.

Some of the "btf_kind_str[]" usages is replaced with the shortcut
"btf_type_str(t)".

Signed-off-by: Martin KaFai Lau 
---
 include/linux/bpf.h   |   4 +-
 include/linux/btf.h   |   5 ++
 kernel/bpf/btf.c  | 159 +++---
 kernel/bpf/verifier.c |   4 +-
 4 files changed, 95 insertions(+), 77 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a25730eaa148..ebd044182f8d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1514,8 +1514,8 @@ int btf_distill_func_proto(struct bpf_verifier_log *log,
   struct btf_func_model *m);
 
 struct bpf_reg_state;
-int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
-struct bpf_reg_state *regs);
+int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
+   struct bpf_reg_state *regs);
 int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog,
  struct bpf_reg_state *reg);
 int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog 
*prog,
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 9c1b52738bbe..8a05687a4ee2 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -141,6 +141,11 @@ static inline bool btf_type_is_enum(const struct btf_type 
*t)
return BTF_INFO_KIND(t->info) == BTF_KIND_ENUM;
 }
 
+static inline bool btf_type_is_scalar(const struct btf_type *t)
+{
+   return btf_type_is_int(t) || btf_type_is_enum(t);
+}
+
 static inline bool btf_type_is_typedef(const struct btf_type *t)
 {
return BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 369faeddf1df..3c489adacf3b 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -4377,7 +4377,7 @@ static u8 bpf_ctx_convert_map[] = {
 #undef BPF_LINK_TYPE
 
 static const struct btf_member *
-btf_get_prog_ctx_type(struct bpf_verifier_log *log, struct btf *btf,
+btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
  const struct btf_type *t, enum bpf_prog_type prog_type,
  int arg)
 {
@@ -5362,122 +5362,135 @@ int btf_check_type_match(struct bpf_verifier_log 
*log, const struct bpf_prog *pr
return btf_check_func_type_match(log, btf1, t1, btf2, t2);
 }
 
-/* Compare BTF of a function with given bpf_reg_state.
- * Returns:
- * EFAULT - there is a verifier bug. Abort verification.
- * EINVAL - there is a type mismatch or BTF is not available.
- * 0 - BTF matches with what bpf_reg_state expects.
- * Only PTR_TO_CTX and SCALAR_VALUE states are recognized.
- */
-int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
-struct bpf_reg_state *regs)
+static int btf_check_func_arg_match(struct bpf_verifier_env *env,
+   const struct btf *btf, u32 func_id,
+   struct bpf_reg_state *regs,
+   bool ptr_to_mem_ok)
 {
struct bpf_verifier_log *log = &env->log;
-   struct bpf_prog *prog = env->prog;
-   struct btf *btf = prog->aux->btf;
-   const struct btf_param *args;
+   const char *func_name, *ref_tname;
const struct btf_type *t, *ref_t;
-   u32 i, nargs, btf_id, type_size;
-   const char *tname;
-   bool is_global;
-
-   if (!prog->aux->func_info)
-   return -EINVAL;
-
-   btf_id = prog->aux->func_info[subprog].type_id;
-   if (!btf_id)
-   return -EFAULT;
-
-   if (prog->aux->func_info_aux[subprog].unreliable)
-   return -EINVAL;
+   const struct btf_param *args;
+   u32 i, nargs;
 
-   t = btf_type_by_id(btf, btf_id);
+   t = btf_type_by_id(btf, func_id);
if (!t || !btf_type_is_func(t)) {
/* These checks were already done by the verifier while loading
 * struct bpf_func_info
 */
-   bpf_log(log, "BTF of func#%d doesn't point to KIND_FUNC\n",
-   subprog);
+   bpf_log(log, "BTF of func_id %u doesn't point to KIND_FUNC\n",
+   func_id);
return -EFAULT;
}
-   tname = btf_name_by_offset(btf, t->name_off);
+   func_name = btf_name_by_offset(btf, t->name_off);
 
t = btf_type_by_id(btf, t->type);
if (!t || !btf_type_is_func_proto(t)) {
-   bpf_log(log, "Invalid BTF 

[PATCH v2 bpf-next 00/14] bpf: Support calling kernel function

2021-03-24 Thread Martin KaFai Lau
This series adds support to allow bpf program calling kernel function.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

Please see individual patch for details.

v2:
- Patch 2 in v1 is removed.  No need to support extern func in kernel.
  Changed libbpf to adjust the .ksyms datasec for extern func
  in patch 11. (Andrii)
- Name change: btf_check_func_arg_match() and btf_check_subprog_arg_match()
  in patch 2. (Andrii)
- Always set unreliable on any error in patch 2 since it does not
  matter. (Andrii)
- s/kern_func/kfunc/ and s/descriptor/desc/ in this set. (Andrii)
- Remove some unnecessary changes in disasm.h and disasm.c
  in patch 3.  In particular, no need to change the function
  signature in bpf_insn_revmap_call_t.  Also, removed the changes
  in print_bpf_insn().
- Fixed an issue in check_kfunc_call() when the calling kernel function
  returns a pointer in patch 3.  Added a selftest.
- Adjusted the verifier selftests due to the changes in the verifier log
  in patch 3.
- Fixed a comparison issue in kfunc_desc_cmp_by_imm() in patch 3. (Andrii)
- Name change: is_ldimm64_insn(),
  new helper: is_call_insn() in patch 10 (Andrii)
- Move btf_func_linkage() from btf.h to libbpf.c in patch 11. (Andrii)
- Fixed the linker error when CONFIG_BPF_SYSCALL is not defined.
  Moved the check_kfunc_call from filter.c to test_run.c in patch 14.
  (kernel test robot)

Martin KaFai Lau (14):
  bpf: Simplify freeing logic in linfo and jited_linfo
  bpf: Refactor btf_check_func_arg_match
  bpf: Support bpf program calling kernel function
  bpf: Support kernel function call in x86-32
  tcp: Rename bictcp function prefix to cubictcp
  bpf: tcp: Put some tcp cong functions in allowlist for bpf-tcp-cc
  libbpf: Refactor bpf_object__resolve_ksyms_btf_id
  libbpf: Refactor codes for finding btf id of a kernel symbol
  libbpf: Rename RELO_EXTERN to RELO_EXTERN_VAR
  libbpf: Record extern sym relocation first
  libbpf: Support extern kernel function
  bpf: selftests: Rename bictcp to bpf_cubic
  bpf: selftests: bpf_cubic and bpf_dctcp calling kernel functions
  bpf: selftests: Add kfunc_call test

 arch/x86/net/bpf_jit_comp.c   |   5 +
 arch/x86/net/bpf_jit_comp32.c | 198 +
 include/linux/bpf.h   |  34 +-
 include/linux/btf.h   |   6 +
 include/linux/filter.h|   4 +-
 include/uapi/linux/bpf.h  |   4 +
 kernel/bpf/btf.c  | 218 ++
 kernel/bpf/core.c |  47 +--
 kernel/bpf/disasm.c   |  13 +-
 kernel/bpf/syscall.c  |   4 +-
 kernel/bpf/verifier.c | 376 +++--
 net/bpf/test_run.c|  28 ++
 net/core/filter.c |   1 +
 net/ipv4/bpf_tcp_ca.c |  41 ++
 net/ipv4/tcp_cubic.c  |  24 +-
 tools/include/uapi/linux/bpf.h|   4 +
 tools/lib/bpf/libbpf.c| 389 +-
 tools/testing/selftests/bpf/bpf_tcp_helpers.h |  29 +-
 .../selftests/bpf/prog_tests/kfunc_call.c |  59 +++
 tools/testing/selftests/bpf/progs/bpf_cubic.c |  36 +-
 tools/testing/selftests/bpf/progs/bpf_dctcp.c |  22 +-
 .../selftests/bpf/progs/kfunc_call_test.c |  47 +++
 .../bpf/progs/kfunc_call_test_subprog.c   |  42 ++
 tools/testing/selftests/bpf/verifier/calls.c  |  12 +-
 .../selftests/bpf/verifier/dead_code.c|  10 +-
 25 files changed, 1334 insertions(+), 319 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kfunc_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c

-- 
2.30.2



Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-22 Thread Martin KaFai Lau
On Sat, Mar 20, 2021 at 10:18:36AM -0700, Andrii Nakryiko wrote:
> > From test_ksyms.c:
> > [22] DATASEC '.ksyms' size=0 vlen=5
> >  type_id=12 offset=0 size=1
> >  type_id=13 offset=0 size=1
> >
> > For extern, does it make sense for the libbpf to assign 0 to
> > both var offset and size since it does not matter?
> 
> That's how it is generated and yes, I think that's how it should be
> kept once kernel supports EXTERN VAR. libbpf is adjusting offsets and
> sizes in addition to marking VAR itself as GLOBAL_ALLOCATED. If kernel
> supports EXTERN VAR natively, none of that needs to happen (on newer
> kernels only, of course).
> 
> > In the kernel, it can ensure a datasec only has all extern or no extern.
> > array_map_check_btf() will ensure the datasec has no extern.
> 
> It certainly makes it less surprising from handling BTF, but it feels
> like an arbitrary policy, rather than technical restriction. You can
> mark allocated variables and externs with the same section name and
> Clang will probably happily generate a single DATASEC with a mix of
> externs and non-externs. Is that inherently a bad thing? I'm not sure.
> Basically, I don't know if the kernel should care and enforce or not.
I have thought a bit more on this.  The offset=0 of extern var
can be used in the verification but I think it will still have some
open ended questions like arraymap.

I will use your suggestion in libbpf and do something similar as
the extern VAR: replace the FUNC in datasec with INT (btf__add_var() if
needed).

> 
> >
> > > But of course to support older kernels libbpf will still have to
> > > do this. EXTERN vars won't reduce the amount of libbpf logic.


Re: Design for sk_lookup helper function in context of sk_lookup hook

2021-03-22 Thread Martin KaFai Lau
On Fri, Mar 19, 2021 at 06:05:20PM +0100, Shanti Lombard née Bouchez-Mongardé 
wrote:
> Le 19/03/2021 à 17:55, Martin KaFai Lau a écrit :
> > On Wed, Mar 17, 2021 at 10:04:18AM +0100, Shanti Lombard née 
> > Bouchez-Mongardé wrote:
> > > Q1: How do we prevent socket lookup from triggering BPF sk_lookup causing 
> > > a
> > > loop?
> > The bpf_sk_lookup_(tcp|udp) will be called from the BPF_PROG_TYPE_SK_LOOKUP 
> > program?
> 
> Yes, the idea is to allow the BPF program to redirect incoming connections
> for 0.0.0.0:1234 to a specific IP address such as 127.0.12.34:1234 or any
> other combinaison with a binding not done based on a predefined socket file
> descriptor but based on a listening IP address for a socket.
> 
> See linked discussion in the original message
> 
> > > - Solution A: We add a flag to the existing inet_lookup exported function
> > > (and similarly for inet6, udp4 and udp6). The 
> > > INET_LOOKUP_SKIP_BPF_SK_LOOKUP
> > > flag, when set, would prevent BPF sk_lookup from happening. It also 
> > > requires
> > > modifying every location making use of those functions.
> > > 
> > > - Solution B: We export a new symbol in inet_hashtables, a wrapper around
> > > static function inet_lhash2_lookup for inet4 and similar functions for 
> > > inet6
> > > and udp4/6. Looking up specific IP/port and if not found looking up for
> > > INADDR_ANY could be done in the helper function in net/core/filters.c or 
> > > in
> > > the BPF program.
For TCP, it is only for lhash lookup, right?

> > > 
> > > Q2: Should we reuse the bpf_sk_lokup_tcp and bpf_sk_lookup_udp helper
> > > functions or create new ones?
> > If the args passing to the bpf_sk_lookup_(tcp|udp) is the same,
> > it makes sense to reuse the same BPF_FUNC_sk_lookup_*.
> > The actual helper implementation could be different though.
> > Look at bpf_xdp_sk_lookup_tcp_proto and bpf_sk_lookup_tcp_proto.
> 
> I was thinking that perhaps a different helper method which would take
> IPPROTO_TCP or IPPROTO_UDP parameter would be better suited. it would avoid
> BPF code such as :
> 
>     switch(ctx->protocol){
>         case IPPROTO_TCP:
>             sk = bpf_sk_lookup_tcp(ctx, &tuple, tuple_size, -1, 0);
>             break;
>         case IPPROTO_UDP:
>             sk = bpf_sk_lookup_udp(ctx, &tuple, tuple_size, -1, 0);
>             break;
>         default:
>             return SK_PASS;
>     }
> 
> But then there is the limit of 5 arguments, isn't it, so perhaps the
> _tcp/_udp functions are not such a bad idea after all.
> 
> I saw already that the same helper functions could be given different
> implementations. And if there is no way to have more than 5 arguments then
> this is probably better to reuse the same helper function name and
> signature.


Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-19 Thread Martin KaFai Lau
On Fri, Mar 19, 2021 at 04:02:27PM -0700, Andrii Nakryiko wrote:
> On Fri, Mar 19, 2021 at 3:45 PM Martin KaFai Lau  wrote:
> >
> > On Fri, Mar 19, 2021 at 03:29:57PM -0700, Andrii Nakryiko wrote:
> > > On Fri, Mar 19, 2021 at 3:19 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Fri, Mar 19, 2021 at 02:27:13PM -0700, Andrii Nakryiko wrote:
> > > > > On Thu, Mar 18, 2021 at 10:29 PM Martin KaFai Lau  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Mar 18, 2021 at 09:13:56PM -0700, Andrii Nakryiko wrote:
> > > > > > > On Thu, Mar 18, 2021 at 4:39 PM Martin KaFai Lau  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 18, 2021 at 03:53:38PM -0700, Andrii Nakryiko wrote:
> > > > > > > > > On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > This patch makes BTF verifier to accept extern func. It is 
> > > > > > > > > > used for
> > > > > > > > > > allowing bpf program to call a limited set of kernel 
> > > > > > > > > > functions
> > > > > > > > > > in a later patch.
> > > > > > > > > >
> > > > > > > > > > When writing bpf prog, the extern kernel function needs
> > > > > > > > > > to be declared under a ELF section (".ksyms") which is
> > > > > > > > > > the same as the current extern kernel variables and that 
> > > > > > > > > > should
> > > > > > > > > > keep its usage consistent without requiring to remember 
> > > > > > > > > > another
> > > > > > > > > > section name.
> > > > > > > > > >
> > > > > > > > > > For example, in a bpf_prog.c:
> > > > > > > > > >
> > > > > > > > > > extern int foo(struct sock *) 
> > > > > > > > > > __attribute__((section(".ksyms")))
> > > > > > > > > >
> > > > > > > > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > > > > > > > '(anon)' type_id=18
> > > > > > > > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > > > > > > > [ ... ]
> > > > > > > > > > [33] DATASEC '.ksyms' size=0 vlen=1
> > > > > > > > > > type_id=25 offset=0 size=0
> > > > > > > > > >
> > > > > > > > > > LLVM will put the "func" type into the BTF datasec ".ksyms".
> > > > > > > > > > The current "btf_datasec_check_meta()" assumes everything 
> > > > > > > > > > under
> > > > > > > > > > it is a "var" and ensures it has non-zero size 
> > > > > > > > > > ("!vsi->size" test).
> > > > > > > > > > The non-zero size check is not true for "func".  This patch 
> > > > > > > > > > postpones the
> > > > > > > > > > "!vsi-size" test from "btf_datasec_check_meta()" to
> > > > > > > > > > "btf_datasec_resolve()" which has all types collected to 
> > > > > > > > > > decide
> > > > > > > > > > if a vsi is a "var" or a "func" and then enforce the 
> > > > > > > > > > "vsi->size"
> > > > > > > > > > differently.
> > > > > > > > > >
> > > > > > > > > > If the datasec only has "func", its "t->size" could be zero.
> > > > > > > > > > Thus, the current "!t->size" test is no longer valid.  The
> > > > > > > > > > invalid "t->size" will still be caught by the later
> > > > > > > > > > "last_vsi_end_off > t->size" check.   This patch also takes 
> > > > > > > 

Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-19 Thread Martin KaFai Lau
On Fri, Mar 19, 2021 at 03:29:57PM -0700, Andrii Nakryiko wrote:
> On Fri, Mar 19, 2021 at 3:19 PM Martin KaFai Lau  wrote:
> >
> > On Fri, Mar 19, 2021 at 02:27:13PM -0700, Andrii Nakryiko wrote:
> > > On Thu, Mar 18, 2021 at 10:29 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Thu, Mar 18, 2021 at 09:13:56PM -0700, Andrii Nakryiko wrote:
> > > > > On Thu, Mar 18, 2021 at 4:39 PM Martin KaFai Lau  wrote:
> > > > > >
> > > > > > On Thu, Mar 18, 2021 at 03:53:38PM -0700, Andrii Nakryiko wrote:
> > > > > > > On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > This patch makes BTF verifier to accept extern func. It is used 
> > > > > > > > for
> > > > > > > > allowing bpf program to call a limited set of kernel functions
> > > > > > > > in a later patch.
> > > > > > > >
> > > > > > > > When writing bpf prog, the extern kernel function needs
> > > > > > > > to be declared under a ELF section (".ksyms") which is
> > > > > > > > the same as the current extern kernel variables and that should
> > > > > > > > keep its usage consistent without requiring to remember another
> > > > > > > > section name.
> > > > > > > >
> > > > > > > > For example, in a bpf_prog.c:
> > > > > > > >
> > > > > > > > extern int foo(struct sock *) __attribute__((section(".ksyms")))
> > > > > > > >
> > > > > > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > > > > > '(anon)' type_id=18
> > > > > > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > > > > > [ ... ]
> > > > > > > > [33] DATASEC '.ksyms' size=0 vlen=1
> > > > > > > > type_id=25 offset=0 size=0
> > > > > > > >
> > > > > > > > LLVM will put the "func" type into the BTF datasec ".ksyms".
> > > > > > > > The current "btf_datasec_check_meta()" assumes everything under
> > > > > > > > it is a "var" and ensures it has non-zero size ("!vsi->size" 
> > > > > > > > test).
> > > > > > > > The non-zero size check is not true for "func".  This patch 
> > > > > > > > postpones the
> > > > > > > > "!vsi-size" test from "btf_datasec_check_meta()" to
> > > > > > > > "btf_datasec_resolve()" which has all types collected to decide
> > > > > > > > if a vsi is a "var" or a "func" and then enforce the "vsi->size"
> > > > > > > > differently.
> > > > > > > >
> > > > > > > > If the datasec only has "func", its "t->size" could be zero.
> > > > > > > > Thus, the current "!t->size" test is no longer valid.  The
> > > > > > > > invalid "t->size" will still be caught by the later
> > > > > > > > "last_vsi_end_off > t->size" check.   This patch also takes this
> > > > > > > > chance to consolidate other "t->size" tests ("vsi->offset >= 
> > > > > > > > t->size"
> > > > > > > > "vsi->size > t->size", and "t->size < sum") into the existing
> > > > > > > > "last_vsi_end_off > t->size" test.
> > > > > > > >
> > > > > > > > The LLVM will also put those extern kernel function as an extern
> > > > > > > > linkage func in the BTF:
> > > > > > > >
> > > > > > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > > > > > '(anon)' type_id=18
> > > > > > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > > > > >
> > > > > > > > This patch allows BTF_FUNC_EXTERN in btf_func_check_meta().
> > > > > >

Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-19 Thread Martin KaFai Lau
On Fri, Mar 19, 2021 at 02:27:13PM -0700, Andrii Nakryiko wrote:
> On Thu, Mar 18, 2021 at 10:29 PM Martin KaFai Lau  wrote:
> >
> > On Thu, Mar 18, 2021 at 09:13:56PM -0700, Andrii Nakryiko wrote:
> > > On Thu, Mar 18, 2021 at 4:39 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Thu, Mar 18, 2021 at 03:53:38PM -0700, Andrii Nakryiko wrote:
> > > > > On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  
> > > > > wrote:
> > > > > >
> > > > > > This patch makes BTF verifier to accept extern func. It is used for
> > > > > > allowing bpf program to call a limited set of kernel functions
> > > > > > in a later patch.
> > > > > >
> > > > > > When writing bpf prog, the extern kernel function needs
> > > > > > to be declared under a ELF section (".ksyms") which is
> > > > > > the same as the current extern kernel variables and that should
> > > > > > keep its usage consistent without requiring to remember another
> > > > > > section name.
> > > > > >
> > > > > > For example, in a bpf_prog.c:
> > > > > >
> > > > > > extern int foo(struct sock *) __attribute__((section(".ksyms")))
> > > > > >
> > > > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > > > '(anon)' type_id=18
> > > > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > > > [ ... ]
> > > > > > [33] DATASEC '.ksyms' size=0 vlen=1
> > > > > > type_id=25 offset=0 size=0
> > > > > >
> > > > > > LLVM will put the "func" type into the BTF datasec ".ksyms".
> > > > > > The current "btf_datasec_check_meta()" assumes everything under
> > > > > > it is a "var" and ensures it has non-zero size ("!vsi->size" test).
> > > > > > The non-zero size check is not true for "func".  This patch 
> > > > > > postpones the
> > > > > > "!vsi-size" test from "btf_datasec_check_meta()" to
> > > > > > "btf_datasec_resolve()" which has all types collected to decide
> > > > > > if a vsi is a "var" or a "func" and then enforce the "vsi->size"
> > > > > > differently.
> > > > > >
> > > > > > If the datasec only has "func", its "t->size" could be zero.
> > > > > > Thus, the current "!t->size" test is no longer valid.  The
> > > > > > invalid "t->size" will still be caught by the later
> > > > > > "last_vsi_end_off > t->size" check.   This patch also takes this
> > > > > > chance to consolidate other "t->size" tests ("vsi->offset >= 
> > > > > > t->size"
> > > > > > "vsi->size > t->size", and "t->size < sum") into the existing
> > > > > > "last_vsi_end_off > t->size" test.
> > > > > >
> > > > > > The LLVM will also put those extern kernel function as an extern
> > > > > > linkage func in the BTF:
> > > > > >
> > > > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > > > '(anon)' type_id=18
> > > > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > > >
> > > > > > This patch allows BTF_FUNC_EXTERN in btf_func_check_meta().
> > > > > > Also extern kernel function declaration does not
> > > > > > necessary have arg name. Another change in btf_func_check() is
> > > > > > to allow extern function having no arg name.
> > > > > >
> > > > > > The btf selftest is adjusted accordingly.  New tests are also added.
> > > > > >
> > > > > > The required LLVM patch: https://reviews.llvm.org/D93563 
> > > > > >
> > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > ---
> > > > >
> > > > > High-level question about EXTERN functions in DATASEC. Does kernel
> > > > > need to see them under DATASEC? What if libbpf just removed all EXTERN
>

Re: [PATCH bpf-next 04/15] bpf: Support bpf program calling kernel function

2021-03-19 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 06:03:49PM -0700, Andrii Nakryiko wrote:
> On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  wrote:
> >
> > This patch adds support to BPF verifier to allow bpf program calling
> > kernel function directly.
> >
> > The use case included in this set is to allow bpf-tcp-cc to directly
> > call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
> > functions have already been used by some kernel tcp-cc implementations.
> >
> > This set will also allow the bpf-tcp-cc program to directly call the
> > kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
> > implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
> > from the kernel tcp_dctcp.c instead of reimplementing (or
> > copy-and-pasting) them.
> >
> > The tcp-cc kernel functions mentioned above will be white listed
> > for the struct_ops bpf-tcp-cc programs to use in a later patch.
> > The white listed functions are not bounded to a fixed ABI contract.
> > Those functions have already been used by the existing kernel tcp-cc.
> > If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
> > implementations have to be changed.  The same goes for the struct_ops
> > bpf-tcp-cc programs which have to be adjusted accordingly.
> >
> > This patch is to make the required changes in the bpf verifier.
> >
> > First change is in btf.c, it adds a case in "do_btf_check_func_arg_match()".
> > When the passed in "btf->kernel_btf == true", it means matching the
> > verifier regs' states with a kernel function.  This will handle the
> > PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
> > and PTR_TO_TCP_SOCK to its kernel's btf_id.
> >
> > In the later libbpf patch, the insn calling a kernel function will
> > look like:
> >
> > insn->code == (BPF_JMP | BPF_CALL)
> > insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
> > insn->imm == func_btf_id /* btf_id of the running kernel */
> >
> > [ For the future calling function-in-kernel-module support, an array
> >   of module btf_fds can be passed at the load time and insn->off
> >   can be used to index into this array. ]
> >
> > At the early stage of verifier, the verifier will collect all kernel
> > function calls into "struct bpf_kern_func_descriptor".  Those
> > descriptors are stored in "prog->aux->kfunc_tab" and will
> > be available to the JIT.  Since this "add" operation is similar
> > to the current "add_subprog()" and looking for the same insn->code,
> > they are done together in the new "add_subprog_and_kern_func()".
> >
> > In the "do_check()" stage, the new "check_kern_func_call()" is added
> > to verify the kernel function call instruction:
> > 1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
> >A new bpf_verifier_ops "check_kern_func_call" is added to do that.
> >The bpf-tcp-cc struct_ops program will implement this function in
> >a later patch.
> > 2. Call "btf_check_kern_func_args_match()" to ensure the regs can be
> >used as the args of a kernel function.
> > 3. Mark the regs' type, subreg_def, and zext_dst.
> >
> > At the later do_misc_fixups() stage, the new fixup_kern_func_call()
> > will replace the insn->imm with the function address (relative
> > to __bpf_call_base).  If needed, the jit can find the btf_func_model
> > by calling the new bpf_jit_find_kern_func_model(prog, insn->imm).
> > With the imm set to the function address, "bpftool prog dump xlated"
> > will be able to display the kernel function calls the same way as
> > it displays other bpf helper calls.
> >
> > gpl_compatible program is required to call kernel function.
> >
> > This feature currently requires JIT.
> >
> > Signed-off-by: Martin KaFai Lau 
> > ---
> 
> After the initial pass it all makes sense so far. I am a bit concerned
> about s32 and kernel function offset, though. See below.
> 
> Also "kern_func" and "descriptor" are quite mouthful, it seems to me
> that using kfunc consistently wouldn't hurt readability at all. You
> also already use desc in place of "descriptor" for variables, so I'd
> do that in type names as well.
The descriptor/desc naming follows the existing poke descriptor
and some of its helper naming.

Sure. both can be renamed in v2.
s/descriptor/desc/
s/kern_func/kfunc/

> > +static int kern_func_desc_cmp_by_imm(const void *a, const void *b)
> > +{
> > +   const struct bpf_kern_func_descriptor *d0 = a;
> > +   const struct bpf_kern_func_descriptor *d1 = b;
> > +
> > +   return d0->imm - d1->imm;
> 
> this is not safe, assuming any possible s32 values, no?
Good catch. will fix.


Re: [PATCH bpf-next 03/15] bpf: Refactor btf_check_func_arg_match

2021-03-19 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 04:32:47PM -0700, Andrii Nakryiko wrote:
> On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  wrote:
> >
> > This patch refactors the core logic of "btf_check_func_arg_match()"
> > into a new function "do_btf_check_func_arg_match()".
> > "do_btf_check_func_arg_match()" will be reused later to check
> > the kernel function call.
> >
> > The "if (!btf_type_is_ptr(t))" is checked first to improve the indentation
> > which will be useful for a later patch.
> >
> > Some of the "btf_kind_str[]" usages is replaced with the shortcut
> > "btf_type_str(t)".
> >
> > Signed-off-by: Martin KaFai Lau 
> > ---
> >  include/linux/btf.h |   5 ++
> >  kernel/bpf/btf.c| 159 
> >  2 files changed, 91 insertions(+), 73 deletions(-)
> >
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 7fabf1428093..93bf2e5225f5 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -140,6 +140,11 @@ static inline bool btf_type_is_enum(const struct 
> > btf_type *t)
> > return BTF_INFO_KIND(t->info) == BTF_KIND_ENUM;
> >  }
> >
> > +static inline bool btf_type_is_scalar(const struct btf_type *t)
> > +{
> > +   return btf_type_is_int(t) || btf_type_is_enum(t);
> > +}
> > +
> >  static inline bool btf_type_is_typedef(const struct btf_type *t)
> >  {
> > return BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF;
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 96cd24020a38..529b94b601c6 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -4381,7 +4381,7 @@ static u8 bpf_ctx_convert_map[] = {
> >  #undef BPF_LINK_TYPE
> >
> >  static const struct btf_member *
> > -btf_get_prog_ctx_type(struct bpf_verifier_log *log, struct btf *btf,
> > +btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
> >   const struct btf_type *t, enum bpf_prog_type 
> > prog_type,
> >   int arg)
> >  {
> > @@ -5366,122 +5366,135 @@ int btf_check_type_match(struct bpf_verifier_log 
> > *log, const struct bpf_prog *pr
> > return btf_check_func_type_match(log, btf1, t1, btf2, t2);
> >  }
> >
> > -/* Compare BTF of a function with given bpf_reg_state.
> > - * Returns:
> > - * EFAULT - there is a verifier bug. Abort verification.
> > - * EINVAL - there is a type mismatch or BTF is not available.
> > - * 0 - BTF matches with what bpf_reg_state expects.
> > - * Only PTR_TO_CTX and SCALAR_VALUE states are recognized.
> > - */
> > -int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
> > -struct bpf_reg_state *regs)
> > +static int do_btf_check_func_arg_match(struct bpf_verifier_env *env,
> 
> do_btf_check_func_arg_match vs btf_check_func_arg_match distinction is
> not clear at all. How about something like
> 
> btf_check_func_arg_match vs btf_check_subprog_arg_match (or btf_func
> vs bpf_subprog). I think that highlights the main distinction better,
> no?
will rename.

> 
> > +  const struct btf *btf, u32 func_id,
> > +  struct bpf_reg_state *regs,
> > +  bool ptr_to_mem_ok)
> >  {
> > struct bpf_verifier_log *log = &env->log;
> > -   struct bpf_prog *prog = env->prog;
> > -   struct btf *btf = prog->aux->btf;
> > -   const struct btf_param *args;
> > +   const char *func_name, *ref_tname;
> > const struct btf_type *t, *ref_t;
> > -   u32 i, nargs, btf_id, type_size;
> > -   const char *tname;
> > -   bool is_global;
> > -
> > -   if (!prog->aux->func_info)
> > -   return -EINVAL;
> > -
> > -   btf_id = prog->aux->func_info[subprog].type_id;
> > -   if (!btf_id)
> > -   return -EFAULT;
> > -
> > -   if (prog->aux->func_info_aux[subprog].unreliable)
> > -   return -EINVAL;
> > +   const struct btf_param *args;
> > +   u32 i, nargs;
> >
> > -   t = btf_type_by_id(btf, btf_id);
> > +   t = btf_type_by_id(btf, func_id);
> > if (!t || !btf_type_is_func(t)) {
> > /* These checks were already done by the verifier while 
> > loading
> >  * struct bpf_func_info
> >

Re: Design for sk_lookup helper function in context of sk_lookup hook

2021-03-19 Thread Martin KaFai Lau
On Wed, Mar 17, 2021 at 10:04:18AM +0100, Shanti Lombard née Bouchez-Mongardé 
wrote:
> Hello everyone,
> 
> Background discussion: 
> https://lore.kernel.org/bpf/CAADnVQJnX-+9u--px_VnhrMTPB=o9y0lh9t7rjbqzflchbu...@mail.gmail.com/
> 
> In a previous e-mail I was explaining that sk_lookup BPF programs had no way
> to lookup existing sockets in kernel space. The sockets have to be provided
> by userspace, and the userspace has to find a way to get them and update
> them, which in some circumstances can be difficult. I'm working on a patch
> to make socket lookup available to BPF programs in the context of the
> sk_lookup hook.
> 
> There is also two helper function bpf_sk_lokup_tcp and bpf_sk_lookup_udp
> which exists but are not available in the context of sk_lookup hooks. Making
> them available in this context is not very difficult (just configure it in
> net/core/filter.c) but those helpers will call back BPF code as part of the
> socket lookup potentially causing an infinite loop. We need to find a way to
> perform socket lookup but disable the BPF hook while doing so.
> 
> Around all this, I have a few design questions :
> 
> Q1: How do we prevent socket lookup from triggering BPF sk_lookup causing a
> loop?
The bpf_sk_lookup_(tcp|udp) will be called from the BPF_PROG_TYPE_SK_LOOKUP 
program?

> 
> - Solution A: We add a flag to the existing inet_lookup exported function
> (and similarly for inet6, udp4 and udp6). The INET_LOOKUP_SKIP_BPF_SK_LOOKUP
> flag, when set, would prevent BPF sk_lookup from happening. It also requires
> modifying every location making use of those functions.
> 
> - Solution B: We export a new symbol in inet_hashtables, a wrapper around
> static function inet_lhash2_lookup for inet4 and similar functions for inet6
> and udp4/6. Looking up specific IP/port and if not found looking up for
> INADDR_ANY could be done in the helper function in net/core/filters.c or in
> the BPF program.
> 
> Q2: Should we reuse the bpf_sk_lokup_tcp and bpf_sk_lookup_udp helper
> functions or create new ones?
If the args passing to the bpf_sk_lookup_(tcp|udp) is the same,
it makes sense to reuse the same BPF_FUNC_sk_lookup_*.
The actual helper implementation could be different though.
Look at bpf_xdp_sk_lookup_tcp_proto and bpf_sk_lookup_tcp_proto.

> 
> For solution A above, the helper functions can be reused almose identically,
> just adding a flag or boolean argument to tell if we are in a sk_lookup
> program or not. In solution B is preferred, them perhaps it would make sense
> to expose the new raw lookup function created, and the BPF program would be
> responsible for falling back to INADDR_ANY if the specific socket is not
> found. It adds more power to the BPF program in this case but requires to
> create a new helper function.
> 
> I was going with Solution A abd identical function names, but as I am
> touching the code it seems that maybe solution B with a new helper function
> could be better. I'm open to ideas.
> 
> Thank you.
> 
> PS: please include me in replies if you are responding only to the netdev
> mailing list as I'm not part of it. I'm subscribed to bpf.
> 


Re: [PATCH bpf-next 15/15] bpf: selftest: Add kfunc_call test

2021-03-18 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 09:21:08PM -0700, Andrii Nakryiko wrote:
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
> > @@ -0,0 +1,61 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright (c) 2021 Facebook */
> > +#include 
> > +#include 
> > +#include "kfunc_call_test.skel.h"
> > +#include "kfunc_call_test_subprog.skel.h"
> > +
> > +static __u32 duration;
> > +
> 
> you shouldn't need it, you don't use CHECK()s
It was for bpf_prog_test_run().
Just noticed it can take NULL.  will remove in v2.


Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-18 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 09:13:56PM -0700, Andrii Nakryiko wrote:
> On Thu, Mar 18, 2021 at 4:39 PM Martin KaFai Lau  wrote:
> >
> > On Thu, Mar 18, 2021 at 03:53:38PM -0700, Andrii Nakryiko wrote:
> > > On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  wrote:
> > > >
> > > > This patch makes BTF verifier to accept extern func. It is used for
> > > > allowing bpf program to call a limited set of kernel functions
> > > > in a later patch.
> > > >
> > > > When writing bpf prog, the extern kernel function needs
> > > > to be declared under a ELF section (".ksyms") which is
> > > > the same as the current extern kernel variables and that should
> > > > keep its usage consistent without requiring to remember another
> > > > section name.
> > > >
> > > > For example, in a bpf_prog.c:
> > > >
> > > > extern int foo(struct sock *) __attribute__((section(".ksyms")))
> > > >
> > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > '(anon)' type_id=18
> > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > > [ ... ]
> > > > [33] DATASEC '.ksyms' size=0 vlen=1
> > > > type_id=25 offset=0 size=0
> > > >
> > > > LLVM will put the "func" type into the BTF datasec ".ksyms".
> > > > The current "btf_datasec_check_meta()" assumes everything under
> > > > it is a "var" and ensures it has non-zero size ("!vsi->size" test).
> > > > The non-zero size check is not true for "func".  This patch postpones 
> > > > the
> > > > "!vsi-size" test from "btf_datasec_check_meta()" to
> > > > "btf_datasec_resolve()" which has all types collected to decide
> > > > if a vsi is a "var" or a "func" and then enforce the "vsi->size"
> > > > differently.
> > > >
> > > > If the datasec only has "func", its "t->size" could be zero.
> > > > Thus, the current "!t->size" test is no longer valid.  The
> > > > invalid "t->size" will still be caught by the later
> > > > "last_vsi_end_off > t->size" check.   This patch also takes this
> > > > chance to consolidate other "t->size" tests ("vsi->offset >= t->size"
> > > > "vsi->size > t->size", and "t->size < sum") into the existing
> > > > "last_vsi_end_off > t->size" test.
> > > >
> > > > The LLVM will also put those extern kernel function as an extern
> > > > linkage func in the BTF:
> > > >
> > > > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > > > '(anon)' type_id=18
> > > > [25] FUNC 'foo' type_id=24 linkage=extern
> > > >
> > > > This patch allows BTF_FUNC_EXTERN in btf_func_check_meta().
> > > > Also extern kernel function declaration does not
> > > > necessary have arg name. Another change in btf_func_check() is
> > > > to allow extern function having no arg name.
> > > >
> > > > The btf selftest is adjusted accordingly.  New tests are also added.
> > > >
> > > > The required LLVM patch: https://reviews.llvm.org/D93563 
> > > >
> > > > Signed-off-by: Martin KaFai Lau 
> > > > ---
> > >
> > > High-level question about EXTERN functions in DATASEC. Does kernel
> > > need to see them under DATASEC? What if libbpf just removed all EXTERN
> > > funcs from under DATASEC and leave them as "free-floating" EXTERN
> > > FUNCs in BTF.
> > >
> > > We need to tag EXTERNs with DATASECs mainly for libbpf to know whether
> > > it's .kconfig or .ksym or other type of externs. Does kernel need to
> > > care?
> > Although the kernel does not need to know, since the a legit llvm generates 
> > it,
> > I go with a proper support in the kernel (e.g. bpftool btf dump can better
> > reflect what was there).
> 
> LLVM also generates extern VAR with BTF_VAR_EXTERN, yet libbpf is
> replacing it with fake INTs.
Yep. I noticed the loop in collect_extern() in libbpf.
It replaces the var->type with INT.

> We could do just that here as well.
What to replace in the FUNC case?

Regardless, supporting it properly in the kernel is a better way to go
instead of asking the userspace to move around it.  It is not very
complicated to support it in the kernel also.

What is the concern of having the kernel to support it?

> If anyone would want to know all the kernel functions that some BPF
> program is using, they could do it from the instruction dump, with
> proper addresses and kernel function names nicely displayed there.
> That's way more useful, IMO.


Re: [PATCH bpf-next 12/15] libbpf: Support extern kernel function

2021-03-18 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 09:11:39PM -0700, Andrii Nakryiko wrote:
> On Tue, Mar 16, 2021 at 12:02 AM Martin KaFai Lau  wrote:
> >
> > This patch is to make libbpf able to handle the following extern
> > kernel function declaration and do the needed relocations before
> > loading the bpf program to the kernel.
> >
> > extern int foo(struct sock *) __attribute__((section(".ksyms")))
> >
> > In the collect extern phase, needed changes is made to
> > bpf_object__collect_externs() and find_extern_btf_id() to collect
> > function.
> >
> > In the collect relo phase, it will record the kernel function
> > call as RELO_EXTERN_FUNC.
> >
> > bpf_object__resolve_ksym_func_btf_id() is added to find the func
> > btf_id of the running kernel.
> >
> > During actual relocation, it will patch the BPF_CALL instruction with
> > src_reg = BPF_PSEUDO_FUNC_CALL and insn->imm set to the running
> > kernel func's btf_id.
> >
> > btf_fixup_datasec() is changed also because a datasec may
> > only have func and its size will be 0.  The "!size" test
> > is postponed till it is confirmed there are vars.
> > It also takes this chance to remove the
> > "if (... || (t->size && t->size != size)) { return -ENOENT; }" test
> > because t->size is zero at the point.
> >
> > The required LLVM patch: https://reviews.llvm.org/D93563 
> >
> > Signed-off-by: Martin KaFai Lau 
> > ---
> >  tools/lib/bpf/btf.c|  32 
> >  tools/lib/bpf/btf.h|   5 ++
> >  tools/lib/bpf/libbpf.c | 113 +
> >  3 files changed, 129 insertions(+), 21 deletions(-)
> >
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index 3aa58f2ac183..bb09b577c154 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -1108,7 +1108,7 @@ static int btf_fixup_datasec(struct bpf_object *obj, 
> > struct btf *btf,
> > const struct btf_type *t_var;
> > struct btf_var_secinfo *vsi;
> > const struct btf_var *var;
> > -   int ret;
> > +   int ret, nr_vars = 0;
> >
> > if (!name) {
> > pr_debug("No name found in string section for DATASEC 
> > kind.\n");
> > @@ -1117,27 +1117,27 @@ static int btf_fixup_datasec(struct bpf_object 
> > *obj, struct btf *btf,
> >
> > /* .extern datasec size and var offsets were set correctly during
> >  * extern collection step, so just skip straight to sorting 
> > variables
> > +* One exception is the datasec may only have extern funcs,
> > +* t->size is 0 in this case.  This will be handled
> > +* with !nr_vars later.
> >  */
> > if (t->size)
> > goto sort_vars;
> >
> > -   ret = bpf_object__section_size(obj, name, &size);
> > -   if (ret || !size || (t->size && t->size != size)) {
> > -   pr_debug("Invalid size for section %s: %u bytes\n", name, 
> > size);
> > -   return -ENOENT;
> > -   }
> > -
> > -   t->size = size;
> > +   bpf_object__section_size(obj, name, &size);
> 
> So it's not great that we just ignore any errors here. ".ksyms" is a
> special section, so it should be fine to just ignore it by name and
> leave the rest of error handling intact.
The ret < 0 case? In that case, size is 0.

or there are cases that a section has no vars but the size should not be 0?

> 
> >
> > for (i = 0, vsi = btf_var_secinfos(t); i < vars; i++, vsi++) {
> > t_var = btf__type_by_id(btf, vsi->type);
> > -   var = btf_var(t_var);
> >
> > -   if (!btf_is_var(t_var)) {
> > -   pr_debug("Non-VAR type seen in section %s\n", name);
> > +   if (btf_is_func(t_var)) {
> > +   continue;
> 
> just
> 
> if (btf_is_func(t_var))
> continue;
> 
> no need for "else if" below
> 
> > +   } else if (!btf_is_var(t_var)) {
> > +   pr_debug("Non-VAR and Non-FUNC type seen in section 
> > %s\n", name);
> 
> nit: Non-FUNC -> non-FUNC
> 
> > return -EINVAL;
> > }
> >
> > +   nr_vars++;
> > +   var = btf_var(t_var);
> > if (var->linkage == BTF_VAR_STATIC)
> > 

Re: [PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-18 Thread Martin KaFai Lau
On Thu, Mar 18, 2021 at 03:53:38PM -0700, Andrii Nakryiko wrote:
> On Tue, Mar 16, 2021 at 12:01 AM Martin KaFai Lau  wrote:
> >
> > This patch makes BTF verifier to accept extern func. It is used for
> > allowing bpf program to call a limited set of kernel functions
> > in a later patch.
> >
> > When writing bpf prog, the extern kernel function needs
> > to be declared under a ELF section (".ksyms") which is
> > the same as the current extern kernel variables and that should
> > keep its usage consistent without requiring to remember another
> > section name.
> >
> > For example, in a bpf_prog.c:
> >
> > extern int foo(struct sock *) __attribute__((section(".ksyms")))
> >
> > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > '(anon)' type_id=18
> > [25] FUNC 'foo' type_id=24 linkage=extern
> > [ ... ]
> > [33] DATASEC '.ksyms' size=0 vlen=1
> > type_id=25 offset=0 size=0
> >
> > LLVM will put the "func" type into the BTF datasec ".ksyms".
> > The current "btf_datasec_check_meta()" assumes everything under
> > it is a "var" and ensures it has non-zero size ("!vsi->size" test).
> > The non-zero size check is not true for "func".  This patch postpones the
> > "!vsi-size" test from "btf_datasec_check_meta()" to
> > "btf_datasec_resolve()" which has all types collected to decide
> > if a vsi is a "var" or a "func" and then enforce the "vsi->size"
> > differently.
> >
> > If the datasec only has "func", its "t->size" could be zero.
> > Thus, the current "!t->size" test is no longer valid.  The
> > invalid "t->size" will still be caught by the later
> > "last_vsi_end_off > t->size" check.   This patch also takes this
> > chance to consolidate other "t->size" tests ("vsi->offset >= t->size"
> > "vsi->size > t->size", and "t->size < sum") into the existing
> > "last_vsi_end_off > t->size" test.
> >
> > The LLVM will also put those extern kernel function as an extern
> > linkage func in the BTF:
> >
> > [24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
> > '(anon)' type_id=18
> > [25] FUNC 'foo' type_id=24 linkage=extern
> >
> > This patch allows BTF_FUNC_EXTERN in btf_func_check_meta().
> > Also extern kernel function declaration does not
> > necessary have arg name. Another change in btf_func_check() is
> > to allow extern function having no arg name.
> >
> > The btf selftest is adjusted accordingly.  New tests are also added.
> >
> > The required LLVM patch: https://reviews.llvm.org/D93563 
> >
> > Signed-off-by: Martin KaFai Lau 
> > ---
> 
> High-level question about EXTERN functions in DATASEC. Does kernel
> need to see them under DATASEC? What if libbpf just removed all EXTERN
> funcs from under DATASEC and leave them as "free-floating" EXTERN
> FUNCs in BTF.
> 
> We need to tag EXTERNs with DATASECs mainly for libbpf to know whether
> it's .kconfig or .ksym or other type of externs. Does kernel need to
> care?
Although the kernel does not need to know, since the a legit llvm generates it,
I go with a proper support in the kernel (e.g. bpftool btf dump can better
reflect what was there).

> 
> >  kernel/bpf/btf.c |  52 ---
> >  tools/testing/selftests/bpf/prog_tests/btf.c | 154 ++-
> >  2 files changed, 178 insertions(+), 28 deletions(-)
> >
> 
> [...]
> 
> > @@ -3611,9 +3594,28 @@ static int btf_datasec_resolve(struct 
> > btf_verifier_env *env,
> > u32 var_type_id = vsi->type, type_id, type_size = 0;
> > const struct btf_type *var_type = btf_type_by_id(env->btf,
> >  
> > var_type_id);
> > -   if (!var_type || !btf_type_is_var(var_type)) {
> > +   if (!var_type) {
> > +   btf_verifier_log_vsi(env, v->t, vsi,
> > +"type not found");
> > +   return -EINVAL;
> > +   }
> > +
> > +   if (btf_type_is_func(var_type)) {
> > +   if (vsi->size || vsi->offset) {
> > +   btf_verifier_

[PATCH bpf-next 14/15] bpf: selftest: bpf_cubic and bpf_dctcp calling kernel functions

2021-03-15 Thread Martin KaFai Lau
This patch removes the bpf implementation of tcp_slow_start()
and tcp_cong_avoid_ai().  Instead, it directly uses the kernel
implementation.

It also replaces the bpf_cubic_undo_cwnd implementation by directly
calling tcp_reno_undo_cwnd().  bpf_dctcp also directly calls
tcp_reno_cong_avoid() instead.

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 29 ++-
 tools/testing/selftests/bpf/progs/bpf_cubic.c |  6 ++--
 tools/testing/selftests/bpf/progs/bpf_dctcp.c | 22 --
 3 files changed, 11 insertions(+), 46 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h 
b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index 91f0fac632f4..029589c008c9 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -187,16 +187,6 @@ struct tcp_congestion_ops {
typeof(y) __y = (y);\
__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
 
-static __always_inline __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked)
-{
-   __u32 cwnd = min(tp->snd_cwnd + acked, tp->snd_ssthresh);
-
-   acked -= cwnd - tp->snd_cwnd;
-   tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);
-
-   return acked;
-}
-
 static __always_inline bool tcp_in_slow_start(const struct tcp_sock *tp)
 {
return tp->snd_cwnd < tp->snd_ssthresh;
@@ -213,22 +203,7 @@ static __always_inline bool tcp_is_cwnd_limited(const 
struct sock *sk)
return !!BPF_CORE_READ_BITFIELD(tp, is_cwnd_limited);
 }
 
-static __always_inline void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, 
__u32 acked)
-{
-   /* If credits accumulated at a higher w, apply them gently now. */
-   if (tp->snd_cwnd_cnt >= w) {
-   tp->snd_cwnd_cnt = 0;
-   tp->snd_cwnd++;
-   }
-
-   tp->snd_cwnd_cnt += acked;
-   if (tp->snd_cwnd_cnt >= w) {
-   __u32 delta = tp->snd_cwnd_cnt / w;
-
-   tp->snd_cwnd_cnt -= delta * w;
-   tp->snd_cwnd += delta;
-   }
-   tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_cwnd_clamp);
-}
+extern __u32 tcp_slow_start(struct tcp_sock *tp, __u32 acked) __ksym;
+extern void tcp_cong_avoid_ai(struct tcp_sock *tp, __u32 w, __u32 acked) 
__ksym;
 
 #endif
diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c 
b/tools/testing/selftests/bpf/progs/bpf_cubic.c
index 33c4d2bded64..f62df4d023f9 100644
--- a/tools/testing/selftests/bpf/progs/bpf_cubic.c
+++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c
@@ -525,11 +525,11 @@ void BPF_STRUCT_OPS(bpf_cubic_acked, struct sock *sk,
hystart_update(sk, delay);
 }
 
+extern __u32 tcp_reno_undo_cwnd(struct sock *sk) __ksym;
+
 __u32 BPF_STRUCT_OPS(bpf_cubic_undo_cwnd, struct sock *sk)
 {
-   const struct tcp_sock *tp = tcp_sk(sk);
-
-   return max(tp->snd_cwnd, tp->prior_cwnd);
+   return tcp_reno_undo_cwnd(sk);
 }
 
 SEC(".struct_ops")
diff --git a/tools/testing/selftests/bpf/progs/bpf_dctcp.c 
b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
index 4dc1a967776a..fd42247da8b4 100644
--- a/tools/testing/selftests/bpf/progs/bpf_dctcp.c
+++ b/tools/testing/selftests/bpf/progs/bpf_dctcp.c
@@ -194,22 +194,12 @@ __u32 BPF_PROG(dctcp_cwnd_undo, struct sock *sk)
return max(tcp_sk(sk)->snd_cwnd, ca->loss_cwnd);
 }
 
-SEC("struct_ops/tcp_reno_cong_avoid")
-void BPF_PROG(tcp_reno_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
-{
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (!tcp_is_cwnd_limited(sk))
-   return;
+extern void tcp_reno_cong_avoid(struct sock *sk, __u32 ack, __u32 acked) 
__ksym;
 
-   /* In "safe" area, increase. */
-   if (tcp_in_slow_start(tp)) {
-   acked = tcp_slow_start(tp, acked);
-   if (!acked)
-   return;
-   }
-   /* In dangerous area, increase slowly. */
-   tcp_cong_avoid_ai(tp, tp->snd_cwnd, acked);
+SEC("struct_ops/dctcp_reno_cong_avoid")
+void BPF_PROG(dctcp_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
+{
+   tcp_reno_cong_avoid(sk, ack, acked);
 }
 
 SEC(".struct_ops")
@@ -226,7 +216,7 @@ struct tcp_congestion_ops dctcp = {
.in_ack_event   = (void *)dctcp_update_alpha,
.cwnd_event = (void *)dctcp_cwnd_event,
.ssthresh   = (void *)dctcp_ssthresh,
-   .cong_avoid = (void *)tcp_reno_cong_avoid,
+   .cong_avoid = (void *)dctcp_cong_avoid,
.undo_cwnd  = (void *)dctcp_cwnd_undo,
.set_state  = (void *)dctcp_state,
.flags  = TCP_CONG_NEEDS_ECN,
-- 
2.30.2



[PATCH bpf-next 15/15] bpf: selftest: Add kfunc_call test

2021-03-15 Thread Martin KaFai Lau
This patch adds two kernel function bpf_kfunc_call_test[12]() for the
selftest's test_run purpose.  They will be allowed for tc_cls prog.

The selftest calling the kernel function bpf_kfunc_call_test[12]()
is also added in this patch.

Signed-off-by: Martin KaFai Lau 
---
 net/bpf/test_run.c| 11 
 net/core/filter.c | 11 
 .../selftests/bpf/prog_tests/kfunc_call.c | 61 +++
 .../selftests/bpf/progs/kfunc_call_test.c | 48 +++
 .../bpf/progs/kfunc_call_test_subprog.c   | 31 ++
 5 files changed, 162 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kfunc_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0abdd67f44b1..c1baab0c7d96 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -209,6 +209,17 @@ int noinline bpf_modify_return_test(int a, int *b)
*b += 1;
return a + *b;
 }
+
+u64 noinline bpf_kfunc_call_test1(struct sock *sk, u32 a, u64 b, u32 c, u64 d)
+{
+   return a + b + c + d;
+}
+
+int noinline bpf_kfunc_call_test2(struct sock *sk, u32 a, u32 b)
+{
+   return a + b;
+}
+
 __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
diff --git a/net/core/filter.c b/net/core/filter.c
index 10dac9dd5086..605fbbdd694b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9799,12 +9799,23 @@ const struct bpf_prog_ops sk_filter_prog_ops = {
.test_run   = bpf_prog_test_run_skb,
 };
 
+BTF_SET_START(bpf_tc_cls_kfunc_ids)
+BTF_ID(func, bpf_kfunc_call_test1)
+BTF_ID(func, bpf_kfunc_call_test2)
+BTF_SET_END(bpf_tc_cls_kfunc_ids)
+
+static bool tc_cls_check_kern_func_call(u32 kfunc_id)
+{
+   return btf_id_set_contains(&bpf_tc_cls_kfunc_ids, kfunc_id);
+}
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
.get_func_proto = tc_cls_act_func_proto,
.is_valid_access= tc_cls_act_is_valid_access,
.convert_ctx_access = tc_cls_act_convert_ctx_access,
.gen_prologue   = tc_cls_act_prologue,
.gen_ld_abs = bpf_gen_ld_abs,
+   .check_kern_func_call   = tc_cls_check_kern_func_call,
 };
 
 const struct bpf_prog_ops tc_cls_act_prog_ops = {
diff --git a/tools/testing/selftests/bpf/prog_tests/kfunc_call.c 
b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
new file mode 100644
index ..3850e6cc0a7d
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/kfunc_call.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include 
+#include 
+#include "kfunc_call_test.skel.h"
+#include "kfunc_call_test_subprog.skel.h"
+
+static __u32 duration;
+
+static void test_main(void)
+{
+   struct kfunc_call_test *skel;
+   int prog_fd, retval, err;
+
+   skel = kfunc_call_test__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "skel"))
+   return;
+
+   prog_fd = bpf_program__fd(skel->progs.kfunc_call_test1);
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, (__u32 *)&retval, &duration);
+
+   if (ASSERT_OK(err, "bpf_prog_test_run(test1)"))
+   ASSERT_EQ(retval, 12, "test1-retval");
+
+   prog_fd = bpf_program__fd(skel->progs.kfunc_call_test2);
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, (__u32 *)&retval, &duration);
+   if (ASSERT_OK(err, "bpf_prog_test_run(test2)"))
+   ASSERT_EQ(retval, 3, "test2-retval");
+
+   kfunc_call_test__destroy(skel);
+}
+
+static void test_subprog(void)
+{
+   struct kfunc_call_test_subprog *skel;
+   int prog_fd, retval, err;
+
+   skel = kfunc_call_test_subprog__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "skel"))
+   return;
+
+   prog_fd = bpf_program__fd(skel->progs.kfunc_call_test1);
+   err = bpf_prog_test_run(prog_fd, 1, &pkt_v4, sizeof(pkt_v4),
+   NULL, NULL, (__u32 *)&retval, &duration);
+
+   if (ASSERT_OK(err, "bpf_prog_test_run(test1)"))
+   ASSERT_EQ(retval, 10, "test1-retval");
+
+   kfunc_call_test_subprog__destroy(skel);
+}
+
+void test_kfunc_call(void)
+{
+   if (test__start_subtest("main"))
+   test_main();
+
+   if (test__start_subtest("subprog"))
+   test_subprog();
+}
diff --git a/tools/testing/selftests/bpf/progs/kfunc_call_test.c 
b/tools/testing/selftests/bpf/progs/kfunc_call_test.c
new file mode 100644
index ..ea8c5266efd8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/

[PATCH bpf-next 13/15] bpf: selftests: Rename bictcp to bpf_cubic

2021-03-15 Thread Martin KaFai Lau
As a similar chanage in the kernel, this patch gives the proper
name to the bpf cubic.

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/progs/bpf_cubic.c | 30 +--
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_cubic.c 
b/tools/testing/selftests/bpf/progs/bpf_cubic.c
index 6939bfd8690f..33c4d2bded64 100644
--- a/tools/testing/selftests/bpf/progs/bpf_cubic.c
+++ b/tools/testing/selftests/bpf/progs/bpf_cubic.c
@@ -174,8 +174,8 @@ static __always_inline void bictcp_hystart_reset(struct 
sock *sk)
  * as long as it is used in one of the func ptr
  * under SEC(".struct_ops").
  */
-SEC("struct_ops/bictcp_init")
-void BPF_PROG(bictcp_init, struct sock *sk)
+SEC("struct_ops/bpf_cubic_init")
+void BPF_PROG(bpf_cubic_init, struct sock *sk)
 {
struct bictcp *ca = inet_csk_ca(sk);
 
@@ -192,7 +192,7 @@ void BPF_PROG(bictcp_init, struct sock *sk)
  * The remaining tcp-cubic functions have an easier way.
  */
 SEC("no-sec-prefix-bictcp_cwnd_event")
-void BPF_PROG(bictcp_cwnd_event, struct sock *sk, enum tcp_ca_event event)
+void BPF_PROG(bpf_cubic_cwnd_event, struct sock *sk, enum tcp_ca_event event)
 {
if (event == CA_EVENT_TX_START) {
struct bictcp *ca = inet_csk_ca(sk);
@@ -384,7 +384,7 @@ static __always_inline void bictcp_update(struct bictcp 
*ca, __u32 cwnd,
 }
 
 /* Or simply use the BPF_STRUCT_OPS to avoid the SEC boiler plate. */
-void BPF_STRUCT_OPS(bictcp_cong_avoid, struct sock *sk, __u32 ack, __u32 acked)
+void BPF_STRUCT_OPS(bpf_cubic_cong_avoid, struct sock *sk, __u32 ack, __u32 
acked)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -403,7 +403,7 @@ void BPF_STRUCT_OPS(bictcp_cong_avoid, struct sock *sk, 
__u32 ack, __u32 acked)
tcp_cong_avoid_ai(tp, ca->cnt, acked);
 }
 
-__u32 BPF_STRUCT_OPS(bictcp_recalc_ssthresh, struct sock *sk)
+__u32 BPF_STRUCT_OPS(bpf_cubic_recalc_ssthresh, struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -420,7 +420,7 @@ __u32 BPF_STRUCT_OPS(bictcp_recalc_ssthresh, struct sock 
*sk)
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-void BPF_STRUCT_OPS(bictcp_state, struct sock *sk, __u8 new_state)
+void BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 new_state)
 {
if (new_state == TCP_CA_Loss) {
bictcp_reset(inet_csk_ca(sk));
@@ -496,7 +496,7 @@ static __always_inline void hystart_update(struct sock *sk, 
__u32 delay)
}
 }
 
-void BPF_STRUCT_OPS(bictcp_acked, struct sock *sk,
+void BPF_STRUCT_OPS(bpf_cubic_acked, struct sock *sk,
const struct ack_sample *sample)
 {
const struct tcp_sock *tp = tcp_sk(sk);
@@ -525,7 +525,7 @@ void BPF_STRUCT_OPS(bictcp_acked, struct sock *sk,
hystart_update(sk, delay);
 }
 
-__u32 BPF_STRUCT_OPS(tcp_reno_undo_cwnd, struct sock *sk)
+__u32 BPF_STRUCT_OPS(bpf_cubic_undo_cwnd, struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
 
@@ -534,12 +534,12 @@ __u32 BPF_STRUCT_OPS(tcp_reno_undo_cwnd, struct sock *sk)
 
 SEC(".struct_ops")
 struct tcp_congestion_ops cubic = {
-   .init   = (void *)bictcp_init,
-   .ssthresh   = (void *)bictcp_recalc_ssthresh,
-   .cong_avoid = (void *)bictcp_cong_avoid,
-   .set_state  = (void *)bictcp_state,
-   .undo_cwnd  = (void *)tcp_reno_undo_cwnd,
-   .cwnd_event = (void *)bictcp_cwnd_event,
-   .pkts_acked = (void *)bictcp_acked,
+   .init   = (void *)bpf_cubic_init,
+   .ssthresh   = (void *)bpf_cubic_recalc_ssthresh,
+   .cong_avoid = (void *)bpf_cubic_cong_avoid,
+   .set_state  = (void *)bpf_cubic_state,
+   .undo_cwnd  = (void *)bpf_cubic_undo_cwnd,
+   .cwnd_event = (void *)bpf_cubic_cwnd_event,
+   .pkts_acked = (void *)bpf_cubic_acked,
.name   = "bpf_cubic",
 };
-- 
2.30.2



[PATCH bpf-next 12/15] libbpf: Support extern kernel function

2021-03-15 Thread Martin KaFai Lau
This patch is to make libbpf able to handle the following extern
kernel function declaration and do the needed relocations before
loading the bpf program to the kernel.

extern int foo(struct sock *) __attribute__((section(".ksyms")))

In the collect extern phase, needed changes is made to
bpf_object__collect_externs() and find_extern_btf_id() to collect
function.

In the collect relo phase, it will record the kernel function
call as RELO_EXTERN_FUNC.

bpf_object__resolve_ksym_func_btf_id() is added to find the func
btf_id of the running kernel.

During actual relocation, it will patch the BPF_CALL instruction with
src_reg = BPF_PSEUDO_FUNC_CALL and insn->imm set to the running
kernel func's btf_id.

btf_fixup_datasec() is changed also because a datasec may
only have func and its size will be 0.  The "!size" test
is postponed till it is confirmed there are vars.
It also takes this chance to remove the
"if (... || (t->size && t->size != size)) { return -ENOENT; }" test
because t->size is zero at the point.

The required LLVM patch: https://reviews.llvm.org/D93563

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/btf.c|  32 
 tools/lib/bpf/btf.h|   5 ++
 tools/lib/bpf/libbpf.c | 113 +
 3 files changed, 129 insertions(+), 21 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 3aa58f2ac183..bb09b577c154 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -1108,7 +1108,7 @@ static int btf_fixup_datasec(struct bpf_object *obj, 
struct btf *btf,
const struct btf_type *t_var;
struct btf_var_secinfo *vsi;
const struct btf_var *var;
-   int ret;
+   int ret, nr_vars = 0;
 
if (!name) {
pr_debug("No name found in string section for DATASEC kind.\n");
@@ -1117,27 +1117,27 @@ static int btf_fixup_datasec(struct bpf_object *obj, 
struct btf *btf,
 
/* .extern datasec size and var offsets were set correctly during
 * extern collection step, so just skip straight to sorting variables
+* One exception is the datasec may only have extern funcs,
+* t->size is 0 in this case.  This will be handled
+* with !nr_vars later.
 */
if (t->size)
goto sort_vars;
 
-   ret = bpf_object__section_size(obj, name, &size);
-   if (ret || !size || (t->size && t->size != size)) {
-   pr_debug("Invalid size for section %s: %u bytes\n", name, size);
-   return -ENOENT;
-   }
-
-   t->size = size;
+   bpf_object__section_size(obj, name, &size);
 
for (i = 0, vsi = btf_var_secinfos(t); i < vars; i++, vsi++) {
t_var = btf__type_by_id(btf, vsi->type);
-   var = btf_var(t_var);
 
-   if (!btf_is_var(t_var)) {
-   pr_debug("Non-VAR type seen in section %s\n", name);
+   if (btf_is_func(t_var)) {
+   continue;
+   } else if (!btf_is_var(t_var)) {
+   pr_debug("Non-VAR and Non-FUNC type seen in section 
%s\n", name);
return -EINVAL;
}
 
+   nr_vars++;
+   var = btf_var(t_var);
if (var->linkage == BTF_VAR_STATIC)
continue;
 
@@ -1157,6 +1157,16 @@ static int btf_fixup_datasec(struct bpf_object *obj, 
struct btf *btf,
vsi->offset = off;
}
 
+   if (!nr_vars)
+   return 0;
+
+   if (!size) {
+   pr_debug("Invalid size for section %s: %u bytes\n", name, size);
+   return -ENOENT;
+   }
+
+   t->size = size;
+
 sort_vars:
qsort(btf_var_secinfos(t), vars, sizeof(*vsi), compare_vsi_off);
return 0;
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 029a9cfc8c2d..07d508b70497 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -368,6 +368,11 @@ btf_var_secinfos(const struct btf_type *t)
return (struct btf_var_secinfo *)(t + 1);
 }
 
+static inline enum btf_func_linkage btf_func_linkage(const struct btf_type *t)
+{
+   return (enum btf_func_linkage)BTF_INFO_VLEN(t->info);
+}
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 0a60fcb2fba2..49bda179bd93 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -190,6 +190,7 @@ enum reloc_type {
RELO_CALL,
RELO_DATA,
RELO_EXTERN_VAR,
+   RELO_EXTERN_FUNC,
RELO_SUBPROG_ADDR,
 };
 
@@ -384,6 +385,7 @@ struct extern_desc {
int btf_id;
int sec_btf_id;
const char *name;
+   const struct btf_type *btf_type;
bool is_set;
bool is_weak;
union {
@@ -3022,7 +3024,7 @@ static bool 

[PATCH bpf-next 08/15] libbpf: Refactor bpf_object__resolve_ksyms_btf_id

2021-03-15 Thread Martin KaFai Lau
This patch refactors most of the logic from
bpf_object__resolve_ksyms_btf_id() into a new function
bpf_object__resolve_ksym_var_btf_id().
It is to get ready for a later patch adding
bpf_object__resolve_ksym_func_btf_id() which resolves
a kernel function to the running kernel btf_id.

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 125 ++---
 1 file changed, 68 insertions(+), 57 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2f351d3ad3e7..7d5f9b7877bc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -7403,75 +7403,86 @@ static int bpf_object__read_kallsyms_file(struct 
bpf_object *obj)
return err;
 }
 
-static int bpf_object__resolve_ksyms_btf_id(struct bpf_object *obj)
+static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
+  struct extern_desc *ext)
 {
-   struct extern_desc *ext;
+   const struct btf_type *targ_var, *targ_type;
+   __u32 targ_type_id, local_type_id;
+   const char *targ_var_name;
+   int i, id, btf_fd, err;
struct btf *btf;
-   int i, j, id, btf_fd, err;
 
-   for (i = 0; i < obj->nr_extern; i++) {
-   const struct btf_type *targ_var, *targ_type;
-   __u32 targ_type_id, local_type_id;
-   const char *targ_var_name;
-   int ret;
+   btf = obj->btf_vmlinux;
+   btf_fd = 0;
+   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
+   if (id == -ENOENT) {
+   err = load_module_btfs(obj);
+   if (err)
+   return err;
 
-   ext = &obj->externs[i];
-   if (ext->type != EXT_KSYM || !ext->ksym.type_id)
-   continue;
+   for (i = 0; i < obj->btf_module_cnt; i++) {
+   btf = obj->btf_modules[i].btf;
+   /* we assume module BTF FD is always >0 */
+   btf_fd = obj->btf_modules[i].fd;
+   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
+   if (id != -ENOENT)
+   break;
+   }
+   }
+   if (id <= 0) {
+   pr_warn("extern (var ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
+   ext->name);
+   return -ESRCH;
+   }
 
-   btf = obj->btf_vmlinux;
-   btf_fd = 0;
-   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
-   if (id == -ENOENT) {
-   err = load_module_btfs(obj);
-   if (err)
-   return err;
+   /* find local type_id */
+   local_type_id = ext->ksym.type_id;
 
-   for (j = 0; j < obj->btf_module_cnt; j++) {
-   btf = obj->btf_modules[j].btf;
-   /* we assume module BTF FD is always >0 */
-   btf_fd = obj->btf_modules[j].fd;
-   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
-   if (id != -ENOENT)
-   break;
-   }
-   }
-   if (id <= 0) {
-   pr_warn("extern (ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
-   ext->name);
-   return -ESRCH;
-   }
+   /* find target type_id */
+   targ_var = btf__type_by_id(btf, id);
+   targ_var_name = btf__name_by_offset(btf, targ_var->name_off);
+   targ_type = skip_mods_and_typedefs(btf, targ_var->type, &targ_type_id);
 
-   /* find local type_id */
-   local_type_id = ext->ksym.type_id;
+   err = bpf_core_types_are_compat(obj->btf, local_type_id,
+   btf, targ_type_id);
+   if (err <= 0) {
+   const struct btf_type *local_type;
+   const char *targ_name, *local_name;
 
-   /* find target type_id */
-   targ_var = btf__type_by_id(btf, id);
-   targ_var_name = btf__name_by_offset(btf, targ_var->name_off);
-   targ_type = skip_mods_and_typedefs(btf, targ_var->type, 
&targ_type_id);
+   local_type = btf__type_by_id(obj->btf, local_type_id);
+   local_name = btf__name_by_offset(obj->btf, 
local_type->name_off);
+   targ_name = btf__name_by_offset(btf, targ_type->name_off);
 
-   ret = bpf_core_types_are_compat(obj->btf, local_type_id,
-   btf, targ_type_id);
-   if (ret <= 0) {
-

[PATCH bpf-next 07/15] bpf: tcp: White list some tcp cong functions to be called by bpf-tcp-cc

2021-03-15 Thread Martin KaFai Lau
This patch white list some tcp cong helper functions, tcp_slow_start()
and tcp_cong_avoid_ai().  They are allowed to be directly called by
the bpf-tcp-cc program.

A few tcp cc implementation functions are also white listed.
A potential use case is the bpf-tcp-cc implementation may only
want to override a subset of a tcp_congestion_ops.  For others,
the bpf-tcp-cc can directly call the kernel counter parts instead of
re-implementing (or copy-and-pasting) them to the bpf program.

They will only be available to the bpf-tcp-cc typed program.
The white listed functions are not bounded to a fixed ABI contract.
When any of them has changed, the bpf-tcp-cc program has to be changed
like any in-tree/out-of-tree kernel tcp-cc implementations do also.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/bpf_tcp_ca.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index d520e61649c8..ed6e6b5b762b 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -178,10 +179,50 @@ bpf_tcp_ca_get_func_proto(enum bpf_func_id func_id,
}
 }
 
+BTF_SET_START(bpf_tcp_ca_kfunc_ids)
+BTF_ID(func, tcp_reno_ssthresh)
+BTF_ID(func, tcp_reno_cong_avoid)
+BTF_ID(func, tcp_reno_undo_cwnd)
+BTF_ID(func, tcp_slow_start)
+BTF_ID(func, tcp_cong_avoid_ai)
+#if IS_BUILTIN(CONFIG_TCP_CONG_CUBIC)
+BTF_ID(func, cubictcp_init)
+BTF_ID(func, cubictcp_recalc_ssthresh)
+BTF_ID(func, cubictcp_cong_avoid)
+BTF_ID(func, cubictcp_state)
+BTF_ID(func, cubictcp_cwnd_event)
+BTF_ID(func, cubictcp_acked)
+#endif
+#if IS_BUILTIN(CONFIG_TCP_CONG_DCTCP)
+BTF_ID(func, dctcp_init)
+BTF_ID(func, dctcp_update_alpha)
+BTF_ID(func, dctcp_cwnd_event)
+BTF_ID(func, dctcp_ssthresh)
+BTF_ID(func, dctcp_cwnd_undo)
+BTF_ID(func, dctcp_state)
+#endif
+#if IS_BUILTIN(CONFIG_TCP_CONG_BBR)
+BTF_ID(func, bbr_init)
+BTF_ID(func, bbr_main)
+BTF_ID(func, bbr_sndbuf_expand)
+BTF_ID(func, bbr_undo_cwnd)
+BTF_ID(func, bbr_cwnd_even),
+BTF_ID(func, bbr_ssthresh)
+BTF_ID(func, bbr_min_tso_segs)
+BTF_ID(func, bbr_set_state)
+#endif
+BTF_SET_END(bpf_tcp_ca_kfunc_ids)
+
+static bool bpf_tcp_ca_check_kern_func_call(u32 kfunc_btf_id)
+{
+   return btf_id_set_contains(&bpf_tcp_ca_kfunc_ids, kfunc_btf_id);
+}
+
 static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
.get_func_proto = bpf_tcp_ca_get_func_proto,
.is_valid_access= bpf_tcp_ca_is_valid_access,
.btf_struct_access  = bpf_tcp_ca_btf_struct_access,
+   .check_kern_func_call   = bpf_tcp_ca_check_kern_func_call,
 };
 
 static int bpf_tcp_ca_init_member(const struct btf_type *t,
-- 
2.30.2



[PATCH bpf-next 10/15] libbpf: Rename RELO_EXTERN to RELO_EXTERN_VAR

2021-03-15 Thread Martin KaFai Lau
This patch renames RELO_EXTERN to RELO_EXTERN_VAR.
It is to avoid the confusion with a later patch adding
RELO_EXTERN_FUNC.

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8355b786b3db..8f924aece736 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -189,7 +189,7 @@ enum reloc_type {
RELO_LD64,
RELO_CALL,
RELO_DATA,
-   RELO_EXTERN,
+   RELO_EXTERN_VAR,
RELO_SUBPROG_ADDR,
 };
 
@@ -3463,7 +3463,7 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
}
pr_debug("prog '%s': found extern #%d '%s' (sym %d) for insn 
#%u\n",
 prog->name, i, ext->name, ext->sym_idx, insn_idx);
-   reloc_desc->type = RELO_EXTERN;
+   reloc_desc->type = RELO_EXTERN_VAR;
reloc_desc->insn_idx = insn_idx;
reloc_desc->sym_off = i; /* sym_off stores extern index */
return 0;
@@ -6226,7 +6226,7 @@ bpf_object__relocate_data(struct bpf_object *obj, struct 
bpf_program *prog)
insn[0].imm = obj->maps[relo->map_idx].fd;
relo->processed = true;
break;
-   case RELO_EXTERN:
+   case RELO_EXTERN_VAR:
ext = &obj->externs[relo->sym_off];
if (ext->type == EXT_KCFG) {
insn[0].src_reg = BPF_PSEUDO_MAP_VALUE;
-- 
2.30.2



[PATCH bpf-next 11/15] libbpf: Record extern sym relocation first

2021-03-15 Thread Martin KaFai Lau
This patch records the extern sym relocs first before recording
subprog relocs.  The later patch will have relocs for extern
kernel function call which is also using BPF_JMP | BPF_CALL.
It will be easier to handle the extern symbols first in
the later patch.

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 50 +-
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8f924aece736..0a60fcb2fba2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -3416,31 +3416,7 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
 
reloc_desc->processed = false;
 
-   /* sub-program call relocation */
-   if (insn->code == (BPF_JMP | BPF_CALL)) {
-   if (insn->src_reg != BPF_PSEUDO_CALL) {
-   pr_warn("prog '%s': incorrect bpf_call opcode\n", 
prog->name);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   /* text_shndx can be 0, if no default "main" program exists */
-   if (!shdr_idx || shdr_idx != obj->efile.text_shndx) {
-   sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, 
shdr_idx));
-   pr_warn("prog '%s': bad call relo against '%s' in 
section '%s'\n",
-   prog->name, sym_name, sym_sec_name);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   if (sym->st_value % BPF_INSN_SZ) {
-   pr_warn("prog '%s': bad call relo against '%s' at 
offset %zu\n",
-   prog->name, sym_name, (size_t)sym->st_value);
-   return -LIBBPF_ERRNO__RELOC;
-   }
-   reloc_desc->type = RELO_CALL;
-   reloc_desc->insn_idx = insn_idx;
-   reloc_desc->sym_off = sym->st_value;
-   return 0;
-   }
-
-   if (!is_ldimm64(insn)) {
+   if (insn->code != (BPF_JMP | BPF_CALL) && !is_ldimm64(insn)) {
pr_warn("prog '%s': invalid relo against '%s' for 
insns[%d].code 0x%x\n",
prog->name, sym_name, insn_idx, insn->code);
return -LIBBPF_ERRNO__RELOC;
@@ -3469,6 +3445,30 @@ static int bpf_program__record_reloc(struct bpf_program 
*prog,
return 0;
}
 
+   /* sub-program call relocation */
+   if (insn->code == (BPF_JMP | BPF_CALL)) {
+   if (insn->src_reg != BPF_PSEUDO_CALL) {
+   pr_warn("prog '%s': incorrect bpf_call opcode\n", 
prog->name);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   /* text_shndx can be 0, if no default "main" program exists */
+   if (!shdr_idx || shdr_idx != obj->efile.text_shndx) {
+   sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, 
shdr_idx));
+   pr_warn("prog '%s': bad call relo against '%s' in 
section '%s'\n",
+   prog->name, sym_name, sym_sec_name);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   if (sym->st_value % BPF_INSN_SZ) {
+   pr_warn("prog '%s': bad call relo against '%s' at 
offset %zu\n",
+   prog->name, sym_name, (size_t)sym->st_value);
+   return -LIBBPF_ERRNO__RELOC;
+   }
+   reloc_desc->type = RELO_CALL;
+   reloc_desc->insn_idx = insn_idx;
+   reloc_desc->sym_off = sym->st_value;
+   return 0;
+   }
+
if (!shdr_idx || shdr_idx >= SHN_LORESERVE) {
pr_warn("prog '%s': invalid relo against '%s' in special 
section 0x%x; forgot to initialize global var?..\n",
prog->name, sym_name, shdr_idx);
-- 
2.30.2



[PATCH bpf-next 09/15] libbpf: Refactor codes for finding btf id of a kernel symbol

2021-03-15 Thread Martin KaFai Lau
This patch refactors code, that finds kernel btf_id by kind
and symbol name, to a new function find_ksym_btf_id().

It also adds a new helper __btf_kind_str() to return
a string by the numeric kind value.

Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 44 +++---
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7d5f9b7877bc..8355b786b3db 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1929,9 +1929,9 @@ resolve_func_ptr(const struct btf *btf, __u32 id, __u32 
*res_id)
return btf_is_func_proto(t) ? t : NULL;
 }
 
-static const char *btf_kind_str(const struct btf_type *t)
+static const char *__btf_kind_str(__u16 kind)
 {
-   switch (btf_kind(t)) {
+   switch (kind) {
case BTF_KIND_UNKN: return "void";
case BTF_KIND_INT: return "int";
case BTF_KIND_PTR: return "ptr";
@@ -1953,6 +1953,11 @@ static const char *btf_kind_str(const struct btf_type *t)
}
 }
 
+static const char *btf_kind_str(const struct btf_type *t)
+{
+   return __btf_kind_str(btf_kind(t));
+}
+
 /*
  * Fetch integer attribute of BTF map definition. Such attributes are
  * represented using a pointer to an array, in which dimensionality of array
@@ -7403,18 +7408,17 @@ static int bpf_object__read_kallsyms_file(struct 
bpf_object *obj)
return err;
 }
 
-static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
-  struct extern_desc *ext)
+static int find_ksym_btf_id(struct bpf_object *obj, const char *ksym_name,
+   __u16 kind, struct btf **res_btf,
+   int *res_btf_fd)
 {
-   const struct btf_type *targ_var, *targ_type;
-   __u32 targ_type_id, local_type_id;
-   const char *targ_var_name;
int i, id, btf_fd, err;
struct btf *btf;
 
btf = obj->btf_vmlinux;
btf_fd = 0;
-   id = btf__find_by_name_kind(btf, ext->name, BTF_KIND_VAR);
+   id = btf__find_by_name_kind(btf, ksym_name, kind);
+
if (id == -ENOENT) {
err = load_module_btfs(obj);
if (err)
@@ -7424,17 +7428,35 @@ static int bpf_object__resolve_ksym_var_btf_id(struct 
bpf_object *obj,
btf = obj->btf_modules[i].btf;
/* we assume module BTF FD is always >0 */
btf_fd = obj->btf_modules[i].fd;
-   id = btf__find_by_name_kind(btf, ext->name, 
BTF_KIND_VAR);
+   id = btf__find_by_name_kind(btf, ksym_name, kind);
if (id != -ENOENT)
break;
}
}
if (id <= 0) {
-   pr_warn("extern (var ksym) '%s': failed to find BTF ID in 
kernel BTF(s).\n",
-   ext->name);
+   pr_warn("extern (%s ksym) '%s': failed to find BTF ID in kernel 
BTF(s).\n",
+   __btf_kind_str(kind), ksym_name);
return -ESRCH;
}
 
+   *res_btf = btf;
+   *res_btf_fd = btf_fd;
+   return id;
+}
+
+static int bpf_object__resolve_ksym_var_btf_id(struct bpf_object *obj,
+  struct extern_desc *ext)
+{
+   const struct btf_type *targ_var, *targ_type;
+   __u32 targ_type_id, local_type_id;
+   const char *targ_var_name;
+   int id, btf_fd = 0, err;
+   struct btf *btf = NULL;
+
+   id = find_ksym_btf_id(obj, ext->name, BTF_KIND_VAR, &btf, &btf_fd);
+   if (id < 0)
+   return id;
+
/* find local type_id */
local_type_id = ext->ksym.type_id;
 
-- 
2.30.2



[PATCH bpf-next 04/15] bpf: Support bpf program calling kernel function

2021-03-15 Thread Martin KaFai Lau
This patch adds support to BPF verifier to allow bpf program calling
kernel function directly.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

This patch is to make the required changes in the bpf verifier.

First change is in btf.c, it adds a case in "do_btf_check_func_arg_match()".
When the passed in "btf->kernel_btf == true", it means matching the
verifier regs' states with a kernel function.  This will handle the
PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
and PTR_TO_TCP_SOCK to its kernel's btf_id.

In the later libbpf patch, the insn calling a kernel function will
look like:

insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
insn->imm == func_btf_id /* btf_id of the running kernel */

[ For the future calling function-in-kernel-module support, an array
  of module btf_fds can be passed at the load time and insn->off
  can be used to index into this array. ]

At the early stage of verifier, the verifier will collect all kernel
function calls into "struct bpf_kern_func_descriptor".  Those
descriptors are stored in "prog->aux->kfunc_tab" and will
be available to the JIT.  Since this "add" operation is similar
to the current "add_subprog()" and looking for the same insn->code,
they are done together in the new "add_subprog_and_kern_func()".

In the "do_check()" stage, the new "check_kern_func_call()" is added
to verify the kernel function call instruction:
1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
   A new bpf_verifier_ops "check_kern_func_call" is added to do that.
   The bpf-tcp-cc struct_ops program will implement this function in
   a later patch.
2. Call "btf_check_kern_func_args_match()" to ensure the regs can be
   used as the args of a kernel function.
3. Mark the regs' type, subreg_def, and zext_dst.

At the later do_misc_fixups() stage, the new fixup_kern_func_call()
will replace the insn->imm with the function address (relative
to __bpf_call_base).  If needed, the jit can find the btf_func_model
by calling the new bpf_jit_find_kern_func_model(prog, insn->imm).
With the imm set to the function address, "bpftool prog dump xlated"
will be able to display the kernel function calls the same way as
it displays other bpf helper calls.

gpl_compatible program is required to call kernel function.

This feature currently requires JIT.

Signed-off-by: Martin KaFai Lau 
---
 arch/x86/net/bpf_jit_comp.c   |   5 +
 include/linux/bpf.h   |  24 ++
 include/linux/btf.h   |   1 +
 include/linux/filter.h|   1 +
 include/uapi/linux/bpf.h  |   4 +
 kernel/bpf/btf.c  |  65 +-
 kernel/bpf/core.c |  18 +-
 kernel/bpf/disasm.c   |  32 +--
 kernel/bpf/disasm.h   |   3 +-
 kernel/bpf/syscall.c  |   1 +
 kernel/bpf/verifier.c | 376 --
 tools/bpf/bpftool/xlated_dumper.c |   3 +-
 tools/include/uapi/linux/bpf.h|   4 +
 13 files changed, 488 insertions(+), 49 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 6926d0ca6c71..bcb957234410 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2327,3 +2327,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
*prog)
   tmp : orig_prog);
return prog;
 }
+
+bool bpf_jit_supports_kfunc_call(void)
+{
+   return true;
+}
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a25730eaa148..75ab8dc02df5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -426,6 +426,7 @@ enum bpf_reg_type {
PTR_TO_PERCPU_BTF_ID,/* reg points to a percpu kernel variable */
PTR_TO_FUNC, /* reg points to a bpf program function */
PTR_TO_MAP_KEY,  /

[PATCH bpf-next 06/15] tcp: Rename bictcp function prefix to cubictcp

2021-03-15 Thread Martin KaFai Lau
The cubic functions in tcp_cubic.c are using the bictcp prefix as
in tcp_bic.c.  This patch gives it the proper name cubictcp
because the later patch will allow the bpf prog to directly
call the cubictcp implementation.  Renaming them will avoid
the name collision when trying to find the intended
one to call during bpf prog load time.

Signed-off-by: Martin KaFai Lau 
---
 net/ipv4/tcp_cubic.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
index ffcbe46dacdb..4a30deaa9a37 100644
--- a/net/ipv4/tcp_cubic.c
+++ b/net/ipv4/tcp_cubic.c
@@ -124,7 +124,7 @@ static inline void bictcp_hystart_reset(struct sock *sk)
ca->sample_cnt = 0;
 }
 
-static void bictcp_init(struct sock *sk)
+static void cubictcp_init(struct sock *sk)
 {
struct bictcp *ca = inet_csk_ca(sk);
 
@@ -137,7 +137,7 @@ static void bictcp_init(struct sock *sk)
tcp_sk(sk)->snd_ssthresh = initial_ssthresh;
 }
 
-static void bictcp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
+static void cubictcp_cwnd_event(struct sock *sk, enum tcp_ca_event event)
 {
if (event == CA_EVENT_TX_START) {
struct bictcp *ca = inet_csk_ca(sk);
@@ -319,7 +319,7 @@ static inline void bictcp_update(struct bictcp *ca, u32 
cwnd, u32 acked)
ca->cnt = max(ca->cnt, 2U);
 }
 
-static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+static void cubictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -338,7 +338,7 @@ static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 
acked)
tcp_cong_avoid_ai(tp, ca->cnt, acked);
 }
 
-static u32 bictcp_recalc_ssthresh(struct sock *sk)
+static u32 cubictcp_recalc_ssthresh(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -355,7 +355,7 @@ static u32 bictcp_recalc_ssthresh(struct sock *sk)
return max((tp->snd_cwnd * beta) / BICTCP_BETA_SCALE, 2U);
 }
 
-static void bictcp_state(struct sock *sk, u8 new_state)
+static void cubictcp_state(struct sock *sk, u8 new_state)
 {
if (new_state == TCP_CA_Loss) {
bictcp_reset(inet_csk_ca(sk));
@@ -442,7 +442,7 @@ static void hystart_update(struct sock *sk, u32 delay)
}
 }
 
-static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
+static void cubictcp_acked(struct sock *sk, const struct ack_sample *sample)
 {
const struct tcp_sock *tp = tcp_sk(sk);
struct bictcp *ca = inet_csk_ca(sk);
@@ -471,13 +471,13 @@ static void bictcp_acked(struct sock *sk, const struct 
ack_sample *sample)
 }
 
 static struct tcp_congestion_ops cubictcp __read_mostly = {
-   .init   = bictcp_init,
-   .ssthresh   = bictcp_recalc_ssthresh,
-   .cong_avoid = bictcp_cong_avoid,
-   .set_state  = bictcp_state,
+   .init   = cubictcp_init,
+   .ssthresh   = cubictcp_recalc_ssthresh,
+   .cong_avoid = cubictcp_cong_avoid,
+   .set_state  = cubictcp_state,
.undo_cwnd  = tcp_reno_undo_cwnd,
-   .cwnd_event = bictcp_cwnd_event,
-   .pkts_acked = bictcp_acked,
+   .cwnd_event = cubictcp_cwnd_event,
+   .pkts_acked = cubictcp_acked,
.owner  = THIS_MODULE,
.name   = "cubic",
 };
-- 
2.30.2



[PATCH bpf-next 05/15] bpf: Support kernel function call in x86-32

2021-03-15 Thread Martin KaFai Lau
This patch adds kernel function call support to the x86-32 bpf jit.

Signed-off-by: Martin KaFai Lau 
---
 arch/x86/net/bpf_jit_comp32.c | 198 ++
 1 file changed, 198 insertions(+)

diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index d17b67c69f89..f2ac36cf08ac 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -1390,6 +1390,19 @@ static inline void emit_push_r64(const u8 src[], u8 
**pprog)
*pprog = prog;
 }
 
+static void emit_push_r32(const u8 src[], u8 **pprog)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   /* mov ecx,dword ptr [ebp+off] */
+   EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+   /* push ecx */
+   EMIT1(0x51);
+
+   *pprog = prog;
+}
+
 static u8 get_cond_jmp_opcode(const u8 op, bool is_cmp_lo)
 {
u8 jmp_cond;
@@ -1459,6 +1472,174 @@ static u8 get_cond_jmp_opcode(const u8 op, bool 
is_cmp_lo)
return jmp_cond;
 }
 
+/* i386 kernel compiles with "-mregparm=3".  From gcc document:
+ *
+ *  snippet 
+ * regparm (number)
+ * On x86-32 targets, the regparm attribute causes the compiler
+ * to pass arguments number one to (number) if they are of integral
+ * type in registers EAX, EDX, and ECX instead of on the stack.
+ * Functions that take a variable number of arguments continue
+ * to be passed all of their arguments on the stack.
+ *  snippet 
+ *
+ * The first three args of a function will be considered for
+ * putting into the 32bit register EAX, EDX, and ECX.
+ *
+ * Two 32bit registers are used to pass a 64bit arg.
+ *
+ * For example,
+ * void foo(u32 a, u32 b, u32 c, u32 d):
+ * u32 a: EAX
+ * u32 b: EDX
+ * u32 c: ECX
+ * u32 d: stack
+ *
+ * void foo(u64 a, u32 b, u32 c):
+ * u64 a: EAX (lo32) EDX (hi32)
+ * u32 b: ECX
+ * u32 c: stack
+ *
+ * void foo(u32 a, u64 b, u32 c):
+ * u32 a: EAX
+ * u64 b: EDX (lo32) ECX (hi32)
+ * u32 c: stack
+ *
+ * void foo(u32 a, u32 b, u64 c):
+ * u32 a: EAX
+ * u32 b: EDX
+ * u64 c: stack
+ *
+ * The return value will be stored in the EAX (and EDX for 64bit value).
+ *
+ * For example,
+ * u32 foo(u32 a, u32 b, u32 c):
+ * return value: EAX
+ *
+ * u64 foo(u32 a, u32 b, u32 c):
+ * return value: EAX (lo32) EDX (hi32)
+ *
+ * Notes:
+ * The verifier only accepts function having integer and pointers
+ * as its args and return value, so it does not have
+ * struct-by-value.
+ *
+ * emit_kfunc_call() finds out the btf_func_model by calling
+ * bpf_jit_find_kern_func_model().  A btf_func_model
+ * has the details about the number of args, size of each arg,
+ * and the size of the return value.
+ *
+ * It first decides how many args can be passed by EAX, EDX, and ECX.
+ * That will decide what args should be pushed to the stack:
+ * [first_stack_regno, last_stack_regno] are the bpf regnos
+ * that should be pushed to the stack.
+ *
+ * It will first push all args to the stack because the push
+ * will need to use ECX.  Then, it moves
+ * [BPF_REG_1, first_stack_regno) to EAX, EDX, and ECX.
+ *
+ * When emitting a call (0xE8), it needs to figure out
+ * the jmp_offset relative to the jit-insn address immediately
+ * following the call (0xE8) instruction.  At this point, it knows
+ * the end of the jit-insn address after completely translated the
+ * current (BPF_JMP | BPF_CALL) bpf-insn.  It is passed as "end_addr"
+ * to the emit_kfunc_call().  Thus, it can learn the "immediate-follow-call"
+ * address by figuring out how many jit-insn is generated between
+ * the call (0xE8) and the end_addr:
+ * - 0-1 jit-insn (3 bytes each) to restore the esp pointer if there
+ *   is arg pushed to the stack.
+ * - 0-2 jit-insns (3 bytes each) to handle the return value.
+ */
+static int emit_kfunc_call(const struct bpf_prog *bpf_prog, u8 *end_addr,
+  const struct bpf_insn *insn, u8 **pprog)
+{
+   const u8 arg_regs[] = { IA32_EAX, IA32_EDX, IA32_ECX };
+   int i, cnt = 0, first_stack_regno, last_stack_regno;
+   int free_arg_regs = ARRAY_SIZE(arg_regs);
+   const struct btf_func_model *fm;
+   int bytes_in_stack = 0;
+   const u8 *cur_arg_reg;
+   u8 *prog = *pprog;
+   s64 jmp_offset;
+
+   fm = bpf_jit_find_kern_func_model(bpf_prog, insn);
+   if (!fm)
+   return -EINVAL;
+
+   first_stack_regno = BPF_REG_1;
+   for (i = 0; i < fm->nr_args; i++) {
+   int regs_needed = fm->arg_size[i] > sizeof(u32) ? 2 : 1;
+
+   if (regs_needed > free_arg_regs)
+   break;
+
+   free_arg_regs -= regs_needed;
+   first_stack_regno++;
+   }
+
+   /* Push the args to the stack */
+   last_stack_regno = BPF_REG_0 + fm->nr_args;
+   for (i = last_stack_regno; i >= first_stack_regno; i--)

[PATCH bpf-next 02/15] bpf: btf: Support parsing extern func

2021-03-15 Thread Martin KaFai Lau
This patch makes BTF verifier to accept extern func. It is used for
allowing bpf program to call a limited set of kernel functions
in a later patch.

When writing bpf prog, the extern kernel function needs
to be declared under a ELF section (".ksyms") which is
the same as the current extern kernel variables and that should
keep its usage consistent without requiring to remember another
section name.

For example, in a bpf_prog.c:

extern int foo(struct sock *) __attribute__((section(".ksyms")))

[24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
'(anon)' type_id=18
[25] FUNC 'foo' type_id=24 linkage=extern
[ ... ]
[33] DATASEC '.ksyms' size=0 vlen=1
type_id=25 offset=0 size=0

LLVM will put the "func" type into the BTF datasec ".ksyms".
The current "btf_datasec_check_meta()" assumes everything under
it is a "var" and ensures it has non-zero size ("!vsi->size" test).
The non-zero size check is not true for "func".  This patch postpones the
"!vsi-size" test from "btf_datasec_check_meta()" to
"btf_datasec_resolve()" which has all types collected to decide
if a vsi is a "var" or a "func" and then enforce the "vsi->size"
differently.

If the datasec only has "func", its "t->size" could be zero.
Thus, the current "!t->size" test is no longer valid.  The
invalid "t->size" will still be caught by the later
"last_vsi_end_off > t->size" check.   This patch also takes this
chance to consolidate other "t->size" tests ("vsi->offset >= t->size"
"vsi->size > t->size", and "t->size < sum") into the existing
"last_vsi_end_off > t->size" test.

The LLVM will also put those extern kernel function as an extern
linkage func in the BTF:

[24] FUNC_PROTO '(anon)' ret_type_id=15 vlen=1
'(anon)' type_id=18
[25] FUNC 'foo' type_id=24 linkage=extern

This patch allows BTF_FUNC_EXTERN in btf_func_check_meta().
Also extern kernel function declaration does not
necessary have arg name. Another change in btf_func_check() is
to allow extern function having no arg name.

The btf selftest is adjusted accordingly.  New tests are also added.

The required LLVM patch: https://reviews.llvm.org/D93563

Signed-off-by: Martin KaFai Lau 
---
 kernel/bpf/btf.c |  52 ---
 tools/testing/selftests/bpf/prog_tests/btf.c | 154 ++-
 2 files changed, 178 insertions(+), 28 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 369faeddf1df..96cd24020a38 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3439,7 +3439,7 @@ static s32 btf_func_check_meta(struct btf_verifier_env 
*env,
return -EINVAL;
}
 
-   if (btf_type_vlen(t) > BTF_FUNC_GLOBAL) {
+   if (btf_type_vlen(t) > BTF_FUNC_EXTERN) {
btf_verifier_log_type(env, t, "Invalid func linkage");
return -EINVAL;
}
@@ -3532,7 +3532,7 @@ static s32 btf_datasec_check_meta(struct btf_verifier_env 
*env,
  u32 meta_left)
 {
const struct btf_var_secinfo *vsi;
-   u64 last_vsi_end_off = 0, sum = 0;
+   u64 last_vsi_end_off = 0;
u32 i, meta_needed;
 
meta_needed = btf_type_vlen(t) * sizeof(*vsi);
@@ -3543,11 +3543,6 @@ static s32 btf_datasec_check_meta(struct 
btf_verifier_env *env,
return -EINVAL;
}
 
-   if (!t->size) {
-   btf_verifier_log_type(env, t, "size == 0");
-   return -EINVAL;
-   }
-
if (btf_type_kflag(t)) {
btf_verifier_log_type(env, t, "Invalid btf_info kind_flag");
return -EINVAL;
@@ -3569,19 +3564,13 @@ static s32 btf_datasec_check_meta(struct 
btf_verifier_env *env,
return -EINVAL;
}
 
-   if (vsi->offset < last_vsi_end_off || vsi->offset >= t->size) {
+   if (vsi->offset < last_vsi_end_off) {
btf_verifier_log_vsi(env, t, vsi,
 "Invalid offset");
return -EINVAL;
}
 
-   if (!vsi->size || vsi->size > t->size) {
-   btf_verifier_log_vsi(env, t, vsi,
-"Invalid size");
-   return -EINVAL;
-   }
-
-   last_vsi_end_off = vsi->offset + vsi->size;
+   last_vsi_end_off = (u64)vsi->offset + vsi->size;
if (last_vsi_end_off > t->size) {
btf_verifier_log_vsi(env, t, vsi,

[PATCH bpf-next 03/15] bpf: Refactor btf_check_func_arg_match

2021-03-15 Thread Martin KaFai Lau
This patch refactors the core logic of "btf_check_func_arg_match()"
into a new function "do_btf_check_func_arg_match()".
"do_btf_check_func_arg_match()" will be reused later to check
the kernel function call.

The "if (!btf_type_is_ptr(t))" is checked first to improve the indentation
which will be useful for a later patch.

Some of the "btf_kind_str[]" usages is replaced with the shortcut
"btf_type_str(t)".

Signed-off-by: Martin KaFai Lau 
---
 include/linux/btf.h |   5 ++
 kernel/bpf/btf.c| 159 
 2 files changed, 91 insertions(+), 73 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 7fabf1428093..93bf2e5225f5 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -140,6 +140,11 @@ static inline bool btf_type_is_enum(const struct btf_type 
*t)
return BTF_INFO_KIND(t->info) == BTF_KIND_ENUM;
 }
 
+static inline bool btf_type_is_scalar(const struct btf_type *t)
+{
+   return btf_type_is_int(t) || btf_type_is_enum(t);
+}
+
 static inline bool btf_type_is_typedef(const struct btf_type *t)
 {
return BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 96cd24020a38..529b94b601c6 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -4381,7 +4381,7 @@ static u8 bpf_ctx_convert_map[] = {
 #undef BPF_LINK_TYPE
 
 static const struct btf_member *
-btf_get_prog_ctx_type(struct bpf_verifier_log *log, struct btf *btf,
+btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
  const struct btf_type *t, enum bpf_prog_type prog_type,
  int arg)
 {
@@ -5366,122 +5366,135 @@ int btf_check_type_match(struct bpf_verifier_log 
*log, const struct bpf_prog *pr
return btf_check_func_type_match(log, btf1, t1, btf2, t2);
 }
 
-/* Compare BTF of a function with given bpf_reg_state.
- * Returns:
- * EFAULT - there is a verifier bug. Abort verification.
- * EINVAL - there is a type mismatch or BTF is not available.
- * 0 - BTF matches with what bpf_reg_state expects.
- * Only PTR_TO_CTX and SCALAR_VALUE states are recognized.
- */
-int btf_check_func_arg_match(struct bpf_verifier_env *env, int subprog,
-struct bpf_reg_state *regs)
+static int do_btf_check_func_arg_match(struct bpf_verifier_env *env,
+  const struct btf *btf, u32 func_id,
+  struct bpf_reg_state *regs,
+  bool ptr_to_mem_ok)
 {
struct bpf_verifier_log *log = &env->log;
-   struct bpf_prog *prog = env->prog;
-   struct btf *btf = prog->aux->btf;
-   const struct btf_param *args;
+   const char *func_name, *ref_tname;
const struct btf_type *t, *ref_t;
-   u32 i, nargs, btf_id, type_size;
-   const char *tname;
-   bool is_global;
-
-   if (!prog->aux->func_info)
-   return -EINVAL;
-
-   btf_id = prog->aux->func_info[subprog].type_id;
-   if (!btf_id)
-   return -EFAULT;
-
-   if (prog->aux->func_info_aux[subprog].unreliable)
-   return -EINVAL;
+   const struct btf_param *args;
+   u32 i, nargs;
 
-   t = btf_type_by_id(btf, btf_id);
+   t = btf_type_by_id(btf, func_id);
if (!t || !btf_type_is_func(t)) {
/* These checks were already done by the verifier while loading
 * struct bpf_func_info
 */
-   bpf_log(log, "BTF of func#%d doesn't point to KIND_FUNC\n",
-   subprog);
+   bpf_log(log, "BTF of func_id %u doesn't point to KIND_FUNC\n",
+   func_id);
return -EFAULT;
}
-   tname = btf_name_by_offset(btf, t->name_off);
+   func_name = btf_name_by_offset(btf, t->name_off);
 
t = btf_type_by_id(btf, t->type);
if (!t || !btf_type_is_func_proto(t)) {
-   bpf_log(log, "Invalid BTF of func %s\n", tname);
+   bpf_log(log, "Invalid BTF of func %s\n", func_name);
return -EFAULT;
}
args = (const struct btf_param *)(t + 1);
nargs = btf_type_vlen(t);
if (nargs > MAX_BPF_FUNC_REG_ARGS) {
-   bpf_log(log, "Function %s has %d > %d args\n", tname, nargs,
+   bpf_log(log, "Function %s has %d > %d args\n", func_name, nargs,
MAX_BPF_FUNC_REG_ARGS);
-   goto out;
+   return -EINVAL;
}
 
-   is_global = prog->aux->func_info_aux[subprog].linkage == 
BTF_FUNC_GLOBAL;
/* check that BTF function arguments match actual types that the
 * verifier sees.
 */
for (i = 0; i < nargs; i++) {
- 

[PATCH bpf-next 01/15] bpf: Simplify freeing logic in linfo and jited_linfo

2021-03-15 Thread Martin KaFai Lau
This patch simplifies the linfo freeing logic by combining
"bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
into the new "bpf_prog_jit_attempt_done()".
It is a prep work for the kernel function call support.  In a later
patch, freeing the kernel function call descriptors will also
be done in the "bpf_prog_jit_attempt_done()".

"bpf_prog_free_linfo()" is removed since it is only called by
"__bpf_prog_put_noref()".  The kvfree() are directly called
instead.

It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
allocation.

Signed-off-by: Martin KaFai Lau 
---
 include/linux/filter.h |  3 +--
 kernel/bpf/core.c  | 35 ---
 kernel/bpf/syscall.c   |  3 ++-
 kernel/bpf/verifier.c  |  4 ++--
 4 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b2b85b2cad8e..0d9c710eb050 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -877,8 +877,7 @@ void bpf_prog_free_linfo(struct bpf_prog *prog);
 void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
   const u32 *insn_to_jit_off);
 int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog);
-void bpf_prog_free_jited_linfo(struct bpf_prog *prog);
-void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog);
+void bpf_prog_jit_attempt_done(struct bpf_prog *prog);
 
 struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags);
 struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t 
gfp_extra_flags);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3a283bf97f2f..4a6dd327446b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -143,25 +143,22 @@ int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog)
if (!prog->aux->nr_linfo || !prog->jit_requested)
return 0;
 
-   prog->aux->jited_linfo = kcalloc(prog->aux->nr_linfo,
-sizeof(*prog->aux->jited_linfo),
-GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+   prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
+ sizeof(*prog->aux->jited_linfo),
+ GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
if (!prog->aux->jited_linfo)
return -ENOMEM;
 
return 0;
 }
 
-void bpf_prog_free_jited_linfo(struct bpf_prog *prog)
+void bpf_prog_jit_attempt_done(struct bpf_prog *prog)
 {
-   kfree(prog->aux->jited_linfo);
-   prog->aux->jited_linfo = NULL;
-}
-
-void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog)
-{
-   if (prog->aux->jited_linfo && !prog->aux->jited_linfo[0])
-   bpf_prog_free_jited_linfo(prog);
+   if (prog->aux->jited_linfo &&
+   (!prog->jited || !prog->aux->jited_linfo[0])) {
+   kvfree(prog->aux->jited_linfo);
+   prog->aux->jited_linfo = NULL;
+   }
 }
 
 /* The jit engine is responsible to provide an array
@@ -217,12 +214,6 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog,
insn_to_jit_off[linfo[i].insn_off - insn_start - 1];
 }
 
-void bpf_prog_free_linfo(struct bpf_prog *prog)
-{
-   bpf_prog_free_jited_linfo(prog);
-   kvfree(prog->aux->linfo);
-}
-
 struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size,
  gfp_t gfp_extra_flags)
 {
@@ -1866,15 +1857,13 @@ struct bpf_prog *bpf_prog_select_runtime(struct 
bpf_prog *fp, int *err)
return fp;
 
fp = bpf_int_jit_compile(fp);
-   if (!fp->jited) {
-   bpf_prog_free_jited_linfo(fp);
+   bpf_prog_jit_attempt_done(fp);
 #ifdef CONFIG_BPF_JIT_ALWAYS_ON
+   if (!fp->jited) {
*err = -ENOTSUPP;
return fp;
-#endif
-   } else {
-   bpf_prog_free_unused_jited_linfo(fp);
}
+#endif
} else {
*err = bpf_prog_offload_compile(fp);
if (*err)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c859bc46d06c..78a653e25df0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1689,7 +1689,8 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, 
bool deferred)
 {
bpf_prog_kallsyms_del_all(prog);
btf_put(prog->aux->btf);
-   bpf_prog_free_linfo(prog);
+   kvfree(prog->aux->jited_linfo);
+   kvfree(prog->aux->linfo);
if (prog->aux->attach_btf)
btf_put(prog->aux->attach_btf);
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f9096b049cd6..0647454a0c8e 100644
--- a/kernel/bpf/verifier.c
+++ b/ke

[PATCH bpf-next 00/15] Support calling kernel function

2021-03-15 Thread Martin KaFai Lau
This series adds support to allow bpf program calling kernel function.

The use case included in this set is to allow bpf-tcp-cc to directly
call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
functions have already been used by some kernel tcp-cc implementations.

This set will also allow the bpf-tcp-cc program to directly call the
kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
from the kernel tcp_dctcp.c instead of reimplementing (or
copy-and-pasting) them.

The tcp-cc kernel functions mentioned above will be white listed
for the struct_ops bpf-tcp-cc programs to use in a later patch.
The white listed functions are not bounded to a fixed ABI contract.
Those functions have already been used by the existing kernel tcp-cc.
If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
implementations have to be changed.  The same goes for the struct_ops
bpf-tcp-cc programs which have to be adjusted accordingly.

Please see individual patch for details.

Martin KaFai Lau (15):
  bpf: Simplify freeing logic in linfo and jited_linfo
  bpf: btf: Support parsing extern func
  bpf: Refactor btf_check_func_arg_match
  bpf: Support bpf program calling kernel function
  bpf: Support kernel function call in x86-32
  tcp: Rename bictcp function prefix to cubictcp
  bpf: tcp: White list some tcp cong functions to be called by
bpf-tcp-cc
  libbpf: Refactor bpf_object__resolve_ksyms_btf_id
  libbpf: Refactor codes for finding btf id of a kernel symbol
  libbpf: Rename RELO_EXTERN to RELO_EXTERN_VAR
  libbpf: Record extern sym relocation first
  libbpf: Support extern kernel function
  bpf: selftests: Rename bictcp to bpf_cubic
  bpf: selftest: bpf_cubic and bpf_dctcp calling kernel functions
  bpf: selftest: Add kfunc_call test

 arch/x86/net/bpf_jit_comp.c   |   5 +
 arch/x86/net/bpf_jit_comp32.c | 198 +
 include/linux/bpf.h   |  24 ++
 include/linux/btf.h   |   6 +
 include/linux/filter.h|   4 +-
 include/uapi/linux/bpf.h  |   4 +
 kernel/bpf/btf.c  | 270 -
 kernel/bpf/core.c |  47 +--
 kernel/bpf/disasm.c   |  32 +-
 kernel/bpf/disasm.h   |   3 +-
 kernel/bpf/syscall.c  |   4 +-
 kernel/bpf/verifier.c | 380 --
 net/bpf/test_run.c|  11 +
 net/core/filter.c |  11 +
 net/ipv4/bpf_tcp_ca.c |  41 ++
 net/ipv4/tcp_cubic.c  |  24 +-
 tools/bpf/bpftool/xlated_dumper.c |   3 +-
 tools/include/uapi/linux/bpf.h|   4 +
 tools/lib/bpf/btf.c   |  32 +-
 tools/lib/bpf/btf.h   |   5 +
 tools/lib/bpf/libbpf.c| 316 ++-
 tools/testing/selftests/bpf/bpf_tcp_helpers.h |  29 +-
 tools/testing/selftests/bpf/prog_tests/btf.c  | 154 ++-
 .../selftests/bpf/prog_tests/kfunc_call.c |  61 +++
 tools/testing/selftests/bpf/progs/bpf_cubic.c |  36 +-
 tools/testing/selftests/bpf/progs/bpf_dctcp.c |  22 +-
 .../selftests/bpf/progs/kfunc_call_test.c |  48 +++
 .../bpf/progs/kfunc_call_test_subprog.c   |  31 ++
 28 files changed, 1454 insertions(+), 351 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/kfunc_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/kfunc_call_test_subprog.c

-- 
2.30.2



Re: [PATCH bpf-next] tools/runqslower: allow substituting custom vmlinux.h for the build

2021-03-02 Thread Martin KaFai Lau
On Tue, Mar 02, 2021 at 04:40:10PM -0800, Andrii Nakryiko wrote:
> Just like was done for bpftool and selftests in ec23eb705620 ("tools/bpftool:
> Allow substituting custom vmlinux.h for the build") and ca4db6389d61
> ("selftests/bpf: Allow substituting custom vmlinux.h for selftests build"),
> allow to provide pre-generated vmlinux.h for runqslower build.
Acked-by: Martin KaFai Lau 


Re: [PATCH v6 bpf-next 0/6] bpf: enable task local storage for tracing programs

2021-02-25 Thread Martin KaFai Lau
On Thu, Feb 25, 2021 at 03:43:13PM -0800, Song Liu wrote:
> This set enables task local storage for non-BPF_LSM programs.
> 
> It is common for tracing BPF program to access per-task data. Currently,
> these data are stored in hash tables with pid as the key. In
> bcc/libbpftools [1], 9 out of 23 tools use such hash tables. However,
> hash table is not ideal for many use case. Task local storage provides
> better usability and performance for BPF programs. Please refer to 6/6 for
> some performance comparison of task local storage vs. hash table.
Thanks for the patches.

Acked-by: Martin KaFai Lau 


Re: [PATCH v6 bpf-next 2/6] bpf: prevent deadlock from recursive bpf_task_storage_[get|delete]

2021-02-25 Thread Martin KaFai Lau
On Thu, Feb 25, 2021 at 03:43:15PM -0800, Song Liu wrote:
> BPF helpers bpf_task_storage_[get|delete] could hold two locks:
> bpf_local_storage_map_bucket->lock and bpf_local_storage->lock. Calling
> these helpers from fentry/fexit programs on functions in bpf_*_storage.c
> may cause deadlock on either locks.
> 
> Prevent such deadlock with a per cpu counter, bpf_task_storage_busy. We
> need this counter to be global, because the two locks here belong to two
> different objects: bpf_local_storage_map and bpf_local_storage. If we
> pick one of them as the owner of the counter, it is still possible to
> trigger deadlock on the other lock. For example, if bpf_local_storage_map
> owns the counters, it cannot prevent deadlock on bpf_local_storage->lock
> when two maps are used.
Acked-by: Martin KaFai Lau 


Re: [PATCH v5 bpf-next 1/6] bpf: enable task local storage for tracing programs

2021-02-25 Thread Martin KaFai Lau
On Tue, Feb 23, 2021 at 02:28:40PM -0800, Song Liu wrote:
> To access per-task data, BPF programs usually creates a hash table with
> pid as the key. This is not ideal because:
>  1. The user need to estimate the proper size of the hash table, which may
> be inaccurate;
>  2. Big hash tables are slow;
>  3. To clean up the data properly during task terminations, the user need
> to write extra logic.
> 
> Task local storage overcomes these issues and offers a better option for
> these per-task data. Task local storage is only available to BPF_LSM. Now
> enable it for tracing programs.
> 
> Unlike LSM programs, tracing programs can be called in IRQ contexts.
> Helpers that access task local storage are updated to use
> raw_spin_lock_irqsave() instead of raw_spin_lock_bh().
> 
> Tracing programs can attach to functions on the task free path, e.g.
> exit_creds(). To avoid allocating task local storage after
> bpf_task_storage_free(). bpf_task_storage_get() is updated to not allocate
> new storage when the task is not refcounted (task->usage == 0).
Acked-by: Martin KaFai Lau 


Re: [PATCH v4 bpf-next 6/6] bpf: runqslower: use task local storage

2021-02-23 Thread Martin KaFai Lau
On Mon, Feb 22, 2021 at 05:20:14PM -0800, Song Liu wrote:
> Replace hashtab with task local storage in runqslower. This improves the
> performance of these BPF programs. The following table summarizes average
> runtime of these programs, in nanoseconds:
> 
>   task-local   hash-prealloc   hash-no-prealloc
> handle__sched_wakeup 125 340   3124
> handle__sched_wakeup_new28121510   2998
> handle__sched_switch 151 208991
Nice!  The required code change is also minimal.


Re: [PATCH v4 bpf-next 5/6] bpf: runqslower: prefer using local vmlimux to generate vmlinux.h

2021-02-23 Thread Martin KaFai Lau
On Mon, Feb 22, 2021 at 05:20:13PM -0800, Song Liu wrote:
> Update the Makefile to prefer using $(O)/mvlinux, $(KBUILD_OUTPUT)/vmlinux
s/mvlinux/vmlinux/


Re: [PATCH v4 bpf-next 1/6] bpf: enable task local storage for tracing programs

2021-02-23 Thread Martin KaFai Lau
On Mon, Feb 22, 2021 at 05:20:09PM -0800, Song Liu wrote:
[ ... ]

> diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
> index e0da0258b732d..2034019966d44 100644
> --- a/kernel/bpf/bpf_task_storage.c
> +++ b/kernel/bpf/bpf_task_storage.c
> @@ -15,7 +15,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  
> @@ -24,12 +23,8 @@ DEFINE_BPF_STORAGE_CACHE(task_cache);
>  static struct bpf_local_storage __rcu **task_storage_ptr(void *owner)
>  {
>   struct task_struct *task = owner;
> - struct bpf_storage_blob *bsb;
>  
> - bsb = bpf_task(task);
> - if (!bsb)
> - return NULL;
task_storage_ptr() no longer returns NULL.  All "!task_storage_ptr(task)"
checks should be removed also.  e.g. In bpf_task_storage_get
and bpf_pid_task_storage_update_elem.

> - return &bsb->storage;
> + return &task->bpf_storage;
>  }
>  


Re: [PATCH v4 bpf-next] Add CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES check to bpftool feature command

2021-02-22 Thread Martin KaFai Lau
On Mon, Feb 22, 2021 at 07:58:46PM +, grantseltzer wrote:
> This adds both the CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES
> kernel compile option to output of the bpftool feature command.
> This is relevant for developers that want to account for data structure
> definition differences between kernels.
Acked-by: Martin KaFai Lau 

[ Acked-by and Reviewed-by can be carried over to
  the following revisions if the change is obvious.

  Also, it is useful to comment on what has
  changed between revisions.  There is no need
  to resend this patch just for this though. ]


Re: [PATCH v3 bpf-next] Add CONFIG_DEBUG_INFO_BTF check to bpftool feature command

2021-02-22 Thread Martin KaFai Lau
On Sat, Feb 20, 2021 at 05:13:07PM +, grantseltzer wrote:
> This adds the CONFIG_DEBUG_INFO_BTF kernel compile option to output of
> the bpftool feature command. This is relevant for developers that want
> to use libbpf to account for data structure definition differences
> between kernels.
Acked-by: Martin KaFai Lau 


[PATCH v2 bpf 2/2] bpf: selftests: Add non function pointer test to struct_ops

2021-02-11 Thread Martin KaFai Lau
This patch adds a "void *owner" member.  The existing
bpf_tcp_ca test will ensure the bpf_cubic.o and bpf_dctcp.o
can be loaded.

Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h 
b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index 6a9053162cf2..91f0fac632f4 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -177,6 +177,7 @@ struct tcp_congestion_ops {
 * after all the ca_state processing. (optional)
 */
void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
+   void *owner;
 };
 
 #define min(a, b) ((a) < (b) ? (a) : (b))
-- 
2.24.1



[PATCH v2 bpf 1/2] libbpf: Ignore non function pointer member in struct_ops

2021-02-11 Thread Martin KaFai Lau
When libbpf initializes the kernel's struct_ops in
"bpf_map__init_kern_struct_ops()", it enforces all
pointer types must be a function pointer and rejects
others.  It turns out to be too strict.  For example,
when directly using "struct tcp_congestion_ops" from vmlinux.h,
it has a "struct module *owner" member and it is set to NULL
in a bpf_tcp_cc.o.

Instead, it only needs to ensure the member is a function
pointer if it has been set (relocated) to a bpf-prog.
This patch moves the "btf_is_func_proto(kern_mtype)" check
after the existing "if (!prog) { continue; }".  The original debug
message in "if (!prog) { continue; }" is also removed since it is
no longer valid.  Beside, there is a later debug message to tell
which function pointer is set.

The "btf_is_func_proto(mtype)" has already been guaranteed
in "bpf_object__collect_st_ops_relos()" which has been run
before "bpf_map__init_kern_struct_ops()".  Thus, this check
is removed.

v2:
- Remove outdated debug message (Andrii)
  Remove because there is a later debug message to tell
  which function pointer is set.
- Following mtype->type is no longer needed. Remove:
  "skip_mods_and_typedefs(btf, mtype->type, &mtype_id)"
- Do "if (!prog)" test before skip_mods_and_typedefs.

Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Acked-by: Andrii Nakryiko 
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6ae748f6ea11..a0d4fc4de402 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -883,24 +883,24 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map 
*map,
if (btf_is_ptr(mtype)) {
struct bpf_program *prog;
 
-   mtype = skip_mods_and_typedefs(btf, mtype->type, 
&mtype_id);
+   prog = st_ops->progs[i];
+   if (!prog)
+   continue;
+
kern_mtype = skip_mods_and_typedefs(kern_btf,
kern_mtype->type,
&kern_mtype_id);
-   if (!btf_is_func_proto(mtype) ||
-   !btf_is_func_proto(kern_mtype)) {
-   pr_warn("struct_ops init_kern %s: non func ptr 
%s is not supported\n",
+
+   /* mtype->type must be a func_proto which was
+* guaranteed in bpf_object__collect_st_ops_relos(),
+* so only check kern_mtype for func_proto here.
+*/
+   if (!btf_is_func_proto(kern_mtype)) {
+   pr_warn("struct_ops init_kern %s: kernel member 
%s is not a func ptr\n",
map->name, mname);
return -ENOTSUP;
}
 
-   prog = st_ops->progs[i];
-   if (!prog) {
-   pr_debug("struct_ops init_kern %s: func ptr %s 
is not set\n",
-map->name, mname);
-   continue;
-   }
-
prog->attach_btf_id = kern_type_id;
prog->expected_attach_type = kern_member_idx;
 
-- 
2.24.1



Re: [PATCH bpf 2/2] bpf: selftests: Add non function pointer test to struct_ops

2021-02-10 Thread Martin KaFai Lau
On Wed, Feb 10, 2021 at 06:07:04PM -0800, Andrii Nakryiko wrote:
> On Wed, Feb 10, 2021 at 5:55 PM Martin KaFai Lau  wrote:
> >
> > On Wed, Feb 10, 2021 at 02:54:40PM -0800, Andrii Nakryiko wrote:
> > > On Wed, Feb 10, 2021 at 1:17 PM Martin KaFai Lau  wrote:
> > > >
> > > > On Wed, Feb 10, 2021 at 12:27:38PM -0800, Andrii Nakryiko wrote:
> > > > > On Tue, Feb 9, 2021 at 12:11 PM Martin KaFai Lau  wrote:
> > > > > >
> > > > > > This patch adds a "void *owner" member.  The existing
> > > > > > bpf_tcp_ca test will ensure the bpf_cubic.o and bpf_dctcp.o
> > > > > > can be loaded.
> > > > > >
> > > > > > Signed-off-by: Martin KaFai Lau 
> > > > > > ---
> > > > >
> > > > > Acked-by: Andrii Nakryiko 
> > > > >
> > > > > What will happen if BPF code initializes such non-func ptr member?
> > > > > Will libbpf complain or just ignore those values? Ignoring initialized
> > > > > members isn't great.
> > > > The latter. libbpf will ignore non-func ptr member.  The non-func ptr
> > > > member stays zero when it is passed to the kernel.
> > > >
> > > > libbpf can be changed to copy this non-func ptr value.
> > > > The kernel will decide what to do with it.  It will
> > > > then be consistent with int/array member like ".name"
> > > > and ".flags" where the kernel will verify the value.
> > > > I can spin v2 to do that.
> > >
> > > I was thinking about erroring out on non-zero fields, but if you think
> > > it's useful to pass through values, it could be done, but will require
> > > more and careful code, probably. So, basically, don't feel obligated
> > > to do this in this patch set.
> > You meant it needs different handling in copying ptr value
> > than copying int/char[]?
> 
> Hm.. If we are talking about copying pointer values, then I don't see
> how you can provide a valid kernel pointer from the BPF program?...
I am thinking the kernel is already rejecting members that is supposed
to be zero (e.g. non func ptr here), so there is no need to add codes
to libbpf to do this again.

> But if we are talking about copying field values in general, then
> you'll need to handle enums, struct/union, etc, no? If int/char[] is
> supported (I probably missed that it is), that might be the only
> things you'd need to support. So for non function pointers, I'd just
> enforce zeroes.
Sure, we can reject everything else for non zero in libbpf.
I think we can use a different patch set for that?


Re: [PATCH bpf 2/2] bpf: selftests: Add non function pointer test to struct_ops

2021-02-10 Thread Martin KaFai Lau
On Wed, Feb 10, 2021 at 02:54:40PM -0800, Andrii Nakryiko wrote:
> On Wed, Feb 10, 2021 at 1:17 PM Martin KaFai Lau  wrote:
> >
> > On Wed, Feb 10, 2021 at 12:27:38PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Feb 9, 2021 at 12:11 PM Martin KaFai Lau  wrote:
> > > >
> > > > This patch adds a "void *owner" member.  The existing
> > > > bpf_tcp_ca test will ensure the bpf_cubic.o and bpf_dctcp.o
> > > > can be loaded.
> > > >
> > > > Signed-off-by: Martin KaFai Lau 
> > > > ---
> > >
> > > Acked-by: Andrii Nakryiko 
> > >
> > > What will happen if BPF code initializes such non-func ptr member?
> > > Will libbpf complain or just ignore those values? Ignoring initialized
> > > members isn't great.
> > The latter. libbpf will ignore non-func ptr member.  The non-func ptr
> > member stays zero when it is passed to the kernel.
> >
> > libbpf can be changed to copy this non-func ptr value.
> > The kernel will decide what to do with it.  It will
> > then be consistent with int/array member like ".name"
> > and ".flags" where the kernel will verify the value.
> > I can spin v2 to do that.
> 
> I was thinking about erroring out on non-zero fields, but if you think
> it's useful to pass through values, it could be done, but will require
> more and careful code, probably. So, basically, don't feel obligated
> to do this in this patch set.
You meant it needs different handling in copying ptr value
than copying int/char[]?


Re: [PATCH bpf 1/2] libbpf: Ignore non function pointer member in struct_ops

2021-02-10 Thread Martin KaFai Lau
On Wed, Feb 10, 2021 at 12:26:20PM -0800, Andrii Nakryiko wrote:
> On Tue, Feb 9, 2021 at 12:40 PM Martin KaFai Lau  wrote:
> >
> > When libbpf initializes the kernel's struct_ops in
> > "bpf_map__init_kern_struct_ops()", it enforces all
> > pointer types must be a function pointer and rejects
> > others.  It turns out to be too strict.  For example,
> > when directly using "struct tcp_congestion_ops" from vmlinux.h,
> > it has a "struct module *owner" member and it is set to NULL
> > in a bpf_tcp_cc.o.
> >
> > Instead, it only needs to ensure the member is a function
> > pointer if it has been set (relocated) to a bpf-prog.
> > This patch moves the "btf_is_func_proto(kern_mtype)" check
> > after the existing "if (!prog) { continue; }".
> >
> > The "btf_is_func_proto(mtype)" has already been guaranteed
> > in "bpf_object__collect_st_ops_relos()" which has been run
> > before "bpf_map__init_kern_struct_ops()".  Thus, this check
> > is removed.
> >
> > Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
> > Signed-off-by: Martin KaFai Lau 
> > ---
> 
> Looks good, see nit below.
> 
> Acked-by: Andrii Nakryiko 
> 
> >  tools/lib/bpf/libbpf.c | 12 ++--
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 6ae748f6ea11..b483608ea72a 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -887,12 +887,6 @@ static int bpf_map__init_kern_struct_ops(struct 
> > bpf_map *map,
> > kern_mtype = skip_mods_and_typedefs(kern_btf,
> > 
> > kern_mtype->type,
> > &kern_mtype_id);
> > -   if (!btf_is_func_proto(mtype) ||
> > -   !btf_is_func_proto(kern_mtype)) {
> > -   pr_warn("struct_ops init_kern %s: non func 
> > ptr %s is not supported\n",
> > -   map->name, mname);
> > -   return -ENOTSUP;
> > -   }
> >
> > prog = st_ops->progs[i];
> > if (!prog) {
> 
> debug message below this line is a bit misleading, it talks about
> "func ptr is not set", but it actually could be any kind of field. So
> it would be nice to just talk about "members" or "fields" there, no?
Good catch.  The debug message needs to change.


Re: [PATCH bpf 2/2] bpf: selftests: Add non function pointer test to struct_ops

2021-02-10 Thread Martin KaFai Lau
On Wed, Feb 10, 2021 at 12:27:38PM -0800, Andrii Nakryiko wrote:
> On Tue, Feb 9, 2021 at 12:11 PM Martin KaFai Lau  wrote:
> >
> > This patch adds a "void *owner" member.  The existing
> > bpf_tcp_ca test will ensure the bpf_cubic.o and bpf_dctcp.o
> > can be loaded.
> >
> > Signed-off-by: Martin KaFai Lau 
> > ---
> 
> Acked-by: Andrii Nakryiko 
> 
> What will happen if BPF code initializes such non-func ptr member?
> Will libbpf complain or just ignore those values? Ignoring initialized
> members isn't great.
The latter. libbpf will ignore non-func ptr member.  The non-func ptr
member stays zero when it is passed to the kernel.

libbpf can be changed to copy this non-func ptr value.
The kernel will decide what to do with it.  It will
then be consistent with int/array member like ".name"
and ".flags" where the kernel will verify the value.
I can spin v2 to do that.

> 
> >  tools/testing/selftests/bpf/bpf_tcp_helpers.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h 
> > b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > index 6a9053162cf2..91f0fac632f4 100644
> > --- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
> > @@ -177,6 +177,7 @@ struct tcp_congestion_ops {
> >  * after all the ca_state processing. (optional)
> >  */
> > void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
> > +   void *owner;
> >  };
> >
> >  #define min(a, b) ((a) < (b) ? (a) : (b))
> > --
> > 2.24.1
> >


Re: [PATCH] bpf_lru_list: Read double-checked variable once without lock

2021-02-09 Thread Martin KaFai Lau
On Tue, Feb 09, 2021 at 12:27:01PM +0100, Marco Elver wrote:
> For double-checked locking in bpf_common_lru_push_free(), node->type is
> read outside the critical section and then re-checked under the lock.
> However, concurrent writes to node->type result in data races.
> 
> For example, the following concurrent access was observed by KCSAN:
> 
>   write to 0x88801521bc22 of 1 bytes by task 10038 on cpu 1:
>__bpf_lru_node_move_inkernel/bpf/bpf_lru_list.c:91
>__local_list_flushkernel/bpf/bpf_lru_list.c:298
>...
>   read to 0x88801521bc22 of 1 bytes by task 10043 on cpu 0:
>bpf_common_lru_push_free  kernel/bpf/bpf_lru_list.c:507
>bpf_lru_push_free kernel/bpf/bpf_lru_list.c:555
>...
> 
> Fix the data races where node->type is read outside the critical section
> (for double-checked locking) by marking the access with READ_ONCE() as
> well as ensuring the variable is only accessed once.
> 
> Reported-by: syzbot+3536db46dfa58c573...@syzkaller.appspotmail.com
> Reported-by: syzbot+516acdb03d3e27d91...@syzkaller.appspotmail.com
> Signed-off-by: Marco Elver 
> ---
> Detailed reports:
>   
> https://groups.google.com/g/syzkaller-upstream-moderation/c/PwsoQ7bfi8k/m/NH9Ni2WxAQAJ
>  
>   
> https://groups.google.com/g/syzkaller-upstream-moderation/c/-fXQO9ehxSM/m/RmQEcI2oAQAJ
>  
> ---
>  kernel/bpf/bpf_lru_list.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> index 1b6b9349cb85..d99e89f113c4 100644
> --- a/kernel/bpf/bpf_lru_list.c
> +++ b/kernel/bpf/bpf_lru_list.c
> @@ -502,13 +502,14 @@ struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru 
> *lru, u32 hash)
>  static void bpf_common_lru_push_free(struct bpf_lru *lru,
>struct bpf_lru_node *node)
>  {
> + u8 node_type = READ_ONCE(node->type);
>   unsigned long flags;
>  
> - if (WARN_ON_ONCE(node->type == BPF_LRU_LIST_T_FREE) ||
> - WARN_ON_ONCE(node->type == BPF_LRU_LOCAL_LIST_T_FREE))
> + if (WARN_ON_ONCE(node_type == BPF_LRU_LIST_T_FREE) ||
> + WARN_ON_ONCE(node_type == BPF_LRU_LOCAL_LIST_T_FREE))
>   return;
>  
> - if (node->type == BPF_LRU_LOCAL_LIST_T_PENDING) {
> + if (node_type == BPF_LRU_LOCAL_LIST_T_PENDING) {
I think this can be bpf-next.

Acked-by: Martin KaFai Lau 


[PATCH bpf 1/2] libbpf: Ignore non function pointer member in struct_ops

2021-02-09 Thread Martin KaFai Lau
When libbpf initializes the kernel's struct_ops in
"bpf_map__init_kern_struct_ops()", it enforces all
pointer types must be a function pointer and rejects
others.  It turns out to be too strict.  For example,
when directly using "struct tcp_congestion_ops" from vmlinux.h,
it has a "struct module *owner" member and it is set to NULL
in a bpf_tcp_cc.o.

Instead, it only needs to ensure the member is a function
pointer if it has been set (relocated) to a bpf-prog.
This patch moves the "btf_is_func_proto(kern_mtype)" check
after the existing "if (!prog) { continue; }".

The "btf_is_func_proto(mtype)" has already been guaranteed
in "bpf_object__collect_st_ops_relos()" which has been run
before "bpf_map__init_kern_struct_ops()".  Thus, this check
is removed.

Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: Martin KaFai Lau 
---
 tools/lib/bpf/libbpf.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6ae748f6ea11..b483608ea72a 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -887,12 +887,6 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map 
*map,
kern_mtype = skip_mods_and_typedefs(kern_btf,
kern_mtype->type,
&kern_mtype_id);
-   if (!btf_is_func_proto(mtype) ||
-   !btf_is_func_proto(kern_mtype)) {
-   pr_warn("struct_ops init_kern %s: non func ptr 
%s is not supported\n",
-   map->name, mname);
-   return -ENOTSUP;
-   }
 
prog = st_ops->progs[i];
if (!prog) {
@@ -901,6 +895,12 @@ static int bpf_map__init_kern_struct_ops(struct bpf_map 
*map,
continue;
}
 
+   if (!btf_is_func_proto(kern_mtype)) {
+   pr_warn("struct_ops init_kern %s: kernel member 
%s is not a func ptr\n",
+   map->name, mname);
+   return -ENOTSUP;
+   }
+
prog->attach_btf_id = kern_type_id;
prog->expected_attach_type = kern_member_idx;
 
-- 
2.24.1



[PATCH bpf 2/2] bpf: selftests: Add non function pointer test to struct_ops

2021-02-09 Thread Martin KaFai Lau
This patch adds a "void *owner" member.  The existing
bpf_tcp_ca test will ensure the bpf_cubic.o and bpf_dctcp.o
can be loaded.

Signed-off-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/bpf_tcp_helpers.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h 
b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index 6a9053162cf2..91f0fac632f4 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -177,6 +177,7 @@ struct tcp_congestion_ops {
 * after all the ca_state processing. (optional)
 */
void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
+   void *owner;
 };
 
 #define min(a, b) ((a) < (b) ? (a) : (b))
-- 
2.24.1



Re: [PATCH bpf-next v3 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-27 Thread Martin KaFai Lau
On Tue, Jan 26, 2021 at 08:51:04AM -0800, Stanislav Fomichev wrote:
> Return 3 to indicate that permission check for port 111
> should be skipped.
> 

[ ... ]

> +void cap_net_bind_service(cap_flag_value_t flag)
> +{
> + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> + cap_t caps;
> +
> + caps = cap_get_proc();
> + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +flag),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> + goto free_caps;
> +
> +free_caps:
> + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> + goto free_caps;
Also mentioned in v2, there is a loop.

> +}
> +
> +void test_bind_perm(void)
> +{
> + struct bind_perm *skel;
> + int cgroup_fd;
> +
> + cgroup_fd = test__join_cgroup("/bind_perm");
> + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> + return;
> +
> + skel = bind_perm__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel"))
> + goto close_cgroup_fd;
> +
> + skel->links.bind_v4_prog = 
> bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> + if (!ASSERT_OK_PTR(skel, "bind_v4_prog"))
> + goto close_skeleton;
> +
> + cap_net_bind_service(CAP_CLEAR);
> + try_bind(110, EACCES);
> + try_bind(111, 0);
> + cap_net_bind_service(CAP_SET);
Instead of always CAP_SET at the end of the test,
it is better to do a cap_get_flag() to save the original value
at the beginning of the test and restore it at the end
of the test.


Re: [PATCH bpf-next v3 1/2] bpf: allow rewriting to ports under ip_unprivileged_port_start

2021-01-27 Thread Martin KaFai Lau
On Tue, Jan 26, 2021 at 08:51:03AM -0800, Stanislav Fomichev wrote:
> At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
> to the privileged ones (< ip_unprivileged_port_start), but it will
> be rejected later on in the __inet_bind or __inet6_bind.
> 
> Let's add another return value to indicate that CAP_NET_BIND_SERVICE
> check should be ignored. Use the same idea as we currently use
> in cgroup/egress where bit #1 indicates CN. Instead, for
> cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
> be bypassed.
> 
> v3:
> - Update description (Martin KaFai Lau)
> - Fix capability restore in selftest (Martin KaFai Lau)
> 
> v2:
> - Switch to explicit return code (Martin KaFai Lau)
> 

[ ... ]

> @@ -499,7 +501,8 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, 
> int addr_len,
>  
>   snum = ntohs(addr->sin_port);
>   err = -EACCES;
> - if (snum && inet_port_requires_bind_service(net, snum) &&
> + if (!(flags & BIND_NO_CAP_NET_BIND_SERVICE) &&
> + snum && inet_port_requires_bind_service(net, snum) &&
The same change needs to be done on __inet6_bind()
and also adds a test for IPv6 in patch 2.

>   !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>   goto out;
>  


Re: [PATCH bpf-next v4 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-26 Thread Martin KaFai Lau
On Tue, Jan 26, 2021 at 11:35:44AM -0800, Stanislav Fomichev wrote:
> Return 3 to indicate that permission check for port 111
> should be skipped.
Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next v4 1/2] bpf: allow rewriting to ports under ip_unprivileged_port_start

2021-01-26 Thread Martin KaFai Lau
On Tue, Jan 26, 2021 at 11:35:43AM -0800, Stanislav Fomichev wrote:
> At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
> to the privileged ones (< ip_unprivileged_port_start), but it will
> be rejected later on in the __inet_bind or __inet6_bind.
> 
> Let's add another return value to indicate that CAP_NET_BIND_SERVICE
> check should be ignored. Use the same idea as we currently use
> in cgroup/egress where bit #1 indicates CN. Instead, for
> cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
> be bypassed.
> 
> v4:
> - Add missing IPv6 support (Martin KaFai Lau)
> 
> v3:
> - Update description (Martin KaFai Lau)
> - Fix capability restore in selftest (Martin KaFai Lau)
> 
> v2:
> - Switch to explicit return code (Martin KaFai Lau)
Reviewed-by: Martin KaFai Lau 


Re: [PATCH bpf-next v2 1/2] bpf: allow rewriting to ports under ip_unprivileged_port_start

2021-01-25 Thread Martin KaFai Lau
On Mon, Jan 25, 2021 at 03:32:53PM -0800, Stanislav Fomichev wrote:
> On Mon, Jan 25, 2021 at 3:25 PM Martin KaFai Lau  wrote:
> >
> > On Mon, Jan 25, 2021 at 09:26:40AM -0800, Stanislav Fomichev wrote:
> > > At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
> > > to the privileged ones (< ip_unprivileged_port_start), but it will
> > > be rejected later on in the __inet_bind or __inet6_bind.
> > >
> > > Let's export 'port_changed' event from the BPF program and bypass
> > > ip_unprivileged_port_start range check when we've seen that
> > > the program explicitly overrode the port. This is accomplished
> > > by generating instructions to set ctx->port_changed along with
> > > updating ctx->user_port.
> > The description requires an update.
> Ah, sure, will update it.
> 
> > [ ... ]
> >
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index da649f20d6b2..cdf3c7e611d9 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -1055,6 +1055,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
> > >   * @uaddr: sockaddr struct provided by user
> > >   * @type: The type of program to be exectuted
> > >   * @t_ctx: Pointer to attach type specific context
> > > + * @flags: Pointer to u32 which contains higher bits of BPF program
> > > + * return value (OR'ed together).
> > >   *
> > >   * socket is expected to be of type INET or INET6.
> > >   *
> > > @@ -1064,7 +1066,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
> > >  int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> > > struct sockaddr *uaddr,
> > > enum bpf_attach_type type,
> > > -   void *t_ctx)
> > > +   void *t_ctx,
> > > +   u32 *flags)
> > >  {
> > >   struct bpf_sock_addr_kern ctx = {
> > >   .sk = sk,
> > > @@ -1087,7 +1090,8 @@ int __cgroup_bpf_run_filter_sock_addr(struct sock 
> > > *sk,
> > >   }
> > >
> > >   cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > - ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, 
> > > BPF_PROG_RUN);
> > > + ret = BPF_PROG_RUN_ARRAY_FLAGS(cgrp->bpf.effective[type], &ctx,
> > > +BPF_PROG_RUN, flags);
> > >
> > >   return ret == 1 ? 0 : -EPERM;
> > >  }
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index d0eae51b31e4..ef7c3ca53214 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -7986,6 +7986,11 @@ static int check_return_code(struct 
> > > bpf_verifier_env *env)
> > >   env->prog->expected_attach_type == 
> > > BPF_CGROUP_INET4_GETSOCKNAME ||
> > >   env->prog->expected_attach_type == 
> > > BPF_CGROUP_INET6_GETSOCKNAME)
> > >   range = tnum_range(1, 1);
> > > + if (env->prog->expected_attach_type == 
> > > BPF_CGROUP_INET4_BIND ||
> > > + env->prog->expected_attach_type == 
> > > BPF_CGROUP_INET6_BIND) {
> > > + range = tnum_range(0, 3);
> > > + enforce_attach_type_range = tnum_range(0, 3);
> > It should be:
> > enforce_attach_type_range = tnum_range(2, 3);
> Hm, weren't we enforcing attach_type for bind progs from the beginning?
Ah, right.  Then there is no need to set enforce_attach_type_range at all.
"enforce_attach_type_range = tnum_range(0, 3);" can be removed.

> Also, looking at bpf_prog_attach_check_attach_type, it seems that we
> care only about BPF_PROG_TYPE_CGROUP_SKB for
> prog->enforce_expected_attach_type.
> Am I missing something?
It is because, from the very beginning, BPF_PROG_TYPE_CGROUP_SKB did not
enforce the attach_type in bpf_prog_attach_check_attach_type().


Re: [PATCH bpf-next v2 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-25 Thread Martin KaFai Lau
On Mon, Jan 25, 2021 at 09:26:41AM -0800, Stanislav Fomichev wrote:
> BPF rewrites from 111 to 111, but it still should mark the port as
> "changed".
> We also verify that if port isn't touched by BPF, it's still prohibited.
The description requires an update.

> 
> Cc: Andrey Ignatov 
> Cc: Martin KaFai Lau 
> Signed-off-by: Stanislav Fomichev 
> ---
>  .../selftests/bpf/prog_tests/bind_perm.c  | 85 +++
>  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
>  2 files changed, 121 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/bind_perm.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c 
> b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> new file mode 100644
> index ..61307d4494bf
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> @@ -0,0 +1,85 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include "bind_perm.skel.h"
> +
> +#include 
> +#include 
> +#include 
> +
> +static int duration;
> +
> +void try_bind(int port, int expected_errno)
> +{
> + struct sockaddr_in sin = {};
> + int fd = -1;
> +
> + fd = socket(AF_INET, SOCK_STREAM, 0);
> + if (CHECK(fd < 0, "fd", "errno %d", errno))
> + goto close_socket;
> +
> + sin.sin_family = AF_INET;
> + sin.sin_port = htons(port);
> +
> + errno = 0;
> + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> + ASSERT_EQ(errno, expected_errno, "bind");
> +
> +close_socket:
> + if (fd >= 0)
> + close(fd);
> +}
> +
> +void cap_net_bind_service(cap_flag_value_t flag)
> +{
> + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> + cap_t caps;
> +
> + caps = cap_get_proc();
> + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
These two back-to-back cap_set_flag() looks incorrect.
Also, the "cap_flag_value_t flag" is unused.

> + goto free_caps;
> +
> + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> + goto free_caps;
> +
> +free_caps:
> + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> + goto free_caps;
There is a loop.

> +}
> +
> +void test_bind_perm(void)
> +{
> + struct bind_perm *skel;
> + int cgroup_fd;
> +
> + cgroup_fd = test__join_cgroup("/bind_perm");
> + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> + return;
> +
> + skel = bind_perm__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel"))
> + goto close_cgroup_fd;
> +
> + skel->links.bind_v4_prog = 
> bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> + if (!ASSERT_OK_PTR(skel, "bind_v4_prog"))
> + goto close_skeleton;
> +
> + cap_net_bind_service(CAP_CLEAR);
> + try_bind(110, EACCES);
> + try_bind(111, 0);
> + cap_net_bind_service(CAP_SET);
> +
> +close_skeleton:
> + bind_perm__destroy(skel);
> +close_cgroup_fd:
> + close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bind_perm.c 
> b/tools/testing/selftests/bpf/progs/bind_perm.c
> new file mode 100644
> index ..31ae8d599796
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bind_perm.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +SEC("cgroup/bind4")
> +int bind_v4_prog(struct bpf_sock_addr *ctx)
> +{
> + struct bpf_sock *sk;
> + __u32 user_ip4;
> + __u16 user_port;
> +
> + sk = ctx->sk;
> + if (!sk)
> + return 0;
> +
> + if (sk->family != AF_INET)
> + return 0;
> +
> + if (ctx->type != SOCK_STREAM)
> + return 0;
> +
> + /* Rewriting to the same value should still cause
> +  * permission check to be bypassed.
> +  */
This comment is out dated also.

> + if (ctx->user_port == bpf_htons(111))
> + return 3;
> +
> + return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.30.0.280.ga3ce27912f-goog
> 


Re: [PATCH bpf-next v2 1/2] bpf: allow rewriting to ports under ip_unprivileged_port_start

2021-01-25 Thread Martin KaFai Lau
On Mon, Jan 25, 2021 at 09:26:40AM -0800, Stanislav Fomichev wrote:
> At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
> to the privileged ones (< ip_unprivileged_port_start), but it will
> be rejected later on in the __inet_bind or __inet6_bind.
> 
> Let's export 'port_changed' event from the BPF program and bypass
> ip_unprivileged_port_start range check when we've seen that
> the program explicitly overrode the port. This is accomplished
> by generating instructions to set ctx->port_changed along with
> updating ctx->user_port.
The description requires an update.

[ ... ]

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index da649f20d6b2..cdf3c7e611d9 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -1055,6 +1055,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>   * @uaddr: sockaddr struct provided by user
>   * @type: The type of program to be exectuted
>   * @t_ctx: Pointer to attach type specific context
> + * @flags: Pointer to u32 which contains higher bits of BPF program
> + * return value (OR'ed together).
>   *
>   * socket is expected to be of type INET or INET6.
>   *
> @@ -1064,7 +1066,8 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>  int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> struct sockaddr *uaddr,
> enum bpf_attach_type type,
> -   void *t_ctx)
> +   void *t_ctx,
> +   u32 *flags)
>  {
>   struct bpf_sock_addr_kern ctx = {
>   .sk = sk,
> @@ -1087,7 +1090,8 @@ int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
>   }
>  
>   cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> - ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
> + ret = BPF_PROG_RUN_ARRAY_FLAGS(cgrp->bpf.effective[type], &ctx,
> +BPF_PROG_RUN, flags);
>  
>   return ret == 1 ? 0 : -EPERM;
>  }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d0eae51b31e4..ef7c3ca53214 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7986,6 +7986,11 @@ static int check_return_code(struct bpf_verifier_env 
> *env)
>   env->prog->expected_attach_type == 
> BPF_CGROUP_INET4_GETSOCKNAME ||
>   env->prog->expected_attach_type == 
> BPF_CGROUP_INET6_GETSOCKNAME)
>   range = tnum_range(1, 1);
> + if (env->prog->expected_attach_type == BPF_CGROUP_INET4_BIND ||
> + env->prog->expected_attach_type == BPF_CGROUP_INET6_BIND) {
> + range = tnum_range(0, 3);
> + enforce_attach_type_range = tnum_range(0, 3);
It should be:
enforce_attach_type_range = tnum_range(2, 3);


Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-22 Thread Martin KaFai Lau
On Fri, Jan 22, 2021 at 08:16:40AM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Thu, Jan 21, 2021 at 04:30:08PM -0800, s...@google.com wrote:
> > > On 01/21, Martin KaFai Lau wrote:
> > > > On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> > > > > On 01/21, Martin KaFai Lau wrote:
> > > > > > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev
> > wrote:
> > > > > > > BPF rewrites from 111 to 111, but it still should mark the
> > port as
> > > > > > > "changed".
> > > > > > > We also verify that if port isn't touched by BPF, it's still
> > > > prohibited.
> > > > > > >
> > > > > > > Signed-off-by: Stanislav Fomichev 
> > > > > > > ---
> > > > > > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88
> > > > +++
> > > > > > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > > > > > >  2 files changed, 124 insertions(+)
> > > > > > >  create mode 100644
> > > > tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > >  create mode 100644
> > tools/testing/selftests/bpf/progs/bind_perm.c
> > > > > > >
> > > > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > > new file mode 100644
> > > > > > > index ..840a04ac9042
> > > > > > > --- /dev/null
> > > > > > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > > > @@ -0,0 +1,88 @@
> > > > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > > > +#include 
> > > > > > > +#include "bind_perm.skel.h"
> > > > > > > +
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +
> > > > > > > +static int duration;
> > > > > > > +
> > > > > > > +void try_bind(int port, int expected_errno)
> > > > > > > +{
> > > > > > > + struct sockaddr_in sin = {};
> > > > > > > + int fd = -1;
> > > > > > > +
> > > > > > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > > > > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > > > > > + goto close_socket;
> > > > > > > +
> > > > > > > + sin.sin_family = AF_INET;
> > > > > > > + sin.sin_port = htons(port);
> > > > > > > +
> > > > > > > + errno = 0;
> > > > > > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > > > > > + CHECK(errno != expected_errno, "bind", "errno %d, expected
> > %d",
> > > > > > > +   errno, expected_errno);
> > > > > > > +
> > > > > > > +close_socket:
> > > > > > > + if (fd >= 0)
> > > > > > > + close(fd);
> > > > > > > +}
> > > > > > > +
> > > > > > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > > > > > +{
> > > > > > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > > > > > + cap_t caps;
> > > > > > > +
> > > > > > > + caps = cap_get_proc();
> > > > > > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > > > > > + goto free_caps;
> > > > > > > +
> > > > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > > > &cap_net_bind_service,
> > > > > > > +CAP_CLEAR),
> > > > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > > > + goto free_caps;
> > > > > > > +
> > > > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > > > &cap_net_bind_service,
> > > > > > > +CAP_CLE

Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 04:30:08PM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> > > On 01/21, Martin KaFai Lau wrote:
> > > > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> > > > > BPF rewrites from 111 to 111, but it still should mark the port as
> > > > > "changed".
> > > > > We also verify that if port isn't touched by BPF, it's still
> > prohibited.
> > > > >
> > > > > Signed-off-by: Stanislav Fomichev 
> > > > > ---
> > > > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88
> > +++
> > > > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > > > >  2 files changed, 124 insertions(+)
> > > > >  create mode 100644
> > tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > new file mode 100644
> > > > > index ..840a04ac9042
> > > > > --- /dev/null
> > > > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > > > @@ -0,0 +1,88 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +#include 
> > > > > +#include "bind_perm.skel.h"
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +static int duration;
> > > > > +
> > > > > +void try_bind(int port, int expected_errno)
> > > > > +{
> > > > > + struct sockaddr_in sin = {};
> > > > > + int fd = -1;
> > > > > +
> > > > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > > > + goto close_socket;
> > > > > +
> > > > > + sin.sin_family = AF_INET;
> > > > > + sin.sin_port = htons(port);
> > > > > +
> > > > > + errno = 0;
> > > > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > > > + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> > > > > +   errno, expected_errno);
> > > > > +
> > > > > +close_socket:
> > > > > + if (fd >= 0)
> > > > > + close(fd);
> > > > > +}
> > > > > +
> > > > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > > > +{
> > > > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > > > + cap_t caps;
> > > > > +
> > > > > + caps = cap_get_proc();
> > > > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > &cap_net_bind_service,
> > > > > +CAP_CLEAR),
> > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1,
> > &cap_net_bind_service,
> > > > > +CAP_CLEAR),
> > > > > +   "cap_set_flag", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", 
> > > > > errno))
> > > > > + goto free_caps;
> > > > > +
> > > > > +free_caps:
> > > > > + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> > > > > + goto free_caps;
> > > > > +}
> > > > > +
> > > > > +void test_bind_perm(void)
> > > > > +{
> > > > > + struct bind_perm *skel;
>

Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 02:57:44PM -0800, s...@google.com wrote:
> On 01/21, Martin KaFai Lau wrote:
> > On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> > > BPF rewrites from 111 to 111, but it still should mark the port as
> > > "changed".
> > > We also verify that if port isn't touched by BPF, it's still prohibited.
> > >
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  .../selftests/bpf/prog_tests/bind_perm.c  | 88 +++
> > >  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
> > >  2 files changed, 124 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > new file mode 100644
> > > index ..840a04ac9042
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> > > @@ -0,0 +1,88 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include 
> > > +#include "bind_perm.skel.h"
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +static int duration;
> > > +
> > > +void try_bind(int port, int expected_errno)
> > > +{
> > > + struct sockaddr_in sin = {};
> > > + int fd = -1;
> > > +
> > > + fd = socket(AF_INET, SOCK_STREAM, 0);
> > > + if (CHECK(fd < 0, "fd", "errno %d", errno))
> > > + goto close_socket;
> > > +
> > > + sin.sin_family = AF_INET;
> > > + sin.sin_port = htons(port);
> > > +
> > > + errno = 0;
> > > + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> > > + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> > > +   errno, expected_errno);
> > > +
> > > +close_socket:
> > > + if (fd >= 0)
> > > + close(fd);
> > > +}
> > > +
> > > +void cap_net_bind_service(cap_flag_value_t flag)
> > > +{
> > > + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> > > + cap_t caps;
> > > +
> > > + caps = cap_get_proc();
> > > + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> > > +CAP_CLEAR),
> > > +   "cap_set_flag", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> > > +CAP_CLEAR),
> > > +   "cap_set_flag", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> > > + goto free_caps;
> > > +
> > > +free_caps:
> > > + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> > > + goto free_caps;
> > > +}
> > > +
> > > +void test_bind_perm(void)
> > > +{
> > > + struct bind_perm *skel;
> > > + int cgroup_fd;
> > > +
> > > + cgroup_fd = test__join_cgroup("/bind_perm");
> > > + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> > > + return;
> > > +
> > > + skel = bind_perm__open_and_load();
> > > + if (CHECK(!skel, "skel-load", "errno %d", errno))
> > > + goto close_cgroup_fd;
> > > +
> > > + skel->links.bind_v4_prog =
> > bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> > > + if (CHECK(IS_ERR(skel->links.bind_v4_prog),
> > > +   "cg-attach", "bind4 %ld",
> > > +   PTR_ERR(skel->links.bind_v4_prog)))
> > > + goto close_skeleton;
> > > +
> > > + cap_net_bind_service(CAP_CLEAR);
> > > + try_bind(110, EACCES);
> > > + try_bind(111, 0);
> > > + cap_net_bind_service(CAP_SET);
> > > +
> > > +close_skeleton:
> > > + bind_perm__destroy(skel);
> > > +close_cgroup_fd:
&g

Re: [PATCH bpf-next 2/2] selftests/bpf: verify that rebinding to port < 1024 from BPF works

2021-01-21 Thread Martin KaFai Lau
On Wed, Jan 20, 2021 at 05:22:41PM -0800, Stanislav Fomichev wrote:
> BPF rewrites from 111 to 111, but it still should mark the port as
> "changed".
> We also verify that if port isn't touched by BPF, it's still prohibited.
> 
> Signed-off-by: Stanislav Fomichev 
> ---
>  .../selftests/bpf/prog_tests/bind_perm.c  | 88 +++
>  tools/testing/selftests/bpf/progs/bind_perm.c | 36 
>  2 files changed, 124 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/bind_perm.c
>  create mode 100644 tools/testing/selftests/bpf/progs/bind_perm.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bind_perm.c 
> b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> new file mode 100644
> index ..840a04ac9042
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/bind_perm.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include "bind_perm.skel.h"
> +
> +#include 
> +#include 
> +#include 
> +
> +static int duration;
> +
> +void try_bind(int port, int expected_errno)
> +{
> + struct sockaddr_in sin = {};
> + int fd = -1;
> +
> + fd = socket(AF_INET, SOCK_STREAM, 0);
> + if (CHECK(fd < 0, "fd", "errno %d", errno))
> + goto close_socket;
> +
> + sin.sin_family = AF_INET;
> + sin.sin_port = htons(port);
> +
> + errno = 0;
> + bind(fd, (struct sockaddr *)&sin, sizeof(sin));
> + CHECK(errno != expected_errno, "bind", "errno %d, expected %d",
> +   errno, expected_errno);
> +
> +close_socket:
> + if (fd >= 0)
> + close(fd);
> +}
> +
> +void cap_net_bind_service(cap_flag_value_t flag)
> +{
> + const cap_value_t cap_net_bind_service = CAP_NET_BIND_SERVICE;
> + cap_t caps;
> +
> + caps = cap_get_proc();
> + if (CHECK(!caps, "cap_get_proc", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_flag(caps, CAP_EFFECTIVE, 1, &cap_net_bind_service,
> +CAP_CLEAR),
> +   "cap_set_flag", "errno %d", errno))
> + goto free_caps;
> +
> + if (CHECK(cap_set_proc(caps), "cap_set_proc", "errno %d", errno))
> + goto free_caps;
> +
> +free_caps:
> + if (CHECK(cap_free(caps), "cap_free", "errno %d", errno))
> + goto free_caps;
> +}
> +
> +void test_bind_perm(void)
> +{
> + struct bind_perm *skel;
> + int cgroup_fd;
> +
> + cgroup_fd = test__join_cgroup("/bind_perm");
> + if (CHECK(cgroup_fd < 0, "cg-join", "errno %d", errno))
> + return;
> +
> + skel = bind_perm__open_and_load();
> + if (CHECK(!skel, "skel-load", "errno %d", errno))
> + goto close_cgroup_fd;
> +
> + skel->links.bind_v4_prog = 
> bpf_program__attach_cgroup(skel->progs.bind_v4_prog, cgroup_fd);
> + if (CHECK(IS_ERR(skel->links.bind_v4_prog),
> +   "cg-attach", "bind4 %ld",
> +   PTR_ERR(skel->links.bind_v4_prog)))
> + goto close_skeleton;
> +
> + cap_net_bind_service(CAP_CLEAR);
> + try_bind(110, EACCES);
> + try_bind(111, 0);
> + cap_net_bind_service(CAP_SET);
> +
> +close_skeleton:
> + bind_perm__destroy(skel);
> +close_cgroup_fd:
> + close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/bind_perm.c 
> b/tools/testing/selftests/bpf/progs/bind_perm.c
> new file mode 100644
> index ..2194587ec806
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bind_perm.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +SEC("cgroup/bind4")
> +int bind_v4_prog(struct bpf_sock_addr *ctx)
> +{
> + struct bpf_sock *sk;
> + __u32 user_ip4;
> + __u16 user_port;
> +
> + sk = ctx->sk;
> + if (!sk)
> + return 0;
> +
> + if (sk->family != AF_INET)
> + return 0;
> +
> + if (ctx->type != SOCK_STREAM)
> + return 0;
> +
> + /* Rewriting to the same value should still cause
> +  * permission check to be bypassed.
> +  */
> + if (ctx->user_port == bpf_htons(111))
> + ctx->user_port = bpf_htons(111);
iiuc, this overwrite is essentially the way to ensure the bind
will succeed (override CAP_NET_BIND_SERVICE in this particular case?).

It seems to be okay if we consider most of the use cases is rewriting
to a different port.

However, it is quite un-intuitive to the bpf prog to overwrite with
the same user_port just to ensure this port can be binded successfully
later.

Is user_port the only case? How about other fields in bpf_sock_addr?

> +
> + return 1;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
> 


Re: More flexible BPF socket inet_lookup hooking after listening sockets are dispatched

2021-01-21 Thread Martin KaFai Lau
On Thu, Jan 21, 2021 at 09:40:19PM +0100, Shanti Lombard wrote:
> Le 2021-01-21 12:14, Jakub Sitnicki a écrit :
> > On Wed, Jan 20, 2021 at 10:06 PM CET, Alexei Starovoitov wrote:
> > 
> > There is also documentation in the kernel:
> > 
> > https://www.kernel.org/doc/html/latest/bpf/prog_sk_lookup.html
> > 
> 
> Thank you, I saw it, it's well written and very much explains it all.
> 
> > 
> > Existing hook is placed before regular listening/unconnected socket
> > lookup to prevent port hijacking on the unprivileged range.
> > 
> 
> Yes, from the point of view of the BPF program. However from the point of
> view of a legitimate service listening on a port that might be blocked by
> the BPF program, BPF is actually hijacking a port bind.
> 
> That being said, if you install the BPF filter, you should know what you are
> doing.
> 
> > > > The suggestion above would work for my use case, but there is another
> > > > possibility to make the same use cases possible : implement in
> > > > BPF (or
> > > > allow BPF to call) the C and E steps above so the BPF program can
> > > > supplant the kernel behavior. I find this solution less elegant
> > > > and it
> > > > might not work well in case there are multiple inet_lookup BPF
> > > > programs
> > > > installed.
> > 
> > Having a BPF helper available to BPF sk_lookup programs that looks up a
> > socket by packet 4-tuple and netns ID in tcp/udp hashtables sounds
> > reasonable to me. You gain the flexibility that you describe without
> > adding code on the hot path.
Agree that a helper to lookup the inet_hash is probably a better way.
There are some existing lookup helper examples as you also pointed out.

I would avoid adding new hooks doing the same thing.
The same bpf prog will be called multiple times, the bpf running
ctx has to be initialized multiple times...etc.

> 
> True, if you consider that hot path should not be slowed down. It makes
> sense. However, for me, it seems the implementation would be more difficult.
> 
> Looking at existing BPF helpers 
>  > I found bpf_sk_lookup_tcp and bpf_sk_lookup_ucp that should yield a socket
> from a matching tuple and netns. If that's true and usable from within BPF
> sk_lookup then it's just a matter of implementing it and the kernel is
> already ready for such use cases.
> 
> Shanti


  1   2   3   4   5   6   7   8   9   10   >