from:"Ilias Apalodimas"

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-20 Thread Ilias Apalodimas

Hi Matthew,

[...]
> 
> And the contents of this page already came from that device ... if it
> wanted to write bad data, it could already have done so.
> 
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).  
> > > 
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> > 
> > That is a good point.  So, there is a race window where it is possible
> > to gain recycling.
> > 
> > It seems my page_pool co-maintainer Ilias is interested in taking up the
> > challenge to get this working with TCP RX zerocopy.  So, lets see how
> > this is doable.
> 
> You could also check page_ref_count() - page_mapcount() instead of
> just checking page_ref_count().  Assuming mapping/unmapping can't
> race with recycling?
> 

That's not a bad idea.  As I explained on my last reply to Shakeel, I don't
think the current patch will blow up anywhere.  If the page is unmapped prior
to kfree_skb() it will be recycled.  If it's done in a reverse order, we'll
just free the page entirely and will have to re-allocate it.
The only thing I need to test is potential races (assuming those can even
happen?).

Trying to recycle the page outside of kfree_skb() means we'd have to 'steal'
the page, during put_page() (or some function that's outside the networking
scope).  I think this is going to have a measurable performance penalty though
not in networking, but in general.

In any case, that should be orthogonal to the current patchset.  So unless
someone feels strongly about it, I'd prefer keeping the current code and
trying to enable recycling in the skb zc case, when we have enough users of
the API.


Thanks
/Ilias

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-19 Thread Ilias Apalodimas

On Mon, Apr 19, 2021 at 09:21:55AM -0700, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
>  wrote:
> >
> [...]
> > > Pages mapped into the userspace have their refcnt elevated, so the
> > > page_ref_count() check by the drivers indicates to not reuse such
> > > pages.
> > >
> >
> > When tcp_zerocopy_receive() is invoked it will call 
> > tcp_zerocopy_vm_insert_batch()
> > which will end up doing a get_page().
> > What you are saying is that once the zerocopy is done though, 
> > skb_release_data()
> > won't be called, but instead put_page() will be? If that's the case then we 
> > are
> > indeed leaking DMA mappings and memory. That sounds weird though, since the
> > refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so 
> > who
> > eventually frees the page?
> > If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> > eventually, we'll end up properly recycling the page into our pool.
> >
> 
> From what I understand (Eric, please correct me if I'm wrong) for
> simple cases there are 3 page references taken. One by the driver,
> second by skb and third by page table.
> 
> In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
> page ref through insert_page_into_pte_locked(). However before
> returning from tcp_zerocopy_receive(), the skb references are dropped
> through tcp_recv_skb(). So, whenever the user unmaps the page and
> drops the page ref only then that page can be reused by the driver.
> 
> In my understanding, for zerocopy rx the skb_release_data() is called
> on the pages while they are still mapped into the userspace. So,
> skb_release_data() might not be the right place to recycle the page
> for zerocopy. The email chain at [1] has some discussion on how to
> bundle the recycling of pages with their lifetime.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.k...@gmail.com/

Ah right, you mentioned the same email before and I completely forgot about
it! In the past we had thoughts of 'stealing' the page on put_page instead of 
skb_release_data().  We were afraid that this would cause a measurable 
performance hit, so we tried to limit it within the skb lifecycle.

However I don't think this will be a problem.  Assuming we are right here and 
skb_release_data() is called while the userspace holds an extra reference from
the mapping here's what will happen:

skb_release_data() -> skb_free_head() -> page_pool_return_skb_page() ->
set_page_private() -> xdp_return_skb_frame() -> __xdp_return() -> 
page_pool_put_full_page() -> page_pool_put_page() -> __page_pool_put_page()

When we call __page_pool_put_page(), the refcnt will be != 1 (because a user
mapping is still active), so we won't try to recycle it. Instead we'll remove 
the DMA mappings and decrease the refcnt.

So although the recycling won't 'work', nothing bad will happen (famous last
words).

In any case, I'll double check with the test you pointed out before v4.

Thanks!
/Ilias

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-19 Thread Ilias Apalodimas

Hi Shakeel,
On Mon, Apr 19, 2021 at 07:57:03AM -0700, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
>  wrote:
> >
> > On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> > >  wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > > can not be used.
> > > > >
> > > > > Yes it can, since it's going to be used as your default allocator for
> > > > > payloads, which might end up on an SKB.
> > > >
> > > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > > zerocopy.
> > > > For several reasons.
> > > >
> > > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > > means using page->mapping and page->index members (right?).
> > > >
> > >
> > > No, only page->_mapcount is used.
> > >
> >
> > I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> > adopt the recycling mechanism we should try preserving the current
> > functionality of the network stack.
> >
> > The question is how does it work with the current drivers that already have 
> > an
> > internal page recycling mechanism.
> >
> 
> I think the current drivers check page_ref_count(page) to decide to
> reuse (or not) the already allocated pages.
> 
> Some examples from the drivers:
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
> drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()
> 

Yes, that's how internal recycling is done in drivers. As Jesper mentioned the
refcnt of the page is 1 for the page_pool owned pages and that's how we decide 
what to do with the page.

> > > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > > device) and also map this page into userspace.
> > > >
> > >
> > > I think this is already the case i.e pages still DMA-mapped and also
> > > mapped into userspace.
> > >
> > > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > > zerocopy will bump the refcnt, which means the page_pool will not
> > > > recycle the page when it see the elevated refcnt (it will instead
> > > > release its DMA-mapping).
> > >
> > > Yes this is right but the userspace might have already consumed and
> > > unmapped the page before the driver considers to recycle the page.
> >
> > Same question here. I'll have a closer look in a few days and make sure we 
> > are
> > not breaking anything wrt zerocopy.
> >
> 
> Pages mapped into the userspace have their refcnt elevated, so the
> page_ref_count() check by the drivers indicates to not reuse such
> pages.
> 

When tcp_zerocopy_receive() is invoked it will call 
tcp_zerocopy_vm_insert_batch() 
which will end up doing a get_page().
What you are saying is that once the zerocopy is done though, 
skb_release_data() 
won't be called, but instead put_page() will be? If that's the case then we are 
indeed leaking DMA mappings and memory. That sounds weird though, since the
refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so who
eventually frees the page? 
If kfree_skb() (or any wrapper that calls skb_release_data()) is called 
eventually, we'll end up properly recycling the page into our pool.

> > >
> > > >
> > > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > > page->private for tricks.  And our patch [3/5] use page->private for
> > > > storing xdp_mem_info.
> > > >
> > > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > > call page_pool_release_page() to release its DMA-mapping.
> > > >
> > >
> > > I will let TCP RX zerocopy experts respond to this but from my high
> > > level code inspection, I didn't see page->private usage.
> >
> > Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
> >
> 
> I would recommend tools/testing/selftests/net/tcp_mmap.c.

Ok, thanks I'll have a look.

Cheers
/Ilias

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-19 Thread Ilias Apalodimas

Hi Christoph,

On Mon, Apr 19, 2021 at 08:34:41AM +0200, Christoph Hellwig wrote:
> On Fri, Apr 16, 2021 at 04:27:55PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > > See below patch.  Where I swap32 the dma address to satisfy
> > > page->compound having bit zero cleared. (It is the simplest fix I could
> > > come up with).
> > 
> > I think this is slightly simpler, and as a bonus code that assumes the
> > old layout won't compile.
> 
> So, why do we even do this crappy overlay of a dma address?  This just
> all seems like a giant hack.  Random subsystems should not just steal
> a few struct page fields as that just turns into the desasters like the
> one we've seen here or probably something worse next time.

The page pool API was using page->private in the past to store these kind of
info. That caused a problem to begin with, since it would fail  on 32-bit
systems with 64bit DMA.  We had a similar discussion on the past but decided
struct page is the right place to store that [1].

Another advantage is that we can now use the information from the networking 
subsystem and enable recycling of SKBs and SKB fragments, by using the stored 
metadata of struct page [2].

[1] 
https://lore.kernel.org/netdev/20190207.132519.1698007650891404763.da...@davemloft.net/
[2] 
https://lore.kernel.org/netdev/20210409223801.104657-1-mcr...@linux.microsoft.com/

Cheers
/Ilias

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-18 Thread Ilias Apalodimas

On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
>  wrote:
> >
> [...]
> > > >
> > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > can not be used.
> > >
> > > Yes it can, since it's going to be used as your default allocator for
> > > payloads, which might end up on an SKB.
> >
> > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > zerocopy.
> > For several reasons.
> >
> > (1) This implies mapping these pages page to userspace, which AFAIK
> > means using page->mapping and page->index members (right?).
> >
> 
> No, only page->_mapcount is used.
> 

I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
adopt the recycling mechanism we should try preserving the current
functionality of the network stack.

The question is how does it work with the current drivers that already have an
internal page recycling mechanism.

> > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > device) and also map this page into userspace.
> >
> 
> I think this is already the case i.e pages still DMA-mapped and also
> mapped into userspace.
> 
> > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > zerocopy will bump the refcnt, which means the page_pool will not
> > recycle the page when it see the elevated refcnt (it will instead
> > release its DMA-mapping).
> 
> Yes this is right but the userspace might have already consumed and
> unmapped the page before the driver considers to recycle the page.

Same question here. I'll have a closer look in a few days and make sure we are
not breaking anything wrt zerocopy.

> 
> >
> > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > page->private for tricks.  And our patch [3/5] use page->private for
> > storing xdp_mem_info.
> >
> > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > call page_pool_release_page() to release its DMA-mapping.
> >
> 
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Shakeel are you aware of any 'easy' way I can have rx zerocopy running?


Thanks!
/Ilias

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-18 Thread Ilias Apalodimas

On Sat, Apr 17, 2021 at 09:22:40PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 17, 2021 at 09:32:06PM +0300, Ilias Apalodimas wrote:
> > > +static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t 
> > > addr)
> > > +{
> > > + page->dma_addr[0] = addr;
> > > + if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > > + page->dma_addr[1] = addr >> 16 >> 16;
> > 
> > The 'error' that was reported will never trigger right?
> > I assume this was compiled with dma_addr_t as 32bits (so it triggered the
> > compilation error), but the if check will never allow this codepath to run.
> > If so can we add a comment explaining this, since none of us will remember 
> > why
> > in 6 months from now?
> 
> That's right.  I compiled it all three ways -- 32-bit, 64-bit dma, 32-bit long
> and 64-bit.  The 32/64 bit case turn into:
> 
>   if (0)
>   page->dma_addr[1] = addr >> 16 >> 16;
> 
> which gets elided.  So the only case that has to work is 64-bit dma and
> 32-bit long.
> 
> I can replace this with upper_32_bits().
> 

Ok up to you, I don't mind either way and thanks for solving this!

Acked-by: Ilias Apalodimas

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread Ilias Apalodimas

Hi Matthew,

On Sat, Apr 17, 2021 at 03:45:22AM +0100, Matthew Wilcox wrote:
> 
> Replacement patch to fix compiler warning.
> 
> From: "Matthew Wilcox (Oracle)" 
> Date: Fri, 16 Apr 2021 16:34:55 -0400
> Subject: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems
> To: bro...@redhat.com
> Cc: linux-ker...@vger.kernel.org,
> linux...@kvack.org,
> netdev@vger.kernel.org,
> linuxppc-...@lists.ozlabs.org,
> linux-arm-ker...@lists.infradead.org,
> linux-m...@vger.kernel.org,
> ilias.apalodi...@linaro.org,
> mcr...@linux.microsoft.com,
> grygorii.stras...@ti.com,
> a...@kernel.org,
> h...@lst.de,
> linux-snps-...@lists.infradead.org,
> mho...@kernel.org,
> mgor...@suse.de
> 
> 32-bit architectures which expect 8-byte alignment for 8-byte integers
> and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> page inadvertently expanded in 2019.  When the dma_addr_t was added,
> it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> gap between 'flags' and the union.
> 
> Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
> This restores the alignment to that of an unsigned long, and also fixes a
> potential problem where (on a big endian platform), the bit used to denote
> PageTail could inadvertently get set, and a racing get_user_pages_fast()
> could dereference a bogus compound_head().
> 
> Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm_types.h |  4 ++--
>  include/net/page_pool.h  | 12 +++-
>  net/core/page_pool.c | 12 +++-
>  3 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..5aacc1c10a45 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -97,10 +97,10 @@ struct page {
>   };
>   struct {/* page_pool used by netstack */
>   /**
> -  * @dma_addr: might require a 64-bit value even on
> +  * @dma_addr: might require a 64-bit value on
>* 32-bit architectures.
>*/
> - dma_addr_t dma_addr;
> + unsigned long dma_addr[2];
>   };
>   struct {/* slab, slob and slub */
>   union {
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index b5b195305346..ad6154dc206c 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
> page_pool *pool,
>  
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> - return page->dma_addr;
> + dma_addr_t ret = page->dma_addr[0];
> + if (sizeof(dma_addr_t) > sizeof(unsigned long))
> + ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
> + return ret;
> +}
> +
> +static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> +{
> + page->dma_addr[0] = addr;
> + if (sizeof(dma_addr_t) > sizeof(unsigned long))
> + page->dma_addr[1] = addr >> 16 >> 16;


The 'error' that was reported will never trigger right?
I assume this was compiled with dma_addr_t as 32bits (so it triggered the
compilation error), but the if check will never allow this codepath to run.
If so can we add a comment explaining this, since none of us will remember why
in 6 months from now?

>  }
>  
>  static inline bool is_page_pool_compiled_in(void)
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..f014fd8c19a6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct 
> page_pool *pool,
> struct page *page,
> unsigned int dma_sync_size)
>  {
> + dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +
>   dma_sync_size = min(dma_sync_size, pool->p.max_len);
> - dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
> + dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>pool->p.offset, dma_sync_size,
>pool->p.dma_dir);
>  }
> @@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>   put_page(page);
>   return NULL;
>   }
> - page->dma_addr = dma;
> + page_pool_set_dma_addr(page, dma);
>  
>   if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>   page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> @@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, 
> struct page *page)
>*/
>   goto skip_dma_unmap;
>  
> - dma = page->dma_addr;
> + dma = page_po

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread Ilias Apalodimas

On Wed, Apr 14, 2021 at 12:50:52PM +0100, Matthew Wilcox wrote:
> On Wed, Apr 14, 2021 at 10:10:44AM +0200, Jesper Dangaard Brouer wrote:
> > Yes, indeed! - And very frustrating.  It's keeping me up at night.
> > I'm dreaming about 32 vs 64 bit data structures. My fitbit stats tell
> > me that I don't sleep well with these kind of dreams ;-)
> 
> Then you're going to love this ... even with the latest patch, there's
> still a problem.  Because dma_addr_t is still 64-bit aligned _as a type_,
> that forces the union to be 64-bit aligned (as we already knew and worked
> around), but what I'd forgotten is that forces the entirety of struct
> page to be 64-bit aligned.  Which means ...
> 
> /* size: 40, cachelines: 1, members: 4 */
> /* padding: 4 */
> /* forced alignments: 1 */
> /* last cacheline: 40 bytes */
> } __attribute__((__aligned__(8)));
> 
> .. that we still have a hole!  It's just moved from being at offset 4
> to being at offset 36.
> 
> > That said, I think we need to have a quicker fix for the immediate
> > issue with 64-bit bit dma_addr on 32-bit arch and the misalignment hole
> > it leaves[3] in struct page.  In[3] you mention ppc32, does it only
> > happens on certain 32-bit archs?
> 
> AFAICT it happens on mips32, ppc32, arm32 and arc.  It doesn't happen
> on x86-32 because dma_addr_t is 32-bit aligned.
> 
> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif
> 
> > I'm seriously considering removing page_pool's support for doing/keeping
> > DMA-mappings on 32-bit arch's.  AFAIK only a single driver use this.
> 
> ... if you're going to do that, then we don't need to do this.

FWIW I already proposed that to Matthew in private a few days ago...
II am not even sure the AM572x has that support.  I'd much prefer getting rid
of it as well, instead of overcomplicating the struct for a device noone is
going to need.

Cheers
/Ilias

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-10 Thread Ilias Apalodimas

Hi Shakeel, 

On Sat, Apr 10, 2021 at 10:42:30AM -0700, Shakeel Butt wrote:
> On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
>  wrote:
> >
> > Hi Matthew
> >
> > On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > > This is needed by the page_pool to avoid recycling a page not allocated
> > > > via page_pool.
> > >
> > > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > > be if you use page->_mapcount (ie mapping it to userspace).
> >
> > Interesting!
> > Please keep in mind this was written ~2018 and was stale on my branches for
> > quite some time.  So back then I did try to use PageType, but had not free
> > bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> > be much much cleaner.  Should we go and define a new PG_pagepool?
> >
> >
> 
> Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> can not be used.

Yes it can, since it's going to be used as your default allocator for
payloads, which might end up on an SKB.
So we have to keep the extra added field on struct page for our mark.
Matthew had an intersting idea.  He suggested keeping it, but changing the 
magic number, so it can't be a kernel address, but I'll let him follow 
up on the details.

> 
> There is a recent discussion [1] on memcg accounting of TCP RX
> zerocopy and I am wondering if this work can somehow help in that
> regard. I will take a look at the series.
> 

I'll try having a look on this as well. The idea behind the patchset is to
allow lower speed NICs that use the API already, gain recycling 'easily'.  
Using page_pool for the driver comes with a penalty to begin with.
Allocating pages instead of SKBs has a measurable difference. By enabling them
to recycle they'll get better performance, since you skip the
reallocation/remapping and only care for syncing the buffers correctly.

> [1] 
> https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.k...@gmail.com/

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-10 Thread Ilias Apalodimas

Hi Matthew 

On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > This is needed by the page_pool to avoid recycling a page not allocated
> > via page_pool.
> 
> Is the PageType mechanism more appropriate to your needs?  It wouldn't
> be if you use page->_mapcount (ie mapping it to userspace).

Interesting!
Please keep in mind this was written ~2018 and was stale on my branches for
quite some time.  So back then I did try to use PageType, but had not free
bits.  Looking at it again though, it's cleaned up.  So yes I think this can
be much much cleaner.  Should we go and define a new PG_pagepool?


Thanks!
/Ilias
> 
> > Signed-off-by: Matteo Croce 
> > ---
> >  include/linux/mm_types.h | 1 +
> >  include/net/page_pool.h  | 2 ++
> >  net/core/page_pool.c | 4 
> >  3 files changed, 7 insertions(+)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 6613b26a8894..ef2d0d5f62e4 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -101,6 +101,7 @@ struct page {
> >  * 32-bit architectures.
> >  */
> > dma_addr_t dma_addr;
> > +   unsigned long signature;
> > };
> > struct {/* slab, slob and slub */
> > union {
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index b5b195305346..b30405e84b5e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -63,6 +63,8 @@
> >   */
> >  #define PP_ALLOC_CACHE_SIZE128
> >  #define PP_ALLOC_CACHE_REFILL  64
> > +#define PP_SIGNATURE   0x20210303
> > +
> >  struct pp_alloc_cache {
> > u32 count;
> > void *cache[PP_ALLOC_CACHE_SIZE];
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index ad8b0707af04..2ae9b554ef98 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -232,6 +232,8 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> > page_pool *pool,
> > page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >  
> >  skip_dma_map:
> > +   page->signature = PP_SIGNATURE;
> > +
> > /* Track how many pages are held 'in-flight' */
> > pool->pages_state_hold_cnt++;
> >  
> > @@ -302,6 +304,8 @@ void page_pool_release_page(struct page_pool *pool, 
> > struct page *page)
> >  DMA_ATTR_SKIP_CPU_SYNC);
> > page->dma_addr = 0;
> >  skip_dma_unmap:
> > +   page->signature = 0;
> > +
> > /* This may be the last page returned, releasing the pool, so
> >  * it is not safe to reference pool afterwards.
> >  */
> > -- 
> > 2.30.2
> >

Re: Bogus struct page layout on 32-bit

2021-04-10 Thread Ilias Apalodimas

+CC Grygorii for the cpsw part as Ivan's email is not valid anymore

Thanks for catching this. Interesting indeed...

On Sat, 10 Apr 2021 at 09:22, Jesper Dangaard Brouer  wrote:
>
> On Sat, 10 Apr 2021 03:43:13 +0100
> Matthew Wilcox  wrote:
>
> > On Sat, Apr 10, 2021 at 06:45:35AM +0800, kernel test robot wrote:
> > > >> include/linux/mm_types.h:274:1: error: static_assert failed due to 
> > > >> requirement '__builtin_offsetof(struct page, lru) == 
> > > >> __builtin_offsetof(struct folio, lru)' "offsetof(struct page, lru) == 
> > > >> offsetof(struct folio, lru)"
> > >FOLIO_MATCH(lru, lru);
> > >include/linux/mm_types.h:272:2: note: expanded from macro 'FOLIO_MATCH'
> > >static_assert(offsetof(struct page, pg) == offsetof(struct 
> > > folio, fl))
> >
> > Well, this is interesting.  pahole reports:
> >
> > struct page {
> > long unsigned int  flags;/* 0 4 */
> > /* XXX 4 bytes hole, try to pack */
> > union {
> > struct {
> > struct list_head lru;/* 8 8 */
> > ...
> > struct folio {
> > union {
> > struct {
> > long unsigned int flags; /* 0 4 */
> > struct list_head lru;/* 4 8 */
> >
> > so this assert has absolutely done its job.
> >
> > But why has this assert triggered?  Why is struct page layout not what
> > we thought it was?  Turns out it's the dma_addr added in 2019 by commit
> > c25fff7171be ("mm: add dma_addr_t to struct page").  On this particular
> > config, it's 64-bit, and ppc32 requires alignment to 64-bit.  So
> > the whole union gets moved out by 4 bytes.
>
> Argh, good that you are catching this!
>
> > Unfortunately, we can't just fix this by putting an 'unsigned long pad'
> > in front of it.  It still aligns the entire union to 8 bytes, and then
> > it skips another 4 bytes after the pad.
> >
> > We can fix it like this ...
> >
> > +++ b/include/linux/mm_types.h
> > @@ -96,11 +96,12 @@ struct page {
> > unsigned long private;
> > };
> > struct {/* page_pool used by netstack */
> > +   unsigned long _page_pool_pad;
>
> I'm fine with this pad.  Matteo is currently proposing[1] to add a 32-bit
> value after @dma_addr, and he could use this area instead.
>
> [1] 
> https://lore.kernel.org/netdev/20210409223801.104657-3-mcr...@linux.microsoft.com/
>
> When adding/changing this, we need to make sure that it doesn't overlap
> member @index, because network stack use/check page_is_pfmemalloc().
> As far as my calculations this is safe to add.  I always try to keep an
> eye out for this, but I wonder if we could have a build check like yours.
>
>
> > /**
> >  * @dma_addr: might require a 64-bit value even on
> >  * 32-bit architectures.
> >  */
> > -   dma_addr_t dma_addr;
> > +   dma_addr_t dma_addr __packed;
> > };
> > struct {/* slab, slob and slub */
> > union {
> >
> > but I don't know if GCC is smart enough to realise that dma_addr is now
> > on an 8 byte boundary and it can use a normal instruction to access it,
> > or whether it'll do something daft like use byte loads to access it.
> >
> > We could also do:
> >
> > +   dma_addr_t dma_addr __packed __aligned(sizeof(void 
> > *));
> >
> > and I see pahole, at least sees this correctly:
> >
> > struct {
> > long unsigned int _page_pool_pad; /* 4 4 */
> > dma_addr_t dma_addr 
> > __attribute__((__aligned__(4))); /* 8 8 */
> > } __attribute__((__packed__)) 
> > __attribute__((__aligned__(4)));
> >
> > This presumably affects any 32-bit architecture with a 64-bit phys_addr_t
> > / dma_addr_t.  Advice, please?
>
> I'm not sure that the 32-bit behavior is with 64-bit (dma) addrs.
>
> I don't have any 32-bit boards with 64-bit DMA.  Cc. Ivan, wasn't your
> board (572x ?) 32-bit with driver 'cpsw' this case (where Ivan added
> XDP+page_pool) ?
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>

Re: [PATCH net-next v3 3/5] page_pool: Allow drivers to hint on SKB recycling

2021-04-09 Thread Ilias Apalodimas

Hi Matteo, 

[...]
> +bool page_pool_return_skb_page(void *data);
> +
>  struct page_pool *page_pool_create(const struct page_pool_params *params);
>  
>  #ifdef CONFIG_PAGE_POOL
> @@ -243,4 +247,13 @@ static inline void page_pool_ring_unlock(struct 
> page_pool *pool)
>   spin_unlock_bh(&pool->ring.producer_lock);
>  }
>  
> +/* Store mem_info on struct page and use it while recycling skb frags */
> +static inline
> +void page_pool_store_mem_info(struct page *page, struct xdp_mem_info *mem)
> +{
> + u32 *xmi = (u32 *)mem;
> +

I just noticed this changed from the original patchset I was carrying. 
On the original, I had a union containing a u32 member to explicitly avoid
this casting. Let's wait for comments on the rest of the series, but i'd like 
to change that back in a v4. Aplogies, I completely missed this on the
previous postings ...

Thanks
/Ilias

Re: [PATCH net-next v2 3/5] page_pool: Allow drivers to hint on SKB recycling

2021-04-09 Thread Ilias Apalodimas

On Fri, Apr 09, 2021 at 12:29:29PM -0700, Jakub Kicinski wrote:
> On Fri, 9 Apr 2021 22:01:51 +0300 Ilias Apalodimas wrote:
> > On Fri, Apr 09, 2021 at 11:56:48AM -0700, Jakub Kicinski wrote:
> > > On Fri,  2 Apr 2021 20:17:31 +0200 Matteo Croce wrote:  
> > > > Co-developed-by: Jesper Dangaard Brouer 
> > > > Co-developed-by: Matteo Croce 
> > > > Signed-off-by: Ilias Apalodimas   
> > > 
> > > Checkpatch says we need sign-offs from all authors.
> > > Especially you since you're posting.  
> > 
> > Yes it does, we forgot that.  Let me take a chance on this one. 
> > The patch is changing the default skb return path and while we've done 
> > enough
> > testing, I would really prefer this going in on a future -rc1 (assuming we 
> > even
> > consider merging it), allowing enough time to have wider tests.
> 
> Up to you guys. FWIW if you decide to try for 5.13 the missing signoffs
> can be posted in replies, no need to repost.
Thanks! but...
I think I prefer another repost, including mm-people on the list as well (and
fixing SoB's).
I just noticed noone is cc'ed and patch [2/5] adds a line in mm_types.h

/Ilias

Re: [PATCH net-next v2 3/5] page_pool: Allow drivers to hint on SKB recycling

2021-04-09 Thread Ilias Apalodimas

On Fri, Apr 09, 2021 at 11:56:48AM -0700, Jakub Kicinski wrote:
> On Fri,  2 Apr 2021 20:17:31 +0200 Matteo Croce wrote:
> > Co-developed-by: Jesper Dangaard Brouer 
> > Co-developed-by: Matteo Croce 
> > Signed-off-by: Ilias Apalodimas 
> 
> Checkpatch says we need sign-offs from all authors.
> Especially you since you're posting.

Yes it does, we forgot that.  Let me take a chance on this one. 
The patch is changing the default skb return path and while we've done enough
testing, I would really prefer this going in on a future -rc1 (assuming we even
consider merging it), allowing enough time to have wider tests.

Regards
/Ilias

Re: [PATCH net-next 6/6] mvneta: recycle buffers

2021-03-24 Thread Ilias Apalodimas

On Wed, Mar 24, 2021 at 10:28:35AM +0100, Lorenzo Bianconi wrote:
> [...]
> > > diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> > > b/drivers/net/ethernet/marvell/mvneta.c
> > > index a635cf84608a..8b3250394703 100644
> > > --- a/drivers/net/ethernet/marvell/mvneta.c
> > > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > > @@ -2332,7 +2332,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, 
> > > struct mvneta_rx_queue *rxq,
> > >   if (!skb)
> > >   return ERR_PTR(-ENOMEM);
> > >  
> > > - page_pool_release_page(rxq->page_pool, virt_to_page(xdp->data));
> > > + skb_mark_for_recycle(skb, virt_to_page(xdp->data), &xdp->rxq->mem);
> > >  
> > >   skb_reserve(skb, xdp->data - xdp->data_hard_start);
> > >   skb_put(skb, xdp->data_end - xdp->data);
> > > @@ -2344,7 +2344,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, 
> > > struct mvneta_rx_queue *rxq,
> > >   skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> > >   skb_frag_page(frag), skb_frag_off(frag),
> > >   skb_frag_size(frag), PAGE_SIZE);
> > > - page_pool_release_page(rxq->page_pool, skb_frag_page(frag));
> > > + skb_mark_for_recycle(skb, skb_frag_page(frag), &xdp->rxq->mem);
> > >   }
> > >  
> > >   return skb;
> > 
> > This cause skb_mark_for_recycle() to set 'skb->pp_recycle=1' multiple
> > times, for the same SKB.  (copy-pasted function below signature to help
> > reviewers).
> > 
> > This makes me question if we need an API for setting this per page
> > fragment?
> > Or if the API skb_mark_for_recycle() need to walk the page fragments in
> > the SKB and set the info stored in the page for each?
> 
> Considering just performances, I guess it is better open-code here since the
> driver already performs a loop over fragments to build the skb, but I guess
> this approach is quite risky and I would prefer to have a single utility
> routine to take care of linear area + fragments. What do you think?
> 

The mark_for_recycle does two things as you noticed, 
set the pp_recyle bit on the skb head and update the struct page information we
need to trigger the recycling.
We could split those and be more explicit, but isn't the current approach a
bit simpler for the driver writer to get it right?
I don't think setting a single value to 1 will have any noticeable performance
impact, but we can always test it.

> Regards,
> Lorenzo
> 
> > 
> > 
> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-24 Thread Ilias Apalodimas

Hi Alexander,

On Tue, Mar 23, 2021 at 08:03:46PM +, Alexander Lobakin wrote:
> From: Ilias Apalodimas 
> Date: Tue, 23 Mar 2021 19:01:52 +0200
> 
> > On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > > > >
> >
> > [...]
> >
> > > > > > >
> > > > > > > Thanks for the testing!
> > > > > > > Any chance you can get a perf measurement on this?
> > > > > >
> > > > > > I guess you mean perf-report (--stdio) output, right?
> > > > > >
> > > > >
> > > > > Yea,
> > > > > As hinted below, I am just trying to figure out if on Alexander's 
> > > > > platform the
> > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 
> > > > > were that
> > > > > was the case.
> > > > >
> > > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > > >
> > > > > > (+1 this is an important question)
> > >
> > > Sure, I'll drop perf tools to my test env and share the results,
> > > maybe tomorrow or in a few days.
> 
> Oh we-e-e-ell...
> Looks like I've been fooled by I-cache misses or smth like that.
> That happens sometimes, not only on my machines, and not only on
> MIPS if I'm not mistaken.
> Sorry for confusing you guys.
> 
> I got drastically different numbers after I enabled CONFIG_KALLSYMS +
> CONFIG_PERF_EVENTS for perf tools.
> The only difference in code is that I rebased onto Mel's
> mm-bulk-rebase-v6r4.
> 
> (lunar is my WIP NIC driver)
> 
> 1. 5.12-rc3 baseline:
> 
> TCP: 566 Mbps
> UDP: 615 Mbps
> 
> perf top:
>  4.44%  [lunar]  [k] lunar_rx_poll_page_pool
>  3.56%  [kernel] [k] r4k_wait_irqoff
>  2.89%  [kernel] [k] free_unref_page
>  2.57%  [kernel] [k] dma_map_page_attrs
>  2.32%  [kernel] [k] get_page_from_freelist
>  2.28%  [lunar]  [k] lunar_start_xmit
>  1.82%  [kernel] [k] __copy_user
>  1.75%  [kernel] [k] dev_gro_receive
>  1.52%  [kernel] [k] cpuidle_enter_state_coupled
>  1.46%  [kernel] [k] tcp_gro_receive
>  1.35%  [kernel] [k] __rmemcpy
>  1.33%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.30%  [kernel] [k] __dev_queue_xmit
>  1.22%  [kernel] [k] pfifo_fast_dequeue
>  1.17%  [kernel] [k] skb_release_data
>  1.17%  [kernel] [k] skb_segment
> 
> free_unref_page() and get_page_from_freelist() consume a lot.
> 
> 2. 5.12-rc3 + Page Pool recycling by Matteo:
> TCP: 589 Mbps
> UDP: 633 Mbps
> 
> perf top:
>  4.27%  [lunar]  [k] lunar_rx_poll_page_pool
>  2.68%  [lunar]  [k] lunar_start_xmit
>  2.41%  [kernel] [k] dma_map_page_attrs
>  1.92%  [kernel] [k] r4k_wait_irqoff
>  1.89%  [kernel] [k] __copy_user
>  1.62%  [kernel] [k] dev_gro_receive
>  1.51%  [kernel] [k] cpuidle_enter_state_coupled
>  1.44%  [kernel] [k] tcp_gro_receive
>  1.40%  [kernel] [k] __rmemcpy
>  1.38%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.37%  [kernel] [k] free_unref_page
>  1.35%  [kernel] [k] __dev_queue_xmit
>  1.30%  [kernel] [k] skb_segment
>  1.28%  [kernel] [k] get_page_from_freelist
>  1.27%  [kernel] [k] r4k_dma_cache_inv
> 
> +20 Mbps increase on both TCP and UDP. free_unref_page() and
> get_page_from_freelist() dropped down the list significantly.
> 
> 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
> TCP: 596
> UDP: 641
> 
> perf top:
>  4.38%  [lunar]  [k] lunar_rx_poll_page_pool
>  3.34%  [kernel] [k] r4k_wait_irqoff
>  3.14%  [kernel] [k] dma_map_page_attrs
>  2.49%  [lunar]  [k] lunar_start_xmit
>  1.85%  [kernel] [k] dev_gro_receive
>  1.76%  [kernel] [k] free_unref_page
>  1.76%  [kernel] [k] __copy_user
>  1.65%  [kernel] [k] inet_gro_receive
>  1.57%  [kernel] [k] tcp_gro_receive
>  1.48%  [kernel] [k] cpuidle_enter_state_coupled
>  1.43%  [nf_conntrack]   [k] nf_conntrack_tcp_packet
>  1.42%  [kernel] [k] __rmemcpy
>

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas

On Tue, Mar 23, 2021 at 04:55:31PM +, Alexander Lobakin wrote:
> > > > > >

[...]

> > > > >
> > > > > Thanks for the testing!
> > > > > Any chance you can get a perf measurement on this?
> > > >
> > > > I guess you mean perf-report (--stdio) output, right?
> > > >
> > >
> > > Yea,
> > > As hinted below, I am just trying to figure out if on Alexander's 
> > > platform the
> > > cost of syncing, is bigger that free-allocate. I remember one armv7 were 
> > > that
> > > was the case.
> > >
> > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > >
> > > > (+1 this is an important question)
> 
> Sure, I'll drop perf tools to my test env and share the results,
> maybe tomorrow or in a few days.
> From what I know for sure about MIPS and my platform,
> post-Rx synching (dma_sync_single_for_cpu()) is a no-op, and
> pre-Rx (dma_sync_single_for_device() etc.) is a bit expensive.
> I always have sane page_pool->pp.max_len value (smth about 1668
> for MTU of 1500) to minimize the overhead.
> 
> By the word, IIRC, all machines shipped with mvpp2 have hardware
> cache coherency units and don't suffer from sync routines at all.
> That may be the reason why mvpp2 wins the most from this series.

Yep exactly. It's also the reason why you explicitly have to opt-in using the
recycling (by marking the skb for it), instead of hiding the feature in the
page pool internals 

Cheers
/Ilias

> 
> > > > > >
> > > > > > [0] 
> > > > > > https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > > > >
> > > >
> >
> > That would be the same as for mvneta:
> >
> > Overhead  Shared Object Symbol
> >   24.10%  [kernel]  [k] __pi___inval_dcache_area
> >   23.02%  [mvneta]  [k] mvneta_rx_swbm
> >7.19%  [kernel]  [k] kmem_cache_alloc
> >
> > Anyway, I tried to use the recycling *and* napi_build_skb on mvpp2,
> > and I get lower packet rate than recycling alone.
> > I don't know why, we should investigate it.
> 
> mvpp2 driver doesn't use napi_consume_skb() on its Tx completion path.
> As a result, NAPI percpu caches get refilled only through
> kmem_cache_alloc_bulk(), and most of skbuff_head recycling
> doesn't work.
> 
> > Regards,
> > --
> > per aspera ad upstream
> 
> Oh, I love that one!
> 
> Al
>

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas

On Tue, Mar 23, 2021 at 05:04:47PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 23 Mar 2021 17:47:46 +0200
> Ilias Apalodimas  wrote:
> 
> > On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> > > From: Matteo Croce 
> > > Date: Mon, 22 Mar 2021 18:02:55 +0100
> > >   
> > > > From: Matteo Croce 
> > > >
> > > > This series enables recycling of the buffers allocated with the 
> > > > page_pool API.
> > > > The first two patches are just prerequisite to save space in a struct 
> > > > and
> > > > avoid recycling pages allocated with other API.
> > > > Patch 2 was based on a previous idea from Jonathan Lemon.
> > > >
> > > > The third one is the real recycling, 4 fixes the compilation of 
> > > > __skb_frag_unref
> > > > users, and 5,6 enable the recycling on two drivers.
> > > >
> > > > In the last two patches I reported the improvement I have with the 
> > > > series.
> > > >
> > > > The recycling as is can't be used with drivers like mlx5 which do page 
> > > > split,
> > > > but this is documented in a comment.
> > > > In the future, a refcount can be used so to support mlx5 with no 
> > > > changes.
> > > >
> > > > Ilias Apalodimas (2):
> > > >   page_pool: DMA handling and allow to recycles frames via SKB
> > > >   net: change users of __skb_frag_unref() and add an extra argument
> > > >
> > > > Jesper Dangaard Brouer (1):
> > > >   xdp: reduce size of struct xdp_mem_info
> > > >
> > > > Matteo Croce (3):
> > > >   mm: add a signature in struct page
> > > >   mvpp2: recycle buffers
> > > >   mvneta: recycle buffers
> > > >
> > > >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> > > >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> > > >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> > > >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> > > >  include/linux/mm_types.h  |  1 +
> > > >  include/linux/skbuff.h| 33 +++--
> > > >  include/net/page_pool.h   | 15 ++
> > > >  include/net/xdp.h |  5 +-
> > > >  net/core/page_pool.c  | 47 +++
> > > >  net/core/skbuff.c | 20 +++-
> > > >  net/core/xdp.c| 14 --
> > > >  net/tls/tls_device.c  |  2 +-
> > > >  13 files changed, 138 insertions(+), 26 deletions(-)  
> > > 
> > > Just for the reference, I've performed some tests on 1G SoC NIC with
> > > this patchset on, here's direct link: [0]
> > >   
> > 
> > Thanks for the testing!
> > Any chance you can get a perf measurement on this?
> 
> I guess you mean perf-report (--stdio) output, right?
> 

Yea, 
As hinted below, I am just trying to figure out if on Alexander's platform the
cost of syncing, is bigger that free-allocate. I remember one armv7 were that
was the case. 

> > Is DMA syncing taking a substantial amount of your cpu usage?
> 
> (+1 this is an important question)
>  
> > > 
> > > [0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> > > 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas

On Tue, Mar 23, 2021 at 03:41:23PM +, Alexander Lobakin wrote:
> From: Matteo Croce 
> Date: Mon, 22 Mar 2021 18:02:55 +0100
> 
> > From: Matteo Croce 
> >
> > This series enables recycling of the buffers allocated with the page_pool 
> > API.
> > The first two patches are just prerequisite to save space in a struct and
> > avoid recycling pages allocated with other API.
> > Patch 2 was based on a previous idea from Jonathan Lemon.
> >
> > The third one is the real recycling, 4 fixes the compilation of 
> > __skb_frag_unref
> > users, and 5,6 enable the recycling on two drivers.
> >
> > In the last two patches I reported the improvement I have with the series.
> >
> > The recycling as is can't be used with drivers like mlx5 which do page 
> > split,
> > but this is documented in a comment.
> > In the future, a refcount can be used so to support mlx5 with no changes.
> >
> > Ilias Apalodimas (2):
> >   page_pool: DMA handling and allow to recycles frames via SKB
> >   net: change users of __skb_frag_unref() and add an extra argument
> >
> > Jesper Dangaard Brouer (1):
> >   xdp: reduce size of struct xdp_mem_info
> >
> > Matteo Croce (3):
> >   mm: add a signature in struct page
> >   mvpp2: recycle buffers
> >   mvneta: recycle buffers
> >
> >  .../chelsio/inline_crypto/ch_ktls/chcr_ktls.c |  2 +-
> >  drivers/net/ethernet/marvell/mvneta.c |  4 +-
> >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 17 +++
> >  drivers/net/ethernet/marvell/sky2.c   |  2 +-
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
> >  include/linux/mm_types.h  |  1 +
> >  include/linux/skbuff.h| 33 +++--
> >  include/net/page_pool.h   | 15 ++
> >  include/net/xdp.h |  5 +-
> >  net/core/page_pool.c  | 47 +++
> >  net/core/skbuff.c | 20 +++-
> >  net/core/xdp.c| 14 --
> >  net/tls/tls_device.c  |  2 +-
> >  13 files changed, 138 insertions(+), 26 deletions(-)
> 
> Just for the reference, I've performed some tests on 1G SoC NIC with
> this patchset on, here's direct link: [0]
> 

Thanks for the testing!
Any chance you can get a perf measurement on this?
Is DMA syncing taking a substantial amount of your cpu usage?

Thanks
/Ilias

> > --
> > 2.30.2
> 
> [0] https://lore.kernel.org/netdev/20210323153550.130385-1-aloba...@pm.me
> 
> Thanks,
> Al
>

Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread Ilias Apalodimas

Hi David, 

On Tue, Mar 23, 2021 at 08:57:57AM -0600, David Ahern wrote:
> On 3/22/21 11:02 AM, Matteo Croce wrote:
> > From: Matteo Croce 
> > 
> > This series enables recycling of the buffers allocated with the page_pool 
> > API.
> > The first two patches are just prerequisite to save space in a struct and
> > avoid recycling pages allocated with other API.
> > Patch 2 was based on a previous idea from Jonathan Lemon.
> > 
> > The third one is the real recycling, 4 fixes the compilation of 
> > __skb_frag_unref
> > users, and 5,6 enable the recycling on two drivers.
> 
> patch 4 should be folded into 3; each patch should build without errors.
> 

Yes 

> > 
> > In the last two patches I reported the improvement I have with the series.
> > 
> > The recycling as is can't be used with drivers like mlx5 which do page 
> > split,
> > but this is documented in a comment.
> > In the future, a refcount can be used so to support mlx5 with no changes.
> 
> Is the end goal of the page_pool changes to remove driver private caches?
> 
> 

Yes. The patchset doesn't currently support that , because all the >10gbit
interfaces split the page and we don't account for that. We should be able to 
extend it though and account for that.  I don't have any hardware
(Intel/mlx) available, but I'll be happy to talk to anyone that does and
figure out a way to support those cards properly.


Cheers
/Ilias

Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path

2021-03-12 Thread Ilias Apalodimas

[...]
> 6. return last_page
> 
> > +   /* Remaining pages store in alloc.cache */
> > +   list_for_each_entry_safe(page, next, &page_list, lru) {
> > +   list_del(&page->lru);
> > +   if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > +   unlikely(!page_pool_dma_map(pool, page))) {
> > +   put_page(page);
> > +   continue;
> > +   }
> 
> So if you added a last_page pointer what you could do is check for it
> here and assign it to the alloc cache. If last_page is not set the
> block would be skipped.
> 
> > +   if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> > +   pool->alloc.cache[pool->alloc.count++] = page;
> > +   pool->pages_state_hold_cnt++;
> > +   trace_page_pool_state_hold(pool, page,
> > +  
> > pool->pages_state_hold_cnt);
> > +   } else {
> > +   put_page(page);
> 
> If you are just calling put_page here aren't you leaking DMA mappings?
> Wouldn't you need to potentially unmap the page before you call
> put_page on it?

Oops, I completely missed that. Alexander is right here.

> 
> > +   }
> > +   }
> > +out:
> > if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > -   unlikely(!page_pool_dma_map(pool, page))) {
> > -   put_page(page);
> > +   unlikely(!page_pool_dma_map(pool, first_page))) {
> > +   put_page(first_page);
> 
> I would probably move this block up and make it a part of the pp_order
> block above. Also since you are doing this in 2 spots it might make
> sense to look at possibly making this an inline function.
> 
> > return NULL;
> > }
> >
> > /* Track how many pages are held 'in-flight' */
> > pool->pages_state_hold_cnt++;
> > -   trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> > +   trace_page_pool_state_hold(pool, first_page, 
> > pool->pages_state_hold_cnt);
> >
> > /* When page just alloc'ed is should/must have refcnt 1. */
> > -   return page;
> > +   return first_page;
> >  }
> >
> >  /* For using page_pool replace: alloc_pages() API calls, but provide
> > --
> > 2.26.2
> >

Cheers
/Ilias

Re: [PATCH 5/5] net: page_pool: use alloc_pages_bulk in refill code path

2021-03-02 Thread Ilias Apalodimas

On Mon, Mar 01, 2021 at 04:12:00PM +, Mel Gorman wrote:
> From: Jesper Dangaard Brouer 
> 
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
> 
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
> 
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> 
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Mel Gorman 
> ---
>  net/core/page_pool.c | 63 
>  1 file changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a26f2ceb6a87..567680bd91c4 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -208,44 +208,61 @@ noinline
>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>gfp_t _gfp)
>  {
> + const int bulk = PP_ALLOC_CACHE_REFILL;
> + struct page *page, *next, *first_page;
>   unsigned int pp_flags = pool->p.flags;
> - struct page *page;
> + unsigned int pp_order = pool->p.order;
> + int pp_nid = pool->p.nid;
> + LIST_HEAD(page_list);
>   gfp_t gfp = _gfp;
>  
> - /* We could always set __GFP_COMP, and avoid this branch, as
> -  * prep_new_page() can handle order-0 with __GFP_COMP.
> -  */
> - if (pool->p.order)
> + /* Don't support bulk alloc for high-order pages */
> + if (unlikely(pp_order)) {
>   gfp |= __GFP_COMP;
> + first_page = alloc_pages_node(pp_nid, gfp, pp_order);
> + if (unlikely(!first_page))
> + return NULL;
> + goto out;
> + }
>  
> - /* FUTURE development:
> -  *
> -  * Current slow-path essentially falls back to single page
> -  * allocations, which doesn't improve performance.  This code
> -  * need bulk allocation support from the page allocator code.
> -  */
> -
> - /* Cache was empty, do real allocation */
> -#ifdef CONFIG_NUMA
> - page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> -#else
> - page = alloc_pages(gfp, pool->p.order);
> -#endif
> - if (!page)
> + if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
> +   bulk, &page_list)))
>   return NULL;
>  
> + /* First page is extracted and returned to caller */
> + first_page = list_first_entry(&page_list, struct page, lru);
> + list_del(&first_page->lru);
> +
> + /* Remaining pages store in alloc.cache */
> + list_for_each_entry_safe(page, next, &page_list, lru) {
> + list_del(&page->lru);
> + if (pp_flags & PP_FLAG_DMA_MAP &&
> + unlikely(!page_pool_dma_map(pool, page))) {
> + put_page(page);
> + continue;
> + }
> + if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> + pool->alloc.cache[pool->alloc.count++] = page;
> + pool->pages_state_hold_cnt++;
> + trace_page_pool_state_hold(pool, page,
> +pool->pages_state_hold_cnt);
> + } else {
> + put_page(page);
> + }
> + }
> +out:
>   if (pp_flags & PP_FLAG_DMA_MAP &&
> - unlikely(!page_pool_dma_map(pool, page))) {
> - put_page(page);
> + unlikely(!page_pool_dma_map(pool, first_page))) {
> + put_page(first_page);
>   return NULL;
>   }
>  
>   /* Track how many pages are held 'in-flight' */
>   pool->pages_state_hold_cnt++;
> - trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> + trace_page_pool_state_hold(pool, first_page, 
> pool->pages_state_hold_cnt);
>  
>   /* When page just alloc'ed is should/must have refcnt 1. */
> - return page;
> + return first_page;
>  }
>  
>  /* For using page_pool replace: alloc_pages() API calls, but provide
> -- 
> 2.26.2
> 

Reviewed-by: Ilias Apalodimas

Re: [PATCH 4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map

2021-03-02 Thread Ilias Apalodimas

Hi Mel,

Can you please CC me in future revisions. I almost missed that!

On Mon, Mar 01, 2021 at 04:11:59PM +, Mel Gorman wrote:
> From: Jesper Dangaard Brouer 
> 
> In preparation for next patch, move the dma mapping into its own
> function, as this will make it easier to follow the changes.
> 
> V2: make page_pool_dma_map return boolean (Ilias)
> 

[...]

>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
>gfp_t _gfp)
>  {
> + unsigned int pp_flags = pool->p.flags;
>   struct page *page;
>   gfp_t gfp = _gfp;
> - dma_addr_t dma;
>  
>   /* We could always set __GFP_COMP, and avoid this branch, as
>* prep_new_page() can handle order-0 with __GFP_COMP.
> @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>   if (!page)
>   return NULL;
>  
> - if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> - goto skip_dma_map;
> -
> - /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> -  * since dma_addr_t can be either 32 or 64 bits and does not always fit
> -  * into page private data (i.e 32bit cpu with 64bit DMA caps)
> -  * This mapping is kept for lifetime of page, until leaving pool.
> -  */
> - dma = dma_map_page_attrs(pool->p.dev, page, 0,
> -  (PAGE_SIZE << pool->p.order),
> -  pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> - if (dma_mapping_error(pool->p.dev, dma)) {
> + if (pp_flags & PP_FLAG_DMA_MAP &&

Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...

> + unlikely(!page_pool_dma_map(pool, page))) {
>   put_page(page);
>   return NULL;
>   }
> - page->dma_addr = dma;
>  
> - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> - page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> -
> -skip_dma_map:
>   /* Track how many pages are held 'in-flight' */
>   pool->pages_state_hold_cnt++;
> -
>   trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
>  
>   /* When page just alloc'ed is should/must have refcnt 1. */
> -- 
> 2.26.2
> 

Otherwise 
Reviewed-by: Ilias Apalodimas

Re: [PATCH v2 bpf-next] bpf: devmap: move drop error path to devmap for XDP_REDIRECT

2021-03-01 Thread Ilias Apalodimas

Hi Lorenzo

for the netsec driver
Reviewed-by: Ilias Apalodimas 


On Sat, Feb 27, 2021 at 12:04:13PM +0100, Lorenzo Bianconi wrote:
> We want to change the current ndo_xdp_xmit drop semantics because
> it will allow us to implement better queue overflow handling.
> This is working towards the larger goal of a XDP TX queue-hook.
> Move XDP_REDIRECT error path handling from each XDP ethernet driver to
> devmap code. According to the new APIs, the driver running the
> ndo_xdp_xmit pointer, will break tx loop whenever the hw reports a tx
> error and it will just return to devmap caller the number of successfully
> transmitted frames. It will be devmap responsability to free dropped
> frames.
> Move each XDP ndo_xdp_xmit capable driver to the new APIs:
> - veth
> - virtio-net
> - mvneta
> - mvpp2
> - socionext
> - amazon ena
> - bnxt
> - freescale (dpaa2, dpaa)
> - xen-frontend
> - qede
> - ice
> - igb
> - ixgbe
> - i40e
> - mlx5
> - ti (cpsw, cpsw-new)
> - tun
> - sfc
> 
> Acked-by: Edward Cree 
> Signed-off-by: Lorenzo Bianconi 
> ---
> More details about the new ndo_xdp_xmit design can be found here:
> https://github.com/xdp-project/xdp-project/blob/master/areas/core/redesign01_ndo_xdp_xmit.org
> 
> Changes since v1:
> - rebase on top of bpf-next
> - add driver maintainers in cc
> - add Edward's ack
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c  | 18 ++---
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 20 ++
>  .../net/ethernet/freescale/dpaa/dpaa_eth.c| 12 -
>  .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 --
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 15 +--
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 15 +--
>  drivers/net/ethernet/intel/igb/igb_main.c | 11 
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 
>  drivers/net/ethernet/marvell/mvneta.c | 13 +
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 13 +
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 15 +--
>  drivers/net/ethernet/qlogic/qede/qede_fp.c| 19 +
>  drivers/net/ethernet/sfc/tx.c | 15 +--
>  drivers/net/ethernet/socionext/netsec.c   | 16 +--
>  drivers/net/ethernet/ti/cpsw.c| 14 +-
>  drivers/net/ethernet/ti/cpsw_new.c| 14 +-
>  drivers/net/ethernet/ti/cpsw_priv.c   | 11 +++-
>  drivers/net/tun.c | 15 ++-
>  drivers/net/veth.c| 27 ++-
>  drivers/net/virtio_net.c  | 25 -
>  drivers/net/xen-netfront.c| 18 ++---
>  kernel/bpf/devmap.c   | 27 +--
>  22 files changed, 153 insertions(+), 193 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 102f2c91fdb8..7ad0557dedbd 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -300,7 +300,7 @@ static int ena_xdp_xmit_frame(struct ena_ring *xdp_ring,
>  
>   rc = ena_xdp_tx_map_frame(xdp_ring, tx_info, xdpf, &push_hdr, 
> &push_len);
>   if (unlikely(rc))
> - goto error_drop_packet;
> + return rc;
>  
>   ena_tx_ctx.ena_bufs = tx_info->bufs;
>   ena_tx_ctx.push_header = push_hdr;
> @@ -330,8 +330,6 @@ static int ena_xdp_xmit_frame(struct ena_ring *xdp_ring,
>  error_unmap_dma:
>   ena_unmap_tx_buff(xdp_ring, tx_info);
>   tx_info->xdpf = NULL;
> -error_drop_packet:
> - xdp_return_frame(xdpf);
>   return rc;
>  }
>  
> @@ -339,8 +337,8 @@ static int ena_xdp_xmit(struct net_device *dev, int n,
>   struct xdp_frame **frames, u32 flags)
>  {
>   struct ena_adapter *adapter = netdev_priv(dev);
> - int qid, i, err, drops = 0;
>   struct ena_ring *xdp_ring;
> + int qid, i, nxmit = 0;
>  
>   if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>   return -EINVAL;
> @@ -360,12 +358,12 @@ static int ena_xdp_xmit(struct net_device *dev, int n,
>   spin_lock(&xdp_ring->xdp_tx_lock);
>  
>   for (i = 0; i < n; i++) {
> - err = ena_xdp_xmit_frame(xdp_ring, dev, frames[i], 0);
>   /* The descriptor is freed by ena_xdp_xmit_frame in case
>* of an error.
>*/
> - if (err)
> - drops++;
> + if (ena_xdp_xmit_frame(xdp_ring, dev, frames[i], 0))
> + break;
> +

Re: [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path

2021-02-24 Thread Ilias Apalodimas

Hi Jesper, 

On Wed, Feb 24, 2021 at 07:56:46PM +0100, Jesper Dangaard Brouer wrote:
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
> 
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
> 
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
> 
> Signed-off-by: Jesper Dangaard Brouer 

[...]

> + /* Remaining pages store in alloc.cache */
> + list_for_each_entry_safe(page, next, &page_list, lru) {
> + list_del(&page->lru);
> + if (pp_flags & PP_FLAG_DMA_MAP) {
> + page = page_pool_dma_map(pool, page);
> + if (!page)

As I commented on the previous patch, i'd prefer the put_page() here to be
explicitly called, instead of hiding in the page_pool_dma_map()

> + continue;
> + }
> + if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> + pool->alloc.cache[pool->alloc.count++] = page;
> + pool->pages_state_hold_cnt++;
> + trace_page_pool_state_hold(pool, page,
> +pool->pages_state_hold_cnt);
> + } else {
> + put_page(page);
> + }
> + }
> +out:
>   if (pool->p.flags & PP_FLAG_DMA_MAP) {
> - page = page_pool_dma_map(pool, page);
> - if (!page)
> + first_page = page_pool_dma_map(pool, first_page);
> + if (!first_page)
>   return NULL;
>   }
>  
[...]

Cheers
/Ilias

Re: [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map

2021-02-24 Thread Ilias Apalodimas

On Wed, Feb 24, 2021 at 07:56:41PM +0100, Jesper Dangaard Brouer wrote:
> In preparation for next patch, move the dma mapping into its own
> function, as this will make it easier to follow the changes.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  net/core/page_pool.c |   49 +
>  1 file changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ad8b0707af04..50d52aa6fbeb 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -180,6 +180,31 @@ static void page_pool_dma_sync_for_device(struct 
> page_pool *pool,
>pool->p.dma_dir);
>  }
>  
> +static struct page * page_pool_dma_map(struct page_pool *pool,
> +struct page *page)
> +{

Why return a struct page* ?
boolean maybe?

> + dma_addr_t dma;
> +
> + /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> +  * since dma_addr_t can be either 32 or 64 bits and does not always fit
> +  * into page private data (i.e 32bit cpu with 64bit DMA caps)
> +  * This mapping is kept for lifetime of page, until leaving pool.
> +  */
> + dma = dma_map_page_attrs(pool->p.dev, page, 0,
> +  (PAGE_SIZE << pool->p.order),
> +  pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> + if (dma_mapping_error(pool->p.dev, dma)) {
> + put_page(page);

This is a bit confusing when reading it. 
The name of the function should try to map the page and report a yes/no,
instead of trying to call put_page as well.
Can't we explicitly ask the user to call put_page() if the mapping failed?

A clear example is on patch 2/3, when on the first read I was convinced there
was a memory leak.

> + return NULL;
> + }
> + page->dma_addr = dma;
> +
> + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> + page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> +
> + return page;
> +}
> +
>  /* slow path */
>  noinline
>  static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> @@ -187,7 +212,6 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>  {
>   struct page *page;
>   gfp_t gfp = _gfp;
> - dma_addr_t dma;
>  
>   /* We could always set __GFP_COMP, and avoid this branch, as
>* prep_new_page() can handle order-0 with __GFP_COMP.
> @@ -211,27 +235,12 @@ static struct page *__page_pool_alloc_pages_slow(struct 
> page_pool *pool,
>   if (!page)
>   return NULL;
>  
> - if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> - goto skip_dma_map;
> -
> - /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> -  * since dma_addr_t can be either 32 or 64 bits and does not always fit
> -  * into page private data (i.e 32bit cpu with 64bit DMA caps)
> -  * This mapping is kept for lifetime of page, until leaving pool.
> -  */
> - dma = dma_map_page_attrs(pool->p.dev, page, 0,
> -  (PAGE_SIZE << pool->p.order),
> -  pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> - if (dma_mapping_error(pool->p.dev, dma)) {
> - put_page(page);
> - return NULL;
> + if (pool->p.flags & PP_FLAG_DMA_MAP) {
> + page = page_pool_dma_map(pool, page);
> + if (!page)
> + return NULL;
>   }
> - page->dma_addr = dma;
> -
> - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> - page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>  
> -skip_dma_map:
>   /* Track how many pages are held 'in-flight' */
>   pool->pages_state_hold_cnt++;
>  
> 
> 

Thanks!
/Ilias

Re: [PATCH net-next 3/3] net: page_pool: simplify page recycling condition tests

2021-01-26 Thread Ilias Apalodimas

On Mon, Jan 25, 2021 at 04:47:20PM +, Alexander Lobakin wrote:
> pool_page_reusable() is a leftover from pre-NUMA-aware times. For now,
> this function is just a redundant wrapper over page_is_pfmemalloc(),
> so Inline it into its sole call site.
> 
> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/page_pool.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index f3c690b8c8e3..ad8b0707af04 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -350,14 +350,6 @@ static bool page_pool_recycle_in_cache(struct page *page,
>   return true;
>  }
>  
> -/* page is NOT reusable when:
> - * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> - */
> -static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> -{
> - return !page_is_pfmemalloc(page);
> -}
> -
>  /* If the page refcnt == 1, this will try to recycle the page.
>   * if PP_FLAG_DMA_SYNC_DEV is set, we'll try to sync the DMA area for
>   * the configured size min(dma_sync_size, pool->max_len).
> @@ -373,9 +365,11 @@ __page_pool_put_page(struct page_pool *pool, struct page 
> *page,
>* regular page allocator APIs.
>*
>* refcnt == 1 means page_pool owns page, and can recycle it.
> +  *
> +  * page is NOT reusable when allocated when system is under
> +  * some pressure. (page_is_pfmemalloc)
>*/
> - if (likely(page_ref_count(page) == 1 &&
> -pool_page_reusable(pool, page))) {
> + if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>   /* Read barrier done in page_ref_count / READ_ONCE */
>  
>   if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> -- 
> 2.30.0
> 
> 

Reviewed-by: Ilias Apalodimas

Re: [PATCH net-next] net: netsec: add xdp tx return bulking support

2020-12-04 Thread Ilias Apalodimas

Hi Jakub, 

On Fri, Nov 20, 2020 at 10:14:34AM -0800, Jakub Kicinski wrote:
> On Fri, 20 Nov 2020 20:07:13 +0200 Ilias Apalodimas wrote:
> > On Fri, Nov 20, 2020 at 10:00:07AM -0800, Jakub Kicinski wrote:
> > > On Tue, 17 Nov 2020 10:35:28 +0100 Lorenzo Bianconi wrote:  
> > > > Convert netsec driver to xdp_return_frame_bulk APIs.
> > > > Rely on xdp_return_frame_rx_napi for XDP_TX in order to try to recycle
> > > > the page in the "in-irq" page_pool cache.
> > > > 
> > > > Co-developed-by: Jesper Dangaard Brouer 
> > > > Signed-off-by: Jesper Dangaard Brouer 
> > > > Signed-off-by: Lorenzo Bianconi 
> > > > ---
> > > > This patch is just compile tested, I have not carried out any run test  
> > > 
> > > Doesn't look like anyone will test this so applied, thanks!  
> > 
> > I had everything applied trying to test, but there was an issue with the 
> > PHY the
> > socionext board uses [1].
> 
> FWIW feel free to send a note saying you need more time.
> 
> > In any case the patch looks correct, so you can keep it and I'll report any 
> > problems once I short the box out.
> 
> Cool, fingers crossed :)

FWIW I did eventually test this. 
I can't see anything wrong with it.

Cheers
/Ilias

Re: [PATCH net-next] net: page_pool: Add page_pool_put_page_bulk() to page_pool.rst

2020-11-23 Thread Ilias Apalodimas

On Fri, Nov 20, 2020 at 11:19:34PM +0100, Lorenzo Bianconi wrote:
> Introduce page_pool_put_page_bulk() entry into the API section of
> page_pool.rst
> 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  Documentation/networking/page_pool.rst | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/Documentation/networking/page_pool.rst 
> b/Documentation/networking/page_pool.rst
> index 43088ddf95e4..e848f5b995b8 100644
> --- a/Documentation/networking/page_pool.rst
> +++ b/Documentation/networking/page_pool.rst
> @@ -97,6 +97,14 @@ a page will cause no race conditions is enough.
>  
>  * page_pool_get_dma_dir(): Retrieve the stored DMA direction.
>  
> +* page_pool_put_page_bulk(): It tries to refill a bulk of count pages into 
> the

Tries to refill a number of pages sounds better?

> +  ptr_ring cache holding ptr_ring producer lock. If the ptr_ring is full,
> +  page_pool_put_page_bulk() will release leftover pages to the page 
> allocator.
> +  page_pool_put_page_bulk() is suitable to be run inside the driver NAPI tx
> +  completion loop for the XDP_REDIRECT use case.
> +  Please consider the caller must not use data area after running

s/consider/note/

> +  page_pool_put_page_bulk(), as this function overwrites it.
> +
>  Coding examples
>  ===
>  
> -- 
> 2.28.0
> 


Other than that 
Acked-by: Ilias Apalodimas

Re: [PATCH net-next] net: netsec: add xdp tx return bulking support

2020-11-20 Thread Ilias Apalodimas

Hi Jakub, 

On Fri, Nov 20, 2020 at 10:00:07AM -0800, Jakub Kicinski wrote:
> On Tue, 17 Nov 2020 10:35:28 +0100 Lorenzo Bianconi wrote:
> > Convert netsec driver to xdp_return_frame_bulk APIs.
> > Rely on xdp_return_frame_rx_napi for XDP_TX in order to try to recycle
> > the page in the "in-irq" page_pool cache.
> > 
> > Co-developed-by: Jesper Dangaard Brouer 
> > Signed-off-by: Jesper Dangaard Brouer 
> > Signed-off-by: Lorenzo Bianconi 
> > ---
> > This patch is just compile tested, I have not carried out any run test
> 
> Doesn't look like anyone will test this so applied, thanks!

I had everything applied trying to test, but there was an issue with the PHY the
socionext board uses [1].

In any case the patch looks correct, so you can keep it and I'll report any 
problems once I short the box out.

[1] https://lore.kernel.org/netdev/20201017151132.gk456...@lunn.ch/T/

Cheers
/Ilias

Re: [PATCH net-next] MAINTAINERS: Update page pool entry

2020-11-20 Thread Ilias Apalodimas

On Fri, Nov 20, 2020 at 10:36:19AM +0100, Jesper Dangaard Brouer wrote:
> Add some file F: matches that is related to page_pool.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  MAINTAINERS |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f827f504251b..efcdc68a03b1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -13179,6 +13179,8 @@ L:netdev@vger.kernel.org
>  S:   Supported
>  F:   include/net/page_pool.h
>  F:   net/core/page_pool.c
> +F:   include/trace/events/page_pool.h
> +F:   Documentation/networking/page_pool.rst
>  
>  PANASONIC LAPTOP ACPI EXTRAS DRIVER
>  M:   Harald Welte 
> 
> 
Acked-by: Ilias Apalodimas

Re: [PATCH bpf-next 6/9] xsk: propagate napi_id to XDP socket Rx path

2020-11-14 Thread Ilias Apalodimas

On Thu, Nov 12, 2020 at 12:40:38PM +0100, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Add napi_id to the xdp_rxq_info structure, and make sure the XDP
> socket pick up the napi_id in the Rx path. The napi_id is used to find
> the corresponding NAPI structure for socket busy polling.
> 
> Signed-off-by: Björn Töpel 
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c  |  2 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c |  2 +-
>  .../ethernet/cavium/thunder/nicvf_queues.c|  2 +-
>  .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |  2 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  2 +-
>  drivers/net/ethernet/intel/ice/ice_base.c |  4 ++--
>  drivers/net/ethernet/intel/ice/ice_txrx.c |  2 +-
>  drivers/net/ethernet/intel/igb/igb_main.c |  2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
>  drivers/net/ethernet/marvell/mvneta.c |  2 +-
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  4 ++--
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c|  2 +-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
>  .../ethernet/netronome/nfp/nfp_net_common.c   |  2 +-
>  drivers/net/ethernet/qlogic/qede/qede_main.c  |  2 +-
>  drivers/net/ethernet/sfc/rx_common.c  |  2 +-
>  drivers/net/ethernet/socionext/netsec.c   |  2 +-
>  drivers/net/ethernet/ti/cpsw_priv.c   |  2 +-
>  drivers/net/hyperv/netvsc.c   |  2 +-
>  drivers/net/tun.c |  2 +-
>  drivers/net/veth.c|  2 +-
>  drivers/net/virtio_net.c  |  2 +-
>  drivers/net/xen-netfront.c|  2 +-
>  include/net/busy_poll.h   | 19 +++
>  include/net/xdp.h |  3 ++-
>  net/core/dev.c|  2 +-
>  net/core/xdp.c|  3 ++-
>  net/xdp/xsk.c     |  1 +
>  29 files changed, 47 insertions(+), 33 deletions(-)
> 
 
For the socionext driver

Acked-by: Ilias Apalodimas

Re: [PATCH v6 net-nex 2/5] net: page_pool: add bulk support for ptr_ring

2020-11-13 Thread Ilias Apalodimas

ge(page);
> +
> + return NULL;
> +}
> +
> +void page_pool_put_page(struct page_pool *pool, struct page *page,
> + unsigned int dma_sync_size, bool allow_direct)
> +{
> + page = __page_pool_put_page(pool, page, dma_sync_size, allow_direct);
> + if (page && !page_pool_recycle_in_ring(pool, page)) {
> + /* Cache full, fallback to free pages */
> + page_pool_return_page(pool, page);
> + }
>  }
>  EXPORT_SYMBOL(page_pool_put_page);
>  
> +/* Caller must not use data area after call, as this function overwrites it 
> */
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> +  int count)
> +{
> + int i, bulk_len = 0;
> +
> + for (i = 0; i < count; i++) {
> + struct page *page = virt_to_head_page(data[i]);
> +
> + page = __page_pool_put_page(pool, page, -1, false);
> + /* Approved for bulk recycling in ptr_ring cache */
> + if (page)
> + data[bulk_len++] = page;
> + }
> +
> + if (unlikely(!bulk_len))
> + return;
> +
> + /* Bulk producer into ptr_ring page_pool cache */
> + page_pool_ring_lock(pool);
> + for (i = 0; i < bulk_len; i++) {
> + if (__ptr_ring_produce(&pool->ring, data[i]))
> + break; /* ring full */
> + }
> + page_pool_ring_unlock(pool);
> +
> + /* Hopefully all pages was return into ptr_ring */
> + if (likely(i == bulk_len))
> + return;
> +
> + /* ptr_ring cache full, free remaining pages outside producer lock
> +  * since put_page() with refcnt == 1 can be an expensive operation
> +  */
> + for (; i < bulk_len; i++)
> + page_pool_return_page(pool, data[i]);
> +}
> +EXPORT_SYMBOL(page_pool_put_page_bulk);
> +
>  static void page_pool_empty_ring(struct page_pool *pool)
>  {
>   struct page *page;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index bbaee7fdd44f..3d330ebda893 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -393,16 +393,11 @@ EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
>  void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
>  {
>   struct xdp_mem_allocator *xa = bq->xa;
> - int i;
>  
> - if (unlikely(!xa))
> + if (unlikely(!xa || !bq->count))
>   return;
>  
> - for (i = 0; i < bq->count; i++) {
> - struct page *page = virt_to_head_page(bq->q[i]);
> -
> - page_pool_put_full_page(xa->page_pool, page, false);
> - }
> + page_pool_put_page_bulk(xa->page_pool, bq->q, bq->count);
>   /* bq->xa is not cleared to save lookup, if mem.id same in next bulk */
>   bq->count = 0;
>  }
> -- 
> 2.26.2
> 
Acked-by: Ilias Apalodimas

Re: [PATCH v6 net-nex 1/5] net: xdp: introduce bulking for xdp tx return path

2020-11-13 Thread Ilias Apalodimas

kely(mem->id != xa->mem.id)) {
> +     xdp_flush_frame_bulk(bq);
> + bq->xa = rhashtable_lookup(mem_id_ht, &mem->id, 
> mem_id_rht_params);
> + }
> +
> + bq->q[bq->count++] = xdpf->data;
> +}
> +EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
> +
>  void xdp_return_buff(struct xdp_buff *xdp)
>  {
>   __xdp_return(xdp->data, &xdp->rxq->mem, true);
> -- 
> 2.26.2
> 

Could you add the changes in the Documentation as well (which can do in later)

Acked-by: Ilias Apalodimas

Re: [PATCH v2 net-next 1/4] net: xdp: introduce bulking for xdp tx return path

2020-10-29 Thread Ilias Apalodimas

Hi Lorenzo, 

On Thu, Oct 29, 2020 at 08:28:44PM +0100, Lorenzo Bianconi wrote:
> XDP bulk APIs introduce a defer/flush mechanism to return
> pages belonging to the same xdp_mem_allocator object
> (identified via the mem.id field) in bulk to optimize
> I-cache and D-cache since xdp_return_frame is usually run
> inside the driver NAPI tx completion loop.
> The bulk queue size is set to 16 to be aligned to how
> XDP_REDIRECT bulking works. The bulk is flushed when
> it is full or when mem.id changes.
> xdp_frame_bulk is usually stored/allocated on the function
> call-stack to avoid locking penalties.
> Current implementation considers only page_pool memory model.
> Convert mvneta driver to xdp_return_frame_bulk APIs.
> 
> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/mvneta.c |  5 ++-
>  include/net/xdp.h |  9 
>  net/core/xdp.c| 61 +++
>  3 files changed, 74 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 54b0bf574c05..43ab8a73900e 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -1834,8 +1834,10 @@ static void mvneta_txq_bufs_free(struct mvneta_port 
> *pp,
>struct netdev_queue *nq, bool napi)
>  {
>   unsigned int bytes_compl = 0, pkts_compl = 0;
> + struct xdp_frame_bulk bq;
>   int i;
>  
> + bq.xa = NULL;
>   for (i = 0; i < num; i++) {
>   struct mvneta_tx_buf *buf = &txq->buf[txq->txq_get_index];
>   struct mvneta_tx_desc *tx_desc = txq->descs +
> @@ -1857,9 +1859,10 @@ static void mvneta_txq_bufs_free(struct mvneta_port 
> *pp,
>   if (napi && buf->type == MVNETA_TYPE_XDP_TX)
>   xdp_return_frame_rx_napi(buf->xdpf);
>   else
> - xdp_return_frame(buf->xdpf);
> + xdp_return_frame_bulk(buf->xdpf, &bq);
>   }
>   }
> + xdp_flush_frame_bulk(&bq);
>  
>   netdev_tx_completed_queue(nq, pkts_compl, bytes_compl);
>  }

Sorry I completely forgot to mention this on the v1 review.
I think this belongs to a patch of it's own similar to mellanox and mvpp2 
drivers

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 3814fb631d52..a1f48a73e6df 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -104,6 +104,12 @@ struct xdp_frame {
>   struct net_device *dev_rx; /* used by cpumap */
>  };
>  
> +#define XDP_BULK_QUEUE_SIZE  16
> +struct xdp_frame_bulk {
> + int count;
> + void *xa;
> + void *q[XDP_BULK_QUEUE_SIZE];
> +};
>  
>  static inline struct skb_shared_info *
>  xdp_get_shared_info_from_frame(struct xdp_frame *frame)
> @@ -194,6 +200,9 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct 
> xdp_buff *xdp)
>  void xdp_return_frame(struct xdp_frame *xdpf);
>  void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
>  void xdp_return_buff(struct xdp_buff *xdp);
> +void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq);
> +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> +struct xdp_frame_bulk *bq);
>  
>  /* When sending xdp_frame into the network stack, then there is no
>   * return point callback, which is needed to release e.g. DMA-mapping
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 48aba933a5a8..66ac275a0360 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -380,6 +380,67 @@ void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
>  }
>  EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
>  
> +/* XDP bulk APIs introduce a defer/flush mechanism to return
> + * pages belonging to the same xdp_mem_allocator object
> + * (identified via the mem.id field) in bulk to optimize
> + * I-cache and D-cache.
> + * The bulk queue size is set to 16 to be aligned to how
> + * XDP_REDIRECT bulking works. The bulk is flushed when
> + * it is full or when mem.id changes.
> + * xdp_frame_bulk is usually stored/allocated on the function
> + * call-stack to avoid locking penalties.
> + */
> +void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
> +{
> + struct xdp_mem_allocator *xa = bq->xa;
> + int i;
> +
> + if (unlikely(!xa))
> + return;
> +
> + for (i = 0; i < bq->count; i++) {
> + struct page *page = virt_to_head_page(bq->q[i]);
> +
> + page_pool_put_full_page(xa->page_pool, page, false);
> + }
> + bq->count = 0;
> +}
> +EXPORT_SYMBOL_GPL(xdp_flush_frame_bulk);
> +
> +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> +struct xdp_frame_bulk *bq)
> +{
> + struct xdp_mem_info *mem = &xdpf->mem;
> + struct xdp_mem_allocator *xa;
> +
> + if (mem->type != MEM_TYPE_PAGE_POOL) {
> + __xdp_return(xdpf->data, &xdpf->mem, false);
> + return;
> +

Re: realtek PHY commit bbc4d71d63549 causes regression

2020-10-29 Thread Ilias Apalodimas

On Thu, Oct 29, 2020 at 03:39:34PM +0100, Andrew Lunn wrote:
> > What about reverting the realtek PHY commit from stable?
> > As Ard said it doesn't really fix anything (usage wise) and causes a bunch 
> > of
> > problems.
> > 
> > If I understand correctly we have 3 options:
> > 1. 'Hack' the  drivers in stable to fix it (and most of those hacks will 
> > take 
> >a long time to remove)
> > 2. Update DTE of all affected devices, backport it to stable and force 
> > users to
> > update
> > 3. Revert the PHY commit
> > 
> > imho [3] is the least painful solution.
> 
> The PHY commit is correct, in that it fixes a bug. So i don't want to
> remove it.

Yea I meant revert the commit from were ever it was backported, not on current 
upstream. I agree it's correct from a coding point of view, but it never 
actually fixes anything functionality wise of the affected platforms. 
On the contrary, it breaks platforms without warning.

> 
> Backporting it to stable is what is causing most of the issues today,
> combined with a number of broken DT descriptions. So i would be happy
> for stable to get a patch which looks at the strapping, sees ID is
> enabled via strapping, warns the DT blob is FUBAR, and then ignores
> the requested PHY-mode. That gives developers time to fix their broken
> DT.

(Ard replied on this while I was typing)

> 
> Andrew

Cheers
/Ilias

Re: realtek PHY commit bbc4d71d63549 causes regression

2020-10-29 Thread Ilias Apalodimas

Hi Andrew

On Sun, Oct 25, 2020 at 03:42:58PM +0100, Andrew Lunn wrote:
> On Sun, Oct 25, 2020 at 03:34:06PM +0100, Ard Biesheuvel wrote:
> > On Sun, 25 Oct 2020 at 15:29, Andrew Lunn  wrote:
> > >
> > > On Sun, Oct 25, 2020 at 03:16:36PM +0100, Ard Biesheuvel wrote:
> > > > On Sun, 18 Oct 2020 at 17:45, Andrew Lunn  wrote:
> > > > >
> > > > > > However, that leaves the question why bbc4d71d63549bcd was 
> > > > > > backported,
> > > > > > although I understand why the discussion is a bit trickier there. 
> > > > > > But
> > > > > > if it did not fix a regression, only broken code that never worked 
> > > > > > in
> > > > > > the first place, I am not convinced it belongs in -stable.
> > > > >
> > > > > Please ask Serge Semin what platform he tested on. I kind of expect it
> > > > > worked for him, in some limited way, enough that it passed his
> > > > > testing.
> > > > >
> > > >
> > > > I'll make a note here that a rather large number of platforms got
> > > > broken by the same fix for the Realtek PHY driver:
> > > >
> > > > https://lore.kernel.org/lkml/?q=bbc4d71d6354
> > > >
> > > > I seriously doubt whether disabling TX/RX delay when it is enabled by
> > > > h/w straps is the right thing to do here.
> > >
> > > The device tree is explicitly asking for rgmii. If it wanted the
> > > hardware left alone, it should of used PHY_INTERFACE_MODE_NA.
> > >
> > 
> > Would you suggest that these DTs remove the phy-mode instead? As I
> > don't see anyone proposing that.
> 
> What is also O.K, for most MAC drivers. Some might enforce it is
> present, in which case, you can set it to "", which will get parsed as
> PHY_INTERFACE_MODE_NA. But a few MAC drivers might configure there MII
> bus depending on the PHY mode, RGMII vs GMII.
> 
> Andrew

What about reverting the realtek PHY commit from stable?
As Ard said it doesn't really fix anything (usage wise) and causes a bunch of
problems.

If I understand correctly we have 3 options:
1. 'Hack' the  drivers in stable to fix it (and most of those hacks will take 
   a long time to remove)
2. Update DTE of all affected devices, backport it to stable and force users to
update
3. Revert the PHY commit

imho [3] is the least painful solution.


Thanks
/Ilias

Re: [PATCH net-next 1/4] net: xdp: introduce bulking for xdp tx return path

2020-10-29 Thread Ilias Apalodimas

On Thu, Oct 29, 2020 at 03:02:16PM +0100, Lorenzo Bianconi wrote:
> > On Tue, 27 Oct 2020 20:04:07 +0100
> > Lorenzo Bianconi  wrote:
> > 
> > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > index 48aba933a5a8..93eabd789246 100644
> > > --- a/net/core/xdp.c
> > > +++ b/net/core/xdp.c
> > > @@ -380,6 +380,57 @@ void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
> > >  }
> > >  EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
> > >  
> > > +void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
> > > +{
> > > + struct xdp_mem_allocator *xa = bq->xa;
> > > + int i;
> > > +
> > > + if (unlikely(!xa))
> > > + return;
> > > +
> > > + for (i = 0; i < bq->count; i++) {
> > > + struct page *page = virt_to_head_page(bq->q[i]);
> > > +
> > > + page_pool_put_full_page(xa->page_pool, page, false);
> > > + }
> > > + bq->count = 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdp_flush_frame_bulk);
> > > +
> > > +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> > > +struct xdp_frame_bulk *bq)
> > > +{
> > > + struct xdp_mem_info *mem = &xdpf->mem;
> > > + struct xdp_mem_allocator *xa, *nxa;
> > > +
> > > + if (mem->type != MEM_TYPE_PAGE_POOL) {
> > > + __xdp_return(xdpf->data, &xdpf->mem, false);
> > > + return;
> > > + }
> > > +
> > > + rcu_read_lock();
> > > +
> > > + xa = bq->xa;
> > > + if (unlikely(!xa || mem->id != xa->mem.id)) {
> > > + nxa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> > > + if (unlikely(!xa)) {
> > > + bq->count = 0;
> > > + bq->xa = nxa;
> > > + xa = nxa;
> > > + }
> > > + }
> > > +
> > > + if (mem->id != xa->mem.id || bq->count == XDP_BULK_QUEUE_SIZE)
> > > + xdp_flush_frame_bulk(bq);
> > > +
> > > + bq->q[bq->count++] = xdpf->data;
> > > + if (mem->id != xa->mem.id)
> > > + bq->xa = nxa;
> > > +
> > > + rcu_read_unlock();
> > > +}
> > > +EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
> > 
> > We (Ilias my co-maintainer and I) think above code is hard to read and
> > understand (as a reader you need to keep too many cases in your head).
> > 
> > I think we both have proposals to improve this, here is mine:
> > 
> > /* Defers return when frame belongs to same mem.id as previous frame */
> > void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> >struct xdp_frame_bulk *bq)
> > {
> > struct xdp_mem_info *mem = &xdpf->mem;
> > struct xdp_mem_allocator *xa;
> > 
> > if (mem->type != MEM_TYPE_PAGE_POOL) {
> > __xdp_return(xdpf->data, &xdpf->mem, false);
> > return;
> > }
> > 
> > rcu_read_lock();
> > 
> > xa = bq->xa;
> > if (unlikely(!xa)) {
> > xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> > bq->count = 0;
> > bq->xa = xa;
> > }
> > 
> > if (bq->count == XDP_BULK_QUEUE_SIZE)
> > xdp_flush_frame_bulk(bq);
> > 
> > if (mem->id != xa->mem.id) {
> > xdp_flush_frame_bulk(bq);
> > bq->xa = rhashtable_lookup(mem_id_ht, &mem->id, 
> > mem_id_rht_params);
> > }
> > 
> > bq->q[bq->count++] = xdpf->data;
> > 
> > rcu_read_unlock();
> > }
> > 
> > Please review for correctness, and also for readability.
> 
> the code seems fine to me (and even easier to read :)).
> I will update v2 using this approach. Thx.
+1 this is close to what we discussed this morning and it detangles 1 more 
'weird' 
if case 


Thanks
/Ilias
> 
> Regards,
> Lorenzo
> 
> > 
> > -- 
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >

Re: [PATCH net-next 2/4] net: page_pool: add bulk support for ptr_ring

2020-10-29 Thread Ilias Apalodimas

Hi Lorenzo, 

On Tue, Oct 27, 2020 at 08:04:08PM +0100, Lorenzo Bianconi wrote:
> Introduce the capability to batch page_pool ptr_ring refill since it is
> usually run inside the driver NAPI tx completion loop.
> 
> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  include/net/page_pool.h | 26 ++
>  net/core/page_pool.c| 33 +
>  net/core/xdp.c  |  9 ++---
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 81d7773f96cd..b5b195305346 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -152,6 +152,8 @@ struct page_pool *page_pool_create(const struct 
> page_pool_params *params);
>  void page_pool_destroy(struct page_pool *pool);
>  void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void 
> *));
>  void page_pool_release_page(struct page_pool *pool, struct page *page);
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> +  int count);
>  #else
>  static inline void page_pool_destroy(struct page_pool *pool)
>  {
> @@ -165,6 +167,11 @@ static inline void page_pool_release_page(struct 
> page_pool *pool,
> struct page *page)
>  {
>  }
> +
> +static inline void page_pool_put_page_bulk(struct page_pool *pool, void 
> **data,
> +int count)
> +{
> +}
>  #endif
>  
>  void page_pool_put_page(struct page_pool *pool, struct page *page,
> @@ -215,4 +222,23 @@ static inline void page_pool_nid_changed(struct 
> page_pool *pool, int new_nid)
>   if (unlikely(pool->p.nid != new_nid))
>   page_pool_update_nid(pool, new_nid);
>  }
> +
> +static inline void page_pool_ring_lock(struct page_pool *pool)
> + __acquires(&pool->ring.producer_lock)
> +{
> + if (in_serving_softirq())
> + spin_lock(&pool->ring.producer_lock);
> + else
> + spin_lock_bh(&pool->ring.producer_lock);
> +}
> +
> +static inline void page_pool_ring_unlock(struct page_pool *pool)
> + __releases(&pool->ring.producer_lock)
> +{
> + if (in_serving_softirq())
> + spin_unlock(&pool->ring.producer_lock);
> + else
> + spin_unlock_bh(&pool->ring.producer_lock);
> +}
> +
>  #endif /* _NET_PAGE_POOL_H */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ef98372facf6..84fb21f8865e 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -11,6 +11,8 @@
>  #include 
>  
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -408,6 +410,37 @@ void page_pool_put_page(struct page_pool *pool, struct 
> page *page,
>  }
>  EXPORT_SYMBOL(page_pool_put_page);
>  
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> +  int count)
> +{
> + struct page *page_ring[XDP_BULK_QUEUE_SIZE];
> + int i, len = 0;
> +
> + for (i = 0; i < count; i++) {
> + struct page *page = virt_to_head_page(data[i]);
> +
> + if (unlikely(page_ref_count(page) != 1 ||
> +  !pool_page_reusable(pool, page))) {
> + page_pool_release_page(pool, page);

Mind switching this similarly to how page_pool_put_page() is using it?
unlikely -> likely and remove the !

> + put_page(page);
> + continue;
> + }
> +
> + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> + page_pool_dma_sync_for_device(pool, page, -1);
> +
> + page_ring[len++] = page;
> + }
> +
> + page_pool_ring_lock(pool);
> + for (i = 0; i < len; i++) {
> + if (__ptr_ring_produce(&pool->ring, page_ring[i]))
> + page_pool_return_page(pool, page_ring[i]);
> + }
> + page_pool_ring_unlock(pool);
> +}
> +EXPORT_SYMBOL(page_pool_put_page_bulk);
> +
>  static void page_pool_empty_ring(struct page_pool *pool)
>  {
>   struct page *page;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 93eabd789246..9f9a8d14df38 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -383,16 +383,11 @@ EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
>  void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
>  {
>   struct xdp_mem_allocator *xa = bq->xa;
> - int i;
>  
> - if (unlikely(!xa))
> + if (unlikely(!xa || !bq->count))
>   return;
>  
> - for (i = 0; i < bq->count; i++) {
> - struct page *page = virt_to_head_page(bq->q[i]);
> -
> - page_pool_put_full_page(xa->page_pool, page, false);
> - }
> + page_pool_put_page_bulk(xa->page_pool, bq->q, bq->count);
>   bq->count = 0;
>  }
>  EXPORT_SYMBOL_GPL(xdp_flush_frame_bulk);
> -- 
> 2.26.2
> 

Cheers
/Ilias

Re: [PATCH net-next 2/4] net: page_pool: add bulk support for ptr_ring

2020-10-29 Thread Ilias Apalodimas

On Tue, Oct 27, 2020 at 08:04:08PM +0100, Lorenzo Bianconi wrote:
> Introduce the capability to batch page_pool ptr_ring refill since it is
> usually run inside the driver NAPI tx completion loop.
> 
> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  include/net/page_pool.h | 26 ++
>  net/core/page_pool.c| 33 +
>  net/core/xdp.c  |  9 ++---
>  3 files changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 81d7773f96cd..b5b195305346 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -152,6 +152,8 @@ struct page_pool *page_pool_create(const struct 
> page_pool_params *params);
>  void page_pool_destroy(struct page_pool *pool);
>  void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void 
> *));
>  void page_pool_release_page(struct page_pool *pool, struct page *page);
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> +  int count);
>  #else
>  static inline void page_pool_destroy(struct page_pool *pool)
>  {
> @@ -165,6 +167,11 @@ static inline void page_pool_release_page(struct 
> page_pool *pool,
> struct page *page)
>  {
>  }
> +
> +static inline void page_pool_put_page_bulk(struct page_pool *pool, void 
> **data,
> +int count)
> +{
> +}
>  #endif
>  
>  void page_pool_put_page(struct page_pool *pool, struct page *page,
> @@ -215,4 +222,23 @@ static inline void page_pool_nid_changed(struct 
> page_pool *pool, int new_nid)
>   if (unlikely(pool->p.nid != new_nid))
>   page_pool_update_nid(pool, new_nid);
>  }
> +
> +static inline void page_pool_ring_lock(struct page_pool *pool)
> + __acquires(&pool->ring.producer_lock)
> +{
> + if (in_serving_softirq())
> + spin_lock(&pool->ring.producer_lock);
> + else
> + spin_lock_bh(&pool->ring.producer_lock);
> +}
> +
> +static inline void page_pool_ring_unlock(struct page_pool *pool)
> + __releases(&pool->ring.producer_lock)
> +{
> + if (in_serving_softirq())
> + spin_unlock(&pool->ring.producer_lock);
> + else
> + spin_unlock_bh(&pool->ring.producer_lock);
> +}
> +
>  #endif /* _NET_PAGE_POOL_H */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ef98372facf6..84fb21f8865e 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -11,6 +11,8 @@
>  #include 
>  
>  #include 
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -408,6 +410,37 @@ void page_pool_put_page(struct page_pool *pool, struct 
> page *page,
>  }
>  EXPORT_SYMBOL(page_pool_put_page);
>  
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> +  int count)
> +{
> + struct page *page_ring[XDP_BULK_QUEUE_SIZE];
> + int i, len = 0;
> +
> + for (i = 0; i < count; i++) {
> + struct page *page = virt_to_head_page(data[i]);
> +
> + if (unlikely(page_ref_count(page) != 1 ||
> +  !pool_page_reusable(pool, page))) {
> + page_pool_release_page(pool, page);
> + put_page(page);
> + continue;
> + }
> +
> + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> + page_pool_dma_sync_for_device(pool, page, -1);
> +
> + page_ring[len++] = page;
> + }
> +
> + page_pool_ring_lock(pool);
> + for (i = 0; i < len; i++) {
> + if (__ptr_ring_produce(&pool->ring, page_ring[i]))
> + page_pool_return_page(pool, page_ring[i]);

Can we add a comment here on why the explicit spinlock needs to protect 
page_pool_return_page() as well instead of just using ptr_ring_produce()?

> + }
> + page_pool_ring_unlock(pool);
> +}
> +EXPORT_SYMBOL(page_pool_put_page_bulk);
> +
>  static void page_pool_empty_ring(struct page_pool *pool)
>  {
>   struct page *page;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 93eabd789246..9f9a8d14df38 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -383,16 +383,11 @@ EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
>  void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
>  {
>   struct xdp_mem_allocator *xa = bq->xa;
> - int i;
>  
> - if (unlikely(!xa))
> + if (unlikely(!xa || !bq->count))
>   return;
>  
> - for (i = 0; i < bq->count; i++) {
> - struct page *page = virt_to_head_page(bq->q[i]);
> -
> - page_pool_put_full_page(xa->page_pool, page, false);
> - }
> + page_pool_put_page_bulk(xa->page_pool, bq->q, bq->count);
>   bq->count = 0;
>  }
>  EXPORT_SYMBOL_GPL(xdp_flush_frame_bulk);
> -- 
> 2.26.2
> 

Thanks
/Ilias

Re: [PATCH net-next 1/4] net: xdp: introduce bulking for xdp tx return path

2020-10-28 Thread Ilias Apalodimas

On Wed, Oct 28, 2020 at 11:23:04AM +0100, Lorenzo Bianconi wrote:
> > Hi Lorenzo,
> 
> Hi Ilias,
> 
> thx for the review.
> 
> > 
> > On Tue, Oct 27, 2020 at 08:04:07PM +0100, Lorenzo Bianconi wrote:
> 
> [...]
> 
> > > +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> > > +struct xdp_frame_bulk *bq)
> > > +{
> > > + struct xdp_mem_info *mem = &xdpf->mem;
> > > + struct xdp_mem_allocator *xa, *nxa;
> > > +
> > > + if (mem->type != MEM_TYPE_PAGE_POOL) {
> > > + __xdp_return(xdpf->data, &xdpf->mem, false);
> > > + return;
> > > + }
> > > +
> > > + rcu_read_lock();
> > > +
> > > + xa = bq->xa;
> > > + if (unlikely(!xa || mem->id != xa->mem.id)) {
> > 
> > Why is this marked as unlikely? The driver passes it as NULL. Should 
> > unlikely be
> > checked on both xa and the comparison?
> 
> xa is NULL only for the first xdp_frame in the burst while it is set for
> subsequent ones. Do you think it is better to remove it?

Ah correct, missed the general context of the driver this runs in.

> 
> > 
> > > + nxa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> > 
> > Is there a chance nxa can be NULL?
> 
> I do not think so since the page_pool is not destroyed while there are
> in-flight pages, right?

I think so but I am not 100% sure. I'll apply the patch and have a closer look

Cheers
/Ilias

Re: [PATCH net-next 1/4] net: xdp: introduce bulking for xdp tx return path

2020-10-28 Thread Ilias Apalodimas

Hi Lorenzo,

On Tue, Oct 27, 2020 at 08:04:07PM +0100, Lorenzo Bianconi wrote:
> Introduce bulking capability in xdp tx return path (XDP_REDIRECT).
> xdp_return_frame is usually run inside the driver NAPI tx completion
> loop so it is possible batch it.
> Current implementation considers only page_pool memory model.
> Convert mvneta driver to xdp_return_frame_bulk APIs.
> 
> Suggested-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/mvneta.c |  5 ++-
>  include/net/xdp.h |  9 +
>  net/core/xdp.c| 51 +++
>  3 files changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 54b0bf574c05..43ab8a73900e 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -1834,8 +1834,10 @@ static void mvneta_txq_bufs_free(struct mvneta_port 
> *pp,
>struct netdev_queue *nq, bool napi)
>  {
>   unsigned int bytes_compl = 0, pkts_compl = 0;
> + struct xdp_frame_bulk bq;
>   int i;
>  
> + bq.xa = NULL;
>   for (i = 0; i < num; i++) {
>   struct mvneta_tx_buf *buf = &txq->buf[txq->txq_get_index];
>   struct mvneta_tx_desc *tx_desc = txq->descs +
> @@ -1857,9 +1859,10 @@ static void mvneta_txq_bufs_free(struct mvneta_port 
> *pp,
>   if (napi && buf->type == MVNETA_TYPE_XDP_TX)
>   xdp_return_frame_rx_napi(buf->xdpf);
>   else
> - xdp_return_frame(buf->xdpf);
> + xdp_return_frame_bulk(buf->xdpf, &bq);
>   }
>   }
> + xdp_flush_frame_bulk(&bq);
>  
>   netdev_tx_completed_queue(nq, pkts_compl, bytes_compl);
>  }
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 3814fb631d52..9567110845ef 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -104,6 +104,12 @@ struct xdp_frame {
>   struct net_device *dev_rx; /* used by cpumap */
>  };
>  
> +#define XDP_BULK_QUEUE_SIZE  16
> +struct xdp_frame_bulk {
> + void *q[XDP_BULK_QUEUE_SIZE];
> + int count;
> + void *xa;
> +};
>  
>  static inline struct skb_shared_info *
>  xdp_get_shared_info_from_frame(struct xdp_frame *frame)
> @@ -194,6 +200,9 @@ struct xdp_frame *xdp_convert_buff_to_frame(struct 
> xdp_buff *xdp)
>  void xdp_return_frame(struct xdp_frame *xdpf);
>  void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
>  void xdp_return_buff(struct xdp_buff *xdp);
> +void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq);
> +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> +struct xdp_frame_bulk *bq);
>  
>  /* When sending xdp_frame into the network stack, then there is no
>   * return point callback, which is needed to release e.g. DMA-mapping
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 48aba933a5a8..93eabd789246 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -380,6 +380,57 @@ void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
>  }
>  EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
>  
> +void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq)
> +{
> + struct xdp_mem_allocator *xa = bq->xa;
> + int i;
> +
> + if (unlikely(!xa))
> + return;
> +
> + for (i = 0; i < bq->count; i++) {
> + struct page *page = virt_to_head_page(bq->q[i]);
> +
> + page_pool_put_full_page(xa->page_pool, page, false);
> + }
> + bq->count = 0;
> +}
> +EXPORT_SYMBOL_GPL(xdp_flush_frame_bulk);
> +
> +void xdp_return_frame_bulk(struct xdp_frame *xdpf,
> +struct xdp_frame_bulk *bq)
> +{
> + struct xdp_mem_info *mem = &xdpf->mem;
> + struct xdp_mem_allocator *xa, *nxa;
> +
> + if (mem->type != MEM_TYPE_PAGE_POOL) {
> + __xdp_return(xdpf->data, &xdpf->mem, false);
> + return;
> + }
> +
> + rcu_read_lock();
> +
> + xa = bq->xa;
> + if (unlikely(!xa || mem->id != xa->mem.id)) {

Why is this marked as unlikely? The driver passes it as NULL. Should unlikely be
checked on both xa and the comparison?

> + nxa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);

Is there a chance nxa can be NULL?

> + if (unlikely(!xa)) {

Same here, driver passes it as NULL

> + bq->count = 0;
> + bq->xa = nxa;
> + xa = nxa;
> + }
> + }
> +
> + if (mem->id != xa->mem.id || bq->count == XDP_BULK_QUEUE_SIZE)
> + xdp_flush_frame_bulk(bq);
> +
> + bq->q[bq->count++] = xdpf->data;
> + if (mem->id != xa->mem.id)
> + bq->xa = nxa;
> +
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
> +
>  void xdp_return_buff(struct xdp_buff *xdp)
>  {
>   __xdp_return(xdp->data, &xdp->rxq->mem

Re: [PATCH net] netsec: ignore 'phy-mode' device property on ACPI systems

2020-10-20 Thread Ilias Apalodimas

Hi Ard, 

On Mon, Oct 19, 2020 at 08:30:45AM +0200, Ard Biesheuvel wrote:
> On Sun, 18 Oct 2020 at 22:32, Ilias Apalodimas
>  wrote:
> >
> > On Sun, Oct 18, 2020 at 07:52:18PM +0200, Andrew Lunn wrote:
> > > > --- a/Documentation/devicetree/bindings/net/socionext-netsec.txt
> > > > +++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> > > > @@ -30,7 +30,9 @@ Optional properties: (See ethernet.txt file in the 
> > > > same directory)
> > > >  - max-frame-size: See ethernet.txt in the same directory.
> > > >
> > > >  The MAC address will be determined using the optional properties
> > > > -defined in ethernet.txt.
> > > > +defined in ethernet.txt. The 'phy-mode' property is required, but may
> > > > +be set to the empty string if the PHY configuration is programmed by
> > > > +the firmware or set by hardware straps, and needs to be preserved.
> > >
> > > In general, phy-mode is not mandatory. of_get_phy_mode() does the
> > > right thing if it is not found, it sets &priv->phy_interface to
> > > PHY_INTERFACE_MODE_NA, but returns -ENODEV. Also, it does not break
> > > backwards compatibility to convert a mandatory property to
> > > optional. So you could just do
> > >
> > >   of_get_phy_mode(pdev->dev.of_node, &priv->phy_interface);
> > >
> > > skip all the error checking, and document it as optional.
> >
> > Why ?
> > The patch as is will not affect systems built on any firmware 
> > implementations
> > that use ACPI and somehow configure the hardware.
> > Although the only firmware implementations I am aware of on upsteream are 
> > based
> > on EDK2, I prefer the explicit error as is now, in case a firmware does on
> > initialize the PHY properly (and is using a DT).
> >
> 
> We will also lose the ability to report bogus values for phy-mode this
> way, so I think we should stick with the check.

I hope Andrew is fine with the current changes

Reviewed-by: Ilias Apalodimas

Re: [PATCH net] netsec: ignore 'phy-mode' device property on ACPI systems

2020-10-18 Thread Ilias Apalodimas

On Sun, Oct 18, 2020 at 07:52:18PM +0200, Andrew Lunn wrote:
> > --- a/Documentation/devicetree/bindings/net/socionext-netsec.txt
> > +++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> > @@ -30,7 +30,9 @@ Optional properties: (See ethernet.txt file in the same 
> > directory)
> >  - max-frame-size: See ethernet.txt in the same directory.
> >  
> >  The MAC address will be determined using the optional properties
> > -defined in ethernet.txt.
> > +defined in ethernet.txt. The 'phy-mode' property is required, but may
> > +be set to the empty string if the PHY configuration is programmed by
> > +the firmware or set by hardware straps, and needs to be preserved.
> 
> In general, phy-mode is not mandatory. of_get_phy_mode() does the
> right thing if it is not found, it sets &priv->phy_interface to
> PHY_INTERFACE_MODE_NA, but returns -ENODEV. Also, it does not break
> backwards compatibility to convert a mandatory property to
> optional. So you could just do
> 
>   of_get_phy_mode(pdev->dev.of_node, &priv->phy_interface);
> 
> skip all the error checking, and document it as optional.

Why ?
The patch as is will not affect systems built on any firmware implementations 
that use ACPI and somehow configure the hardware. 
Although the only firmware implementations I am aware of on upsteream are based
on EDK2, I prefer the explicit error as is now, in case a firmware does on 
initialize the PHY properly (and is using a DT).

Cheers
/Ilias

> 
>  Andrew

Re: realtek PHY commit bbc4d71d63549 causes regression

2020-10-17 Thread Ilias Apalodimas

Hi Ard,

[...]
> > > > You can also use '' as the phy-mode, which results in
> > > > PHY_INTERFACE_MODE_NA, which effectively means, don't touch the PHY
> > > > mode, something else has already set it up. This might actually be the
> > > > correct way to go for ACPI. In the DT world, we tend to assume the
> > > > bootloader has done the absolute minimum and Linux should configure
> > > > everything. The ACPI takes the opposite view, the firmware will do the
> > > > basic hardware configuration, and Linux should not touch it, or ask
> > > > ACPI to modify it.
> > > >
> > >
> > > Indeed, the firmware should have set this up.
> >
> > Would EDK2 take care of the RGMII Rx/Tx delays even when configured to
> > use a DT instead of ACPI?
> >
>
> Yes. The network driver has no awareness whatsoever which h/w
> description is being provided to the OS.
>
>
> > > This would mean we could > do this in the driver: it currently uses
> > >
> > > priv->phy_interface = device_get_phy_mode(&pdev->dev);
> > >
> > > Can we just assign that to PHY_INTERFACE_MODE_NA instead?
> >
>
> I have tried this, and it seems to fix the issue. I will send out a
> patch against the netsec driver.

Great thanks!

Cheers
/Ilias

Re: realtek PHY commit bbc4d71d63549 causes regression

2020-10-17 Thread Ilias Apalodimas

Hi Ard, 

On Sat, Oct 17, 2020 at 05:18:16PM +0200, Ard Biesheuvel wrote:
> On Sat, 17 Oct 2020 at 17:11, Andrew Lunn  wrote:
> >
> > On Sat, Oct 17, 2020 at 04:46:23PM +0200, Ard Biesheuvel wrote:
> > > On Sat, 17 Oct 2020 at 16:44, Andrew Lunn  wrote:
> > > >
> > > > On Sat, Oct 17, 2020 at 04:20:36PM +0200, Ard Biesheuvel wrote:
> > > > > Hello all,
> > > > >
> > > > > I just upgraded my arm64 SynQuacer box to 5.8.16 and lost all network
> > > > > connectivity.
> > > >
> > > > Hi Ard
> > > >
> > > > Please could you point me at the DT files.
> > > >
> > > > > This box has a on-SoC socionext 'netsec' network controller wired to
> > > > > a Realtek 80211e PHY, and this was working without problems until
> > > > > the following commit was merged
> > > >
> > > > It could be this fix has uncovered a bug in the DT file. Before this
> > > > fix, if there is an phy-mode property in DT, it could of been ignored.
> > > > Now the phy-handle property is correctly implemented. So it could be
> > > > the DT has the wrong value, e.g. it has rgmii-rxid when maybe it
> > > > should have rgmii-id.
> > > >
> > >
> > > This is an ACPI system. The phy-mode device property is set to 'rgmii'
> >
> > Hi Ard
> >
> > Please try rgmii-id.
> >
> > Also, do you have the schematic? Can you see if there are any
> > strapping resistors? It could be, there are strapping resistors to put
> > it into rgmii-id. Now that the phy-mode properties is respected, the
> > reset defaults are being over-written to rgmii, which breaks the link.
> > Or the bootloader has already set the PHY mode to rgmii-id.
> >
> > You can also use '' as the phy-mode, which results in
> > PHY_INTERFACE_MODE_NA, which effectively means, don't touch the PHY
> > mode, something else has already set it up. This might actually be the
> > correct way to go for ACPI. In the DT world, we tend to assume the
> > bootloader has done the absolute minimum and Linux should configure
> > everything. The ACPI takes the opposite view, the firmware will do the
> > basic hardware configuration, and Linux should not touch it, or ask
> > ACPI to modify it.
> >
> 
> Indeed, the firmware should have set this up.

Would EDK2 take care of the RGMII Rx/Tx delays even when configured to 
use a DT instead of ACPI?

> This would mean we could > do this in the driver: it currently uses
> 
> priv->phy_interface = device_get_phy_mode(&pdev->dev);
> 
> Can we just assign that to PHY_INTERFACE_MODE_NA instead?

Thanks
/Ilias

[PATCH v3] arm64: bpf: Fix branch offset in JIT

2020-09-17 Thread Ilias Apalodimas

Running the eBPF test_verifier leads to random errors looking like this:

[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd 
aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev 
efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon 
zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd 
igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec 
mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW 
5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
build #1 Jun  6 2020
[ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : 8000130cbb80
[ 6525.824141] x29: 8000130cbbb0 x28: 
[ 6525.829451] x27: 05ef6fcbf39b x26: 
[ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
[ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
[ 6525.845375] x21: 0001 x20: 800011dc7000
[ 6525.850682] x19:  x18: 
[ 6525.855990] x17:  x16: 
[ 6525.861298] x15:  x14: 
[ 6525.866606] x13:  x12: 
[ 6525.871913] x11: 0001 x10: 800a660c
[ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
[ 6525.882528] x7 :  x6 : 009864cfa881
[ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0018 x2 : 800a4374
[ 6525.898452] x1 : 000a x0 : 0009
[ 6525.903760] Call trace:
[ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398]  bpf_test_run+0x70/0x1b0
[ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
[ 6525.932072]  __arm64_sys_bpf+0x24/0x30
[ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607]  do_el0_svc+0x28/0x88
[ 6525.943920]  el0_sync_handler+0x88/0x190
[ 6525.947838]  el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---

The reason is the offset[] creation and later usage, while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.

When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.

commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.

So let's fix it by creating the ctx->offset[] differently. Track the
beginning of instruction and account for the extra instruction while
calculating the arm instruction offsets.

Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju 
Reported-by: Jiri Olsa 
Co-developed-by: Jean-Philippe Brucker 
Signed-off-by: Jean-Philippe Brucker 
Co-developed-by: Yauheni Kaliuta 
Signed-off-by: Yauheni Kaliuta 
Signed-off-by: Ilias Apalodimas 
---
Changes since v1: 
 - Added Co-developed-by, Reported-by and Fixes tags correctly
 - Describe the expected context of ctx->offset[] in comments
Changes since v2:
 - Drop the change of behavior for 16-byte eBPF instructions. This won't
 currently cause any problems and can go in on a different patch
 - simplify bpf2a64_offset()

 arch/arm64/net/bpf_jit_comp.c | 43 +--
 1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f8912e45be7a..ef9f1d5e989d 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -143,14 +143,17 @@ static inline void emit_addr_mov_i64(const int reg, const 
u64 val,
}
 }
 
-static inline int bpf2a64_offset(int bpf_to, int bpf_from,
+static inline int bpf2a64_offset(int bpf_insn, int off,

Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-16 Thread Ilias Apalodimas

Hi Will, 

On Tue, Sep 15, 2020 at 02:11:03PM +0100, Will Deacon wrote:
[...]
> > continue;
> > }
> > -   if (ctx->image == NULL)
> > -   ctx->offset[i] = ctx->idx;
> > if (ret)
> > return ret;
> > }
> > +   if (ctx->image == NULL)
> > +   ctx->offset[i] = ctx->idx;
> 
> I think it would be cleared to set ctx->offset[0] before the for loop (with
> a comment about what it is) and then change the for loop to iterate from 1
> all the way to prog->len.

On a second thought while trying to code this, I'd prefer leaving it as is. 
First of all we'll have to increase ctx->idx while adding ctx->offset[0] and 
more importantly, I don't think that's a 'special' case. 
It's still the same thing i.e the start of the 1st instruction (which happens 
to be the end of prologue), the next one will be the start of the second 
instruction etc etc. 

I don't mind changing if you feel strongly about it, but I think it makese sense
as-is.

Thanks
/Ilias
> 
> Will

Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-15 Thread Ilias Apalodimas

On Tue, Sep 15, 2020 at 02:11:03PM +0100, Will Deacon wrote:
> Hi Ilias,
> 
> On Mon, Sep 14, 2020 at 07:03:55PM +0300, Ilias Apalodimas wrote:
> > Running the eBPF test_verifier leads to random errors looking like this:
> > 
> > [ 6525.735488] Unexpected kernel BRK exception at EL1
> > [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> 
> Does this happen because we poison the BPF memory with BRK instructions?
> Maybe we should look at using a special immediate so we can detect this,
> rather than end up in the ptrace handler.

As discussed offline this is what aarch64_insn_gen_branch_imm() will return for
offsets > 128M and yes replacing the handler with a more suitable message would 
be good.

> 
> > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > index f8912e45be7a..0974e58c 100644
> > --- a/arch/arm64/net/bpf_jit_comp.c
> > +++ b/arch/arm64/net/bpf_jit_comp.c
> > @@ -143,9 +143,13 @@ static inline void emit_addr_mov_i64(const int reg, 
> > const u64 val,
> > }
> >  }
> >  
> > -static inline int bpf2a64_offset(int bpf_to, int bpf_from,
> > +static inline int bpf2a64_offset(int bpf_insn, int off,
> >  const struct jit_ctx *ctx)
> >  {
> > +   /* arm64 offset is relative to the branch instruction */
> > +   int bpf_from = bpf_insn + 1;
> > +   /* BPF JMP offset is relative to the next instruction */
> > +   int bpf_to = bpf_insn + off + 1;
> > int to = ctx->offset[bpf_to];
> > /* -1 to account for the Branch instruction */
> > int from = ctx->offset[bpf_from] - 1;
> 
> I think this is a bit confusing with all the variables. How about just
> doing:
> 
>   /* BPF JMP offset is relative to the next BPF instruction */
>   bpf_insn++;
> 
>   /*
>* Whereas arm64 branch instructions encode the offset from the
>* branch itself, so we must subtract 1 from the instruction offset.
>*/
>   return ctx->offset[bpf_insn + off] - ctx->offset[bpf_insn] - 1;
> 

Sure

> > @@ -642,7 +646,7 @@ static int build_insn(const struct bpf_insn *insn, 
> > struct jit_ctx *ctx,
> >  
> > /* JUMP off */
> > case BPF_JMP | BPF_JA:
> > -   jmp_offset = bpf2a64_offset(i + off, i, ctx);
> > +   jmp_offset = bpf2a64_offset(i, off, ctx);
> > check_imm26(jmp_offset);
> > emit(A64_B(jmp_offset), ctx);
> > break;
> > @@ -669,7 +673,7 @@ static int build_insn(const struct bpf_insn *insn, 
> > struct jit_ctx *ctx,
> > case BPF_JMP32 | BPF_JSLE | BPF_X:
> > emit(A64_CMP(is64, dst, src), ctx);
> >  emit_cond_jmp:
> > -   jmp_offset = bpf2a64_offset(i + off, i, ctx);
> > +   jmp_offset = bpf2a64_offset(i, off, ctx);
> > check_imm19(jmp_offset);
> > switch (BPF_OP(code)) {
> > case BPF_JEQ:
> > @@ -912,18 +916,26 @@ static int build_body(struct jit_ctx *ctx, bool 
> > extra_pass)
> > const struct bpf_insn *insn = &prog->insnsi[i];
> > int ret;
> >  
> > +   /*
> > +* offset[0] offset of the end of prologue, start of the
> > +* first insn.
> > +* offset[x] - offset of the end of x insn.
> 
> So does offset[1] point at the last arm64 instruction for the first BPF
> instruction, or does it point to the first arm64 instruction for the second
> BPF instruction?
> 

Right this isn't exactly a good comment. 
I'll change it to something like:

offset[0] - offset of the end of prologue, start of the 1st insn.
offset[1] - offset of the end of 1st insn.

> > +*/
> > +   if (ctx->image == NULL)
> > +   ctx->offset[i] = ctx->idx;
> > +
> > ret = build_insn(insn, ctx, extra_pass);
> > if (ret > 0) {
> > i++;
> > if (ctx->image == NULL)
> > -   ctx->offset[i] = ctx->idx;
> > +   ctx->offset[i] = ctx->offset[i - 1];
> 
> Does it matter that we set the offset for both halves of a 16-byte BPF
> instruction? I think that's a change in behaviour here.

Yes it is, but from reading around that's what I understood.
for 16-byte eBPF instructions both should point to the start of 
the corresponding jited arm64 instruction.
If I am horribly wrong about this, please shout.

> 
> > continue;
> > }
> > -   if (ctx->image == NULL)
> > -   ctx->offset[i] = ctx->idx;
> > if (ret)
> > return ret;
> > }
> > +   if (ctx->image == NULL)
> > +   ctx->offset[i] = ctx->idx;
> 
> I think it would be cleared to set ctx->offset[0] before the for loop (with
> a comment about what it is) and then change the for loop to iterate from 1
> all the way to prog->len.
> 

Sure

> Will

Thanks
/Ilias

Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-15 Thread Ilias Apalodimas

Hi Will, 

On Tue, Sep 15, 2020 at 03:17:08PM +0100, Will Deacon wrote:
> On Tue, Sep 15, 2020 at 04:53:44PM +0300, Ilias Apalodimas wrote:
> > On Tue, Sep 15, 2020 at 02:11:03PM +0100, Will Deacon wrote:
> > > Hi Ilias,
> > > 
> > > On Mon, Sep 14, 2020 at 07:03:55PM +0300, Ilias Apalodimas wrote:
> > > > Running the eBPF test_verifier leads to random errors looking like this:
> > > > 
> > > > [ 6525.735488] Unexpected kernel BRK exception at EL1
> > > > [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> > > 
> > > Does this happen because we poison the BPF memory with BRK instructions?
> > > Maybe we should look at using a special immediate so we can detect this,
> > > rather than end up in the ptrace handler.
> > 
> > As discussed offline this is what aarch64_insn_gen_branch_imm() will return 
> > for
> > offsets > 128M and yes replacing the handler with a more suitable message 
> > would 
> > be good.
> 
> Can you give the diff below a shot, please? Hopefully printing a more useful
> message will mean these things get triaged/debugged better in future.

[...]

The error print is going to be helpful imho. At least it will help
people notice something is wrong a lot faster than the previous one.


[  575.273203] BPF JIT generated an invalid instruction at 
bpf_prog_64e6f4ba80861823_F+0x2e4/0x9a4!
[  575.281996] Unexpected kernel BRK exception at EL1
[  575.286786] Internal error: BRK handler: f2000100 [#5] PREEMPT SMP
[  575.292965] Modules linked in: crct10dif_ce drm ip_tables x_tables ipv6 
btrfs blake2b_generic libcrc32c xor xor_neon zstd_compress raid6_pq nvme 
nvme_core realtek
[  575.307516] CPU: 21 PID: 11760 Comm: test_verifier Tainted: G  D W   
  5.9.0-rc3-01410-ged6d9b022813-dirty #1
[  575.318125] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
build #1 Jun  6 2020
[  575.326825] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
[  575.332396] pc : bpf_prog_64e6f4ba80861823_F+0x2e4/0x9a4
[  575.337705] lr : bpf_prog_d3e125b76c96daac+0x40/0xdec
[  575.342752] sp : 8000144e3ba0
[  575.346061] x29: 8000144e3bd0 x28: 
[  575.351371] x27: 0085f19dc08d x26: 
[  575.356681] x25: 8000144e3ba0 x24: 800011fdf038
[  575.361991] x23: 8000144e3d20 x22: 0001
[  575.367301] x21: 800011fdf000 x20: 0009609d4740
[  575.372611] x19:  x18: 
[  575.377921] x17:  x16: 
[  575.383231] x15:  x14: 
[  575.388540] x13:  x12: 
[  575.393850] x11:  x10: 800bc65c
[  575.399160] x9 :  x8 : 8000144e3c58
[  575.404469] x7 :  x6 : 000dd7ae967a
[  575.409779] x5 : 00ff x4 : 0007fabd6992cf96
[  575.415088] x3 : 0018 x2 : 800ba214
[  575.420398] x1 : 000a x0 : 0009
[  575.425708] Call trace:
[  575.428152]  bpf_prog_64e6f4ba80861823_F+0x2e4/0x9a4
[  575.433114]  bpf_prog_d3e125b76c96daac+0x40/0xdec
[  575.437822]  bpf_dispatcher_xdp_func+0x10/0x1c
[  575.442265]  bpf_test_run+0x80/0x240
[  575.445838]  bpf_prog_test_run_xdp+0xe8/0x190
[  575.450196]  __do_sys_bpf+0x8e8/0x1b00
[  575.453943]  __arm64_sys_bpf+0x24/0x510
[  575.457780]  el0_svc_common.constprop.0+0x6c/0x170
[  575.462570]  do_el0_svc+0x24/0x90
[  575.465883]  el0_sync_handler+0x90/0x19c
[  575.469802]  el0_sync+0x158/0x180
[  575.473118] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[  575.479211] ---[ end trace 8cd54c7d5c0ffda4 ]---

Cheers
/Ilias

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

On Mon, Sep 14, 2020 at 11:52:16AM -0700, Xi Wang wrote:
> On Mon, Sep 14, 2020 at 11:28 AM Ilias Apalodimas
>  wrote:
> > Even if that's true, is any reason at all why we should skip the first 
> > element
> > of the array, that's now needed since 7c2e988f400 to jump back to the first
> > instruction?
> > Introducing 2 extra if conditions and hotfix the array on the fly (and for
> > every future invocation of that), seems better to you?
> 
> My point was that there's no inherently correct/wrong way to construct
> offsets.  As Luke explained in his email, 1) there are two different
> strategies used by the JITs and 2) there are likely similar bugs
> beyond arm64.
> 
> Each strategy has pros and cons, and I'm fine with either.  I like the
> strategy used in your patch because it's more intuitive (offset[i] is
> the start of the emitted instructions for BPF instruction i, rather
> than the end), though the changes to the construction process are
> trickier.
> 

Well the arm64 was literally a 'save the idx before building the instruction',
and add another element on the array.  So it's not that trickier, especially
if we document it properly.

I haven't checked the rest of the architectures tbh (apart from x86). 
I assumed the tracking used in arm64 at that point, was a result of how 
eBPF worked before bounded loops were introduced. Maybe I was wrong.
It felt a bit more natural to track the beginning of the emitted 
instructions rather than the end.

> If we decide to patch the arm64 JIT the way you proposed, we should
> consider whether to change other JITs consistently.

I think this is a good idea. Following the code is not exactly a stroll in the
park, so we can at least make it consistent across architectures.

Thanks
/Ilias

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Hi Luke, 

On Mon, Sep 14, 2020 at 11:21:58AM -0700, Luke Nelson wrote:
> On Mon, Sep 14, 2020 at 11:08 AM Xi Wang  wrote:
> > I don't think there's some consistent semantics of "offsets" across
> > the JITs of different architectures (maybe it's good to clean that
> > up).  RV64 and RV32 JITs are doing something similar to arm64 with
> > respect to offsets.  CCing Björn and Luke.
> 
> As I understand it, there are two strategies JITs use to keep track of
> the ctx->offset table.
> 
> Some JITs (RV32, RV64, arm32, arm64 currently, x86-32) track the end
> of each instruction (e.g., ctx->offset[i] marks the beginning of
> instruction i + 1).
> This requires care to handle jumps to the first instruction to avoid
> using ctx->offset[-1]. The RV32 and RV64 JITs have special handling
> for this case,
> while the arm32, arm64, and x86-32 JITs appear not to. The arm32 and
> x32 probably need to be fixed for the same reason arm64 does.
> 
> The other strategy is for ctx->offset[i] to track the beginning of
> instruction i. The x86-64 JIT currently works this way.
> This can be easier to use (no need to special case -1) but looks to be
> trickier to construct. This patch changes the arm64 JIT to work this
> way.
> 
> I don't think either strategy is inherently better, both can be
> "correct" as long as the JIT uses ctx->offset in the right way.
> This might be a good opportunity to change the JITs to be consistent
> about this (especially if the arm32, arm64, and x32 JITs all need to
> be fixed anyways).
> Having all JITs agree on the meaning of ctx->offset could help future
> readers debug / understand the code, and could help to someday verify
> the
> ctx->offset construction.
> 
> Any thoughts?

The common strategy does make a lot of sense and yes, both patches will  works 
assuming the ctx->offset ends up being what the JIT engine expects it to be. 
As I mentioned earlier we did consider both, but ended up using the later, 
since as you said, removes the need for handling the special (-1) case.

Cheers
/Ilias

> 
> - Luke

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Hi Xi, 

On Mon, Sep 14, 2020 at 11:08:13AM -0700, Xi Wang wrote:
> On Mon, Sep 14, 2020 at 10:55 AM Ilias Apalodimas
>  wrote:
> > We've briefly discussed this approach with Yauheni while coming up with the
> > posted patch.
> > I think that contructing the array correctly in the first place is better.
> > Right now it might only be used in bpf2a64_offset() and 
> > bpf_prog_fill_jited_linfo()
> > but if we fixup the values on the fly in there, everyone that intends to 
> > use the
> > offset for any reason will have to account for the missing instruction.
> 
> I don't understand what you mean by "correctly."  What's your correctness 
> spec?

> 
> I don't think there's some consistent semantics of "offsets" across
> the JITs of different architectures (maybe it's good to clean that
> up).  RV64 and RV32 JITs are doing something similar to arm64 with
> respect to offsets.  CCing Björn and Luke.

Even if that's true, is any reason at all why we should skip the first element 
of the array, that's now needed since 7c2e988f400 to jump back to the first
instruction?
Introducing 2 extra if conditions and hotfix the array on the fly (and for 
every future invocation of that), seems better to you?

Cheers
/Ilias

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

On Mon, Sep 14, 2020 at 10:47:33AM -0700, Xi Wang wrote:
> On Mon, Sep 14, 2020 at 10:03 AM Ilias Apalodimas
>  wrote:
> > Naresh from Linaro reported it during his tests on 5.8-rc1 as well [1].
> > I've included both Jiri and him on the v2 as reporters.
> >
> > [1] https://lkml.org/lkml/2020/8/11/58
> 
> I'm curious what you think of Luke's earlier patch to this bug:

We've briefly discussed this approach with Yauheni while coming up with the 
posted patch.
I think that contructing the array correctly in the first place is better. 
Right now it might only be used in bpf2a64_offset() and 
bpf_prog_fill_jited_linfo()
but if we fixup the values on the fly in there, everyone that intends to use the
offset for any reason will have to account for the missing instruction.

Cheers
/Ilias
> 
> https://lore.kernel.org/bpf/canowswkaj1hysw3bxbmg9_nd48fm0mxm5egdtmhu6ysec_g...@mail.gmail.com/T/#m4335b4005da0d60059ba96920fcaaecf2637042a

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

On Mon, Sep 14, 2020 at 06:12:34PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Mon, 14 Sep 2020 15:01:15 +0100 Will Deacon  wrote:
> 
> > Hi Ilias,
> > 
> > On Mon, Sep 14, 2020 at 04:23:50PM +0300, Ilias Apalodimas wrote:
> > > On Mon, Sep 14, 2020 at 03:35:04PM +0300, Ilias Apalodimas wrote:  
> > > > On Mon, Sep 14, 2020 at 01:20:43PM +0100, Will Deacon wrote:  
> > > > > On Mon, Sep 14, 2020 at 11:36:21AM +0300, Ilias Apalodimas wrote:  
> > > > > > Running the eBPF test_verifier leads to random errors looking like 
> > > > > > this:  
> > 
> > [...]
> > > >   
> > > Any suggestion on any Fixes I should apply? The original code was 
> > > 'correct' and
> > > broke only when bounded loops and their self-tests were introduced.  
> > 
> > Ouch, that's pretty bad as it means nobody is regression testing BPF on
> > arm64 with mainline. Damn.
> 
> Yes, it unfortunately seems that upstream is lacking BPF regression
> testing for ARM64 :-(
> 
> This bug surfaced when Red Hat QA tested our kernel backports, on
> different archs.

Naresh from Linaro reported it during his tests on 5.8-rc1 as well [1].
I've included both Jiri and him on the v2 as reporters.

[1] https://lkml.org/lkml/2020/8/11/58
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Hi Will,

On Mon, Sep 14, 2020 at 03:01:15PM +0100, Will Deacon wrote:
> Hi Ilias,
> 

[...]

> > > > 
> > > > No Fixes: tag?
> > > 
> > > I'll re-spin and apply one 
> > > 
> > Any suggestion on any Fixes I should apply? The original code was 'correct' 
> > and
> > broke only when bounded loops and their self-tests were introduced.
> 
> Ouch, that's pretty bad as it means nobody is regression testing BPF on
> arm64 with mainline. Damn.

That might not be entirely true. Since offset is a pointer, there's a chance
(and a pretty high one according to my reproducer) that the offset[-1] value 
happens to be 0. In that case the tests will pass fine. I can reproduce the bug
approximately 1 every 6-7 passes here.

I'll send a v2 shortly fixing the tags and adding a few comments on the code,
which will hopefully make future reading easier.

Cheers
/Ilias

[PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Running the eBPF test_verifier leads to random errors looking like this:

[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd 
aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev 
efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon 
zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd 
igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec 
mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW 
5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
build #1 Jun  6 2020
[ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : 8000130cbb80
[ 6525.824141] x29: 8000130cbbb0 x28: 
[ 6525.829451] x27: 05ef6fcbf39b x26: 
[ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
[ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
[ 6525.845375] x21: 0001 x20: 800011dc7000
[ 6525.850682] x19:  x18: 
[ 6525.855990] x17:  x16: 
[ 6525.861298] x15:  x14: 
[ 6525.866606] x13:  x12: 
[ 6525.871913] x11: 0001 x10: 800a660c
[ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
[ 6525.882528] x7 :  x6 : 009864cfa881
[ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0018 x2 : 800a4374
[ 6525.898452] x1 : 000a x0 : 0009
[ 6525.903760] Call trace:
[ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398]  bpf_test_run+0x70/0x1b0
[ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
[ 6525.932072]  __arm64_sys_bpf+0x24/0x30
[ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607]  do_el0_svc+0x28/0x88
[ 6525.943920]  el0_sync_handler+0x88/0x190
[ 6525.947838]  el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---

The reason is the offset[] creation and later usage while building
the eBPF body. The code currently omits the first instruction, since
build_insn() will increase our ctx->idx before saving it.
That was fine up until bounded eBPF loops were introduced. After that
introduction, offset[0] must be the offset of the end of prologue which
is the start of the 1st insn while, offset[n] holds the
offset of the end of n-th insn.

When "taken loop with back jump to 1st insn" test runs, it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.

7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
fixed an indentical bug on x86 when eBPF bounded loops were introduced.

So let's fix it by creating the ctx->offset[] correctly in the first
place and account for the first instruction while calculating the arm
instruction offsets.

Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Reported-by: Naresh Kamboju 
Reported-by: Jiri Olsa 
Co-developed-by: Jean-Philippe Brucker 
Signed-off-by: Jean-Philippe Brucker 
Co-developed-by: Yauheni Kaliuta 
Signed-off-by: Yauheni Kaliuta 
Signed-off-by: Ilias Apalodimas 
---
Changes since v1: 
 - Added Co-developed-by, Reported-by and Fixes tags correctly
 - Describe the expected context of ctx->offset[] in comments

 arch/arm64/net/bpf_jit_comp.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f8912e45be7a..0974e58c 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -143,9 +143,13 @@ static inline void emit_addr_mov_i64(const int reg, const 
u64 val,
}
 }
 
-static inline int bpf2a64_offset(int bpf_to, int bpf_from,
+static inline int bpf2a64_offset(int bpf_insn, int off,
 const struct jit_ctx *ctx)
 {
+   /* arm64 offset is relative to the branch instruction */
+   int bpf_from = bpf_insn + 1;
+   /* BPF JMP offset is relative to the next instruction */
+   int bp

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Hi Will,

On Mon, Sep 14, 2020 at 03:35:04PM +0300, Ilias Apalodimas wrote:
> On Mon, Sep 14, 2020 at 01:20:43PM +0100, Will Deacon wrote:
> > On Mon, Sep 14, 2020 at 11:36:21AM +0300, Ilias Apalodimas wrote:
> > > Running the eBPF test_verifier leads to random errors looking like this:
> > > 
> > > [ 6525.735488] Unexpected kernel BRK exception at EL1
> > > [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> > > [ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 
> > > dns_resolver fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk 
> > > crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce 
> > > sha256_arm64 sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 
> > > btrfs blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c 
> > > crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit 
> > > nvme realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio 
> > > gpio_keys fixed_phy libphy gpio_mb86s7x
> > > [ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW   
> > >   5.9.0-rc1+ #47
> > > [ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, 
> > > BIOS build #1 Jun  6 2020
> > > [ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
> > > [ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> > > [ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> > > [ 6525.820832] sp : 8000130cbb80
> > > [ 6525.824141] x29: 8000130cbbb0 x28: 
> > > [ 6525.829451] x27: 05ef6fcbf39b x26: 
> > > [ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
> > > [ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
> > > [ 6525.845375] x21: 0001 x20: 800011dc7000
> > > [ 6525.850682] x19:  x18: 
> > > [ 6525.855990] x17:  x16: 
> > > [ 6525.861298] x15:  x14: 
> > > [ 6525.866606] x13:  x12: 
> > > [ 6525.871913] x11: 0001 x10: 800a660c
> > > [ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
> > > [ 6525.882528] x7 :  x6 : 009864cfa881
> > > [ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
> > > [ 6525.893144] x3 : 0018 x2 : 800a4374
> > > [ 6525.898452] x1 : 000a x0 : 0009
> > > [ 6525.903760] Call trace:
> > > [ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> > > [ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> > > [ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
> > > [ 6525.920398]  bpf_test_run+0x70/0x1b0
> > > [ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
> > > [ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
> > > [ 6525.932072]  __arm64_sys_bpf+0x24/0x30
> > > [ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
> > > [ 6525.940607]  do_el0_svc+0x28/0x88
> > > [ 6525.943920]  el0_sync_handler+0x88/0x190
> > > [ 6525.947838]  el0_sync+0x140/0x180
> > > [ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> > > [ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
> > > 
> > > The reason seems to be the offset[] creation and usage ctx->offset[]
> > 
> > "seems to be"? Are you unsure?
> 
> Reading the history and other ports of the JIT implementation, I couldn't 
> tell if the decision on skipping the 1st entry was deliberate or not on 
> Aarch64. Reading through the mailist list didn't help either [1].
> Skipping the 1st entry seems indeed to cause the problem.
> I did run the patch though the BPF tests and showed no regressions + fixing 
> the error.

I'll correct myself here.
Looking into 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st 
insn")
explains things a bit better.
Jumping back to the 1st insn wasn't allowed until eBPF bounded loops were 
introduced. That's what the 1st instruction was not saved in the original code.

> > 
> > No Fixes: tag?
> 
> I'll re-spin and apply one 
> 
Any suggestion on any Fixes I should apply? The original code was 'correct' and
broke only when bounded loops and their self-tests were introduced.

Thanks
/Ilias

Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

On Mon, Sep 14, 2020 at 01:20:43PM +0100, Will Deacon wrote:
> On Mon, Sep 14, 2020 at 11:36:21AM +0300, Ilias Apalodimas wrote:
> > Running the eBPF test_verifier leads to random errors looking like this:
> > 
> > [ 6525.735488] Unexpected kernel BRK exception at EL1
> > [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> > [ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
> > fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd 
> > cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 
> > sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs 
> > blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c 
> > crc32c_generic ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme 
> > realtek usbcore nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio 
> > gpio_keys fixed_phy libphy gpio_mb86s7x
> > [ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW 
> > 5.9.0-rc1+ #47
> > [ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, 
> > BIOS build #1 Jun  6 2020
> > [ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
> > [ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> > [ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> > [ 6525.820832] sp : 8000130cbb80
> > [ 6525.824141] x29: 8000130cbbb0 x28: 
> > [ 6525.829451] x27: 05ef6fcbf39b x26: 
> > [ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
> > [ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
> > [ 6525.845375] x21: 0001 x20: 800011dc7000
> > [ 6525.850682] x19:  x18: 
> > [ 6525.855990] x17:  x16: 
> > [ 6525.861298] x15:  x14: 
> > [ 6525.866606] x13:  x12: 
> > [ 6525.871913] x11: 0001 x10: 800a660c
> > [ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
> > [ 6525.882528] x7 :  x6 : 009864cfa881
> > [ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
> > [ 6525.893144] x3 : 0018 x2 : 800a4374
> > [ 6525.898452] x1 : 000a x0 : 0009
> > [ 6525.903760] Call trace:
> > [ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> > [ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> > [ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
> > [ 6525.920398]  bpf_test_run+0x70/0x1b0
> > [ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
> > [ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
> > [ 6525.932072]  __arm64_sys_bpf+0x24/0x30
> > [ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
> > [ 6525.940607]  do_el0_svc+0x28/0x88
> > [ 6525.943920]  el0_sync_handler+0x88/0x190
> > [ 6525.947838]  el0_sync+0x140/0x180
> > [ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> > [ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
> > 
> > The reason seems to be the offset[] creation and usage ctx->offset[]
> 
> "seems to be"? Are you unsure?

Reading the history and other ports of the JIT implementation, I couldn't 
tell if the decision on skipping the 1st entry was deliberate or not on 
Aarch64. Reading through the mailist list didn't help either [1].
Skipping the 1st entry seems indeed to cause the problem.
I did run the patch though the BPF tests and showed no regressions + fixing 
the error.

> 
> > while building the eBPF body.  The code currently omits the first 
> > instruction, since build_insn() will increase our ctx->idx before saving 
> > it.  When "taken loop with back jump to 1st insn" test runs it will
> > eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
> > permitted, the current outcome depends on the value stored in
> > ctx->offset[-1], which has nothing to do with our array.
> > If the value happens to be 0 the tests will work. If not this error
> > triggers.
> > 
> > So let's fix it by creating the ctx->offset[] correctly in the first
> > place and account for the extra instruction while calculating the arm
> > instruction offsets.
> 
> No Fixes: tag?

I'll re-spin and apply one 

> 
> > Signed-off-by: Ilias Apalodimas 
> > Signed-off-by: Jean-Philippe Brucker 
> > Signed-off-by: Yauheni Kaliuta 
> 
> Non-author signoffs here. What's going on?

My bad here, I'll add a Co-developed-by on v2 for the rest of the people and 
move my Signed-off last

[1] 
https://lore.kernel.org/bpf/canowswkaj1hysw3bxbmg9_nd48fm0mxm5egdtmhu6ysec_g...@mail.gmail.com/T/#u

Thanks
/Ilias
> 
> Will

[PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Ilias Apalodimas

Running the eBPF test_verifier leads to random errors looking like this:

[ 6525.735488] Unexpected kernel BRK exception at EL1
[ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
[ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd cryptd 
aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 sha1_ce evdev 
efivars efivarfs ip_tables x_tables autofs4 btrfs blake2b_generic xor xor_neon 
zstd_compress raid6_pq libcrc32c crc32c_generic ahci xhci_pci libahci xhci_hcd 
igb libata i2c_algo_bit nvme realtek usbcore nvme_core scsi_mod t10_pi netsec 
mdio_devres of_mdio gpio_keys fixed_phy libphy gpio_mb86s7x
[ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW 
5.9.0-rc1+ #47
[ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
build #1 Jun  6 2020
[ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
[ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.820832] sp : 8000130cbb80
[ 6525.824141] x29: 8000130cbbb0 x28: 
[ 6525.829451] x27: 05ef6fcbf39b x26: 
[ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
[ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
[ 6525.845375] x21: 0001 x20: 800011dc7000
[ 6525.850682] x19:  x18: 
[ 6525.855990] x17:  x16: 
[ 6525.861298] x15:  x14: 
[ 6525.866606] x13:  x12: 
[ 6525.871913] x11: 0001 x10: 800a660c
[ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
[ 6525.882528] x7 :  x6 : 009864cfa881
[ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
[ 6525.893144] x3 : 0018 x2 : 800a4374
[ 6525.898452] x1 : 000a x0 : 0009
[ 6525.903760] Call trace:
[ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
[ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
[ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
[ 6525.920398]  bpf_test_run+0x70/0x1b0
[ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
[ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
[ 6525.932072]  __arm64_sys_bpf+0x24/0x30
[ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
[ 6525.940607]  do_el0_svc+0x28/0x88
[ 6525.943920]  el0_sync_handler+0x88/0x190
[ 6525.947838]  el0_sync+0x140/0x180
[ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---

The reason seems to be the offset[] creation and usage ctx->offset[]
while building the eBPF body.  The code currently omits the first 
instruction, since build_insn() will increase our ctx->idx before saving 
it.  When "taken loop with back jump to 1st insn" test runs it will
eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
permitted, the current outcome depends on the value stored in
ctx->offset[-1], which has nothing to do with our array.
If the value happens to be 0 the tests will work. If not this error
triggers.

So let's fix it by creating the ctx->offset[] correctly in the first
place and account for the extra instruction while calculating the arm
instruction offsets.

Signed-off-by: Ilias Apalodimas 
Signed-off-by: Jean-Philippe Brucker 
Signed-off-by: Yauheni Kaliuta 
---
 arch/arm64/net/bpf_jit_comp.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index f8912e45be7a..5891733a9f39 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -143,9 +143,13 @@ static inline void emit_addr_mov_i64(const int reg, const 
u64 val,
}
 }
 
-static inline int bpf2a64_offset(int bpf_to, int bpf_from,
+static inline int bpf2a64_offset(int bpf_insn, int off,
 const struct jit_ctx *ctx)
 {
+   /* arm64 offset is relative to the branch instruction */
+   int bpf_from = bpf_insn + 1;
+   /* BPF JMP offset is relative to the next instruction */
+   int bpf_to = bpf_insn + off + 1;
int to = ctx->offset[bpf_to];
/* -1 to account for the Branch instruction */
int from = ctx->offset[bpf_from] - 1;
@@ -642,7 +646,7 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx,
 
/* JUMP off */
case BPF_JMP | BPF_JA:
-   jmp_offset = bpf2a64_offset(i + off, i, ctx);
+   jmp_offset = bpf2a64_offset(i, off, ctx);
check_imm26(jmp_offset);
emit(A64_B(jmp_offset), ctx);
break;
@@ -669,7 +673,7 @@ static int build_insn(const struct bpf_insn *insn, struct 
jit_ctx *ctx,
case BPF_JMP32 | BPF_JSLE | BPF_X:

Re: [PATCH net-next 3/4] mvpp2: add basic XDP support

2020-07-02 Thread ilias . apalodimas

On Tue, Jun 30, 2020 at 08:09:29PM +0200, Matteo Croce wrote:
> From: Matteo Croce 
> 
> Add XDP native support.
> By now only XDP_DROP, XDP_PASS and XDP_REDIRECT
> verdicts are supported.
> 
> Co-developed-by: Sven Auhagen 
> Signed-off-by: Sven Auhagen 
> Signed-off-by: Matteo Croce 
> ---

[...]

>  }
>  
> +static int
> +mvpp2_run_xdp(struct mvpp2_port *port, struct mvpp2_rx_queue *rxq,
> +   struct bpf_prog *prog, struct xdp_buff *xdp,
> +   struct page_pool *pp)
> +{
> + unsigned int len, sync, err;
> + struct page *page;
> + u32 ret, act;
> +
> + len = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM;
> + act = bpf_prog_run_xdp(prog, xdp);
> +
> + /* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
> + sync = xdp->data_end - xdp->data_hard_start - MVPP2_SKB_HEADROOM;
> + sync = max(sync, len);
> +
> + switch (act) {
> + case XDP_PASS:
> + ret = MVPP2_XDP_PASS;
> + break;
> + case XDP_REDIRECT:
> + err = xdp_do_redirect(port->dev, xdp, prog);
> + if (unlikely(err)) {
> + ret = MVPP2_XDP_DROPPED;
> + page = virt_to_head_page(xdp->data);
> + page_pool_put_page(pp, page, sync, true);
> + } else {
> + ret = MVPP2_XDP_REDIR;
> + }
> + break;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + fallthrough;
> + case XDP_ABORTED:
> + trace_xdp_exception(port->dev, prog, act);
> + fallthrough;
> + case XDP_DROP:
> + page = virt_to_head_page(xdp->data);
> + page_pool_put_page(pp, page, sync, true);
> + ret = MVPP2_XDP_DROPPED;
> + break;
> + }
> +
> + return ret;
> +}
> +
>  /* Main rx processing */
>  static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
>   int rx_todo, struct mvpp2_rx_queue *rxq)
>  {
>   struct net_device *dev = port->dev;
> + struct bpf_prog *xdp_prog;
> + struct xdp_buff xdp;
>   int rx_received;
>   int rx_done = 0;
> + u32 xdp_ret = 0;
>   u32 rcvd_pkts = 0;
>   u32 rcvd_bytes = 0;
>  
> + rcu_read_lock();
> +
> + xdp_prog = READ_ONCE(port->xdp_prog);
> +
>   /* Get number of received packets and clamp the to-do */
>   rx_received = mvpp2_rxq_received(port, rxq->id);
>   if (rx_todo > rx_received)
> @@ -3060,7 +3115,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct 
> napi_struct *napi,
>   dma_addr_t dma_addr;
>   phys_addr_t phys_addr;
>   u32 rx_status;
> - int pool, rx_bytes, err;
> + int pool, rx_bytes, err, ret;
>   void *data;
>  
>   rx_done++;
> @@ -3096,6 +3151,33 @@ static int mvpp2_rx(struct mvpp2_port *port, struct 
> napi_struct *napi,
>   else
>   frag_size = bm_pool->frag_size;
>  
> + if (xdp_prog) {
> + xdp.data_hard_start = data;
> + xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
> + xdp.data_end = xdp.data + rx_bytes;
> + xdp.frame_sz = PAGE_SIZE;
> +
> + if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
> + xdp.rxq = &rxq->xdp_rxq_short;
> + else
> + xdp.rxq = &rxq->xdp_rxq_long;
> +
> + xdp_set_data_meta_invalid(&xdp);
> +
> + ret = mvpp2_run_xdp(port, rxq, xdp_prog, &xdp, pp);
> +
> + if (ret) {
> + xdp_ret |= ret;
> + err = mvpp2_rx_refill(port, bm_pool, pp, pool);
> + if (err) {
> + netdev_err(port->dev, "failed to refill 
> BM pools\n");
> + goto err_drop_frame;
> + }
> +
> + continue;
> + }
> + }
> +
>   skb = build_skb(data, frag_size);
>   if (!skb) {
>   netdev_warn(port->dev, "skb build failed\n");
> @@ -3118,7 +3200,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct 
> napi_struct *napi,
>   rcvd_pkts++;
>   rcvd_bytes += rx_bytes;
>  
> - skb_reserve(skb, MVPP2_MH_SIZE + NET_SKB_PAD);
> + skb_reserve(skb, MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM);
>   skb_put(skb, rx_bytes);
>   skb->protocol = eth_type_trans(skb, dev);
>   mvpp2_rx_csum(port, rx_status, skb);
> @@ -3133,6 +3215,8 @@ static int mvpp2_rx(struct mvpp2_port *port, struct 
> napi_struct *napi,
>   mvpp2_bm_pool_put(port, pool, dma_addr, phys_addr);
>   }
>  
> + rcu_read_unlock();
> +
>

Re: [PATCH net-next 2/4] mvpp2: use page_pool allocator

2020-07-02 Thread ilias . apalodimas

Hi Matteo, 

Thanks for working on this!

On Tue, Jun 30, 2020 at 08:09:28PM +0200, Matteo Croce wrote:
> From: Matteo Croce 
> 
> Use the page_pool API for memory management. This is a prerequisite for
> native XDP support.
> 
> Tested-by: Sven Auhagen 
> Signed-off-by: Matteo Croce 
> ---
>  drivers/net/ethernet/marvell/Kconfig  |   1 +
>  drivers/net/ethernet/marvell/mvpp2/mvpp2.h|   8 +
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   | 155 +++---
>  3 files changed, 139 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/Kconfig 
> b/drivers/net/ethernet/marvell/Kconfig
> index cd8ddd1ef6f2..ef4f35ba077d 100644
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -87,6 +87,7 @@ config MVPP2
>   depends on ARCH_MVEBU || COMPILE_TEST
>   select MVMDIO
>   select PHYLINK
> + select PAGE_POOL
>   help
> This driver supports the network interface units in the
> Marvell ARMADA 375, 7K and 8K SoCs.
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> index 543a310ec102..4c16c9e9c1e5 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Fifo Registers */
>  #define MVPP2_RX_DATA_FIFO_SIZE_REG(port)(0x00 + 4 * (port))
> @@ -820,6 +821,9 @@ struct mvpp2 {
>  
>   /* RSS Indirection tables */
>   struct mvpp2_rss_table *rss_tables[MVPP22_N_RSS_TABLES];
> +
> + /* page_pool allocator */
> + struct page_pool *page_pool[MVPP2_PORT_MAX_RXQ];
>  };
>  
>  struct mvpp2_pcpu_stats {
> @@ -1161,6 +1165,10 @@ struct mvpp2_rx_queue {
>  
>   /* Port's logic RXQ number to which physical RXQ is mapped */
>   int logic_rxq;
> +
> + /* XDP memory accounting */
> + struct xdp_rxq_info xdp_rxq_short;
> + struct xdp_rxq_info xdp_rxq_long;
>  };
>  
>  struct mvpp2_bm_pool {
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> index 027de7291f92..9e2e8fb0a0b8 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> @@ -95,6 +95,22 @@ static inline u32 mvpp2_cpu_to_thread(struct mvpp2 *priv, 
> int cpu)
>   return cpu % priv->nthreads;
>  }
>  
> +static struct page_pool *
> +mvpp2_create_page_pool(struct device *dev, int num, int len)
> +{
> + struct page_pool_params pp_params = {
> + /* internal DMA mapping in page_pool */
> + .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> + .pool_size = num,
> + .nid = NUMA_NO_NODE,
> + .dev = dev,
> + .dma_dir = DMA_FROM_DEVICE,
> + .max_len = len,
> + };
> +
> + return page_pool_create(&pp_params);
> +}
> +
>  /* These accessors should be used to access:
>   *
>   * - per-thread registers, where each thread has its own copy of the
> @@ -327,17 +343,26 @@ static inline int mvpp2_txq_phys(int port, int txq)
>   return (MVPP2_MAX_TCONT + port) * MVPP2_MAX_TXQ + txq;
>  }
>  
> -static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool)
> +/* Returns a struct page if page_pool is set, otherwise a buffer */
> +static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool,
> +   struct page_pool *page_pool)
>  {
> + if (page_pool)
> + return page_pool_alloc_pages(page_pool,
> +  GFP_ATOMIC | __GFP_NOWARN);

page_pool_dev_alloc_pages() can set these flags for you, instead of explicitly
calling them

> +
>   if (likely(pool->frag_size <= PAGE_SIZE))
>   return netdev_alloc_frag(pool->frag_size);
> - else
> - return kmalloc(pool->frag_size, GFP_ATOMIC);
> +
> + return kmalloc(pool->frag_size, GFP_ATOMIC);
>  }
>  
> -static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool, void *data)
> +static void mvpp2_frag_free(const struct mvpp2_bm_pool *pool,
> + struct page_pool *page_pool, void *data)
>  {
> - if (likely(pool->frag_size <= PAGE_SIZE))
> + if (page_pool)
> + page_pool_put_full_page(page_pool, virt_to_head_page(data), 
> false);
> + else if (likely(pool->frag_size <= PAGE_SIZE))
>   skb_free_frag(data);
>   else
>   kfree(data);
> @@ -442,6 +467,7 @@ static void mvpp2_bm_bufs_get_addrs(struct device *dev, 
> struct mvpp2 *priv,
>  static void mvpp2_bm_bufs_free(struct device *dev, struct mvpp2 *priv,
>  struct mvpp2_bm_pool *bm_pool, int buf_num)
>  {
> + struct page_pool *pp = NULL;
>   int i;
>  
>   if (buf_num > bm_pool->buf_num) {
> @@ -450,6 +476,9 @@ static void mvpp2_bm_bufs_free(struct device *dev, struct 
> mvpp2 *priv,
>   buf_num = bm_pool->buf_num

Re: [PATCH net-next v6 1/2] xen networking: add basic XDP support for xen-netfront

2020-05-01 Thread Ilias Apalodimas

On Fri, May 01, 2020 at 01:12:17PM +0300, Denis Kirjanov wrote:
> The patch adds a basic XDP processing to xen-netfront driver.
> 
> We ran an XDP program for an RX response received from netback
> driver. Also we request xen-netback to adjust data offset for
> bpf_xdp_adjust_head() header space for custom headers.
> 
> synchronization between frontend and backend parts is done
> by using xenbus state switching:
> Reconfiguring -> Reconfigured- > Connected
> 
> UDP packets drop rate using xdp program is around 310 kpps
> using ./pktgen_sample04_many_flows.sh and 160 kpps without the patch.
> 
> v6:
> - added the missing SOB line
> - fixed subject
> 
> v5:
> - split netfront/netback changes
> - added a sync point between backend/frontend on switching to XDP
> - added pagepool API
> 
> v4:
> - added verbose patch descriprion
> - don't expose the XDP headroom offset to the domU guest
> - add a modparam to netback to toggle XDP offset
> - don't process jumbo frames for now
> 
> v3:
> - added XDP_TX support (tested with xdping echoserver)
> - added XDP_REDIRECT support (tested with modified xdp_redirect_kern)
> - moved xdp negotiation to xen-netback
> 
> v2:
> - avoid data copying while passing to XDP
> - tell xen-netback that we need the headroom space
> 
> Signed-off-by: Denis Kirjanov 
> ---
>  drivers/net/xen-netfront.c | 302 
> -
>  1 file changed, 298 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 482c6c8..e7e2c11 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -44,6 +44,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -102,6 +105,8 @@ struct netfront_queue {
>   char name[QUEUE_NAME_SIZE]; /* DEVNAME-qN */
>   struct netfront_info *info;
>  
> + struct bpf_prog __rcu *xdp_prog;
> +
>   struct napi_struct napi;
>  
>   /* Split event channels support, tx_* == rx_* when using
> @@ -144,6 +149,9 @@ struct netfront_queue {
>   struct sk_buff *rx_skbs[NET_RX_RING_SIZE];
>   grant_ref_t gref_rx_head;
>   grant_ref_t grant_rx_ref[NET_RX_RING_SIZE];
> +
> + struct page_pool *page_pool;
> + struct xdp_rxq_info xdp_rxq;
>  };
>  
>  struct netfront_info {
> @@ -159,6 +167,8 @@ struct netfront_info {
>   struct netfront_stats __percpu *rx_stats;
>   struct netfront_stats __percpu *tx_stats;
>  
> + bool netback_has_xdp_headroom;
> +
>   atomic_t rx_gso_checksum_fixup;
>  };
>  
> @@ -167,6 +177,9 @@ struct netfront_rx_info {
>   struct xen_netif_extra_info extras[XEN_NETIF_EXTRA_TYPE_MAX - 1];
>  };
>  
> +static int xennet_xdp_xmit(struct net_device *dev, int n,
> +struct xdp_frame **frames, u32 flags);
> +
>  static void skb_entry_set_link(union skb_entry *list, unsigned short id)
>  {
>   list->link = id;
> @@ -265,8 +278,9 @@ static struct sk_buff *xennet_alloc_one_rx_buffer(struct 
> netfront_queue *queue)
>   if (unlikely(!skb))
>   return NULL;
>  
> - page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
> - if (!page) {
> + page = page_pool_alloc_pages(queue->page_pool,
> +  GFP_ATOMIC | __GFP_NOWARN);

You can use page_pool_dev_alloc_pages(), which is called with the exact same
arguments.

> + if (unlikely(!page)) {
>   kfree_skb(skb);
>   return NULL;
>   }
> @@ -778,6 +792,53 @@ static int xennet_get_extras(struct netfront_queue 
> *queue,
>   return err;
>  }
>  
> +u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
> +struct xen_netif_rx_response *rx, struct bpf_prog *prog,
> +struct xdp_buff *xdp)
> +{
> + struct xdp_frame *xdpf;
> + u32 len = rx->status;
> + u32 act = XDP_PASS;
> + int err;
> +
> + xdp->data_hard_start = page_address(pdata);
> + xdp->data = xdp->data_hard_start + XDP_PACKET_HEADROOM;
> + xdp_set_data_meta_invalid(xdp);
> + xdp->data_end = xdp->data + len;
> + xdp->rxq = &queue->xdp_rxq;
> + xdp->handle = 0;
> +
> + act = bpf_prog_run_xdp(prog, xdp);
> + switch (act) {
> + case XDP_TX:
> + get_page(pdata);
> + xdpf = convert_to_xdp_frame(xdp);
> + err = xennet_xdp_xmit(queue->info->netdev, 1, &xdpf, 0);
> + if (unlikely(err < 0))
> + trace_xdp_exception(queue->info->netdev, prog, act);
> + break;
> + case XDP_REDIRECT:
> + get_page(pdata);
> + err = xdp_do_redirect(queue->info->netdev, xdp, prog);
> + if (unlikely(err))
> + trace_xdp_exception(queue->info->netdev, prog, act);
> + xdp_do_flush();
> + break;
> + case XDP_PASS:
> + case XDP_DROP:
> + break;
> +
> + case XDP_ABORTED:
> + trace_xdp_exception(que

Re: [PATCH net-next 3/4] page_pool: Restructure __page_pool_put_page()

2019-10-23 Thread Ilias Apalodimas

On Tue, Oct 22, 2019 at 04:44:24AM +, Saeed Mahameed wrote:
> From: Jonathan Lemon 
> 
> 1) Rename functions to reflect what they are actually doing.
> 
> 2) Unify the condition to keep a page.
> 
> 3) When page can't be kept in cache, fallback to releasing page to page
> allocator in one place, instead of calling it from multiple conditions,
> and reuse __page_pool_return_page().
> 
> Signed-off-by: Jonathan Lemon 
> Signed-off-by: Saeed Mahameed 
> ---
>  net/core/page_pool.c | 38 +++---
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 8120aec999ce..65680aaa0818 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -258,6 +258,7 @@ static bool __page_pool_recycle_into_ring(struct 
> page_pool *pool,
>  struct page *page)
>  {
>   int ret;
> +
>   /* BH protection not needed if current is serving softirq */
>   if (in_serving_softirq())
>   ret = ptr_ring_produce(&pool->ring, page);
> @@ -272,8 +273,8 @@ static bool __page_pool_recycle_into_ring(struct 
> page_pool *pool,
>   *
>   * Caller must provide appropriate safe context.
>   */
> -static bool __page_pool_recycle_direct(struct page *page,
> -struct page_pool *pool)
> +static bool __page_pool_recycle_into_cache(struct page *page,
> +struct page_pool *pool)
>  {
>   if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
>   return false;
> @@ -283,15 +284,18 @@ static bool __page_pool_recycle_direct(struct page 
> *page,
>   return true;
>  }
>  
> -/* page is NOT reusable when:
> - * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> - * 2) belongs to a different NUMA node than pool->p.nid.
> +/* Keep page in caches only if page:
> + * 1) wasn't allocated when system is under some pressure 
> (page_is_pfmemalloc).
> + * 2) belongs to pool's numa node (pool->p.nid).
> + * 3) refcount is 1 (owned by page pool).
>   *
>   * To update pool->p.nid users must call page_pool_update_nid.
>   */
> -static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> +static bool page_pool_keep_page(struct page_pool *pool, struct page *page)
>  {
> - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> + return !page_is_pfmemalloc(page) &&
> +page_to_nid(page) == pool->p.nid &&
> +page_ref_count(page) == 1;
>  }
>  
>  void __page_pool_put_page(struct page_pool *pool,
> @@ -300,22 +304,19 @@ void __page_pool_put_page(struct page_pool *pool,
>   /* This allocator is optimized for the XDP mode that uses
>* one-frame-per-page, but have fallbacks that act like the
>* regular page allocator APIs.
> -  *
> -  * refcnt == 1 means page_pool owns page, and can recycle it.
>*/
> - if (likely(page_ref_count(page) == 1 &&
> -pool_page_reusable(pool, page))) {
> +
> + if (likely(page_pool_keep_page(pool, page))) {
>   /* Read barrier done in page_ref_count / READ_ONCE */
>  
>   if (allow_direct && in_serving_softirq())
> - if (__page_pool_recycle_direct(page, pool))
> + if (__page_pool_recycle_into_cache(page, pool))
>   return;
>  
> - if (!__page_pool_recycle_into_ring(pool, page)) {
> - /* Cache full, fallback to free pages */
> - __page_pool_return_page(pool, page);
> - }
> - return;
> + if (__page_pool_recycle_into_ring(pool, page))
> + return;
> +
> + /* Cache full, fallback to return pages */
>   }
>   /* Fallback/non-XDP mode: API user have elevated refcnt.
>*
> @@ -330,8 +331,7 @@ void __page_pool_put_page(struct page_pool *pool,
>* doing refcnt based recycle tricks, meaning another process
>* will be invoking put_page.
>*/
> - __page_pool_clean_page(pool, page);
> - put_page(page);
> + __page_pool_return_page(pool, page);

I think Jesper had a reason for calling them separately instead of 
__page_pool_return_page + put_page() (which in fact does the same thing). 

In the future he was planning on removing the __page_pool_clean_page call from
there, since someone might call __page_pool_put_page() after someone has called
__page_pool_clean_page()
Can we leave the calls there as-is?

>  }
>  EXPORT_SYMBOL(__page_pool_put_page);
>  
> -- 
> 2.21.0
> 

Thanks
/Ilias

Re: [PATCH v2 net-next] net: socionext: netsec: fix xdp stats accounting

2019-10-17 Thread Ilias Apalodimas

Thanks Lorenzo,

On Thu, Oct 17, 2019 at 02:28:32PM +0200, Lorenzo Bianconi wrote:
> Increment netdev rx counters even for XDP_DROP verdict. Report even
> tx bytes for xdp buffers (TYPE_NETSEC_XDP_TX or TYPE_NETSEC_XDP_NDO).
> Moreover account pending buffer length in netsec_xdp_queue_one as it is
> done for skb counterpart
> 
> Tested-by: Ilias Apalodimas 
> Signed-off-by: Lorenzo Bianconi 
> ---
> Changes since v1:
> - fix BQL accounting
> - target the patch to next-next
> ---
>  drivers/net/ethernet/socionext/netsec.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index f9e6744d8fd6..c40294470bfa 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -252,7 +252,6 @@
>  #define NETSEC_XDP_CONSUMED  BIT(0)
>  #define NETSEC_XDP_TXBIT(1)
>  #define NETSEC_XDP_REDIR BIT(2)
> -#define NETSEC_XDP_RX_OK (NETSEC_XDP_PASS | NETSEC_XDP_TX | NETSEC_XDP_REDIR)
>  
>  enum ring_id {
>   NETSEC_RING_TX = 0,
> @@ -661,6 +660,7 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> *priv)
>   bytes += desc->skb->len;
>   dev_kfree_skb(desc->skb);
>   } else {
> + bytes += desc->xdpf->len;
>   xdp_return_frame(desc->xdpf);
>   }
>  next:
> @@ -858,6 +858,7 @@ static u32 netsec_xdp_queue_one(struct netsec_priv *priv,
>   tx_desc.addr = xdpf->data;
>   tx_desc.len = xdpf->len;
>  
> + netdev_sent_queue(priv->ndev, xdpf->len);
>   netsec_set_tx_de(priv, tx_ring, &tx_ctrl, &tx_desc, xdpf);
>  
>   return NETSEC_XDP_TX;
> @@ -1030,7 +1031,7 @@ static int netsec_process_rx(struct netsec_priv *priv, 
> int budget)
>  
>  next:
>   if ((skb && napi_gro_receive(&priv->napi, skb) != GRO_DROP) ||
> - xdp_result & NETSEC_XDP_RX_OK) {
> +     xdp_result) {
>   ndev->stats.rx_packets++;
>   ndev->stats.rx_bytes += xdp.data_end - xdp.data;
>   }
> -- 
> 2.21.0
> 

Reviewed-by: Ilias Apalodimas

Re: [PATCH 04/10 net-next] page_pool: Add API to update numa node and flush page caches

2019-10-17 Thread Ilias Apalodimas

Hi Saeed,

On Wed, Oct 16, 2019 at 03:50:22PM -0700, Jonathan Lemon wrote:
> From: Saeed Mahameed 
> 
> Add page_pool_update_nid() to be called from drivers when they detect
> numa node changes.
> 
> It will do:
> 1) Flush the pool's page cache and ptr_ring.
> 2) Update page pool nid value to start allocating from the new numa
> node.
> 
> Signed-off-by: Saeed Mahameed 
> Signed-off-by: Jonathan Lemon 
> ---
>  include/net/page_pool.h | 10 ++
>  net/core/page_pool.c| 16 +++-
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 2cbcdbdec254..fb13cf6055ff 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -226,4 +226,14 @@ static inline bool page_pool_put(struct page_pool *pool)
>   return refcount_dec_and_test(&pool->user_cnt);
>  }
>  
> +/* Only safe from napi context or when user guarantees it is thread safe */
> +void __page_pool_flush(struct page_pool *pool);

This should be called per packet right? Any noticeable impact on performance?

> +static inline void page_pool_update_nid(struct page_pool *pool, int new_nid)
> +{
> + if (unlikely(pool->p.nid != new_nid)) {
> + /* TODO: Add statistics/trace */
> + __page_pool_flush(pool);
> + pool->p.nid = new_nid;
> + }
> +}
>  #endif /* _NET_PAGE_POOL_H */
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 5bc65587f1c4..678cf85f273a 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -373,16 +373,13 @@ void __page_pool_free(struct page_pool *pool)
>  }
>  EXPORT_SYMBOL(__page_pool_free);
>  
> -/* Request to shutdown: release pages cached by page_pool, and check
> - * for in-flight pages
> - */
> -bool __page_pool_request_shutdown(struct page_pool *pool)
> +void __page_pool_flush(struct page_pool *pool)
>  {
>   struct page *page;
>  
>   /* Empty alloc cache, assume caller made sure this is
>* no-longer in use, and page_pool_alloc_pages() cannot be
> -  * call concurrently.
> +  * called concurrently.
>*/
>   while (pool->alloc.count) {
>   page = pool->alloc.cache[--pool->alloc.count];
> @@ -393,6 +390,15 @@ bool __page_pool_request_shutdown(struct page_pool *pool)
>* be in-flight.
>*/
>   __page_pool_empty_ring(pool);
> +}
> +EXPORT_SYMBOL(__page_pool_flush);

A later patch removes this, do we actually need it here?

> +
> +/* Request to shutdown: release pages cached by page_pool, and check
> + * for in-flight pages
> + */
> +bool __page_pool_request_shutdown(struct page_pool *pool)
> +{
> + __page_pool_flush(pool);
>  
>   return __page_pool_safe_to_destroy(pool);
>  }
> -- 
> 2.17.1
> 


Thanks
/Ilias

Re: [PATCH] net: netsec: Correct dma sync for XDP_TX frames

2019-10-16 Thread Ilias Apalodimas

Hi Jakub, 

On Wed, Oct 16, 2019 at 05:14:01PM -0700, Jakub Kicinski wrote:
> On Wed, 16 Oct 2019 14:40:32 +0300, Ilias Apalodimas wrote:
> > bpf_xdp_adjust_head() can change the frame boundaries. Account for the
> > potential shift properly by calculating the new offset before
> > syncing the buffer to the device for XDP_TX
> > 
> > Fixes: ba2b232108d3 ("net: netsec: add XDP support")
> > Signed-off-by: Ilias Apalodimas 
> 
> Reviewed-by: Jakub Kicinski 
> 
> You should target this to the bpf or net tree (appropriate [PATCH xyz]
> marking). Although I must admit it's unclear to me as well whether the
> driver changes should be picked up by bpf maintainers or Dave :S

My bad i forgot to add the net-next tag. I'd prefer Dave picking that up, since
he picked all the XDP-related patches for this driver before. 
Dave shall i re-send with the proper tag?

> 
> > diff --git a/drivers/net/ethernet/socionext/netsec.c 
> > b/drivers/net/ethernet/socionext/netsec.c
> > index f9e6744d8fd6..41ddd8fff2a7 100644
> > --- a/drivers/net/ethernet/socionext/netsec.c
> > +++ b/drivers/net/ethernet/socionext/netsec.c
> > @@ -847,8 +847,8 @@ static u32 netsec_xdp_queue_one(struct netsec_priv 
> > *priv,
> > enum dma_data_direction dma_dir =
> > page_pool_get_dma_dir(rx_ring->page_pool);
> >  
> > -   dma_handle = page_pool_get_dma_addr(page) +
> > -   NETSEC_RXBUF_HEADROOM;
> > +   dma_handle = page_pool_get_dma_addr(page) + xdpf->headroom +
> > +   sizeof(*xdpf);
> 
> very nitpick: I'd personally write addr + sizeof(*xdpf) + xdpf->headroom
> since that's the order in which they appear in memory
> 
> But likely not worth reposting for just that :)

Isn't sizeof static anyway? If Dave needs a v2 with the proper tag i'll change
this as well 

> 
> > dma_sync_single_for_device(priv->dev, dma_handle, xdpf->len,
> >dma_dir);
> > tx_desc.buf_type = TYPE_NETSEC_XDP_TX;
> 

Thanks
/Ilias

[PATCH] net: netsec: Correct dma sync for XDP_TX frames

2019-10-16 Thread Ilias Apalodimas

bpf_xdp_adjust_head() can change the frame boundaries. Account for the
potential shift properly by calculating the new offset before
syncing the buffer to the device for XDP_TX

Fixes: ba2b232108d3 ("net: netsec: add XDP support")
Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index f9e6744d8fd6..41ddd8fff2a7 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -847,8 +847,8 @@ static u32 netsec_xdp_queue_one(struct netsec_priv *priv,
enum dma_data_direction dma_dir =
page_pool_get_dma_dir(rx_ring->page_pool);
 
-   dma_handle = page_pool_get_dma_addr(page) +
-   NETSEC_RXBUF_HEADROOM;
+   dma_handle = page_pool_get_dma_addr(page) + xdpf->headroom +
+   sizeof(*xdpf);
dma_sync_single_for_device(priv->dev, dma_handle, xdpf->len,
   dma_dir);
tx_desc.buf_type = TYPE_NETSEC_XDP_TX;
-- 
2.23.0

Re: [PATCH v3 net-next 8/8] net: mvneta: add XDP_TX support

2019-10-16 Thread Ilias Apalodimas

On Wed, Oct 16, 2019 at 12:09:00PM +0200, Lorenzo Bianconi wrote:
> > On Mon, 14 Oct 2019 12:49:55 +0200, Lorenzo Bianconi wrote:
> > > Implement XDP_TX verdict and ndo_xdp_xmit net_device_ops function
> > > pointer
> > > 
> > > Signed-off-by: Lorenzo Bianconi 
> > 
> > > @@ -1972,6 +1975,109 @@ int mvneta_rx_refill_queue(struct mvneta_port 
> > > *pp, struct mvneta_rx_queue *rxq)
> > >   return i;
> > >  }
> > >  
> > > +static int
> > > +mvneta_xdp_submit_frame(struct mvneta_port *pp, struct mvneta_tx_queue 
> > > *txq,
> > > + struct xdp_frame *xdpf, bool dma_map)
> > > +{
> > > + struct mvneta_tx_desc *tx_desc;
> > > + struct mvneta_tx_buf *buf;
> > > + dma_addr_t dma_addr;
> > > +
> > > + if (txq->count >= txq->tx_stop_threshold)
> > > + return MVNETA_XDP_CONSUMED;
> > > +
> > > + tx_desc = mvneta_txq_next_desc_get(txq);
> > > +
> > > + buf = &txq->buf[txq->txq_put_index];
> > > + if (dma_map) {
> > > + /* ndo_xdp_xmit */
> > > + dma_addr = dma_map_single(pp->dev->dev.parent, xdpf->data,
> > > +   xdpf->len, DMA_TO_DEVICE);
> > > + if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
> > > + mvneta_txq_desc_put(txq);
> > > + return MVNETA_XDP_CONSUMED;
> > > + }
> > > + buf->type = MVNETA_TYPE_XDP_NDO;
> > > + } else {
> > > + struct page *page = virt_to_page(xdpf->data);
> > > +
> > > + dma_addr = page_pool_get_dma_addr(page) +
> > > +pp->rx_offset_correction + MVNETA_MH_SIZE;
> > > + dma_sync_single_for_device(pp->dev->dev.parent, dma_addr,
> > > +xdpf->len, DMA_BIDIRECTIONAL);
> > 
> > This looks a little suspicious, XDP could have moved the start of frame
> > with adjust_head, right? You should also use xdpf->data to find where
> > the frame starts, no?
> 
> uhm, right..we need to update the dma_addr doing something like:
> 
> dma_addr = page_pool_get_dma_addr(page) + xdpf->data - xdpf;

Can we do  page_pool_get_dma_addr(page) + xdpf->headroom as well right?

> 
> and then use xdpf->len for dma-sync
> 
> > 
> > > + buf->type = MVNETA_TYPE_XDP_TX;
> > > + }
> > > + buf->xdpf = xdpf;
> > > +
> > > + tx_desc->command = MVNETA_TXD_FLZ_DESC;
> > > + tx_desc->buf_phys_addr = dma_addr;
> > > + tx_desc->data_size = xdpf->len;
> > > +
> > > + mvneta_update_stats(pp, 1, xdpf->len, true);
> > > + mvneta_txq_inc_put(txq);
> > > + txq->pending++;
> > > + txq->count++;
> > > +
> > > + return MVNETA_XDP_TX;
> > > +}
> > > +
> > > +static int
> > > +mvneta_xdp_xmit_back(struct mvneta_port *pp, struct xdp_buff *xdp)
> > > +{
> > > + struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
> > > + int cpu = smp_processor_id();
> > > + struct mvneta_tx_queue *txq;
> > > + struct netdev_queue *nq;
> > > + u32 ret;
> > > +
> > > + if (unlikely(!xdpf))
> > > + return MVNETA_XDP_CONSUMED;
> > 
> > Personally I'd prefer you haven't called a function which return code
> > has to be error checked in local variable init.
> 
> do you mean moving cpu = smp_processor_id(); after the if condition?
> 
> Regards,
> Lorenzo
> 
> > 
> > > +
> > > + txq = &pp->txqs[cpu % txq_number];
> > > + nq = netdev_get_tx_queue(pp->dev, txq->id);
> > > +
> > > + __netif_tx_lock(nq, cpu);
> > > + ret = mvneta_xdp_submit_frame(pp, txq, xdpf, false);
> > > + if (ret == MVNETA_XDP_TX)
> > > + mvneta_txq_pend_desc_add(pp, txq, 0);
> > > + __netif_tx_unlock(nq);
> > > +
> > > + return ret;
> > > +}

Thanks
/Ilias

Re: [PATCH v3 net-next 7/8] net: mvneta: make tx buffer array agnostic

2019-10-16 Thread Ilias Apalodimas

Hi Jakub,

On Tue, Oct 15, 2019 at 05:03:53PM -0700, Jakub Kicinski wrote:
> On Mon, 14 Oct 2019 12:49:54 +0200, Lorenzo Bianconi wrote:
> > Allow tx buffer array to contain both skb and xdp buffers in order to
> > enable xdp frame recycling adding XDP_TX verdict support
> > 
> > Signed-off-by: Lorenzo Bianconi 
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 66 +--
> >  1 file changed, 43 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> > b/drivers/net/ethernet/marvell/mvneta.c
> > index a79d81c9be7a..477ae6592fa3 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -561,6 +561,20 @@ struct mvneta_rx_desc {
> >  };
> >  #endif
> >  
> > +enum mvneta_tx_buf_type {
> > +   MVNETA_TYPE_SKB,
> > +   MVNETA_TYPE_XDP_TX,
> > +   MVNETA_TYPE_XDP_NDO,
> > +};
> > +
> > +struct mvneta_tx_buf {
> > +   enum mvneta_tx_buf_type type;
> 
> I'd be tempted to try to encode type on the low bits of the pointer,
> otherwise you're increasing the cache pressure here. I'm not 100% sure
> it's worth the hassle, perhaps could be a future optimization.
> 

Since this is already offering a performance boost (since buffers are not
unmapped, but recycled and synced), we'll consider adding the buffer tracking
capability to the page_pool API. I don't think you'll see any performance
benefits on this device specifically (or any 1gbit interface), but your idea is
nice, if we add it on the page_pool API we'll try implementing it like that.

> > +   union {
> > +   struct xdp_frame *xdpf;
> > +   struct sk_buff *skb;
> > +   };
> > +};
> 


Thanks
/Ilias

Re: [PATCH net] net: socionext: netsec: fix xdp stats accounting

2019-10-11 Thread Ilias Apalodimas

On Fri, Oct 11, 2019 at 05:15:03PM +0300, Ilias Apalodimas wrote:
> Hi Lorenzo, 
> 
> On Fri, Oct 11, 2019 at 03:45:38PM +0200, Lorenzo Bianconi wrote:
> > Increment netdev rx counters even for XDP_DROP verdict. Moreover report
> > even tx bytes for xdp buffers (TYPE_NETSEC_XDP_TX or
> > TYPE_NETSEC_XDP_NDO)
> 
> The RX counters work fine. The TX change is causing a panic though and i am
> looking into it since your patch seems harmless. In any case please don't 
> merge
> this yet
> 

Ok i think i know what's going on. 
Our clean TX routine has a netdev_completed_queue(). This is properly accounted
for on netsec_netdev_start_xmit() which calls netdev_sent_queue().

Since the XDP never had support for that you need to account for the extra bytes
in netsec_xdp_queue_one(). That's what triggering the BUG_ON 
(lib/dynamic_queue_limits.c line 27) 

Since netdev_completed_queue() enforces barrier() and in some cases smp_mb() i
think i'd prefer it per function, although it looks uglier. 
Can you send a patch with this call in netsec_xdp_queue_one()? If we cant
measure any performance difference i am fine with adding it in that only.

Thanks
/Ilias

> Thanks
> /Ilias
> 
> > Fixes: ba2b232108d3 ("net: netsec: add XDP support")
> > Signed-off-by: Lorenzo Bianconi 
> > ---
> > just compiled not tested on a real device
> > ---
> >  drivers/net/ethernet/socionext/netsec.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/socionext/netsec.c 
> > b/drivers/net/ethernet/socionext/netsec.c
> > index f9e6744d8fd6..b1c2a79899b3 100644
> > --- a/drivers/net/ethernet/socionext/netsec.c
> > +++ b/drivers/net/ethernet/socionext/netsec.c
> > @@ -252,7 +252,6 @@
> >  #define NETSEC_XDP_CONSUMED  BIT(0)
> >  #define NETSEC_XDP_TXBIT(1)
> >  #define NETSEC_XDP_REDIR BIT(2)
> > -#define NETSEC_XDP_RX_OK (NETSEC_XDP_PASS | NETSEC_XDP_TX | 
> > NETSEC_XDP_REDIR)
> >  
> >  enum ring_id {
> > NETSEC_RING_TX = 0,
> > @@ -661,6 +660,7 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> > *priv)
> > bytes += desc->skb->len;
> > dev_kfree_skb(desc->skb);
> > } else {
> > +   bytes += desc->xdpf->len;
> > xdp_return_frame(desc->xdpf);
> > }
> >  next:
> > @@ -1030,7 +1030,7 @@ static int netsec_process_rx(struct netsec_priv 
> > *priv, int budget)
> >  
> >  next:
> > if ((skb && napi_gro_receive(&priv->napi, skb) != GRO_DROP) ||
> > -   xdp_result & NETSEC_XDP_RX_OK) {
> > +   xdp_result) {
> > ndev->stats.rx_packets++;
> > ndev->stats.rx_bytes += xdp.data_end - xdp.data;
> > }
> > -- 
> > 2.21.0
> >

Re: [PATCH net] net: socionext: netsec: fix xdp stats accounting

2019-10-11 Thread Ilias Apalodimas

Hi Lorenzo, 

On Fri, Oct 11, 2019 at 03:45:38PM +0200, Lorenzo Bianconi wrote:
> Increment netdev rx counters even for XDP_DROP verdict. Moreover report
> even tx bytes for xdp buffers (TYPE_NETSEC_XDP_TX or
> TYPE_NETSEC_XDP_NDO)

The RX counters work fine. The TX change is causing a panic though and i am
looking into it since your patch seems harmless. In any case please don't merge
this yet

Thanks
/Ilias

> Fixes: ba2b232108d3 ("net: netsec: add XDP support")
> Signed-off-by: Lorenzo Bianconi 
> ---
> just compiled not tested on a real device
> ---
>  drivers/net/ethernet/socionext/netsec.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index f9e6744d8fd6..b1c2a79899b3 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -252,7 +252,6 @@
>  #define NETSEC_XDP_CONSUMED  BIT(0)
>  #define NETSEC_XDP_TXBIT(1)
>  #define NETSEC_XDP_REDIR BIT(2)
> -#define NETSEC_XDP_RX_OK (NETSEC_XDP_PASS | NETSEC_XDP_TX | NETSEC_XDP_REDIR)
>  
>  enum ring_id {
>   NETSEC_RING_TX = 0,
> @@ -661,6 +660,7 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> *priv)
>   bytes += desc->skb->len;
>   dev_kfree_skb(desc->skb);
>   } else {
> + bytes += desc->xdpf->len;
>   xdp_return_frame(desc->xdpf);
>   }
>  next:
> @@ -1030,7 +1030,7 @@ static int netsec_process_rx(struct netsec_priv *priv, 
> int budget)
>  
>  next:
>   if ((skb && napi_gro_receive(&priv->napi, skb) != GRO_DROP) ||
> - xdp_result & NETSEC_XDP_RX_OK) {
> + xdp_result) {
>   ndev->stats.rx_packets++;
>   ndev->stats.rx_bytes += xdp.data_end - xdp.data;
>   }
> -- 
> 2.21.0
>

Re: [PATCH v2 net-next 4/8] net: mvneta: sync dma buffers before refilling hw queues

2019-10-10 Thread Ilias Apalodimas

Hi Lorenzo, Jesper,

On Thu, Oct 10, 2019 at 09:08:31AM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 10 Oct 2019 01:18:34 +0200
> Lorenzo Bianconi  wrote:
> 
> > mvneta driver can run on not cache coherent devices so it is
> > necessary to sync dma buffers before sending them to the device
> > in order to avoid memory corruption. This patch introduce a performance
> > penalty and it is necessary to introduce a more sophisticated logic
> > in order to avoid dma sync as much as we can
> 
> Report with benchmarks here:
>  
> https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/board_espressobin08_bench_xdp.org
> 
> We are testing this on an Espressobin board, and do see a huge
> performance cost associated with this DMA-sync.   Regardless we still
> want to get this patch merged, to move forward with XDP support for
> this driver. 
> 
> We promised each-other (on IRC freenode #xdp) that we will follow-up
> with a solution/mitigation, after this patchset is merged.  There are
> several ideas, that likely should get separate upstream review.

I think mentioning that the patch *introduces* a performance penalty is a bit
misleading. 
The dma sync does have a performance penalty but it was always there. 
The initial driver was mapping the DMA with DMA_FROM_DEVICE, which implies
syncing as well. In page_pool we do not explicitly sync buffers on allocation
and leave it up the driver writer (and allow him some tricks to avoid that),
thus this patch is needed.

In any case what Jesper mentions is correct, we do have a plan :)

> 
> > Signed-off-by: Lorenzo Bianconi 
> 
> Signed-off-by: Jesper Dangaard Brouer 
> 
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> > b/drivers/net/ethernet/marvell/mvneta.c
> > index 79a6bac0192b..ba4aa9bbc798 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -1821,6 +1821,7 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
> > struct mvneta_rx_queue *rxq,
> > gfp_t gfp_mask)
> >  {
> > +   enum dma_data_direction dma_dir;
> > dma_addr_t phys_addr;
> > struct page *page;
> >  
> > @@ -1830,6 +1831,9 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
> > return -ENOMEM;
> >  
> > phys_addr = page_pool_get_dma_addr(page) + pp->rx_offset_correction;
> > +   dma_dir = page_pool_get_dma_dir(rxq->page_pool);
> > +   dma_sync_single_for_device(pp->dev->dev.parent, phys_addr,
> > +  MVNETA_MAX_RX_BUF_SIZE, dma_dir);
> > mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq);
> >  
> > return 0;
> 
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Thanks!
/Ilias

Re: [PATCH v2 net-next 3/8] net: mvneta: rely on build_skb in mvneta_rx_swbm poll routine

2019-10-10 Thread Ilias Apalodimas

Hi Lorenzo, 

On Thu, Oct 10, 2019 at 01:18:33AM +0200, Lorenzo Bianconi wrote:
> Refactor mvneta_rx_swbm code introducing mvneta_swbm_rx_frame and
> mvneta_swbm_add_rx_fragment routines. Rely on build_skb in oreder to
> allocate skb since the previous patch introduced buffer recycling using
> the page_pool API.
> This patch fixes even an issue in the original driver where dma buffers
> are accessed before dma sync
> 
> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 198 ++
>  1 file changed, 104 insertions(+), 94 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 31cecc1ed848..79a6bac0192b 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -323,6 +323,11 @@
> ETH_HLEN + ETH_FCS_LEN,\
> cache_line_size())
>  
> +#define MVNETA_SKB_PAD   (SKB_DATA_ALIGN(sizeof(struct skb_shared_info) 
> + \
> +  NET_SKB_PAD))
> +#define MVNETA_SKB_SIZE(len) (SKB_DATA_ALIGN(len) + MVNETA_SKB_PAD)
> +#define MVNETA_MAX_RX_BUF_SIZE   (PAGE_SIZE - MVNETA_SKB_PAD)
> +
>  #define IS_TSO_HEADER(txq, addr) \
>   ((addr >= txq->tso_hdrs_phys) && \
>(addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
> @@ -646,7 +651,6 @@ static int txq_number = 8;
>  static int rxq_def;
>  
>  static int rx_copybreak __read_mostly = 256;
> -static int rx_header_size __read_mostly = 128;
>  
>  /* HW BM need that each port be identify by a unique ID */
>  static int global_port_id;
> + if (rxq->left_size > MVNETA_MAX_RX_BUF_SIZE) {

[...]

> + len = MVNETA_MAX_RX_BUF_SIZE;
> + data_len = len;
> + } else {
> + len = rxq->left_size;
> + data_len = len - ETH_FCS_LEN;
> + }
> + dma_dir = page_pool_get_dma_dir(rxq->page_pool);
> + dma_sync_single_range_for_cpu(dev->dev.parent,
> +   rx_desc->buf_phys_addr, 0,
> +   len, dma_dir);
> + if (data_len > 0) {
> + /* refill descriptor with new buffer later */
> + skb_add_rx_frag(rxq->skb,
> + skb_shinfo(rxq->skb)->nr_frags,
> + page, NET_SKB_PAD, data_len,
> + PAGE_SIZE);
> +
> + page_pool_release_page(rxq->page_pool, page);
> + rx_desc->buf_phys_addr = 0;

Shouldn't we unmap and set the buf_phys_addr to 0 regardless of the data_len?

> + }
> + rxq->left_size -= len;
> +}
> +
>   mvneta_rxq_buf_size_set(pp, rxq, PAGE_SIZE < SZ_64K ?

[...]

> - PAGE_SIZE :
> + MVNETA_MAX_RX_BUF_SIZE :
>   MVNETA_RX_BUF_SIZE(pp->pkt_size));
>   mvneta_rxq_bm_disable(pp, rxq);
>   mvneta_rxq_fill(pp, rxq, rxq->size);
> @@ -4656,7 +4666,7 @@ static int mvneta_probe(struct platform_device *pdev)
>   SET_NETDEV_DEV(dev, &pdev->dev);
>  
>   pp->id = global_port_id++;
> - pp->rx_offset_correction = 0; /* not relevant for SW BM */
> + pp->rx_offset_correction = NET_SKB_PAD;
>  
>   /* Obtain access to BM resources if enabled and already initialized */
>   bm_node = of_parse_phandle(dn, "buffer-manager", 0);
> -- 
> 2.21.0
> 

Regards
/Ilias

Re: [PATCH 3/7] net: mvneta: rely on build_skb in mvneta_rx_swbm poll routine

2019-10-07 Thread Ilias Apalodimas

Hi Lorenzo,

On Sat, Oct 05, 2019 at 10:44:36PM +0200, Lorenzo Bianconi wrote:
> Refactor mvneta_rx_swbm code introducing mvneta_swbm_rx_frame and
> mvneta_swbm_add_rx_fragment routines. Rely on build_skb in order to
> allocate skb since the previous patch introduced buffer recycling using
> the page_pool API
> 
> Tested-by: Ilias Apalodimas 
> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 198 ++
>  1 file changed, 104 insertions(+), 94 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 8beae0e1eda7..d775fcae9353 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -323,6 +323,11 @@
> ETH_HLEN + ETH_FCS_LEN,\
> cache_line_size())
>  
> +#define MVNETA_SKB_PAD   (SKB_DATA_ALIGN(sizeof(struct skb_shared_info) 
> + \
> +  NET_SKB_PAD))
> +#define MVNETA_SKB_SIZE(len) (SKB_DATA_ALIGN(len) + MVNETA_SKB_PAD)
> +#define MVNETA_MAX_RX_BUF_SIZE   (PAGE_SIZE - MVNETA_SKB_PAD)
> +
>  #define IS_TSO_HEADER(txq, addr) \
>   ((addr >= txq->tso_hdrs_phys) && \
>(addr < txq->tso_hdrs_phys + txq->size * TSO_HEADER_SIZE))
> @@ -646,7 +651,6 @@ static int txq_number = 8;
>  static int rxq_def;
>  
>  static int rx_copybreak __read_mostly = 256;
> -static int rx_header_size __read_mostly = 128;
>  
>  /* HW BM need that each port be identify by a unique ID */
>  static int global_port_id;
> @@ -1941,30 +1945,102 @@ int mvneta_rx_refill_queue(struct mvneta_port *pp, 
> struct mvneta_rx_queue *rxq)
>   return i;
>  }
>  
> +static int
> +mvneta_swbm_rx_frame(struct mvneta_port *pp,
> +  struct mvneta_rx_desc *rx_desc,
> +  struct mvneta_rx_queue *rxq,
> +  struct page *page)
> +{
> + unsigned char *data = page_address(page);
> + int data_len = -MVNETA_MH_SIZE, len;
> + struct net_device *dev = pp->dev;
> + enum dma_data_direction dma_dir;
> +
> + if (MVNETA_SKB_SIZE(rx_desc->data_size) > PAGE_SIZE) {
> + len = MVNETA_MAX_RX_BUF_SIZE;
> + data_len += len;
> + } else {
> + len = rx_desc->data_size;
> + data_len += len - ETH_FCS_LEN;
> + }
> +
> + dma_dir = page_pool_get_dma_dir(rxq->page_pool);
> + dma_sync_single_range_for_cpu(dev->dev.parent,
> +   rx_desc->buf_phys_addr, 0,
> +   len, dma_dir);
> +
> + rxq->skb = build_skb(data, PAGE_SIZE);
> + if (unlikely(!rxq->skb)) {
> + netdev_err(dev,
> +"Can't allocate skb on queue %d\n",
> +rxq->id);
> + dev->stats.rx_dropped++;
> + rxq->skb_alloc_err++;
> + return -ENOMEM;
> + }
> + page_pool_release_page(rxq->page_pool, page);
> +
> + skb_reserve(rxq->skb, MVNETA_MH_SIZE + NET_SKB_PAD);
> + skb_put(rxq->skb, data_len);
> + mvneta_rx_csum(pp, rx_desc->status, rxq->skb);
> +
> + rxq->left_size = rx_desc->data_size - len;
> + rx_desc->buf_phys_addr = 0;
> +
> + return 0;
> +}
> +
> +static void
> +mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
> + struct mvneta_rx_desc *rx_desc,
> + struct mvneta_rx_queue *rxq,
> + struct page *page)
> +{
> + struct net_device *dev = pp->dev;
> + enum dma_data_direction dma_dir;
> + int data_len, len;
> +
> + if (rxq->left_size > MVNETA_MAX_RX_BUF_SIZE) {
> + len = MVNETA_MAX_RX_BUF_SIZE;
> + data_len = len;
> + } else {
> + len = rxq->left_size;
> + data_len = len - ETH_FCS_LEN;
> + }
> + dma_dir = page_pool_get_dma_dir(rxq->page_pool);
> + dma_sync_single_range_for_cpu(dev->dev.parent,
> +   rx_desc->buf_phys_addr, 0,
> +   len, dma_dir);
> + if (data_len > 0) {
> + /* refill descriptor with new buffer later */
> + skb_add_rx_frag(rxq->skb,
> + skb_shinfo(rxq->skb)->nr_frags,
> + page, NET_SKB_PAD, data_len,
> + PAGE_SIZE);
> +
> + page_pool_release_page(rxq->p

Re: [PATCH 2/7] net: mvneta: introduce page pool API for sw buffer manager

2019-10-05 Thread Ilias Apalodimas

Hi Lorenzo, 

On Sat, Oct 05, 2019 at 10:44:35PM +0200, Lorenzo Bianconi wrote:
> Use the page_pool api for allocations and DMA handling instead of
> __dev_alloc_page()/dma_map_page() and free_page()/dma_unmap_page().
> Pages are unmapped using page_pool_release_page before packets
> go into the network stack.
> 
> The page_pool API offers buffer recycling capabilities for XDP but
> allocates one page per packet, unless the driver splits and manages
> the allocated page.
> This is a preliminary patch to add XDP support to mvneta driver
> 
> Tested-by: Ilias Apalodimas 
> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/Kconfig  |  1 +
>  drivers/net/ethernet/marvell/mvneta.c | 76 ---
>  2 files changed, 58 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/Kconfig 
> b/drivers/net/ethernet/marvell/Kconfig
> index fb942167ee54..3d5caea096fb 100644
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -61,6 +61,7 @@ config MVNETA
>   depends on ARCH_MVEBU || COMPILE_TEST
>   select MVMDIO
>   select PHYLINK
> + select PAGE_POOL
>   ---help---
> This driver supports the network interface units in the
> Marvell ARMADA XP, ARMADA 370, ARMADA 38x and
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 128b9fded959..8beae0e1eda7 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Registers */
>  #define MVNETA_RXQ_CONFIG_REG(q)(0x1400 + ((q) << 2))
> @@ -603,6 +604,10 @@ struct mvneta_rx_queue {
>   u32 pkts_coal;
>   u32 time_coal;
>  
> + /* page_pool */
> + struct page_pool *page_pool;
> + struct xdp_rxq_info xdp_rxq;
> +
>   /* Virtual address of the RX buffer */
>   void  **buf_virt_addr;
>  
> @@ -1815,19 +1820,12 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>   dma_addr_t phys_addr;
>   struct page *page;
>  
> - page = __dev_alloc_page(gfp_mask);
> + page = page_pool_alloc_pages(rxq->page_pool,
> +  gfp_mask | __GFP_NOWARN);
>   if (!page)
>   return -ENOMEM;

Is the driver syncing the buffer somewhere else? (for_device)
If not you'll have to do this here. 

On a non-cache coherent machine (and i think this one is) you may get dirty
cache lines handed to the device. Those dirty cache lines might get written back
*after* the device has DMA'ed it's data. You need to flush those first to avoid
any data corruption

>  
> - /* map page for use */
> - phys_addr = dma_map_page(pp->dev->dev.parent, page, 0, PAGE_SIZE,
> -  DMA_FROM_DEVICE);
> - if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
> - __free_page(page);
> - return -ENOMEM;
> - }
> -
> - phys_addr += pp->rx_offset_correction;
> + phys_addr = page_pool_get_dma_addr(page) + pp->rx_offset_correction;
>   mvneta_rx_desc_fill(rx_desc, phys_addr, page, rxq);
>   return 0;
>  }
> @@ -1894,10 +1892,11 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port 
> *pp,
>   if (!data || !(rx_desc->buf_phys_addr))
>   continue;
>  
> - dma_unmap_page(pp->dev->dev.parent, rx_desc->buf_phys_addr,
> -PAGE_SIZE, DMA_FROM_DEVICE);
> - __free_page(data);
> + page_pool_put_page(rxq->page_pool, data, false);
>   }
> + if (xdp_rxq_info_is_reg(&rxq->xdp_rxq))
> + xdp_rxq_info_unreg(&rxq->xdp_rxq);
> + page_pool_destroy(rxq->page_pool);
>  }
>  
>  static void
> @@ -2029,8 +2028,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
>   skb_add_rx_frag(rxq->skb, frag_num, page,
>   frag_offset, frag_size,
>   PAGE_SIZE);
> - dma_unmap_page(dev->dev.parent, phys_addr,
> -PAGE_SIZE, DMA_FROM_DEVICE);
> + page_pool_release_page(rxq->page_pool, page);
>   rxq->left_size -= frag_size;
>   }
>   } else {
> @@ -2060,9 +2058,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,

Re: [RFC 3/4] net: mvneta: add basic XDP support

2019-10-01 Thread Ilias Apalodimas

On Tue, Oct 01, 2019 at 11:24:43AM +0200, Lorenzo Bianconi wrote:
> Add basic XDP support to mvneta driver for devices that rely on software
> buffer management. Currently supported verdicts are:
> - XDP_DROP
> - XDP_PASS
> - XDP_REDIRECT
> 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 145 --
>  1 file changed, 136 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index e842c744e4f3..f2d12556efa8 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
[...]
>   .pool_size = size,
>   .nid = cpu_to_node(0),
>   .dev = pp->dev->dev.parent,
> - .dma_dir = DMA_FROM_DEVICE,
> + .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE,
>   };
> + int err;
>  
>   rxq->page_pool = page_pool_create(&pp_params);
>   if (IS_ERR(rxq->page_pool)) {
> @@ -2851,7 +2920,22 @@ static int mvneta_create_page_pool(struct mvneta_port 
> *pp,
>   return PTR_ERR(rxq->page_pool);
>   }
>  
> + err = xdp_rxq_info_reg(&rxq->xdp_rxq, pp->dev, 0);
> + if (err < 0)
> + goto err_free_pp;
> +
> + err = xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
> +  rxq->page_pool);
> + if (err)
> + goto err_unregister_pp;

I think this should be part of patch [1/4], adding page pol support. 
Jesper introduced the changes to track down inflight packets [1], so you need
those changes in place when implementing page_pool

> +
>   return 0;
> +
> +err_unregister_pp:
> + xdp_rxq_info_unreg(&rxq->xdp_rxq);
> +err_free_pp:
> + page_pool_destroy(rxq->page_pool);
> + return err;
>  }
>  
>  /* Handle rxq fill: allocates rxq skbs; called when initializing a port */
> @@ -3291,6 +3375,11 @@ static int mvneta_change_mtu(struct net_device *dev, 
> int mtu)
>   mtu = ALIGN(MVNETA_RX_PKT_SIZE(mtu), 8);
>   }
>  
> + if (pp->xdp_prog && mtu > MVNETA_MAX_RX_BUF_SIZE) {
> + netdev_info(dev, "Illegal MTU value %d for XDP mode\n", mtu);
> + return -EINVAL;
> + }
> +
>   dev->mtu = mtu;
>  
>   if (!netif_running(dev)) {
> @@ -3960,6 +4049,43 @@ static int mvneta_ioctl(struct net_device *dev, struct 
> ifreq *ifr, int cmd)
>   return phylink_mii_ioctl(pp->phylink, ifr, cmd);
>  }
>  
> +static int mvneta_xdp_setup(struct net_device *dev, struct bpf_prog *prog,
> + struct netlink_ext_ack *extack)
> +{
> + struct mvneta_port *pp = netdev_priv(dev);
> + struct bpf_prog *old_prog;
> +
> + if (prog && dev->mtu > MVNETA_MAX_RX_BUF_SIZE) {
> + NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
> + return -EOPNOTSUPP;
> + }
> +
> + mvneta_stop(dev);
> +
> + old_prog = xchg(&pp->xdp_prog, prog);
> + if (old_prog)
> + bpf_prog_put(old_prog);
> +
> + mvneta_open(dev);
> +
> + return 0;
> +}
> +
> +static int mvneta_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> + struct mvneta_port *pp = netdev_priv(dev);
> +
> + switch (xdp->command) {
> + case XDP_SETUP_PROG:
> + return mvneta_xdp_setup(dev, xdp->prog, xdp->extack);
> + case XDP_QUERY_PROG:
> + xdp->prog_id = pp->xdp_prog ? pp->xdp_prog->aux->id : 0;
> + return 0;
> + default:
> + return -EINVAL;
> + }
> +}
> +
>  /* Ethtool methods */
>  
>  /* Set link ksettings (phy address, speed) for ethtools */
> @@ -4356,6 +4482,7 @@ static const struct net_device_ops mvneta_netdev_ops = {
>   .ndo_fix_features= mvneta_fix_features,
>   .ndo_get_stats64 = mvneta_get_stats64,
>   .ndo_do_ioctl= mvneta_ioctl,
> + .ndo_bpf = mvneta_xdp,
>  };
>  
>  static const struct ethtool_ops mvneta_eth_tool_ops = {
> @@ -4646,7 +4773,7 @@ static int mvneta_probe(struct platform_device *pdev)
>   SET_NETDEV_DEV(dev, &pdev->dev);
>  
>   pp->id = global_port_id++;
> - pp->rx_offset_correction = NET_SKB_PAD;
> + pp->rx_offset_correction = MVNETA_SKB_HEADROOM;
>  
>   /* Obtain access to BM resources if enabled and already initialized */
>   bm_node = of_parse_phandle(dn, "buffer-manager", 0);
> -- 
> 2.21.0
> 

[1] 
https://lore.kernel.org/netdev/156086304827.27760.11339786046465638081.stgit@firesoul/


Regards
/Ilias

Re: [PATCH net] net: socionext: netsec: always grab descriptor lock

2019-10-01 Thread Ilias Apalodimas

On Tue, Oct 01, 2019 at 10:33:51AM +0200, Lorenzo Bianconi wrote:
> Always acquire tx descriptor spinlock even if a xdp program is not loaded
> on the netsec device since ndo_xdp_xmit can run concurrently with
> netsec_netdev_start_xmit and netsec_clean_tx_dring. This can happen
> loading a xdp program on a different device (e.g virtio-net) and
> xdp_do_redirect_map/xdp_do_redirect_slow can redirect to netsec even if
> we do not have a xdp program on it.
> 
> Fixes: ba2b232108d3 ("net: netsec: add XDP support")
> Tested-by: Ilias Apalodimas 
> Signed-off-by: Lorenzo Bianconi 
> ---
>  drivers/net/ethernet/socionext/netsec.c | 30 ++---
>  1 file changed, 7 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 55db7fbd43cc..f9e6744d8fd6 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -282,7 +282,6 @@ struct netsec_desc_ring {
>   void *vaddr;
>   u16 head, tail;
>   u16 xdp_xmit; /* netsec_xdp_xmit packets */
> - bool is_xdp;
>   struct page_pool *page_pool;
>   struct xdp_rxq_info xdp_rxq;
>   spinlock_t lock; /* XDP tx queue locking */
> @@ -634,8 +633,7 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> *priv)
>   unsigned int bytes;
>   int cnt = 0;
>  
> - if (dring->is_xdp)
> - spin_lock(&dring->lock);
> + spin_lock(&dring->lock);
>  
>   bytes = 0;
>   entry = dring->vaddr + DESC_SZ * tail;
> @@ -682,8 +680,8 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> *priv)
>   entry = dring->vaddr + DESC_SZ * tail;
>   cnt++;
>   }
> - if (dring->is_xdp)
> - spin_unlock(&dring->lock);
> +
> + spin_unlock(&dring->lock);
>  
>   if (!cnt)
>   return false;
> @@ -799,9 +797,6 @@ static void netsec_set_tx_de(struct netsec_priv *priv,
>   de->data_buf_addr_lw = lower_32_bits(desc->dma_addr);
>   de->buf_len_info = (tx_ctrl->tcp_seg_len << 16) | desc->len;
>   de->attr = attr;
> - /* under spin_lock if using XDP */
> - if (!dring->is_xdp)
> - dma_wmb();
>  
>   dring->desc[idx] = *desc;
>   if (desc->buf_type == TYPE_NETSEC_SKB)
> @@ -1123,12 +1118,10 @@ static netdev_tx_t netsec_netdev_start_xmit(struct 
> sk_buff *skb,
>   u16 tso_seg_len = 0;
>   int filled;
>  
> - if (dring->is_xdp)
> - spin_lock_bh(&dring->lock);
> + spin_lock_bh(&dring->lock);
>   filled = netsec_desc_used(dring);
>   if (netsec_check_stop_tx(priv, filled)) {
> - if (dring->is_xdp)
> - spin_unlock_bh(&dring->lock);
> + spin_unlock_bh(&dring->lock);
>   net_warn_ratelimited("%s %s Tx queue full\n",
>dev_name(priv->dev), ndev->name);
>   return NETDEV_TX_BUSY;
> @@ -1161,8 +1154,7 @@ static netdev_tx_t netsec_netdev_start_xmit(struct 
> sk_buff *skb,
>   tx_desc.dma_addr = dma_map_single(priv->dev, skb->data,
> skb_headlen(skb), DMA_TO_DEVICE);
>   if (dma_mapping_error(priv->dev, tx_desc.dma_addr)) {
> - if (dring->is_xdp)
> - spin_unlock_bh(&dring->lock);
> + spin_unlock_bh(&dring->lock);
>   netif_err(priv, drv, priv->ndev,
> "%s: DMA mapping failed\n", __func__);
>   ndev->stats.tx_dropped++;
> @@ -1177,8 +1169,7 @@ static netdev_tx_t netsec_netdev_start_xmit(struct 
> sk_buff *skb,
>   netdev_sent_queue(priv->ndev, skb->len);
>  
>   netsec_set_tx_de(priv, dring, &tx_ctrl, &tx_desc, skb);
> - if (dring->is_xdp)
> - spin_unlock_bh(&dring->lock);
> + spin_unlock_bh(&dring->lock);
>   netsec_write(priv, NETSEC_REG_NRM_TX_PKTCNT, 1); /* submit another tx */
>  
>   return NETDEV_TX_OK;
> @@ -1262,7 +1253,6 @@ static int netsec_alloc_dring(struct netsec_priv *priv, 
> enum ring_id id)
>  static void netsec_setup_tx_dring(struct netsec_priv *priv)
>  {
>   struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_TX];
> -     struct bpf_prog *xdp_prog = READ_ONCE(priv->xdp_prog);
>   int i;
>  
>   for (i = 0; i < DESC_NUM; i++) {
> @@ -1275,12 +1265,6 @@ static void netsec_setup_tx_dring(struct netsec_priv 
> *priv)
>*/
>   de->attr = 1U << NETSEC_TX_SHIFT_OWN_FIELD;
>   }
> -
> - if (xdp_prog)
> - dring->is_xdp = true;
> - else
> - dring->is_xdp = false;
> -
>  }
>  
>  static int netsec_setup_rx_dring(struct netsec_priv *priv)
> -- 
> 2.21.0
> 

Reviewed-by: Ilias Apalodimas

Re: [PATCH v4 net-next 2/6] net: dsa: Pass ndo_setup_tc slave callback to drivers

2019-09-16 Thread Ilias Apalodimas

On Sun, Sep 15, 2019 at 04:59:59AM +0300, Vladimir Oltean wrote:
> DSA currently handles shared block filters (for the classifier-action
> qdisc) in the core due to what I believe are simply pragmatic reasons -
> hiding the complexity from drivers and offerring a simple API for port
> mirroring.
> 
> Extend the dsa_slave_setup_tc function by passing all other qdisc
> offloads to the driver layer, where the driver may choose what it
> implements and how. DSA is simply a pass-through in this case.
> 
> Signed-off-by: Vladimir Oltean 
> Acked-by: Kurt Kanzenbach 
> Reviewed-by: Florian Fainelli 
> ---
> Changes since v2:
> - Added Florian Fainelli's Reviewed-by.
> 
> Changes since v1:
> - Added Kurt Kanzenbach's Acked-by.
> 
> Changes since RFC:
> - Removed the unused declaration of struct tc_taprio_qopt_offload.
> 
>  include/net/dsa.h |  2 ++
>  net/dsa/slave.c   | 12 
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 96acb14ec1a8..541fb514e31d 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -515,6 +515,8 @@ struct dsa_switch_ops {
>  bool ingress);
>   void(*port_mirror_del)(struct dsa_switch *ds, int port,
>  struct dsa_mall_mirror_tc_entry *mirror);
> + int (*port_setup_tc)(struct dsa_switch *ds, int port,
> +  enum tc_setup_type type, void *type_data);
>  
>   /*
>* Cross-chip operations
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 9a88035517a6..75d58229a4bd 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1035,12 +1035,16 @@ static int dsa_slave_setup_tc_block(struct net_device 
> *dev,
>  static int dsa_slave_setup_tc(struct net_device *dev, enum tc_setup_type 
> type,
> void *type_data)
>  {
> - switch (type) {
> - case TC_SETUP_BLOCK:
> + struct dsa_port *dp = dsa_slave_to_port(dev);
> + struct dsa_switch *ds = dp->ds;
> +
> + if (type == TC_SETUP_BLOCK)
>   return dsa_slave_setup_tc_block(dev, type_data);
> - default:
> +
> + if (!ds->ops->port_setup_tc)
>   return -EOPNOTSUPP;
> - }
> +
> + return ds->ops->port_setup_tc(ds, dp->index, type, type_data);
>  }
>  
>  static void dsa_slave_get_stats64(struct net_device *dev,
> -- 
> 2.17.1
> 

Acked-by: Ilias Apalodimas

Re: [PATCH v2 net-next 2/7] net: dsa: Pass ndo_setup_tc slave callback to drivers

2019-09-16 Thread Ilias Apalodimas

Hi Vladimir,

Yes fixes my request on the initial RFC. Sorry for the delayed response.

On Sat, Sep 14, 2019 at 04:17:57AM +0300, Vladimir Oltean wrote:
> DSA currently handles shared block filters (for the classifier-action
> qdisc) in the core due to what I believe are simply pragmatic reasons -
> hiding the complexity from drivers and offerring a simple API for port
> mirroring.
> 
> Extend the dsa_slave_setup_tc function by passing all other qdisc
> offloads to the driver layer, where the driver may choose what it
> implements and how. DSA is simply a pass-through in this case.
> 
> Signed-off-by: Vladimir Oltean 
> Acked-by: Kurt Kanzenbach 
> ---
> Changes since v1:
> - Added Kurt Kanzenbach's Acked-by.
> 
> Changes since RFC:
> - Removed the unused declaration of struct tc_taprio_qopt_offload.
> 
>  include/net/dsa.h |  2 ++
>  net/dsa/slave.c   | 12 
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 96acb14ec1a8..541fb514e31d 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -515,6 +515,8 @@ struct dsa_switch_ops {
>  bool ingress);
>   void(*port_mirror_del)(struct dsa_switch *ds, int port,
>  struct dsa_mall_mirror_tc_entry *mirror);
> + int (*port_setup_tc)(struct dsa_switch *ds, int port,
> +  enum tc_setup_type type, void *type_data);
>  
>   /*
>* Cross-chip operations
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 9a88035517a6..75d58229a4bd 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1035,12 +1035,16 @@ static int dsa_slave_setup_tc_block(struct net_device 
> *dev,
>  static int dsa_slave_setup_tc(struct net_device *dev, enum tc_setup_type 
> type,
> void *type_data)
>  {
> - switch (type) {
> - case TC_SETUP_BLOCK:
> + struct dsa_port *dp = dsa_slave_to_port(dev);
> + struct dsa_switch *ds = dp->ds;
> +
> + if (type == TC_SETUP_BLOCK)
>   return dsa_slave_setup_tc_block(dev, type_data);
> - default:
> +
> + if (!ds->ops->port_setup_tc)
>   return -EOPNOTSUPP;
> -     }
> +
> + return ds->ops->port_setup_tc(ds, dp->index, type, type_data);
>  }
>  
>  static void dsa_slave_get_stats64(struct net_device *dev,
> -- 
> 2.17.1
> 

Acked-by: Ilias Apalodimas

Re: [PATCH net-next] page_pool: fix logic in __page_pool_get_cached

2019-08-14 Thread Ilias Apalodimas

Hi Jonathan,

Thanks!

On Tue, Aug 13, 2019 at 10:45:09AM -0700, Jonathan Lemon wrote:
> __page_pool_get_cached() will return NULL when the ring is
> empty, even if there are pages present in the lookaside cache.
> 
> It is also possible to refill the cache, and then return a
> NULL page.
> 
> Restructure the logic so eliminate both cases.

Acked-by: Ilias Apalodimas 

> 
> Signed-off-by: Jonathan Lemon 
> ---
>  net/core/page_pool.c | 39 ---
>  1 file changed, 16 insertions(+), 23 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 68510eb869ea..de09a74a39a4 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -82,12 +82,9 @@ EXPORT_SYMBOL(page_pool_create);
>  static struct page *__page_pool_get_cached(struct page_pool *pool)
>  {
>   struct ptr_ring *r = &pool->ring;
> + bool refill = false;
>   struct page *page;
>  
> - /* Quicker fallback, avoid locks when ring is empty */
> - if (__ptr_ring_empty(r))
> - return NULL;
> -
>   /* Test for safe-context, caller should provide this guarantee */
>   if (likely(in_serving_softirq())) {
>   if (likely(pool->alloc.count)) {
> @@ -95,27 +92,23 @@ static struct page *__page_pool_get_cached(struct 
> page_pool *pool)
>   page = pool->alloc.cache[--pool->alloc.count];
>   return page;
>   }
> - /* Slower-path: Alloc array empty, time to refill
> -  *
> -  * Open-coded bulk ptr_ring consumer.
> -  *
> -  * Discussion: the ring consumer lock is not really
> -  * needed due to the softirq/NAPI protection, but
> -  * later need the ability to reclaim pages on the
> -  * ring. Thus, keeping the locks.
> -  */
> - spin_lock(&r->consumer_lock);
> - while ((page = __ptr_ring_consume(r))) {
> - if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
> - break;
> - pool->alloc.cache[pool->alloc.count++] = page;
> - }
> - spin_unlock(&r->consumer_lock);
> - return page;
> + refill = true;
>   }
>  
> - /* Slow-path: Get page from locked ring queue */
> - page = ptr_ring_consume(&pool->ring);
> + /* Quicker fallback, avoid locks when ring is empty */
> + if (__ptr_ring_empty(r))
> + return NULL;
> +
> + /* Slow-path: Get page from locked ring queue,
> +  * refill alloc array if requested.
> +  */
> + spin_lock(&r->consumer_lock);
> + page = __ptr_ring_consume(r);
> + if (refill)
> + pool->alloc.count = __ptr_ring_consume_batched(r,
> + pool->alloc.cache,
> + PP_ALLOC_CACHE_REFILL);
> + spin_unlock(&r->consumer_lock);
>   return page;
>  }
>  
> -- 
> 2.17.1
>

[PATCH] MAINTAINERS: update netsec driver

2019-07-18 Thread Ilias Apalodimas

Add myself to maintainers since i provided the XDP and page_pool
implementation

Signed-off-by: Ilias Apalodimas 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 211ea3a199bd..64f659d8346c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14789,6 +14789,7 @@ F:  
Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
 
 SOCIONEXT (SNI) NETSEC NETWORK DRIVER
 M: Jassi Brar 
+M: Ilias Apalodimas 
 L: netdev@vger.kernel.org
 S: Maintained
 F: drivers/net/ethernet/socionext/netsec.c
-- 
2.20.1

[PATCH 2/2] net: netsec: remove static declaration for netsec_set_tx_de()

2019-07-09 Thread Ilias Apalodimas

On commit ba2b232108d3 ("net: netsec: add XDP support") a static
declaration for netsec_set_tx_de() was added to make the diff easier
to read.  Now that the patch is merged let's move the functions around
and get rid of that

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 86 -
 1 file changed, 41 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 7f9280f1fb28..1502fe8b0456 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -328,11 +328,6 @@ struct netsec_rx_pkt_info {
bool err_flag;
 };
 
-static void netsec_set_tx_de(struct netsec_priv *priv,
-struct netsec_desc_ring *dring,
-const struct netsec_tx_pkt_ctrl *tx_ctrl,
-const struct netsec_desc *desc, void *buf);
-
 static void netsec_write(struct netsec_priv *priv, u32 reg_addr, u32 val)
 {
writel(val, priv->ioaddr + reg_addr);
@@ -778,6 +773,47 @@ static void netsec_finalize_xdp_rx(struct netsec_priv 
*priv, u32 xdp_res,
netsec_xdp_ring_tx_db(priv, pkts);
 }
 
+static void netsec_set_tx_de(struct netsec_priv *priv,
+struct netsec_desc_ring *dring,
+const struct netsec_tx_pkt_ctrl *tx_ctrl,
+const struct netsec_desc *desc, void *buf)
+{
+   int idx = dring->head;
+   struct netsec_de *de;
+   u32 attr;
+
+   de = dring->vaddr + (DESC_SZ * idx);
+
+   attr = (1 << NETSEC_TX_SHIFT_OWN_FIELD) |
+  (1 << NETSEC_TX_SHIFT_PT_FIELD) |
+  (NETSEC_RING_GMAC << NETSEC_TX_SHIFT_TDRID_FIELD) |
+  (1 << NETSEC_TX_SHIFT_FS_FIELD) |
+  (1 << NETSEC_TX_LAST) |
+  (tx_ctrl->cksum_offload_flag << NETSEC_TX_SHIFT_CO) |
+  (tx_ctrl->tcp_seg_offload_flag << NETSEC_TX_SHIFT_SO) |
+  (1 << NETSEC_TX_SHIFT_TRS_FIELD);
+   if (idx == DESC_NUM - 1)
+   attr |= (1 << NETSEC_TX_SHIFT_LD_FIELD);
+
+   de->data_buf_addr_up = upper_32_bits(desc->dma_addr);
+   de->data_buf_addr_lw = lower_32_bits(desc->dma_addr);
+   de->buf_len_info = (tx_ctrl->tcp_seg_len << 16) | desc->len;
+   de->attr = attr;
+   /* under spin_lock if using XDP */
+   if (!dring->is_xdp)
+   dma_wmb();
+
+   dring->desc[idx] = *desc;
+   if (desc->buf_type == TYPE_NETSEC_SKB)
+   dring->desc[idx].skb = buf;
+   else if (desc->buf_type == TYPE_NETSEC_XDP_TX ||
+desc->buf_type == TYPE_NETSEC_XDP_NDO)
+   dring->desc[idx].xdpf = buf;
+
+   /* move head ahead */
+   dring->head = (dring->head + 1) % DESC_NUM;
+}
+
 /* The current driver only supports 1 Txq, this should run under spin_lock() */
 static u32 netsec_xdp_queue_one(struct netsec_priv *priv,
struct xdp_frame *xdpf, bool is_ndo)
@@ -1041,46 +1077,6 @@ static int netsec_napi_poll(struct napi_struct *napi, 
int budget)
return done;
 }
 
-static void netsec_set_tx_de(struct netsec_priv *priv,
-struct netsec_desc_ring *dring,
-const struct netsec_tx_pkt_ctrl *tx_ctrl,
-const struct netsec_desc *desc, void *buf)
-{
-   int idx = dring->head;
-   struct netsec_de *de;
-   u32 attr;
-
-   de = dring->vaddr + (DESC_SZ * idx);
-
-   attr = (1 << NETSEC_TX_SHIFT_OWN_FIELD) |
-  (1 << NETSEC_TX_SHIFT_PT_FIELD) |
-  (NETSEC_RING_GMAC << NETSEC_TX_SHIFT_TDRID_FIELD) |
-  (1 << NETSEC_TX_SHIFT_FS_FIELD) |
-  (1 << NETSEC_TX_LAST) |
-  (tx_ctrl->cksum_offload_flag << NETSEC_TX_SHIFT_CO) |
-  (tx_ctrl->tcp_seg_offload_flag << NETSEC_TX_SHIFT_SO) |
-  (1 << NETSEC_TX_SHIFT_TRS_FIELD);
-   if (idx == DESC_NUM - 1)
-   attr |= (1 << NETSEC_TX_SHIFT_LD_FIELD);
-
-   de->data_buf_addr_up = upper_32_bits(desc->dma_addr);
-   de->data_buf_addr_lw = lower_32_bits(desc->dma_addr);
-   de->buf_len_info = (tx_ctrl->tcp_seg_len << 16) | desc->len;
-   de->attr = attr;
-   /* under spin_lock if using XDP */
-   if (!dring->is_xdp)
-   dma_wmb();
-
-   dring->desc[idx] = *desc;
-   if (desc->buf_type == TYPE_NETSEC_SKB)
-   dring->desc[idx].skb = buf;
-   else if (desc->buf_type == TYPE_NETSEC_XDP_TX ||
-desc->buf_type == TYPE_NETSEC_XDP_NDO)
-   dring->desc[idx].xdpf = buf;
-
-   /* move head ahead */
-   dring->head = (dring->head + 1) % DESC_NUM;
-}
 
 static int netsec_desc_used(struct netsec_desc_ring *dring)
 {
-- 
2.20.1

[PATCH 1/2] net: netsec: remove superfluous if statement

2019-07-09 Thread Ilias Apalodimas

While freeing tx buffers the memory has to be unmapped if the packet was
an skb or was used for .ndo_xdp_xmit using the same arguments. Get rid
of the unneeded extra 'else if' statement

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index c3a4f86f56ee..7f9280f1fb28 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -654,12 +654,12 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
*priv)
eop = (entry->attr >> NETSEC_TX_LAST) & 1;
dma_rmb();
 
-   if (desc->buf_type == TYPE_NETSEC_SKB)
+   /* if buf_type is either TYPE_NETSEC_SKB or
+* TYPE_NETSEC_XDP_NDO we mapped it
+*/
+   if (desc->buf_type != TYPE_NETSEC_XDP_TX)
dma_unmap_single(priv->dev, desc->dma_addr, desc->len,
 DMA_TO_DEVICE);
-   else if (desc->buf_type == TYPE_NETSEC_XDP_NDO)
-   dma_unmap_single(priv->dev, desc->dma_addr,
-desc->len, DMA_TO_DEVICE);
 
if (!eop)
goto next;
-- 
2.20.1

Re: [PATCH net-next] bnxt_en: Add page_pool_destroy() during RX ring cleanup.

2019-07-09 Thread Ilias Apalodimas

On Tue, Jul 09, 2019 at 12:31:54PM -0400, Andy Gospodarek wrote:
> On Tue, Jul 09, 2019 at 06:20:57PM +0300, Ilias Apalodimas wrote:
> > Hi,
> > 
> > > > Add page_pool_destroy() in bnxt_free_rx_rings() during normal RX ring
> > > > cleanup, as Ilias has informed us that the following commit has been
> > > > merged:
> > > > 
> > > > 1da4bbeffe41 ("net: core: page_pool: add user refcnt and reintroduce 
> > > > page_pool_destroy")
> > > > 
> > > > The special error handling code to call page_pool_free() can now be
> > > > removed.  bnxt_free_rx_rings() will always be called during normal
> > > > shutdown or any error paths.
> > > > 
> > > > Fixes: 322b87ca55f2 ("bnxt_en: add page_pool support")
> > > > Cc: Ilias Apalodimas 
> > > > Cc: Andy Gospodarek 
> > > > Signed-off-by: Michael Chan 
> > > > ---
> > > >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++--
> > > >  1 file changed, 2 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> > > > b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > > index e9d3bd8..2b5b0ab 100644
> > > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > > @@ -2500,6 +2500,7 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> > > > if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
> > > > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> > > >  
> > > > +   page_pool_destroy(rxr->page_pool);
> > > > rxr->page_pool = NULL;
> > > >  
> > > > kfree(rxr->rx_tpa);
> > > > @@ -2560,19 +2561,14 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> > > > return rc;
> > > >  
> > > > rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
> > > > -   if (rc < 0) {
> > > > -   page_pool_free(rxr->page_pool);
> > > > -   rxr->page_pool = NULL;
> > > > +   if (rc < 0)
> > > > return rc;
> > > > -   }
> > > >  
> > > > rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
> > > > MEM_TYPE_PAGE_POOL,
> > > > rxr->page_pool);
> > > > if (rc) {
> > > > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> > > > -   page_pool_free(rxr->page_pool);
> > > > -   rxr->page_pool = NULL;
> > > 
> > > Rather than deleting these lines it would also be acceptable to do:
> > > 
> > > if (rc) {
> > > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> > > -   page_pool_free(rxr->page_pool);
> > > +   page_pool_destroy(rxr->page_pool);
> > > rxr->page_pool = NULL;
> > > return rc;
> > > }
> > > 
> > > but anytime there is a failure to bnxt_alloc_rx_rings the driver will
> > > immediately follow it up with a call to bnxt_free_rx_rings, so
> > > page_pool_destroy will be called.
> > > 
> > > Thanks for pushing this out so quickly!
> > > 
> > 
> > I also can't find page_pool_release_page() or page_pool_put_page() called 
> > when
> > destroying the pool. Can you try to insmod -> do some traffic -> rmmod ?
> > If there's stale buffers that haven't been unmapped properly you'll get a
> > WARN_ON for them.
> 
> I did that test a few times with a few different bpf progs but I do not
> see any WARN messages.  Of course this does not mean that the code we
> have is 100% correct.
> 

I'll try to have a closer look as well

> Presumably you are talking about one of these messages, right?
> 
> 215 /* The distance should not be able to become negative */
> 216 WARN(inflight < 0, "Negative(%d) inflight packet-pages", 
> inflight);
> 
> or
> 
> 356 /* Drivers should fix this, but only problematic when DMA is used 
> */
> 357 WARN(1, "Still in-flight pages:%d hold:%u released:%u",
> 358  distance, hold_cnt, release_cnt);
> 

Yea particularly the second one. There's a counter we increase everytime you
alloc a fresh page which needs to be decresed before freeing the whole pool.
page_pool_release_page will do that for example

> 
> > This part was added later on in the API when Jesper fixed in-flight packet
> > handling

Thanks
/Ilias

Re: [PATCH net-next] bnxt_en: Add page_pool_destroy() during RX ring cleanup.

2019-07-09 Thread Ilias Apalodimas

Hi,

> > Add page_pool_destroy() in bnxt_free_rx_rings() during normal RX ring
> > cleanup, as Ilias has informed us that the following commit has been
> > merged:
> > 
> > 1da4bbeffe41 ("net: core: page_pool: add user refcnt and reintroduce 
> > page_pool_destroy")
> > 
> > The special error handling code to call page_pool_free() can now be
> > removed.  bnxt_free_rx_rings() will always be called during normal
> > shutdown or any error paths.
> > 
> > Fixes: 322b87ca55f2 ("bnxt_en: add page_pool support")
> > Cc: Ilias Apalodimas 
> > Cc: Andy Gospodarek 
> > Signed-off-by: Michael Chan 
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++--
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> > b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index e9d3bd8..2b5b0ab 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -2500,6 +2500,7 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> > if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
> > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> >  
> > +   page_pool_destroy(rxr->page_pool);
> > rxr->page_pool = NULL;
> >  
> > kfree(rxr->rx_tpa);
> > @@ -2560,19 +2561,14 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> > return rc;
> >  
> > rc = xdp_rxq_info_reg(&rxr->xdp_rxq, bp->dev, i);
> > -   if (rc < 0) {
> > -   page_pool_free(rxr->page_pool);
> > -   rxr->page_pool = NULL;
> > +   if (rc < 0)
> > return rc;
> > -   }
> >  
> > rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
> > MEM_TYPE_PAGE_POOL,
> > rxr->page_pool);
> > if (rc) {
> > xdp_rxq_info_unreg(&rxr->xdp_rxq);
> > -   page_pool_free(rxr->page_pool);
> > -   rxr->page_pool = NULL;
> 
> Rather than deleting these lines it would also be acceptable to do:
> 
> if (rc) {
> xdp_rxq_info_unreg(&rxr->xdp_rxq);
> -   page_pool_free(rxr->page_pool);
> +   page_pool_destroy(rxr->page_pool);
> rxr->page_pool = NULL;
> return rc;
> }
> 
> but anytime there is a failure to bnxt_alloc_rx_rings the driver will
> immediately follow it up with a call to bnxt_free_rx_rings, so
> page_pool_destroy will be called.
> 
> Thanks for pushing this out so quickly!
> 

I also can't find page_pool_release_page() or page_pool_put_page() called when
destroying the pool. Can you try to insmod -> do some traffic -> rmmod ?
If there's stale buffers that haven't been unmapped properly you'll get a
WARN_ON for them.
This part was added later on in the API when Jesper fixed in-flight packet
handling

> Acked-by: Andy Gospodarek  
> 

Thanks
/Ilias

[PATCH] net: netsec: start using buffers if page_pool registration succeeded

2019-07-09 Thread Ilias Apalodimas

The current driver starts using page_pool buffers before calling
xdp_rxq_info_reg_mem_model(). Start using the buffers after the
registration succeeded, so we won't have to call
page_pool_request_shutdown() in case of failure

Fixes: 5c67bf0ec4d0 ("net: netsec: Use page_pool API")
Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index d7307ab90d74..c3a4f86f56ee 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1309,6 +1309,15 @@ static int netsec_setup_rx_dring(struct netsec_priv 
*priv)
goto err_out;
}
 
+   err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0);
+   if (err)
+   goto err_out;
+
+   err = xdp_rxq_info_reg_mem_model(&dring->xdp_rxq, MEM_TYPE_PAGE_POOL,
+dring->page_pool);
+   if (err)
+   goto err_out;
+
for (i = 0; i < DESC_NUM; i++) {
struct netsec_desc *desc = &dring->desc[i];
dma_addr_t dma_handle;
@@ -1327,14 +1336,6 @@ static int netsec_setup_rx_dring(struct netsec_priv 
*priv)
}
 
netsec_rx_fill(priv, 0, DESC_NUM);
-   err = xdp_rxq_info_reg(&dring->xdp_rxq, priv->ndev, 0);
-   if (err)
-   goto err_out;
-
-   err = xdp_rxq_info_reg_mem_model(&dring->xdp_rxq, MEM_TYPE_PAGE_POOL,
-dring->page_pool);
-   if (err)
-   goto err_out;
 
return 0;
 
-- 
2.20.1

Re: [PATCH net-next v2 0/4] bnxt_en: Add XDP_REDIRECT support.

2019-07-08 Thread Ilias Apalodimas

Hi David, 

On Mon, Jul 08, 2019 at 03:20:20PM -0700, David Miller wrote:
> From: Michael Chan 
> Date: Mon,  8 Jul 2019 17:53:00 -0400
> 
> > This patch series adds XDP_REDIRECT support by Andy Gospodarek.
> 
> Series applied, thanks everyone.

We need a fix on this after merging Ivans patch
commit 1da4bbeffe41ba318812d7590955faee8636668b

page_pool_destroy needs to be explicitely called when shutting down the
interface as it's not automatically called from xdp_rxq_info_unreg()

Thanks
/Ilias

Re: [PATCH net-next v2 4/4] bnxt_en: add page_pool support

2019-07-08 Thread Ilias Apalodimas

Hi Andy, Michael,

On Mon, Jul 08, 2019 at 05:53:04PM -0400, Michael Chan wrote:
> From: Andy Gospodarek 
> 
> This removes contention over page allocation for XDP_REDIRECT actions by
> adding page_pool support per queue for the driver.  The performance for
> XDP_REDIRECT actions scales linearly with the number of cores performing
> redirect actions when using the page pools instead of the standard page
> allocator.
> 
> v2: Fix up the error path from XDP registration, noted by Ilias Apalodimas.
> 
> Signed-off-by: Andy Gospodarek 
> Signed-off-by: Michael Chan 
> ---
>  drivers/net/ethernet/broadcom/Kconfig |  1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 47 
> +++
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h |  3 ++
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  3 +-
>  4 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/Kconfig 
> b/drivers/net/ethernet/broadcom/Kconfig
> index 2e4a8c7..e9017ca 100644
> --- a/drivers/net/ethernet/broadcom/Kconfig
> +++ b/drivers/net/ethernet/broadcom/Kconfig
> @@ -199,6 +199,7 @@ config BNXT
>   select FW_LOADER
>   select LIBCRC32C
>   select NET_DEVLINK
> + select PAGE_POOL
>   ---help---
> This driver supports Broadcom NetXtreme-C/E 10/25/40/50 gigabit
> Ethernet cards.  To compile this driver as a module, choose M here:
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index d8f0846..d25bb38 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -54,6 +54,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "bnxt_hsi.h"
>  #include "bnxt.h"
> @@ -668,19 +669,20 @@ static void bnxt_tx_int(struct bnxt *bp, struct 
> bnxt_napi *bnapi, int nr_pkts)
>  }
>  
>  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t 
> *mapping,
> +  struct bnxt_rx_ring_info *rxr,
>gfp_t gfp)
>  {
>   struct device *dev = &bp->pdev->dev;
>   struct page *page;
>  
> - page = alloc_page(gfp);
> + page = page_pool_dev_alloc_pages(rxr->page_pool);
>   if (!page)
>   return NULL;
>  
>   *mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir,
> DMA_ATTR_WEAK_ORDERING);
>   if (dma_mapping_error(dev, *mapping)) {
> - __free_page(page);
> + page_pool_recycle_direct(rxr->page_pool, page);
>   return NULL;
>   }
>   *mapping += bp->rx_dma_offset;
> @@ -716,7 +718,8 @@ int bnxt_alloc_rx_data(struct bnxt *bp, struct 
> bnxt_rx_ring_info *rxr,
>   dma_addr_t mapping;
>  
>   if (BNXT_RX_PAGE_MODE(bp)) {
> - struct page *page = __bnxt_alloc_rx_page(bp, &mapping, gfp);
> + struct page *page =
> + __bnxt_alloc_rx_page(bp, &mapping, rxr, gfp);
>  
>   if (!page)
>   return -ENOMEM;
> @@ -2360,7 +2363,7 @@ static void bnxt_free_rx_skbs(struct bnxt *bp)
>   dma_unmap_page_attrs(&pdev->dev, mapping,
>PAGE_SIZE, bp->rx_dir,
>DMA_ATTR_WEAK_ORDERING);
> - __free_page(data);
> + page_pool_recycle_direct(rxr->page_pool, data);
>   } else {
>   dma_unmap_single_attrs(&pdev->dev, mapping,
>  bp->rx_buf_use_size,
> @@ -2497,6 +2500,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>   if (xdp_rxq_info_is_reg(&rxr->xdp_rxq))
>   xdp_rxq_info_unreg(&rxr->xdp_rxq);
>  
> + rxr->page_pool = NULL;
> +
>   kfree(rxr->rx_tpa);
>   rxr->rx_tpa = NULL;
>  
> @@ -2511,6 +2516,26 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>   }
>  }
>  
> +static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> +struct bnxt_rx_ring_info *rxr)
> +{
> + struct page_pool_params pp = { 0 };
> +
> + pp.pool_size = bp->rx_ring_size;
> + pp.nid = dev_to_node(&bp->pdev->dev);
> + pp.dev = &bp->pdev->dev;
> + pp.dma_dir = DMA_BIDIRECTIONAL;
> +
> + rxr->page_pool = page_pool_create(&pp);
> +

Re: [PATCH net-next 3/4] bnxt_en: optimized XDP_REDIRECT support

2019-07-08 Thread Ilias Apalodimas

Hi Andy, 

> On Mon, Jul 08, 2019 at 11:28:03AM +0300, Ilias Apalodimas wrote:
> > Thanks Andy, Michael
> > 
> > > + if (event & BNXT_REDIRECT_EVENT)
> > > + xdp_do_flush_map();
> > > +
> > >   if (event & BNXT_TX_EVENT) {
> > >   struct bnxt_tx_ring_info *txr = bnapi->tx_ring;
> > >   u16 prod = txr->tx_prod;
> > > @@ -2254,9 +2257,23 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
> > >  
> > >   for (j = 0; j < max_idx;) {
> > >   struct bnxt_sw_tx_bd *tx_buf = &txr->tx_buf_ring[j];
> > > - struct sk_buff *skb = tx_buf->skb;
> > > + struct sk_buff *skb;
> > >   int k, last;
> > >  
> > > + if (i < bp->tx_nr_rings_xdp &&
> > > + tx_buf->action == XDP_REDIRECT) {
> > > + dma_unmap_single(&pdev->dev,
> > > + dma_unmap_addr(tx_buf, mapping),
> > > + dma_unmap_len(tx_buf, len),
> > > + PCI_DMA_TODEVICE);
> > > + xdp_return_frame(tx_buf->xdpf);
> > > + tx_buf->action = 0;
> > > + tx_buf->xdpf = NULL;
> > > + j++;
> > > + continue;
> > > + }
> > > +
> > 
> > Can't see the whole file here and maybe i am missing something, but since 
> > you
> > optimize for that and start using page_pool, XDP_TX will be a re-synced (and
> > not remapped)  buffer that can be returned to the pool and resynced for 
> > device usage. 
> > Is that happening later on the tx clean function?
> 
> Take a look at the way we treat the buffers in bnxt_rx_xdp() when we
> receive them and then in bnxt_tx_int_xdp() when the transmits have
> completed (for XDP_TX and XDP_REDIRECT).  I think we are doing what is
> proper with respect to mapping vs sync for both cases, but I would be
> fine to be corrected.
> 

Yea seems to be doing the right thing, 
XDP_TX syncs correctly and reuses with bnxt_reuse_rx_data() right?

This might be a bit confusing for someone reading the driver on the first time,
probably because you'll end up with 2 ways of recycling buffers. 

Once a buffers get freed on the XDP path it's either fed back to the pool, so
the next requested buffer get served from the pools cache (ndo_xdp_xmit case in
the patch). If the buffer is used for XDP_TX is's synced correctly but recycled
via bnxt_reuse_rx_data() right? Since you are moving to page pool please
consider having a common approach towards the recycling path. I understand that
means tracking buffers types and make sure you do the right thing on 'tx clean'.
I've done something similar on the netsec driver and i do think this might be a
good thing to add on page_pool API

Again this isn't a blocker at least for me but you already have the buffer type
(via tx_buf->action)

> > 
> > > + skb = tx_buf->skb;
> > >   if (!skb) {
> > >   j++;
> > >   continue;
> > > @@ -2517,6 +2534,13 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> > >   if (rc < 0)
> > >   return rc;
> > >  
> > > + rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
> > > + MEM_TYPE_PAGE_SHARED, NULL);
> > > + if (rc) {
> > > + xdp_rxq_info_unreg(&rxr->xdp_rxq);
> > 
> > I think you can use page_pool_free directly here (and pge_pool_destroy once
> > Ivan's patchset gets nerged), that's what mlx5 does iirc. Can we keep that
> > common please?
> 
> That's an easy change, I can do that.
> 
> > 
> > If Ivan's patch get merged please note you'll have to explicitly
> > page_pool_destroy, after calling xdp_rxq_info_unreg() in the general 
> > unregister
> > case (not the error habdling here). Sorry for the confusion this might 
> > bring!
> 
> Funny enough the driver was basically doing that until page_pool_destroy
> was removed (these patches are not new).  I saw last week there was
> discussion to add it back, but I did not want to wait to get this on the
> list before that was resolved.

Fair enough

> 
> This path works as expected with the code in the tree today so it seemed
> like the correct approach to post something that is working, right?  :-)

Yes.

It will continue to work even if you dont change the call in the future. 
This is more a 'let's not spread the code' attempt, but removing and re-adding
page_pool_destroy() was/is our mess. We might as well live with the
consequences!

> 
> > 
> > > + return rc;
> > > + }
> > > +
> > >   rc = bnxt_alloc_ring(bp, &ring->ring_mem);
> > >   if (rc)
> > >   return rc;
> > > @@ -10233,6 +10257,7 @@ static const struct net_device_ops 
> > > bnxt_netdev_ops = {
> > [...]
> > 

Thanks!
/Ilias

Re: [RFC PATCH net-next 3/6] net: dsa: Pass tc-taprio offload to drivers

2019-07-08 Thread Ilias Apalodimas

Hi Vladimir,

> tc-taprio is a qdisc based on the enhancements for scheduled traffic
> specified in IEEE 802.1Qbv (later merged in 802.1Q).  This qdisc has
> a software implementation and an optional offload through which
> compatible Ethernet ports may configure their egress 802.1Qbv
> schedulers.
> 
> Signed-off-by: Vladimir Oltean 
> ---
>  include/net/dsa.h |  3 +++
>  net/dsa/slave.c   | 14 ++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 1e8650fa8acc..e7ee6ac8ce6b 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -152,6 +152,7 @@ struct dsa_mall_tc_entry {
>   };
>  };
>  
> +struct tc_taprio_qopt_offload;
>  
>  struct dsa_port {
>   /* A CPU port is physically connected to a master device.
> @@ -516,6 +517,8 @@ struct dsa_switch_ops {
>  bool ingress);
>   void(*port_mirror_del)(struct dsa_switch *ds, int port,
>  struct dsa_mall_mirror_tc_entry *mirror);
> + int (*port_setup_taprio)(struct dsa_switch *ds, int port,
> +  struct tc_taprio_qopt_offload *qopt);

Is there any way to make this more generic? 802.1Qbv are not the only hardware
schedulers. CBS and ETF are examples that first come to mind. Maybe having
something more generic than tc_taprio_qopt_offload as an option could host
future schedulers?

>  
>   /*
>* Cross-chip operations
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 99673f6b07f6..2bae33788708 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -965,12 +965,26 @@ static int dsa_slave_setup_tc_block(struct net_device 
> *dev,
>   }
>  }
>  
> +static int dsa_slave_setup_tc_taprio(struct net_device *dev,
> +  struct tc_taprio_qopt_offload *f)
> +{
> + struct dsa_port *dp = dsa_slave_to_port(dev);
> + struct dsa_switch *ds = dp->ds;
> +
> + if (!ds->ops->port_setup_taprio)
> + return -EOPNOTSUPP;
> +
> + return ds->ops->port_setup_taprio(ds, dp->index, f);
> +}
> +
>  static int dsa_slave_setup_tc(struct net_device *dev, enum tc_setup_type 
> type,
> void *type_data)
>  {
>   switch (type) {
>   case TC_SETUP_BLOCK:
>   return dsa_slave_setup_tc_block(dev, type_data);
> + case TC_SETUP_QDISC_TAPRIO:
> + return dsa_slave_setup_tc_taprio(dev, type_data);
>   default:
>   return -EOPNOTSUPP;
>   }
> -- 
> 2.17.1
> 
Thanks
/Ilias

Re: [PATCH net-next V2] MAINTAINERS: Add page_pool maintainer entry

2019-07-08 Thread Ilias Apalodimas

On Fri, Jul 05, 2019 at 02:57:55PM +0200, Jesper Dangaard Brouer wrote:
> In this release cycle the number of NIC drivers using page_pool
> will likely reach 4 drivers.  It is about time to add a maintainer
> entry.  Add myself and Ilias.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
> V2: Ilias also volunteered to co-maintain over IRC

Would be glad to serve as one

> 
>  MAINTAINERS |8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 449e7cdb3303..22655aa84a46 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11902,6 +11902,14 @@ F:   kernel/padata.c
>  F:   include/linux/padata.h
>  F:   Documentation/padata.txt
>  
> +PAGE POOL
> +M:   Jesper Dangaard Brouer 
> +M:   Ilias Apalodimas 
> +L:   netdev@vger.kernel.org
> +S:   Supported
> +F:   net/core/page_pool.c
> +F:   include/net/page_pool.h
> +
>  PANASONIC LAPTOP ACPI EXTRAS DRIVER
>  M:   Harald Welte 
>  L:   platform-driver-...@vger.kernel.org
> 

Acked-by: Ilias Apalodimas

Re: [PATCH net-next 3/4] bnxt_en: optimized XDP_REDIRECT support

2019-07-08 Thread Ilias Apalodimas

Thanks Andy, Michael

> + if (event & BNXT_REDIRECT_EVENT)
> + xdp_do_flush_map();
> +
>   if (event & BNXT_TX_EVENT) {
>   struct bnxt_tx_ring_info *txr = bnapi->tx_ring;
>   u16 prod = txr->tx_prod;
> @@ -2254,9 +2257,23 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
>  
>   for (j = 0; j < max_idx;) {
>   struct bnxt_sw_tx_bd *tx_buf = &txr->tx_buf_ring[j];
> - struct sk_buff *skb = tx_buf->skb;
> + struct sk_buff *skb;
>   int k, last;
>  
> + if (i < bp->tx_nr_rings_xdp &&
> + tx_buf->action == XDP_REDIRECT) {
> + dma_unmap_single(&pdev->dev,
> + dma_unmap_addr(tx_buf, mapping),
> + dma_unmap_len(tx_buf, len),
> + PCI_DMA_TODEVICE);
> + xdp_return_frame(tx_buf->xdpf);
> + tx_buf->action = 0;
> + tx_buf->xdpf = NULL;
> + j++;
> + continue;
> + }
> +

Can't see the whole file here and maybe i am missing something, but since you
optimize for that and start using page_pool, XDP_TX will be a re-synced (and
not remapped)  buffer that can be returned to the pool and resynced for 
device usage. 
Is that happening later on the tx clean function?

> + skb = tx_buf->skb;
>   if (!skb) {
>   j++;
>   continue;
> @@ -2517,6 +2534,13 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
>   if (rc < 0)
>   return rc;
>  
> + rc = xdp_rxq_info_reg_mem_model(&rxr->xdp_rxq,
> + MEM_TYPE_PAGE_SHARED, NULL);
> + if (rc) {
> + xdp_rxq_info_unreg(&rxr->xdp_rxq);

I think you can use page_pool_free directly here (and pge_pool_destroy once
Ivan's patchset gets nerged), that's what mlx5 does iirc. Can we keep that
common please?

If Ivan's patch get merged please note you'll have to explicitly
page_pool_destroy, after calling xdp_rxq_info_unreg() in the general unregister
case (not the error habdling here). Sorry for the confusion this might bring!

> + return rc;
> + }
> +
>   rc = bnxt_alloc_ring(bp, &ring->ring_mem);
>   if (rc)
>   return rc;
> @@ -10233,6 +10257,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
[...]

Thanks!
/Ilias

[PATCH] net: netsec: Sync dma for device on buffer allocation

2019-07-08 Thread Ilias Apalodimas

cd1973a9215a ("net: netsec: Sync dma for device on buffer allocation")
was merged on it's v1 instead of the v3.
Merge the proper patch version

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index f6e261c6a059..460777449cd9 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -743,9 +743,7 @@ static void *netsec_alloc_rx_data(struct netsec_priv *priv,
 */
*desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
dma_dir = page_pool_get_dma_dir(dring->page_pool);
-   dma_sync_single_for_device(priv->dev,
-  *dma_handle - NETSEC_RXBUF_HEADROOM,
-  PAGE_SIZE, dma_dir);
+   dma_sync_single_for_device(priv->dev, *dma_handle, *desc_len, dma_dir);
 
return page_address(page);
 }
-- 
2.20.1

[net-next, PATCH, v3] net: netsec: Sync dma for device on buffer allocation

2019-07-05 Thread Ilias Apalodimas

Quoting Arnd,
We have to do a sync_single_for_device /somewhere/ before the
buffer is given to the device. On a non-cache-coherent machine with
a write-back cache, there may be dirty cache lines that get written back
after the device DMA's data into it (e.g. from a previous memset
from before the buffer got freed), so you absolutely need to flush any
dirty cache lines on it first.

Since the coherency is configurable in this device make sure we cover
all configurations by explicitly syncing the allocated buffer for the
device before refilling it's descriptors

Signed-off-by: Ilias Apalodimas 
---
Changes since v2:
- Only sync for the portion of the packet owned by the NIC as suggested by 
  Jesper

 drivers/net/ethernet/socionext/netsec.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 5544a722543f..6b954ad88842 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -727,6 +727,7 @@ static void *netsec_alloc_rx_data(struct netsec_priv *priv,
 {
 
struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
+   enum dma_data_direction dma_dir;
struct page *page;
 
page = page_pool_dev_alloc_pages(dring->page_pool);
@@ -742,6 +743,8 @@ static void *netsec_alloc_rx_data(struct netsec_priv *priv,
 * cases and reserve enough space for headroom + skb_shared_info
 */
*desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
+   dma_dir = page_pool_get_dma_dir(dring->page_pool);
+   dma_sync_single_for_device(priv->dev, *dma_handle, *desc_len, dma_dir);
 
return page_address(page);
 }
-- 
2.20.1

Re: [net-next, PATCH, v2] net: netsec: Sync dma for device on buffer allocation

2019-07-04 Thread Ilias Apalodimas

On Thu, Jul 04, 2019 at 08:52:50PM +0300, Ilias Apalodimas wrote:
> On Thu, Jul 04, 2019 at 07:39:44PM +0200, Jesper Dangaard Brouer wrote:
> > On Thu,  4 Jul 2019 17:46:09 +0300
> > Ilias Apalodimas  wrote:
> > 
> > > Quoting Arnd,
> > > 
> > > We have to do a sync_single_for_device /somewhere/ before the
> > > buffer is given to the device. On a non-cache-coherent machine with
> > > a write-back cache, there may be dirty cache lines that get written back
> > > after the device DMA's data into it (e.g. from a previous memset
> > > from before the buffer got freed), so you absolutely need to flush any
> > > dirty cache lines on it first.
> > > 
> > > Since the coherency is configurable in this device make sure we cover
> > > all configurations by explicitly syncing the allocated buffer for the
> > > device before refilling it's descriptors
> > > 
> > > Signed-off-by: Ilias Apalodimas 
> > > ---
> > > 
> > > Changes since V1: 
> > > - Make the code more readable
> > >  
> > >  drivers/net/ethernet/socionext/netsec.c | 7 ++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/ethernet/socionext/netsec.c 
> > > b/drivers/net/ethernet/socionext/netsec.c
> > > index 5544a722543f..ada7626bf3a2 100644
> > > --- a/drivers/net/ethernet/socionext/netsec.c
> > > +++ b/drivers/net/ethernet/socionext/netsec.c
> > > @@ -727,21 +727,26 @@ static void *netsec_alloc_rx_data(struct 
> > > netsec_priv *priv,
> > >  {
> > >  
> > >   struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
> > > + enum dma_data_direction dma_dir;
> > > + dma_addr_t dma_start;
> > >   struct page *page;
> > >  
> > >   page = page_pool_dev_alloc_pages(dring->page_pool);
> > >   if (!page)
> > >   return NULL;
> > >  
> > > + dma_start = page_pool_get_dma_addr(page);
> > >   /* We allocate the same buffer length for XDP and non-XDP cases.
> > >* page_pool API will map the whole page, skip what's needed for
> > >* network payloads and/or XDP
> > >*/
> > > - *dma_handle = page_pool_get_dma_addr(page) + NETSEC_RXBUF_HEADROOM;
> > > + *dma_handle = dma_start + NETSEC_RXBUF_HEADROOM;
> > >   /* Make sure the incoming payload fits in the page for XDP and non-XDP
> > >* cases and reserve enough space for headroom + skb_shared_info
> > >*/
> > >   *desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
> > > + dma_dir = page_pool_get_dma_dir(dring->page_pool);
> > > + dma_sync_single_for_device(priv->dev, dma_start, PAGE_SIZE, dma_dir);
> > 
> > It's it costly to sync_for_device the entire page size?
> > 
> > E.g. we already know that the head-room is not touched by device.  And
> > we actually want this head-room cache-hot for e.g. xdp_frame, thus it
> > would be unfortunate if the head-room is explicitly evicted from the
> > cache here.
> > 
> > Even smarter, the driver could do the sync for_device, when it
> > release/recycle page, as it likely know the exact length that was used
> > by the packet.
> It does sync for device when recycling takes place in XDP_TX with the correct
> size. 
> I guess i can explicitly sync on the xdp_return_buff cases, and 
> netsec_setup_rx_dring() instead of the generic buffer allocation
> 
> I'll send a V3

On a second thought i think this is going to look a bit complicated for no
apparent reason.
If i do this i'll have to track the buffers that got recycled vs buffers 
that are freshly allocated (and sync in this case). I currently have no 
way of cwtelling if the buffer is new or recycled, so i'll just sync the 
payload for now as you suggested.

Maybe this information can be added on page_pool_dev_alloc_pages() ?

Thanks
/Ilias

Re: [net-next, PATCH, v2] net: netsec: Sync dma for device on buffer allocation

2019-07-04 Thread Ilias Apalodimas

On Thu, Jul 04, 2019 at 07:39:44PM +0200, Jesper Dangaard Brouer wrote:
> On Thu,  4 Jul 2019 17:46:09 +0300
> Ilias Apalodimas  wrote:
> 
> > Quoting Arnd,
> > 
> > We have to do a sync_single_for_device /somewhere/ before the
> > buffer is given to the device. On a non-cache-coherent machine with
> > a write-back cache, there may be dirty cache lines that get written back
> > after the device DMA's data into it (e.g. from a previous memset
> > from before the buffer got freed), so you absolutely need to flush any
> > dirty cache lines on it first.
> > 
> > Since the coherency is configurable in this device make sure we cover
> > all configurations by explicitly syncing the allocated buffer for the
> > device before refilling it's descriptors
> > 
> > Signed-off-by: Ilias Apalodimas 
> > ---
> > 
> > Changes since V1: 
> > - Make the code more readable
> >  
> >  drivers/net/ethernet/socionext/netsec.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/socionext/netsec.c 
> > b/drivers/net/ethernet/socionext/netsec.c
> > index 5544a722543f..ada7626bf3a2 100644
> > --- a/drivers/net/ethernet/socionext/netsec.c
> > +++ b/drivers/net/ethernet/socionext/netsec.c
> > @@ -727,21 +727,26 @@ static void *netsec_alloc_rx_data(struct netsec_priv 
> > *priv,
> >  {
> >  
> > struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
> > +   enum dma_data_direction dma_dir;
> > +   dma_addr_t dma_start;
> > struct page *page;
> >  
> > page = page_pool_dev_alloc_pages(dring->page_pool);
> > if (!page)
> > return NULL;
> >  
> > +   dma_start = page_pool_get_dma_addr(page);
> > /* We allocate the same buffer length for XDP and non-XDP cases.
> >  * page_pool API will map the whole page, skip what's needed for
> >  * network payloads and/or XDP
> >  */
> > -   *dma_handle = page_pool_get_dma_addr(page) + NETSEC_RXBUF_HEADROOM;
> > +   *dma_handle = dma_start + NETSEC_RXBUF_HEADROOM;
> > /* Make sure the incoming payload fits in the page for XDP and non-XDP
> >  * cases and reserve enough space for headroom + skb_shared_info
> >  */
> > *desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
> > +   dma_dir = page_pool_get_dma_dir(dring->page_pool);
> > +   dma_sync_single_for_device(priv->dev, dma_start, PAGE_SIZE, dma_dir);
> 
> It's it costly to sync_for_device the entire page size?
> 
> E.g. we already know that the head-room is not touched by device.  And
> we actually want this head-room cache-hot for e.g. xdp_frame, thus it
> would be unfortunate if the head-room is explicitly evicted from the
> cache here.
> 
> Even smarter, the driver could do the sync for_device, when it
> release/recycle page, as it likely know the exact length that was used
> by the packet.
It does sync for device when recycling takes place in XDP_TX with the correct
size. 
I guess i can explicitly sync on the xdp_return_buff cases, and 
netsec_setup_rx_dring() instead of the generic buffer allocation

I'll send a V3

Thanks!
/Ilias

[net-next, PATCH, v2] net: netsec: Sync dma for device on buffer allocation

2019-07-04 Thread Ilias Apalodimas

Quoting Arnd,

We have to do a sync_single_for_device /somewhere/ before the
buffer is given to the device. On a non-cache-coherent machine with
a write-back cache, there may be dirty cache lines that get written back
after the device DMA's data into it (e.g. from a previous memset
from before the buffer got freed), so you absolutely need to flush any
dirty cache lines on it first.

Since the coherency is configurable in this device make sure we cover
all configurations by explicitly syncing the allocated buffer for the
device before refilling it's descriptors

Signed-off-by: Ilias Apalodimas 
---

Changes since V1: 
- Make the code more readable
 
 drivers/net/ethernet/socionext/netsec.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 5544a722543f..ada7626bf3a2 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -727,21 +727,26 @@ static void *netsec_alloc_rx_data(struct netsec_priv 
*priv,
 {
 
struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
+   enum dma_data_direction dma_dir;
+   dma_addr_t dma_start;
struct page *page;
 
page = page_pool_dev_alloc_pages(dring->page_pool);
if (!page)
return NULL;
 
+   dma_start = page_pool_get_dma_addr(page);
/* We allocate the same buffer length for XDP and non-XDP cases.
 * page_pool API will map the whole page, skip what's needed for
 * network payloads and/or XDP
 */
-   *dma_handle = page_pool_get_dma_addr(page) + NETSEC_RXBUF_HEADROOM;
+   *dma_handle = dma_start + NETSEC_RXBUF_HEADROOM;
/* Make sure the incoming payload fits in the page for XDP and non-XDP
 * cases and reserve enough space for headroom + skb_shared_info
 */
*desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
+   dma_dir = page_pool_get_dma_dir(dring->page_pool);
+   dma_sync_single_for_device(priv->dev, dma_start, PAGE_SIZE, dma_dir);
 
return page_address(page);
 }
-- 
2.20.1

[PATCH] net: netsec: Sync dma for device on buffer allocation

2019-07-04 Thread Ilias Apalodimas

Quoting Arnd,

We have to do a sync_single_for_device /somewhere/ before the
buffer is given to the device. On a non-cache-coherent machine with
a write-back cache, there may be dirty cache lines that get written back
after the device DMA's data into it (e.g. from a previous memset
from before the buffer got freed), so you absolutely need to flush any
dirty cache lines on it first.

Since the coherency is configurable in this device make sure we cover
all configurations by explicitly syncing the allocated buffer for the
device before refilling it's descriptors

Signed-off-by: Ilias Apalodimas 
---
 drivers/net/ethernet/socionext/netsec.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 5544a722543f..e05a7191336d 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -727,6 +727,7 @@ static void *netsec_alloc_rx_data(struct netsec_priv *priv,
 {
 
struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_RX];
+   enum dma_data_direction dma_dir;
struct page *page;
 
page = page_pool_dev_alloc_pages(dring->page_pool);
@@ -742,6 +743,10 @@ static void *netsec_alloc_rx_data(struct netsec_priv *priv,
 * cases and reserve enough space for headroom + skb_shared_info
 */
*desc_len = PAGE_SIZE - NETSEC_RX_BUF_NON_DATA;
+   dma_dir = page_pool_get_dma_dir(dring->page_pool);
+   dma_sync_single_for_device(priv->dev,
+  *dma_handle - NETSEC_RXBUF_HEADROOM,
+  PAGE_SIZE, dma_dir);
 
return page_address(page);
 }
-- 
2.20.1

Re: [PATCH net-next] net: socionext: remove set but not used variable 'pkts'

2019-07-02 Thread Ilias Apalodimas

On Wed, Jul 03, 2019 at 02:42:13AM +, YueHaibing wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/socionext/netsec.c: In function 'netsec_clean_tx_dring':
> drivers/net/ethernet/socionext/netsec.c:637:15: warning:
>  variable 'pkts' set but not used [-Wunused-but-set-variable]
> 
> It is not used since commit ba2b232108d3 ("net: netsec: add XDP support")
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/net/ethernet/socionext/netsec.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 5544a722543f..015d1ec5436a 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -634,7 +634,7 @@ static void netsec_set_rx_de(struct netsec_priv *priv,
>  static bool netsec_clean_tx_dring(struct netsec_priv *priv)
>  {
>   struct netsec_desc_ring *dring = &priv->desc_ring[NETSEC_RING_TX];
> - unsigned int pkts, bytes;
> + unsigned int bytes;
>   struct netsec_de *entry;
>   int tail = dring->tail;
>   int cnt = 0;
> @@ -642,7 +642,6 @@ static bool netsec_clean_tx_dring(struct netsec_priv 
> *priv)
>   if (dring->is_xdp)
>       spin_lock(&dring->lock);
>  
> - pkts = 0;
>   bytes = 0;
>   entry = dring->vaddr + DESC_SZ * tail;
> 
> 
> 

Acked-by: Ilias Apalodimas

1 2 3 >

1 - 100 of 229 matches

Mail list logo