Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-07 Thread Niklas Schnelle
On Tue, 2024-06-04 at 20:27 -0400, Steven Rostedt wrote:
> On Wed, 5 Jun 2024 01:44:37 +0200
> Andrew Lunn  wrote:
> 
> > > Interesting, as I sped up the ftrace ring buffer by a substantial amount 
> > > by
> > > adding strategic __always_inline, noinline, likely() and unlikely()
> > > throughout the code. It had to do with what was considered the fast path
> > > and slow path, and not actually the size of the function. gcc got it
> > > horribly wrong.  
> > 
> > And what did the compiler people say when you reported gcc was getting
> > it wrong?
> > 
> > Our assumption is, the compiler is better than a human at deciding
> > this. Or at least, a human who does not spend a long time profiling
> > and tuning. If this assumption is not true, we probably should be
> > trying to figure out why, and improving the compiler when
> > possible. That will benefit everybody.
> > 
> 
> How is the compiler going to know which path is going to be taken the most?
> There's two main paths in the ring buffer logic. One when an event stays on
> the sub-buffer, the other when the event crosses over to a new sub buffer.
> As there's 100s of events that happen on the same sub-buffer for every one
> time there's a cross over, I optimized the paths that stayed on the
> sub-buffer, which caused the time for those events to go from 250ns down to
> 150 ns!. That's a 40% speed up.
> 
> I added the unlikely/likely and 'always_inline' and 'noinline' paths to
> make sure the "staying on the buffer" path was always the hot path, and
> keeping it tight in cache.
> 
> How is a compiler going to know that?
> 
> -- Steve
> 

Isn't this basically a perfect example of something where profile
guided optimization should work?

Thanks,
Niklas


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-05 Thread Steven Rostedt
On Wed, 5 Jun 2024 02:52:29 +0200
Andrew Lunn  wrote:

> > How is a compiler going to know that?  
> 
> It might have some heuristics to try to guess unlikely/likely, but
> that is not what we are talking about here.
> 
> How much difference did 'always_inline' and 'noinline' make? Hopefully
> the likely is enough of a clue it should prefer to inline whatever is
> in that branch, where as for the unlikely case it can do a function
> call.

Perhaps, but one of the issues was that I have lots of small functions that
are used all over the place, and gcc tends to change them to function
calls, instead of duplicating them. I did this analysis back in 2016, so
maybe it became better.

> 
> But compilers is not my thing, which is why i would reach out to the
> compiler people and ask them, is it expected to get this wrong, could
> it be made better?

Well, I actually do work with the compiler folks, and we are actually
trying to get a session at GNU Cauldron where Linux kernel folks can talk
with the gcc compiler folks.

I've stared at so many objdump outputs, that I can now pretty much see the
assembly that my C code makes ;-)

-- Steve


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Andrew Lunn
> How is the compiler going to know which path is going to be taken the most?
> There's two main paths in the ring buffer logic. One when an event stays on
> the sub-buffer, the other when the event crosses over to a new sub buffer.
> As there's 100s of events that happen on the same sub-buffer for every one
> time there's a cross over, I optimized the paths that stayed on the
> sub-buffer, which caused the time for those events to go from 250ns down to
> 150 ns!. That's a 40% speed up.
> 
> I added the unlikely/likely and 'always_inline' and 'noinline' paths to
> make sure the "staying on the buffer" path was always the hot path, and
> keeping it tight in cache.
> 
> How is a compiler going to know that?

It might have some heuristics to try to guess unlikely/likely, but
that is not what we are talking about here.

How much difference did 'always_inline' and 'noinline' make? Hopefully
the likely is enough of a clue it should prefer to inline whatever is
in that branch, where as for the unlikely case it can do a function
call.

But compilers is not my thing, which is why i would reach out to the
compiler people and ask them, is it expected to get this wrong, could
it be made better?

   Andrew


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Steven Rostedt
On Wed, 5 Jun 2024 01:44:37 +0200
Andrew Lunn  wrote:

> > Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> > adding strategic __always_inline, noinline, likely() and unlikely()
> > throughout the code. It had to do with what was considered the fast path
> > and slow path, and not actually the size of the function. gcc got it
> > horribly wrong.  
> 
> And what did the compiler people say when you reported gcc was getting
> it wrong?
> 
> Our assumption is, the compiler is better than a human at deciding
> this. Or at least, a human who does not spend a long time profiling
> and tuning. If this assumption is not true, we probably should be
> trying to figure out why, and improving the compiler when
> possible. That will benefit everybody.
> 

How is the compiler going to know which path is going to be taken the most?
There's two main paths in the ring buffer logic. One when an event stays on
the sub-buffer, the other when the event crosses over to a new sub buffer.
As there's 100s of events that happen on the same sub-buffer for every one
time there's a cross over, I optimized the paths that stayed on the
sub-buffer, which caused the time for those events to go from 250ns down to
150 ns!. That's a 40% speed up.

I added the unlikely/likely and 'always_inline' and 'noinline' paths to
make sure the "staying on the buffer" path was always the hot path, and
keeping it tight in cache.

How is a compiler going to know that?

-- Steve


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Andrew Lunn
> Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> adding strategic __always_inline, noinline, likely() and unlikely()
> throughout the code. It had to do with what was considered the fast path
> and slow path, and not actually the size of the function. gcc got it
> horribly wrong.

And what did the compiler people say when you reported gcc was getting
it wrong?

Our assumption is, the compiler is better than a human at deciding
this. Or at least, a human who does not spend a long time profiling
and tuning. If this assumption is not true, we probably should be
trying to figure out why, and improving the compiler when
possible. That will benefit everybody.

   Andrew



Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Steven Rostedt
On Tue, 4 Jun 2024 13:31:58 -0300
Jason Gunthorpe  wrote:

> On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> > On Tue, 04 Jun 2024 12:13:15 +0200
> > Paolo Abeni  wrote:
> >   
> > > On Thu, 2024-05-30 at 20:16 +, Mina Almasry wrote:  
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -32,6 +32,14 @@ static void 
> > > > net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> > > > kfree(owner);
> > > >  }
> > > >  
> > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov 
> > > > *niov)
> > > 
> > > Minor nit: please no 'inline' keyword in c files.  
> > 
> > I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> > time.  
> 
> It mostly comes from Documentation/process/coding-style.rst:
> 
> 15) The inline disease
> --
> 
> There appears to be a common misperception that gcc has a magic "make me
> faster" speedup option called ``inline``. While the use of inlines can be
> appropriate (for example as a means of replacing macros, see Chapter 12), it
> very often is not. Abundant use of the inline keyword leads to a much bigger
> kernel, which in turn slows the system as a whole down, due to a bigger
> icache footprint for the CPU and simply because there is less memory
> available for the pagecache. Just think about it; a pagecache miss causes a
> disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
> that can go into these 5 milliseconds.
> 
> A reasonable rule of thumb is to not put inline at functions that have more
> than 3 lines of code in them. An exception to this rule are the cases where
> a parameter is known to be a compiletime constant, and as a result of this
> constantness you *know* the compiler will be able to optimize most of your
> function away at compile time. For a good example of this later case, see
> the kmalloc() inline function.
> 
> Often people argue that adding inline to functions that are static and used
> only once is always a win since there is no space tradeoff. While this is
> technically correct, gcc is capable of inlining these automatically without
> help, and the maintenance issue of removing the inline when a second user
> appears outweighs the potential value of the hint that tells gcc to do
> something it would have done anyway.
> 

Interesting, as I sped up the ftrace ring buffer by a substantial amount by
adding strategic __always_inline, noinline, likely() and unlikely()
throughout the code. It had to do with what was considered the fast path
and slow path, and not actually the size of the function. gcc got it
horribly wrong.

-- Steve


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Jason Gunthorpe
On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> On Tue, 04 Jun 2024 12:13:15 +0200
> Paolo Abeni  wrote:
> 
> > On Thu, 2024-05-30 at 20:16 +, Mina Almasry wrote:
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct 
> > > gen_pool *genpool,
> > >   kfree(owner);
> > >  }
> > >  
> > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov 
> > > *niov)  
> > 
> > Minor nit: please no 'inline' keyword in c files.
> 
> I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> time.

It mostly comes from Documentation/process/coding-style.rst:

15) The inline disease
--

There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called ``inline``. While the use of inlines can be
appropriate (for example as a means of replacing macros, see Chapter 12), it
very often is not. Abundant use of the inline keyword leads to a much bigger
kernel, which in turn slows the system as a whole down, due to a bigger
icache footprint for the CPU and simply because there is less memory
available for the pagecache. Just think about it; a pagecache miss causes a
disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
that can go into these 5 milliseconds.

A reasonable rule of thumb is to not put inline at functions that have more
than 3 lines of code in them. An exception to this rule are the cases where
a parameter is known to be a compiletime constant, and as a result of this
constantness you *know* the compiler will be able to optimize most of your
function away at compile time. For a good example of this later case, see
the kmalloc() inline function.

Often people argue that adding inline to functions that are static and used
only once is always a win since there is no space tradeoff. While this is
technically correct, gcc is capable of inlining these automatically without
help, and the maintenance issue of removing the inline when a second user
appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.

Jason


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Steven Rostedt
On Tue, 04 Jun 2024 12:13:15 +0200
Paolo Abeni  wrote:

> On Thu, 2024-05-30 at 20:16 +, Mina Almasry wrote:
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index d82f92d7cf9ce..d5fac8edf621d 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct 
> > gen_pool *genpool,
> > kfree(owner);
> >  }
> >  
> > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov 
> > *niov)  
> 
> Minor nit: please no 'inline' keyword in c files.

I'm curious. Is this a networking rule? I use 'inline' in my C code all the
time.

-- Steve


Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator

2024-06-04 Thread Paolo Abeni
On Thu, 2024-05-30 at 20:16 +, Mina Almasry wrote:
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index d82f92d7cf9ce..d5fac8edf621d 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct 
> gen_pool *genpool,
>   kfree(owner);
>  }
>  
> +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)

Minor nit: please no 'inline' keyword in c files.

Thanks,

Paolo