Re: [PATCH net-next v24 08/13] net: add support for skbs with unreadable frags

2024-09-04 Thread Eric Dumazet
On Wed, Sep 4, 2024 at 5:18 PM Mina Almasry  wrote:
>
> On Tue, Sep 3, 2024 at 2:40 PM Jakub Kicinski  wrote:
> >
> > On Sat, 31 Aug 2024 00:43:08 + Mina Almasry wrote:
> > >  static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
> > >  {
> > > - return likely(!TCP_SKB_CB(skb)->eor);
> > > + return likely(!TCP_SKB_CB(skb)->eor && skb_frags_readable(skb));
> >
> > Do you remember why this is here?
>
> Yeah, to be honest, when I first implemented some of these checks I
> erred on the side of caution, and added checks around anything that
> looked concerning, some of these unnecessary checks got removed, but
> looks like this one didn't.
>
> > Both for Rx and Tx what should matter
> > is whether the "readability" matches, right? We can merge two unreadable
> > messages.
>
> Yes, you're right, only 'readability matches' should be the criteria
> here. `tcp_skb_can_collapse` already checks readability is matching
> correctly, so no issue there. The `tcp_skb_can_collapse_to` check
> you're commenting on here looks unnecessary. I will remove it and run
> that through some testing.
>
> As an aside, it looks to me like that tcp_skb_can_collapse_to
> callsites don't seem to be doing any collapsing. Unless I misread the
> code. It looks like tcp_skb_can_collapse_to is used as an eor check. I
> can rename the function to tcp_skb_is_eor() or something if that makes
> sense (in a separate patch). I think the name of the function confused
> me slightly and made me think I need to do a readability check.
>

Please do not use tcp_skb_is_eor()

tcp_skb_can_collapse_to() could be renamed to tcp_skb_can_aggregate_to()
or tcp_skb_can_append_to(), but EOR would not help at all.



Re: [PATCH net-next v9 0/7] Implement reset reason mechanism to detect

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:13 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> In production, there are so many cases about why the RST skb is sent but
> we don't have a very convenient/fast method to detect the exact underlying
> reasons.
>
> RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
> and active kind (like tcp_send_active_reset()). The former can be traced
> carefully 1) in TCP, with the help of drop reasons, which is based on
> Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
> RFC 8684. The latter is relatively independent, which should be
> implemented on our own, such as active reset reasons which can not be
> replace by skb drop reason or something like this.
>
> In this series, I focus on the fundamental implement mostly about how
> the rstreason mechanism works and give the detailed passive part as an
> example, not including the active reset part. In future, we can go
> further and refine those NOT_SPECIFIED reasons.
>
> Here are some examples when tracing:
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NO_SOCKET
>
> [1]
> Link: 
> https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH net-next v9 7/7] rstreason: make it work in trace world

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> At last, we should let it work by introducing this reset reason in
> trace world.
>
> One of the possible expected outputs is:
> ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
> state=TCP_ESTABLISHED reason=NOT_SPECIFIED
>
> Signed-off-by: Jason Xing 
> Reviewed-by: Steven Rostedt (Google) 
> ---

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 6/7] mptcp: introducing a helper into active reset logic

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Since we have mapped every mptcp reset reason definition in enum
> sk_rst_reason, introducing a new helper can cover some missing places
> where we have already set the subflow->reset_reason.
>
> Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
> SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
> it directly.
>
> Suggested-by: Paolo Abeni 
> Signed-off-by: Jason Xing 
> Reviewed-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 5/7] mptcp: support rstreason for passive reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> It relies on what reset options in the skb are as rfc8684 says. Reusing
> this logic can save us much energy. This patch replaces most of the prior
> NOT_SPECIFIED reasons.
>
> Signed-off-by: Jason Xing 
> Reviewed-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 3/7] rstreason: prepare for active reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Like what we did to passive reset:
> only passing possible reset reason in each active reset path.
>
> No functional changes.
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 2/7] rstreason: prepare for passive reset

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:14 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Adjust the parameter and support passing reason of reset which
> is for now NOT_SPECIFIED. No functional changes.
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v9 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-26 Thread Eric Dumazet
On Thu, Apr 25, 2024 at 5:13 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
>
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
>
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) our own independent reasons which aren't relying on other reasons at all
> 3) reuse MP_TCPRST option for MPTCP
>
> The benefits of a standalone reset reason are listed here:
> 1) it can cover more than one case, such as reset reasons in MPTCP,
> active reset reasons.
> 2) people can easily/fastly understand and maintain this mechanism.
> 3) we get unified format of output with prefix stripped.
> 4) more new reset reasons are on the way
> ...
>
> I will implement the basic codes of active/passive reset reason in
> those three protocols, which are not complete for this moment. For
> passive reset part in TCP, I only introduce the NO_SOCKET common case
> which could be set as an example.
>
> After this series applied, it will have the ability to open a new
> gate to let other people contribute more reasons into it :)
>
> Signed-off-by: Jason Xing 
> Acked-by: Matthieu Baerts (NGI0) 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Eric Dumazet
On Fri, Apr 19, 2024 at 9:29 AM Jason Xing  wrote:
>
> On Fri, Apr 19, 2024 at 3:02 PM Eric Dumazet  wrote:
> >
> > On Fri, Apr 19, 2024 at 4:31 AM Jason Xing  
> > wrote:
> > >
> > > On Fri, Apr 19, 2024 at 7:26 AM Jason Xing  
> > > wrote:
> > > >
> > > > > When I said "If you feel the need to put them in a special group, this
> > > > > is fine by me.",
> > > > > this was really about partitioning the existing enum into groups, if
> > > > > you prefer having a group of 'RES reasons'
> > > >
> > > > Are you suggesting copying what we need from enum skb_drop_reason{} to
> > > > enum sk_rst_reason{}? Why not reusing them directly. I have no idea
> > > > what the side effect of cast conversion itself is?
> > >
> > > Sorry that I'm writing this email. I'm worried my statement is not
> > > that clear, so I write one simple snippet which can help me explain
> > > well :)
> > >
> > > Allow me give NO_SOCKET as an example:
> > > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> > > index e63a3bf99617..2c9f7364de45 100644
> > > --- a/net/ipv4/icmp.c
> > > +++ b/net/ipv4/icmp.c
> > > @@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> > > int code, __be32 info,
> > > if (!fl4.saddr)
> > > fl4.saddr = htonl(INADDR_DUMMY);
> > >
> > > +   trace_icmp_send(skb_in, type, code);
> > > icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
> > >  ende:
> > > ip_rt_put(rt);
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 1e650ec71d2f..d5963831280f 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -2160,6 +2160,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >  {
> > > struct net *net = dev_net(skb->dev);
> > > enum skb_drop_reason drop_reason;
> > > +   enum sk_rst_reason rst_reason;
> > > int sdif = inet_sdif(skb);
> > > int dif = inet_iif(skb);
> > > const struct iphdr *iph;
> > > @@ -2355,7 +2356,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > >  bad_packet:
> > > __TCP_INC_STATS(net, TCP_MIB_INERRS);
> > > } else {
> > > -   tcp_v4_send_reset(NULL, skb);
> > > +   rst_reason = RST_REASON_NO_SOCKET;
> > > +   tcp_v4_send_reset(NULL, skb, rst_reason);
> > > }
> > >
> > >  discard_it:
> > >
> > > As you can see, we need to add a new 'rst_reason' variable which
> > > actually is the same as drop reason. They are the same except for the
> > > enum type... Such rst_reasons/drop_reasons are all over the place.
> > >
> > > Eric, if you have a strong preference, I can do it as you said.
> > >
> > > Well, how about explicitly casting them like this based on the current
> > > series. It looks better and clearer and more helpful to people who is
> > > reading codes to understand:
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 461b4d2b7cfe..eb125163d819 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff 
> > > *skb)
> > > return 0;
> > >
> > >  reset:
> > > -   tcp_v4_send_reset(rsk, skb, (u32)reason);
> > > +   tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
> > >  discard:
> > > kfree_skb_reason(skb, reason);
> > > /* Be careful here. If this function gets more complicated and
> >
> > It makes no sense to declare an enum sk_rst_reason and then convert it
> > to drop_reason
> > or vice versa.
> >
> > Next thing you know, compiler guys will add a new -Woption that will
> > forbid such conversions.
> >
> > Please add to tcp_v4_send_reset() an skb_drop_reason, not a new enum.
>
> Ah... It looks like I didn't make myself clear again. Sorry...
> Actually I wrote this part many times. My conclusion is that It's not
> feasible to do this.
>
> REASONS:
> If we __only__ need to deal with this passive reset in TCP, it's fine.
> We pass a skb_drop_reason which should also be converted to u32 type
> in tcp_v4_send_reset() as you said, it can work. People who use the
&g

Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Eric Dumazet
On Fri, Apr 19, 2024 at 4:31 AM Jason Xing  wrote:
>
> On Fri, Apr 19, 2024 at 7:26 AM Jason Xing  wrote:
> >
> > > When I said "If you feel the need to put them in a special group, this
> > > is fine by me.",
> > > this was really about partitioning the existing enum into groups, if
> > > you prefer having a group of 'RES reasons'
> >
> > Are you suggesting copying what we need from enum skb_drop_reason{} to
> > enum sk_rst_reason{}? Why not reusing them directly. I have no idea
> > what the side effect of cast conversion itself is?
>
> Sorry that I'm writing this email. I'm worried my statement is not
> that clear, so I write one simple snippet which can help me explain
> well :)
>
> Allow me give NO_SOCKET as an example:
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..2c9f7364de45 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -767,6 +767,7 @@ void __icmp_send(struct sk_buff *skb_in, int type,
> int code, __be32 info,
> if (!fl4.saddr)
> fl4.saddr = htonl(INADDR_DUMMY);
>
> +   trace_icmp_send(skb_in, type, code);
> icmp_push_reply(sk, &icmp_param, &fl4, &ipc, &rt);
>  ende:
> ip_rt_put(rt);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 1e650ec71d2f..d5963831280f 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2160,6 +2160,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  {
> struct net *net = dev_net(skb->dev);
> enum skb_drop_reason drop_reason;
> +   enum sk_rst_reason rst_reason;
> int sdif = inet_sdif(skb);
> int dif = inet_iif(skb);
> const struct iphdr *iph;
> @@ -2355,7 +2356,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  bad_packet:
> __TCP_INC_STATS(net, TCP_MIB_INERRS);
> } else {
> -   tcp_v4_send_reset(NULL, skb);
> +   rst_reason = RST_REASON_NO_SOCKET;
> +   tcp_v4_send_reset(NULL, skb, rst_reason);
> }
>
>  discard_it:
>
> As you can see, we need to add a new 'rst_reason' variable which
> actually is the same as drop reason. They are the same except for the
> enum type... Such rst_reasons/drop_reasons are all over the place.
>
> Eric, if you have a strong preference, I can do it as you said.
>
> Well, how about explicitly casting them like this based on the current
> series. It looks better and clearer and more helpful to people who is
> reading codes to understand:
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 461b4d2b7cfe..eb125163d819 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> return 0;
>
>  reset:
> -   tcp_v4_send_reset(rsk, skb, (u32)reason);
> +   tcp_v4_send_reset(rsk, skb, (enum sk_rst_reason)reason);
>  discard:
> kfree_skb_reason(skb, reason);
> /* Be careful here. If this function gets more complicated and

It makes no sense to declare an enum sk_rst_reason and then convert it
to drop_reason
or vice versa.

Next thing you know, compiler guys will add a new -Woption that will
forbid such conversions.

Please add to tcp_v4_send_reset() an skb_drop_reason, not a new enum.

skb_drop_reason are simply values that are later converted to strings...

So : Do not declare a new enum.



Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-18 Thread Eric Dumazet
On Thu, Apr 18, 2024 at 6:24 PM Jason Xing  wrote:
>
> On Thu, Apr 18, 2024 at 11:46 PM Jakub Kicinski  wrote:
> >
> > On Thu, 18 Apr 2024 11:30:02 +0800 Jason Xing wrote:
> > > I'm not sure why the patch series has been changed to 'Changes
> > > Requested', until now I don't think I need to change something.
> > >
> > > Should I repost this series (keeping the v6 tag) and then wait for
> > > more comments?
> >
> > If Eric doesn't like it - it's not getting merged.
>
> I'm not a English native speaker. If I understand correctly, it seems
> that Eric doesn't object to the patch series. Here is the quotation
> [1]:
> "If you feel the need to put them in a special group, this is fine by me."
>
> This rst reason mechanism can cover all the possible reasons for both
> TCP and MPTCP. We don't need to reinvent some definitions of reset
> reasons which are totally the same as drop reasons. Also, we don't
> need to reinvent something to cover MPTCP. If people are willing to
> contribute more rst reasons, they can find a good place.
>
> Reset is one big complicated 'issue' in production. I spent a lot of
> time handling all kinds of reset reasons daily. I'm apparently not the
> only one. I just want to make admins' lives easier, including me. This
> special/separate reason group is important because we can extend it in
> the future, which will not get confused.
>
> I hope it can have a chance to get merged :) Thank you.
>
> [1]: 
> https://lore.kernel.org/all/cann89i+alo_agyc8dr8dkfyi+6wpzcgrogysvgr8frfrvaa...@mail.gmail.com/
>
> Thanks,
> Jason

My objection was these casts between enums. Especially if hiding with (u32)

I see no reason for adding these casts in TCP stack.

When I said "If you feel the need to put them in a special group, this
is fine by me.",
this was really about partitioning the existing enum into groups, if
you prefer having a group of 'RES reasons'



Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-17 Thread Eric Dumazet
On Wed, Apr 17, 2024 at 10:51 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
>
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
>
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) reuse MP_TCPRST option for MPTCP
> 3) our own reasons
>
> I will implement the basic codes of active/passive reset reason in
> those three protocols, which is not complete for this moment. But
> it provides a new chance to let other people add more reasons into
> it:)
>
> Signed-off-by: Jason Xing 

My original suggestion was to use normal values in  'enum
skb_drop_reason', even if there was not necessarily a 'drop'
in the common sense.

https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

This would avoid these ugly casts later, even casting an enum to other
ones is not very logical.
Going through an u32 pivot is quite a hack.

If you feel the need to put them in a special group, this is fine by me.



Re: [PATCH net-next v3 2/3] trace: tcp: fully support trace_tcp_send_reset

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 11:23 AM Jason Xing  wrote:
>
> On Fri, Mar 29, 2024 at 5:07 PM Eric Dumazet  wrote:
> >
> > On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  
> > wrote:
> > >
> > > From: Jason Xing 
> > >
> > > Prior to this patch, what we can see by enabling trace_tcp_send is
> > > only happening under two circumstances:
> > > 1) active rst mode
> > > 2) non-active rst mode and based on the full socket
> > >
> > > That means the inconsistency occurs if we use tcpdump and trace
> > > simultaneously to see how rst happens.
> > >
> > > It's necessary that we should take into other cases into considerations,
> > > say:
> > > 1) time-wait socket
> > > 2) no socket
> > > ...
> > >
> > > By parsing the incoming skb and reversing its 4-tuple can
> > > we know the exact 'flow' which might not exist.
> > >
> > > Samples after applied this patch:
> > > 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
> > > state=TCP_ESTABLISHED
> > > 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
> > > state=UNKNOWN
> > > Note:
> > > 1) UNKNOWN means we cannot extract the right information from skb.
> > > 2) skbaddr/skaddr could be 0
> > >
> > > Signed-off-by: Jason Xing 
> > > ---
> > >  include/trace/events/tcp.h | 39 --
> > >  net/ipv4/tcp_ipv4.c|  4 ++--
> > >  net/ipv6/tcp_ipv6.c|  3 ++-
> > >  3 files changed, 41 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> > > index 194425f69642..289438c54227 100644
> > > --- a/include/trace/events/tcp.h
> > > +++ b/include/trace/events/tcp.h
> > > @@ -78,11 +78,46 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
> > >   * skb of trace_tcp_send_reset is the skb that caused RST. In case of
> > >   * active reset, skb should be NULL
> > >   */
> > > -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
> > > +TRACE_EVENT(tcp_send_reset,
> > >
> > > TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> > >
> > > -   TP_ARGS(sk, skb)
> > > +   TP_ARGS(sk, skb),
> > > +
> > > +   TP_STRUCT__entry(
> > > +   __field(const void *, skbaddr)
> > > +   __field(const void *, skaddr)
> > > +   __field(int, state)
> > > +   __array(__u8, saddr, sizeof(struct sockaddr_in6))
> > > +   __array(__u8, daddr, sizeof(struct sockaddr_in6))
> > > +   ),
> > > +
> > > +   TP_fast_assign(
> > > +   __entry->skbaddr = skb;
> > > +   __entry->skaddr = sk;
> > > +   /* Zero means unknown state. */
> > > +   __entry->state = sk ? sk->sk_state : 0;
> > > +
> > > +   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> > > +   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
> > > +
> > > +   if (sk && sk_fullsock(sk)) {
> > > +   const struct inet_sock *inet = inet_sk(sk);
> > > +
> > > +   TP_STORE_ADDR_PORTS(__entry, inet, sk);
> > > +   } else {
> >
> > To be on the safe side, I would test if (skb) here.
> > We have one caller with skb == NULL, we might have more in the future.
>
> Thanks for the review.
>
> How about changing '} else {' to '} else if (skb) {', then if we go
> into this else-if branch, we will print nothing, right? I'll test it
> in this case.

Right, the fields are cleared before this else

+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));

>
> >
> > > +   /*
> > > +* We should reverse the 4-tuple of skb, so later
> > > +* it can print the right flow direction of rst.
> > > +*/
> > > +   TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
> > > entry->saddr);
> > > +   }
> > > +   ),
> > > +
> > > +   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
> > > + __entry->skbaddr, __entry->skaddr,
> > > + __entry->saddr, __entry->daddr,
> > > + __entry->state ? show_tcp_state_name(__entry->state) : 
> > > "UNKNOWN")
> > >  );
> > >
> > >  /*
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index a22ee5838751..d5c4a969c066 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -868,10 +868,10 @@ static void tcp_v4_send_reset(const struct sock 
> > > *sk, struct sk_buff *skb)
> > >  */
> > > if (sk) {
> > > arg.bound_dev_if = sk->sk_bound_dev_if;
> > > -   if (sk_fullsock(sk))
> > > -   trace_tcp_send_reset(sk, skb);
> > > }
> >
> > Remove the { } ?
>
> Yes, I forgot to remove them.

No problem.



Re: [PATCH net-next v3 3/3] tcp: add location into reset trace process

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> In addition to knowing the 4-tuple of the flow which generates RST,
> the reason why it does so is very important because we have some
> cases where the RST should be sent and have no clue which one
> exactly.
>
> Adding location of reset process can help us more, like what
> trace_kfree_skb does.

Well, I would prefer a drop_reason here, even if there is no 'dropped' packet.

This would be more stable than something based on function names that
could be changed.

tracepoints do not have to get ugly, we can easily get stack traces if needed.

perf record -a -g  -e tcp:tcp_send_reset ...



Re: [PATCH net-next v3 2/3] trace: tcp: fully support trace_tcp_send_reset

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Prior to this patch, what we can see by enabling trace_tcp_send is
> only happening under two circumstances:
> 1) active rst mode
> 2) non-active rst mode and based on the full socket
>
> That means the inconsistency occurs if we use tcpdump and trace
> simultaneously to see how rst happens.
>
> It's necessary that we should take into other cases into considerations,
> say:
> 1) time-wait socket
> 2) no socket
> ...
>
> By parsing the incoming skb and reversing its 4-tuple can
> we know the exact 'flow' which might not exist.
>
> Samples after applied this patch:
> 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
> state=TCP_ESTABLISHED
> 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
> state=UNKNOWN
> Note:
> 1) UNKNOWN means we cannot extract the right information from skb.
> 2) skbaddr/skaddr could be 0
>
> Signed-off-by: Jason Xing 
> ---
>  include/trace/events/tcp.h | 39 --
>  net/ipv4/tcp_ipv4.c|  4 ++--
>  net/ipv6/tcp_ipv6.c|  3 ++-
>  3 files changed, 41 insertions(+), 5 deletions(-)
>
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 194425f69642..289438c54227 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -78,11 +78,46 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   * skb of trace_tcp_send_reset is the skb that caused RST. In case of
>   * active reset, skb should be NULL
>   */
> -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
> +TRACE_EVENT(tcp_send_reset,
>
> TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
>
> -   TP_ARGS(sk, skb)
> +   TP_ARGS(sk, skb),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(const void *, skaddr)
> +   __field(int, state)
> +   __array(__u8, saddr, sizeof(struct sockaddr_in6))
> +   __array(__u8, daddr, sizeof(struct sockaddr_in6))
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->skbaddr = skb;
> +   __entry->skaddr = sk;
> +   /* Zero means unknown state. */
> +   __entry->state = sk ? sk->sk_state : 0;
> +
> +   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> +   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
> +
> +   if (sk && sk_fullsock(sk)) {
> +   const struct inet_sock *inet = inet_sk(sk);
> +
> +   TP_STORE_ADDR_PORTS(__entry, inet, sk);
> +   } else {

To be on the safe side, I would test if (skb) here.
We have one caller with skb == NULL, we might have more in the future.

> +   /*
> +* We should reverse the 4-tuple of skb, so later
> +* it can print the right flow direction of rst.
> +*/
> +   TP_STORE_ADDR_PORTS_SKB(skb, entry->daddr, 
> entry->saddr);
> +   }
> +   ),
> +
> +   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
> + __entry->skbaddr, __entry->skaddr,
> + __entry->saddr, __entry->daddr,
> + __entry->state ? show_tcp_state_name(__entry->state) : 
> "UNKNOWN")
>  );
>
>  /*
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index a22ee5838751..d5c4a969c066 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -868,10 +868,10 @@ static void tcp_v4_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
>  */
> if (sk) {
> arg.bound_dev_if = sk->sk_bound_dev_if;
> -   if (sk_fullsock(sk))
> -   trace_tcp_send_reset(sk, skb);
> }

Remove the { } ?


>
> +   trace_tcp_send_reset(sk, skb);
> +
> BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
>  offsetof(struct inet_timewait_sock, tw_bound_dev_if));
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 3f4cba49e9ee..8e9c59b6c00c 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
> if (sk) {
> oif = sk->sk_bound_dev_if;
> if (sk_fullsock(sk)) {
> -   trace_tcp_send_reset(sk, skb);
> if (inet6_test_bit(REPFLOW, sk))
> label = ip6_flowlabel(ipv6h);
> priority = READ_ONCE(sk->sk_priority);
> @@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, 
> struct sk_buff *skb)
> label = ip6_flowlabel(ipv6h);
> }
>
> +   trace_tcp_send_reset(sk, skb);
> +
> tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
>

Re: [PATCH net-next v3 1/3] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters

2024-03-29 Thread Eric Dumazet
On Fri, Mar 29, 2024 at 4:43 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Introducing entry_saddr and entry_daddr parameters in this macro
> for later use can help us record the reverse 4-tuple by analyzing
> the 4-tuple of the incoming skb when receiving.
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Eric Dumazet
On Tue, Mar 26, 2024 at 11:44 AM Jason Xing  wrote:

> Well, it's a pity that it seems that we are about to abandon this
> method but it's not that friendly to the users who are unable to
> deploy BPF...

It is a pity these tracepoint patches are consuming a lot of reviewer
time, just because
some people 'can not deploy BPF'

Well, I came up with more ideas about how to improve the
> trace function in recent days. The motivation of doing this is that I
> encountered some issues which could be traced/diagnosed by using trace
> effortlessly without writing some bpftrace codes again and again. The
> status of trace seems not active but many people are still using it, I
> believe.

'Writing bpftrace codes again and again' is not a good reason to add
maintenance costs
to linux networking stack.



Re: [PATCH net-next v2 2/2] tcp: add tracing of skbaddr in tcp_event_skb class

2024-03-07 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 10:29 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Use the existing parameter and print the address of skbaddr
> as other trace functions do.
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v2 1/2] tcp: add tracing of skb/skaddr in tcp_event_sk_skb class

2024-03-07 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 10:29 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Printing the addresses can help us identify the exact skb/sk
> for those system in which it's not that easy to run BPF program.
> As we can see, it already fetches those, then use it directly
> and it will print like below:
>
> ...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...
>
> Signed-off-by: Jason Xing 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 1/2] tcp: add tracing of skb/skaddr in tcp_event_sk_skb class

2024-03-04 Thread Eric Dumazet
On Thu, Feb 29, 2024 at 6:10 PM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Prio to this patch, the trace function doesn't print addresses
> which might be forgotten. As we can see, it already fetches
> those, use it directly and it will print like below:
>
> ...tcp_retransmit_skb: skbaddr=XXX skaddr=XXX family=AF_INET...

It was not forgotten, but a recommendation from Alexei

https://yhbt.net/lore/all/20171010173821.6djxyvrggvaiv...@ast-mbp.dhcp.thefacebook.com/



Re: [PATCH net-next 0/2] add two missing addresses when using trace

2024-03-04 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 8:01 AM Jason Xing  wrote:
>
> On Fri, Mar 1, 2024 at 1:10 AM Jason Xing  wrote:
> >
> > From: Jason Xing 
> >
> > When I reviewed other people's patch [1], I noticed that similar thing
> > also happens in tcp_event_skb class and tcp_event_sk_skb class. They
> > don't print those two addrs of skb/sk which already exist.
> >
> > They are probably forgotten by the original authors, so this time I
> > finish the work. Also, adding more trace about the socket/skb addr can
> > help us sometime even though the chance is minor.
> >
> > I don't consider it is a bug, thus I chose to target net-next tree.
>
> Gentle ping...No rush. Just in case this simple patchset was lost for
> some reason.

This was a conscious choice I think.

https://yhbt.net/lore/all/20171010173821.6djxyvrggvaiv...@ast-mbp.dhcp.thefacebook.com/



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-05 Thread Eric Dumazet
On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo  wrote:
>
> On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> > >
> > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > which will be called every time a tcp data packet is sent, received, and
> > > acked.
> > > tcp_data_send: called after a data packet is sent.
> > > tcp_data_recv: called after a data packet is receviced.
> > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > data are ackknowledged).
> > >
> > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > and analyses every tcp request/response event information. The whole
> > > system has been described in SIGMOD'18 (see
> > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > achieve this with bpf, we require hooks for data events that call bpf
> > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> >
> > This I do not understand.
> >
> > >
> > > Though kretprobe/fexit can also be used to collect these information,
> > > they will not work if the kernel functions get inlined. Considering the
> > > stability, we prefer tracepoint as the solution.
> >
> > I dunno, this seems quite weak to me. I see many patches coming to add
> > tracing in the stack, but no patches fixing any issues.
>
>
> We have implemented a mechanism to split the request and response from the TCP
> connection using these "hookers", which can handle various protocols such as
> HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> information about each request and response, including the amount of data
> uploaded, the time taken by the server to handle the request, and the time 
> taken
> for the client to receive the response. This mechanism has been running
> internally for many years and has proven to be very useful.
>
> One of the main benefits of this mechanism is that it helps in locating the
> source of any issues or problems that may arise. For example, if there is a
> problem with the network, the application, or the machine, we can use this
> mechanism to identify and isolate the issue.
>
> TCP has long been a challenge when it comes to tracking the transmission of 
> data
> on the network. The application can only confirm that it has sent a certain
> amount of data to the kernel, but it has limited visibility into whether the
> client has actually received this data. Our mechanism addresses this issue by
> providing insights into the amount of data received by the client and the time
> it was received. Furthermore, we can also detect any packet loss or delays
> caused by the server.
>
> https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
>
> So, we do not want to add some tracepoint to do some unknow debug.
> We have a clear goal. debugging is just an incidental capability.
>

We have powerful mechanisms in the stack already that ordinary (no
privilege requested) applications can readily use.

We have been using them for a while.

If existing mechanisms are missing something you need, please expand them.

For reference, start looking at tcp_get_timestamping_opt_stats() history.

Sender side can for instance get precise timestamps.

Combinations of these timestamps reveal different parts of the overall
network latency,

T0: sendmsg() enters TCP
T1: first byte enters qdisc
T2: first byte sent to the NIC
T3: first byte ACKed in TCP
T4: last byte sent to the NIC
T5: last byte ACKed
T1 - T0: how long the first byte was blocked in the TCP layer ("Head
of Line Blocking" latency).
T2 - T1: how long the first byte was blocked in the Linux traffic
shaping layer (known as QDisc).
T3 - T2: the network ‘distance’ (propagation delay + current queuing
delay along the network path and at the receiver).
T5 - T2: how fast the sent chunk was delivered.
Message Size / (T5 - T0): goodput (from application’s perspective)



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Eric Dumazet
On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
>
> Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> which will be called every time a tcp data packet is sent, received, and
> acked.
> tcp_data_send: called after a data packet is sent.
> tcp_data_recv: called after a data packet is receviced.
> tcp_data_acked: called after a valid ack packet is processed (some sent
> data are ackknowledged).
>
> We use these callbacks for fine-grained tcp monitoring, which collects
> and analyses every tcp request/response event information. The whole
> system has been described in SIGMOD'18 (see
> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> achieve this with bpf, we require hooks for data events that call bpf
> prog (1) when any data packet is sent/received/acked, and (2) after
> critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> Besides, these tracepoints help to debug tcp when data send/recv/acked.

This I do not understand.

>
> Though kretprobe/fexit can also be used to collect these information,
> they will not work if the kernel functions get inlined. Considering the
> stability, we prefer tracepoint as the solution.

I dunno, this seems quite weak to me. I see many patches coming to add
tracing in the stack, but no patches fixing any issues.

It really looks like : We do not know how TCP stack works, we do not
know if there is any issue,
let us add trace points to help us to make forward progress in our analysis.

These tracepoints will not tell how many segments/bytes were
sent/acked/received, I really do not see
how we will avoid adding in the future more stuff, forcing the
compiler to save more state
just in case the tracepoint needs the info.

The argument of "add minimal info", so that we can silently add more
stuff in the future "for free" is not something I buy.

I very much prefer that you make sure the stuff you need is not
inlined, so that standard kprobe/kretprobe facility can be used.