Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Xuan Zhuo
On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> >
> > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > which will be called every time a tcp data packet is sent, received, and
> > acked.
> > tcp_data_send: called after a data packet is sent.
> > tcp_data_recv: called after a data packet is receviced.
> > tcp_data_acked: called after a valid ack packet is processed (some sent
> > data are ackknowledged).
> >
> > We use these callbacks for fine-grained tcp monitoring, which collects
> > and analyses every tcp request/response event information. The whole
> > system has been described in SIGMOD'18 (see
> > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > achieve this with bpf, we require hooks for data events that call bpf
> > prog (1) when any data packet is sent/received/acked, and (2) after
> > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > Besides, these tracepoints help to debug tcp when data send/recv/acked.
>
> This I do not understand.
>
> >
> > Though kretprobe/fexit can also be used to collect these information,
> > they will not work if the kernel functions get inlined. Considering the
> > stability, we prefer tracepoint as the solution.
>
> I dunno, this seems quite weak to me. I see many patches coming to add
> tracing in the stack, but no patches fixing any issues.


We have implemented a mechanism to split the request and response from the TCP
connection using these "hookers", which can handle various protocols such as
HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
information about each request and response, including the amount of data
uploaded, the time taken by the server to handle the request, and the time taken
for the client to receive the response. This mechanism has been running
internally for many years and has proven to be very useful.

One of the main benefits of this mechanism is that it helps in locating the
source of any issues or problems that may arise. For example, if there is a
problem with the network, the application, or the machine, we can use this
mechanism to identify and isolate the issue.

TCP has long been a challenge when it comes to tracking the transmission of data
on the network. The application can only confirm that it has sent a certain
amount of data to the kernel, but it has limited visibility into whether the
client has actually received this data. Our mechanism addresses this issue by
providing insights into the amount of data received by the client and the time
it was received. Furthermore, we can also detect any packet loss or delays
caused by the server.

https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg

So, we do not want to add some tracepoint to do some unknow debug.
We have a clear goal. debugging is just an incidental capability.

Thanks.


>
> It really looks like : We do not know how TCP stack works, we do not
> know if there is any issue,
> let us add trace points to help us to make forward progress in our analysis.
>
> These tracepoints will not tell how many segments/bytes were
> sent/acked/received, I really do not see
> how we will avoid adding in the future more stuff, forcing the
> compiler to save more state
> just in case the tracepoint needs the info.
>
> The argument of "add minimal info", so that we can silently add more
> stuff in the future "for free" is not something I buy.
>
> I very much prefer that you make sure the stuff you need is not
> inlined, so that standard kprobe/kretprobe facility can be used.



[PATCH v2] 9p: prevent read overrun in protocol dump tracepoint

2023-12-04 Thread JP Kobryn
An out of bounds read can occur within the tracepoint 9p_protocol_dump. In
the fast assign, there is a memcpy that uses a constant size of 32 (macro
named P9_PROTO_DUMP_SZ). When the copy is invoked, the source buffer is not
guaranteed match this size.  It was found that in some cases the source
buffer size is less than 32, resulting in a read that overruns.

The size of the source buffer seems to be known at the time of the
tracepoint being invoked. The allocations happen within p9_fcall_init(),
where the capacity field is set to the allocated size of the payload
buffer. This patch tries to fix the overrun by changing the fixed array to
a dynamically sized array and using the minimum of the capacity value or
P9_PROTO_DUMP_SZ as its length. The trace log statement is adjusted to
account for this. Note that the trace log no longer splits the payload on
the first 16 bytes. The full payload is now logged to a single line.

To repro the orignal problem, operations to a plan 9 managed resource can
be used. The simplest approach might just be mounting a shared filesystem
(between host and guest vm) using the plan 9 protocol while the tracepoint
is enabled.

mount -t 9p -o trans=virtio  

The bpftrace program below can be used to show the out of bounds read.
Note that a recent version of bpftrace is needed for the raw tracepoint
support. The script was tested using v0.19.0.

/* from include/net/9p/9p.h */
struct p9_fcall {
u32 size;
u8 id;
u16 tag;
size_t offset;
size_t capacity;
struct kmem_cache *cache;
u8 *sdata;
bool zc;
};

tracepoint:9p:9p_protocol_dump
{
/* out of bounds read can happen when this tracepoint is enabled */
}

rawtracepoint:9p_protocol_dump
{
$pdu = (struct p9_fcall *)arg1;
$dump_sz = (uint64)32;

if ($dump_sz > $pdu->capacity) {
printf("reading %zu bytes from src buffer of %zu bytes\n",
$dump_sz, $pdu->capacity);
}
}

Signed-off-by: JP Kobryn 
---
 include/trace/events/9p.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/9p.h b/include/trace/events/9p.h
index 4dfa6d7f83ba..cd104a1343e2 100644
--- a/include/trace/events/9p.h
+++ b/include/trace/events/9p.h
@@ -178,18 +178,21 @@ TRACE_EVENT(9p_protocol_dump,
__field(void *, clnt
)
__field(__u8,   type
)
__field(__u16,  tag 
)
-   __array(unsigned char,  line,   P9_PROTO_DUMP_SZ
)
+   __dynamic_array(unsigned char, line,
+   min_t(size_t, pdu->capacity, P9_PROTO_DUMP_SZ))
),
 
TP_fast_assign(
__entry->clnt   =  clnt;
__entry->type   =  pdu->id;
__entry->tag=  pdu->tag;
-   memcpy(__entry->line, pdu->sdata, P9_PROTO_DUMP_SZ);
+   memcpy(__get_dynamic_array(line), pdu->sdata,
+   __get_dynamic_array_len(line));
),
-   TP_printk("clnt %lu %s(tag = %d)\n%.3x: %16ph\n%.3x: %16ph\n",
+   TP_printk("clnt %lu %s(tag = %d)\n%*ph\n",
  (unsigned long)__entry->clnt, show_9p_op(__entry->type),
- __entry->tag, 0, __entry->line, 16, __entry->line + 16)
+ __entry->tag, __get_dynamic_array_len(line),
+ __get_dynamic_array(line))
  );
 
 
-- 
2.43.0




Re: [PATCH] 9p: prevent read overrun in protocol dump tracepoint

2023-12-04 Thread JP Kobryn
On Sun, Dec 03, 2023 at 02:32:15PM +0900, Dominique Martinet wrote:
> Steven Rostedt wrote on Sat, Dec 02, 2023 at 11:15:24PM -0500:
> > > Also, for custom tracepoints e.g. bpftrace the program needs to know how
> > > many bytes can be read safely even if it's just for dumping -- unless
> > > dynamic_array is a "fat pointer" that conveys its own size?
> > > (Sorry didn't take the time to check)
> > 
> > Yes, there's also a __get_dynamic_array_len(line) that will return the
> > allocated length of the line. Is that what you need?
> 
> Yes, thanks! So the lower two bytes of the field are its position in
> the entry and the higher two bytes its size; ok.
> It doesn't look like bpftrace has any helper for it but that can
> probably be sorted out if someone wants to dump data there.
> 
> 
> Let's update the event to use a dynamic array and have printk fomrat to
> use %*ph with that length.
> 
> JP Kobryn, does that sound good to you? I'm not sure what you were
> trying to do in the first place.
> Do you want to send a v2 or shall I?

Sounds good. I'll send out a v2. Thanks Steve for recommending the
dynamic array macros.

JP
> 
> -- 
> Dominique Martinet | Asmadeus



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Eric Dumazet
On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
>
> Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> which will be called every time a tcp data packet is sent, received, and
> acked.
> tcp_data_send: called after a data packet is sent.
> tcp_data_recv: called after a data packet is receviced.
> tcp_data_acked: called after a valid ack packet is processed (some sent
> data are ackknowledged).
>
> We use these callbacks for fine-grained tcp monitoring, which collects
> and analyses every tcp request/response event information. The whole
> system has been described in SIGMOD'18 (see
> https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> achieve this with bpf, we require hooks for data events that call bpf
> prog (1) when any data packet is sent/received/acked, and (2) after
> critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> Besides, these tracepoints help to debug tcp when data send/recv/acked.

This I do not understand.

>
> Though kretprobe/fexit can also be used to collect these information,
> they will not work if the kernel functions get inlined. Considering the
> stability, we prefer tracepoint as the solution.

I dunno, this seems quite weak to me. I see many patches coming to add
tracing in the stack, but no patches fixing any issues.

It really looks like : We do not know how TCP stack works, we do not
know if there is any issue,
let us add trace points to help us to make forward progress in our analysis.

These tracepoints will not tell how many segments/bytes were
sent/acked/received, I really do not see
how we will avoid adding in the future more stuff, forcing the
compiler to save more state
just in case the tracepoint needs the info.

The argument of "add minimal info", so that we can silently add more
stuff in the future "for free" is not something I buy.

I very much prefer that you make sure the stuff you need is not
inlined, so that standard kprobe/kretprobe facility can be used.



[PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Philo Lu
Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
which will be called every time a tcp data packet is sent, received, and
acked.
tcp_data_send: called after a data packet is sent.
tcp_data_recv: called after a data packet is receviced.
tcp_data_acked: called after a valid ack packet is processed (some sent
data are ackknowledged).

We use these callbacks for fine-grained tcp monitoring, which collects
and analyses every tcp request/response event information. The whole
system has been described in SIGMOD'18 (see
https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
achieve this with bpf, we require hooks for data events that call bpf
prog (1) when any data packet is sent/received/acked, and (2) after
critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
rcv_nxt). However, existing bpf hooks cannot meet our requirements.
Besides, these tracepoints help to debug tcp when data send/recv/acked.

Though kretprobe/fexit can also be used to collect these information,
they will not work if the kernel functions get inlined. Considering the
stability, we prefer tracepoint as the solution.

Signed-off-by: Philo Lu 
Signed-off-by: Xuan Zhuo 
---
 include/trace/events/tcp.h | 21 +
 net/ipv4/tcp_input.c   |  4 
 net/ipv4/tcp_output.c  |  2 ++
 3 files changed, 27 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 7b1ddffa3dfc..1423f7cb73f9 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -113,6 +113,13 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
TP_ARGS(sk, skb)
 );
 
+DEFINE_EVENT(tcp_event_sk_skb, tcp_data_recv,
+
+   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+
+   TP_ARGS(sk, skb)
+);
+
 /*
  * tcp event with arguments sk
  *
@@ -187,6 +194,20 @@ DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
TP_ARGS(sk)
 );
 
+DEFINE_EVENT(tcp_event_sk, tcp_data_send,
+
+   TP_PROTO(struct sock *sk),
+
+   TP_ARGS(sk)
+);
+
+DEFINE_EVENT(tcp_event_sk, tcp_data_acked,
+
+   TP_PROTO(struct sock *sk),
+
+   TP_ARGS(sk)
+);
+
 TRACE_EVENT(tcp_retransmit_synack,
 
TP_PROTO(const struct sock *sk, const struct request_sock *req),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bcb55d98004c..edb1e24a3423 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -824,6 +824,8 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
 
now = tcp_jiffies32;
 
+   trace_tcp_data_recv(sk, skb);
+
if (!icsk->icsk_ack.ato) {
/* The _first_ data packet received, initialize
 * delayed ACK engine.
@@ -3486,6 +3488,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, const 
struct sk_buff *ack_skb,
}
}
 #endif
+
+   trace_tcp_data_acked(sk);
return flag;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index eb13a55d660c..cb6f2af55ce2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2821,6 +2821,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
/* Send one loss probe per tail loss episode. */
if (push_one != 2)
tcp_schedule_loss_probe(sk, false);
+
+   trace_tcp_data_send(sk);
return false;
}
return !tp->packets_out && !tcp_write_queue_empty(sk);
-- 
2.32.0.3.g01195cf9f