Re: UDP wierdness around skb_copy_and_csum_datagram_msg()
Trying to revive this thread. To review: skb_copy_and_csum_datagram_msg() pretty clearly doesn't do the right thing since it started using an iov_iter to copy into the user's iovec. In particular, if it encounters a datagram that fails the checksum, the iov_iter continues to point to the end of the failed datagram's data, and that data makes it out to user-space. I'd previously sent a test program that consistenly reproduced the UDP(v4) symptoms of this problem [0]. I've now also taken Christian's netem suggestion and written a quick program that sends known data over loopback TCP from one thread and reads it from another. It optionally sets up a netem qdisc that corrupts 1% of packets. As expected, even with corruption, tcp delivers correct data to the user on pre-3.19 kernels; on 3.19 and later, long transfers usually see corruptions. (Source for the TCP test program below.) I've also tried both test programs with the following patch: diff --git a/net/core/datagram.c b/net/core/datagram.c index b7de71f..61da091 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -730,6 +730,7 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, { __wsum csum; int chunk = skb->len - hlen; + struct iov_iter saved_iter; if (!chunk) return 0; @@ -741,11 +742,16 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, goto fault; } else { csum = csum_partial(skb->data, hlen, skb->csum); + + /* save msg_iter state, so we can revert if csum fails. */ + saved_iter = msg->msg_iter; if (skb_copy_and_csum_datagram(skb, hlen, &msg->msg_iter, chunk, &csum)) goto fault; - if (csum_fold(csum)) + if (csum_fold(csum)) { + msg->msg_iter = saved_iter; goto csum_error; + } if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE)) netdev_rx_csum_fault(skb->dev); } (This is essentially the same fix Alan previously sent [1], except that it uses struct assignment rather than memcpy'ing the struct iov_iter.) As expected, both UDP and TCP tests succeed under this fix. So I've missed whatever conversations happened off-list after Alan's original report. But it appears to me that this patch completely resolves the csum/iov_iter problem for both TCP and UDP; I'm not sure I see what further problem we'd want to hold the fix off for? [0] https://www.spinics.net/lists/netdev/msg398026.html [1] https://patchwork.kernel.org/patch/9260733/ New TCP test program: #include #include #include #include #include #include #include #include #include #include #include int bytes = 0; struct sockaddr_in addr; socklen_t addrLen = sizeof(addr); void *sender(void *ignore) { int send = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); if (send < 0) { fprintf(stderr, "failed to create sending socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } int ret = connect(send, (struct sockaddr *) &addr, addrLen); if (ret != 0) { fprintf(stderr, "failed to connect sending socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } int i = 0; while (i < bytes) { #define OUT_MESSAGE "" int w = write(send, OUT_MESSAGE, strlen(OUT_MESSAGE)); if (w < 0) { fprintf(stdout, "failed to write byte %d\n", i); exit(1); } i += w; } close(send); } /* set a qdisc on lo with corruption_probability 1% (or remove if if on==0) */ void setCorrupt(int on) { struct nl_sock *sock; struct nl_cache *cache; struct rtnl_qdisc *q; struct rtnl_link *link; int if_index; sock = nl_socket_alloc(); nl_connect(sock, NETLINK_ROUTE); rtnl_link_alloc_cache(sock, AF_UNSPEC, &cache); link = rtnl_link_get_by_name(cache, "lo"); if_index = rtnl_link_get_ifindex(link); q = rtnl_qdisc_alloc(); rtnl_tc_set_ifindex(TC_CAST(q), if_index); rtnl_tc_set_parent(TC_CAST(q), TC_H_ROOT); rtnl_tc_set_handle(TC_CAST(q), TC_HANDLE(1, 0)); rtnl_tc_set_kind(TC_CAST(q), "netem"); rtnl_netem_set_corruption_probability(q, 0x / 100); if (on) { int ret = rtnl_qdisc_add(sock, q, NLM_F_CREATE); if (ret < 0) { fprintf(stderr, "rtnl_qdisc_add error: %s\n", nl_geterror(ret)); exit(1); } } else { int ret = rtnl_qdisc_delete(sock, q); if (ret < 0) { fprintf(stderr, "rtnl_qdisc_del error: %s\n", nl_geterror(ret)); exit(1); } } rtnl_qdisc_put(q); nl_socket_free(sock); rtnl_link_put(link); nl_cache_put(cache); } int main(int argc, char **argv) { if (argc < 2) { fprintf(stderr, "Usage: tcpcsum [corruption-rate]"); exit(1); } bytes = atoi(argv[1]); int cor
Re: UDP wierdness around skb_copy_and_csum_datagram_msg()
Christian Lamparter writes: > On Wednesday, September 28, 2016 7:20:39 PM CEST Jay Smith wrote: >> Actually, on a little more searching of this list's archives, I think >> that this discussion: https://patchwork.kernel.org/patch/9260733/ is >> about exactly the same issue I've found, except from the TCP side. I'm >> cc'ing a few of the participants from that discussion. >> >> So is the patch proposed there (copying and restoring the entire >> iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a >> fix? > > From Alan's post: > > "My ugly patch fixes this in the most obvious way: make a local copy of > msg->msg_iter before the call to skb_copy_and_csum_datagram(), and copy > it back if the checksum is bad, just before goto csum_error;" > > IMHO this meant that the patch is a proof of concept for his problem. It's also the simplest thing that fixes all of the relevant cases (udp4, udp6, tcp4). Basically, the state of the iov_iter (which, if I'm reading correctly, consists of three elements -- iov_offset, count, and nr_segs all change values as the iterator moves through the vectors) needs to be backed-up and restored at exactly the points in datagram.c where Alan's patch does so. Whether that should be done with memcpy, as Alan does, or by exposing some more abstract backup/restore functions from iov_iter.c is a matter of taste. I'm happy to accept the call of someone more maintainer-ish on that. > Al Viro identified more inconsistencies within the error-paths that deal > with EFAULT in the whole area (in and around skb_copy_and_csum_datagram()). Was this in some other thread? The only other discussion I see of that function in the "PROBLEM: network data corruption..." thread is around this patch https://patchwork.kernel.org/patch/9245023/ , which as Al says was just a diagnostic patch -- it intentionally doesn't handle the multiple-vector case. It seems like the EFAULT case in skb_copy_and_csum_datagram() would indicate that the iov_iter code ran out of room to copy the current message, even though it's checked for that room at datagram.c:738. Which I guess is possible -- there could be some non-obvious counting error in the iov_inter.c macros. But, at least in the UDP cases, it wouldn't trigger the same problem as a checksum failure -- the EFAULT gets returned to the caller in that case, and the buffer isn't meant to be valid. It's only in the checksum case that we retry underneath the udp_recvmsg() covers, and end up returning the supposedly-rejected data. > > As for fixing the issue: I'm happy to test and review patches. > The trouble is that nobody seem to be able to produce them... Sorry -- is the trouble you're talking about here that no-one's produced a patch, or that we don't have a reproduction of the problem? I don't think either is true. The test program I'd attached to my first mail reliably reproduces the UDP version of the problem. It's pretty simple: listen on a UDP port (using loopback, so that there's no hardware csum offload), use a raw socket to send a datagram with a bad UDP checksum, then send a good datagram, and then finally read from the socket. On post-3.19 kernels, you always get the contents of the bad packet at the start of the user buffer: # bin/csumtestn 69 listening on port 47193 recvmsg returned 9 bytes: BAD DATABAD DATABAD DATABAD DATABAD DATABAD DATABAD DATABAD DATABAD DGood data After Alan's patch, the good packet's contents are at the start of the buffer, where they belong: # bin/csumtestn 69 listening on port 54620 recvmsg returned 9 bytes: Good dataAD DATABAD DATABAD DATABAD DATABAD DATABAD DATABAD DATABAD D So functionally, I believe that Alan's patch does the trick. I haven't actually tested it on UDP6, but a similar test should work there. Inserting the bad packets deterministically into a TCP connection is trickier, but I thought in the previous thread that you and Alan both had wireless hardware configurations that frequently generated checksum errors, and that Alan's claim was that his patch gave him good TCP data even in the presence of those checksum errors. Or do I misunderstand? (Just to be clear, though, if there is a need for a new patch, for whatever reason, I'm happy to generate one.)
Re: UDP wierdness around skb_copy_and_csum_datagram_msg()
Actually, on a little more searching of this list's archives, I think that this discussion: https://patchwork.kernel.org/patch/9260733/ is about exactly the same issue I've found, except from the TCP side. I'm cc'ing a few of the participants from that discussion. So is the patch proposed there (copying and restoring the entire iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a fix? If not, would an alternate one that concealed the save-and-restore logic inside iov_iter.c be more acceptable? I'd be happy to produce whatever's needed, or yield to someone with stronger feelings about where it should go... On Wed, Sep 28, 2016 at 6:24 PM, Eric Dumazet wrote: > On Wed, 2016-09-28 at 17:18 -0700, Jay Smith wrote: >> I've spent the last week or so trying to track down a recurring >> problem I'm seeing with UDP datagram handling. I'm new to the >> internals of the Linux network stack, but it appears to me that >> there's a substantial error in recent kernels' handling of UDP >> checksum errors. >> >> The behavior I'm seeing: I have a UDP server that receives lots of >> datagrams from many devices on a single port. A small number of those >> devices occasionally send packets with bad UDP checksums. After I >> receive one of these bad packets, the next recvmsg() made on the >> socket returns data from the bad-checksum packet, but the >> source-address and length of the next (good) packet that arrived at >> the port. >> >> I believe this issue was introduced between linux 3.18 and 3.19, by a >> set of changes to net/core/datagram.c that made >> skb_copy_and_csum_datagram_msg() and related functions use the >> iov_iter structure to copy data to user buffers. In the case where >> those functions copy a datagram and then conclude that the checksum is >> invalid, they don't remove the already-copied data from the user's >> iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a >> second time, looking at the next datagram on the queue, that second >> datagram's data is appended to the first datagram's. So when >> recvmsg() returns to the user, the return value and msg_name reflect >> the second datagram, but the first bytes in the user's iovec come from >> the first (bad) datagram. >> >> (I've attached a test program that demonstrates the problem. Note >> that it sees correct behavior unless the bad-checksum packet has > 68 >> bytes of UDP data; otherwise, the packet doesn't make it past the >> CHECKSUM_BREAK test, and never enters >> skp_copy_and_csum_datagram_msg().) >> >> The fix for UDP seems pretty simple; the iov_iter's iov_offset member >> just needs to be set back to zero on a checksum failure. But it looks >> like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c, >> where I assume that multiple sk_buffs can be copied-and-csum'd into >> the same iov -- if that's right, it seems like iov_iter needs some >> additional state to support rolling-back the most recent copy without >> losing previous ones. >> >> Any thoughts? > > Nice catch ! > > What about clearing iov_offset from UDP (v4 & v6) only ? > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index > 7d96dc2d3d08fa909f247dfbcbd0fc1eeb59862b..928da2fbb3caa6de4d0e1d889c237590f71607ea > 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1342,6 +1342,7 @@ csum_copy_err: > /* starting over for a new packet, but check if we need to yield */ > cond_resched(); > msg->msg_flags &= ~MSG_TRUNC; > + msg->msg_iter.iov_offset = 0; > goto try_again; > } > > > (similar for ipv6) > >
UDP wierdness around skb_copy_and_csum_datagram_msg()
I've spent the last week or so trying to track down a recurring problem I'm seeing with UDP datagram handling. I'm new to the internals of the Linux network stack, but it appears to me that there's a substantial error in recent kernels' handling of UDP checksum errors. The behavior I'm seeing: I have a UDP server that receives lots of datagrams from many devices on a single port. A small number of those devices occasionally send packets with bad UDP checksums. After I receive one of these bad packets, the next recvmsg() made on the socket returns data from the bad-checksum packet, but the source-address and length of the next (good) packet that arrived at the port. I believe this issue was introduced between linux 3.18 and 3.19, by a set of changes to net/core/datagram.c that made skb_copy_and_csum_datagram_msg() and related functions use the iov_iter structure to copy data to user buffers. In the case where those functions copy a datagram and then conclude that the checksum is invalid, they don't remove the already-copied data from the user's iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a second time, looking at the next datagram on the queue, that second datagram's data is appended to the first datagram's. So when recvmsg() returns to the user, the return value and msg_name reflect the second datagram, but the first bytes in the user's iovec come from the first (bad) datagram. (I've attached a test program that demonstrates the problem. Note that it sees correct behavior unless the bad-checksum packet has > 68 bytes of UDP data; otherwise, the packet doesn't make it past the CHECKSUM_BREAK test, and never enters skp_copy_and_csum_datagram_msg().) The fix for UDP seems pretty simple; the iov_iter's iov_offset member just needs to be set back to zero on a checksum failure. But it looks like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c, where I assume that multiple sk_buffs can be copied-and-csum'd into the same iov -- if that's right, it seems like iov_iter needs some additional state to support rolling-back the most recent copy without losing previous ones. Any thoughts? #include #include #include #include #include #include #include #include #include #include #include int main(int argc, char **argv) { int ret; if (argc < 2) { fprintf(stderr, "Usage: csumtest \n"); exit(1); } struct sockaddr_in addr; socklen_t addrLen = sizeof(addr); addr.sin_family = AF_INET; addr.sin_addr.s_addr = inet_addr("127.0.0.1"); addr.sin_port = htons(0); int listen = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); if (listen < 0) { fprintf(stderr, "failed to create listening socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } ret = bind(listen, (struct sockaddr *) &addr, addrLen); if (ret != 0) { fprintf(stderr, "failed to bind listening socket socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } ret = getsockname(listen, (struct sockaddr *) &addr, &addrLen); if (ret != 0) { fprintf(stderr, "getsockname failed for listening socket socket (err=%d: %s)\n", errno, strerror(errno)); exit(1); } else { printf("listening on port %d\n", ntohs(addr.sin_port)); } // Send a packet with deliberately bad checksum char rawBuf[4096]; memset(rawBuf, 0, 4096); struct udphdr *udph = (struct udphdr *) rawBuf; // Send as UDP data the string "BAD DATA", repeated as necessary // to fill the number of bytes requested on the command-line char *data = rawBuf + sizeof(struct udphdr); int badBytes = atoi(argv[1]); int i; for (i = 0; i < badBytes; i += 8) { strncpy(data + i, "BAD DATA", 8); } // UDP headers contain a bogus checksum, but are otherwise reasonable udph->check = htons(0xbadd); udph->source = htons(18); udph->dest = addr.sin_port; udph->len = htons(sizeof(struct udphdr) + badBytes); int raw = socket(AF_INET, SOCK_RAW, IPPROTO_UDP); ret = sendto(raw, rawBuf, sizeof(struct udphdr) + badBytes, 0, (struct sockaddr *) &addr, addrLen); if (ret < 0) { fprintf(stderr, "raw send failed (err=%d: %s)\n", errno, strerror(errno)); exit(1); } // Send a second, legitimate UDP packet int cooked = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP); char *second = "Good data"; ret = sendto(cooked, second, strlen(second), 0, (struct sockaddr *) &addr, addrLen); if (ret < 0) { fprintf(stderr, "second cooked send failed (err=%d: %s)\n", errno, strerror(errno)); exit(1); } // Read what's queued up for us. while (1) { struct msghdr msg; struct iovec iov; char readBuf[4096]; ssize_t bytes; memset(readBuf, 0, 4096); iov.iov_base = readBuf; iov.iov_len = sizeof(readBuf); msg.msg_name = NULL; msg.msg_namelen = 0; msg.msg_iov = &iov; msg.msg_iovlen = 1; msg.msg_control = NULL; msg.msg_controllen = 0; msg.msg_flags = 0; b