On 03/18/2016 12:45 AM, Wei Xu wrote: > On 2016年03月17日 16:42, Jason Wang wrote: > >> >> On 03/15/2016 05:17 PM, w...@redhat.com wrote: >>> From: Wei Xu <w...@redhat.com> >>> >>> All the data packets in a tcp connection will be cached to a big buffer >>> in every receive interval, and will be sent out via a timer, the >>> 'virtio_net_rsc_timeout' controls the interval, the value will >>> influent the >>> performance and response of tcp connection extremely, 50000(50us) is a >>> experience value to gain a performance improvement, since the whql test >>> sends packets every 100us, so '300000(300us)' can pass the test case, >>> this is also the default value, it's gonna to be tunable. >>> >>> The timer will only be triggered if the packets pool is not empty, >>> and it'll drain off all the cached packets >>> >>> 'NetRscChain' is used to save the segments of different protocols in a >>> VirtIONet device. >>> >>> The main handler of TCP includes TCP window update, duplicated ACK >>> check >>> and the real data coalescing if the new segment passed sanity check >>> and is identified as an 'wanted' one. >>> >>> An 'wanted' segment means: >>> 1. Segment is within current window and the sequence is the expected >>> one. >>> 2. ACK of the segment is in the valid window. >>> 3. If the ACK in the segment is a duplicated one, then it must less >>> than 2, >>> this is to notify upper layer TCP starting retransmission due to >>> the spec. >>> >>> Sanity check includes: >>> 1. Incorrect version in IP header >>> 2. IP options & IP fragment >>> 3. Not a TCP packets >>> 4. Sanity size check to prevent buffer overflow attack. >>> >>> There maybe more cases should be considered such as ip >>> identification other >>> flags, while it broke the test because windows set it to the same >>> even it's >>> not a fragment. >>> >>> Normally it includes 2 typical ways to handle a TCP control flag, >>> 'bypass' >>> and 'finalize', 'bypass' means should be sent out directly, and >>> 'finalize' >>> means the packets should also be bypassed, and this should be done >>> after searching for the same connection packets in the pool and sending >>> all of them out, this is to avoid out of data. >>> >>> All the 'SYN' packets will be bypassed since this always begin a new' >>> connection, other flags such 'FIN/RST' will trigger a finalization, >>> because >>> this normally happens upon a connection is going to be closed, an >>> 'URG' packet >>> also finalize current coalescing unit while there maybe protocol >>> difference to >>> different OS. >> But URG packet should be sent as quickly as possible regardless of >> ordering, no? > > Yes, you right, URG will terminate the current 'SCU', i'll amend the > commit log. > >> >>> Statistics can be used to monitor the basic coalescing status, the >>> 'out of order' >>> and 'out of window' means how many retransmitting packets, thus >>> describe the >>> performance intuitively. >>> >>> Signed-off-by: Wei Xu <w...@redhat.com> >>> --- >>> hw/net/virtio-net.c | 486 >>> ++++++++++++++++++++++++++++++++++++++++- >>> include/hw/virtio/virtio-net.h | 1 + >>> include/hw/virtio/virtio.h | 72 ++++++ >>> 3 files changed, 558 insertions(+), 1 deletion(-)
[...] >>> + } else { >>> + /* Coalesce window update */ >>> + o_tcp->th_win = n_tcp->th_win; >>> + chain->stat.win_update++; >>> + return RSC_COALESCE; >>> + } >>> + } else { >>> + /* pure ack, update ack */ >>> + o_tcp->th_ack = n_tcp->th_ack; >>> + chain->stat.pure_ack++; >>> + return RSC_COALESCE; >> Looks like there're something I missed. The spec said: >> >> "In other words, any pure ACK that is not a duplicate ACK or a window >> update triggers an exception and must not be coalesced. All such pure >> ACKs must be indicated as individual segments." >> >> Does it mean we *should not* coalesce windows update and pure ack? >> (Since it can wakeup transmission)? > > It's also a little bit inexplicit and flexible due to the spec, please > see the flowchart I on the same page. > > Comments about the flowchart: > ------------------------------------------------------------------------ > The first of the following two flowcharts describes the rules for > coalescing segments and updating the TCP headers. > This flowchart refers to mechanisms for distinguishing valid duplicate > ACKs and window updates. The second flowchart describes these mechanisms. > ------------------------------------------------------------------------ > As show in the flowchart, only status 'C' will break current scu and > get finalized, both 'A' and 'B' can be coalesced afaik. > Interesting, looks like you're right. >> >>> + } >>> +} >>> + >>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, >>> NetRscSeg *seg, >>> + const uint8_t *buf, NetRscUnit >>> *n_unit) >>> +{ >>> + void *data; >>> + uint16_t o_ip_len; >>> + uint32_t nseq, oseq; >>> + NetRscUnit *o_unit; >>> + >>> + o_unit = &seg->unit; >>> + o_ip_len = htons(*o_unit->ip_plen); >>> + nseq = htonl(n_unit->tcp->th_seq); >>> + oseq = htonl(o_unit->tcp->th_seq); >>> + >>> + if (n_unit->tcp_hdrlen > TCP_HDR_SZ) { >>> + /* Log this only for debugging observation */ >>> + chain->stat.tcp_option++; >>> + } >>> + >>> + /* Ignore packet with more/larger tcp options */ >>> + if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) { >> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ? > do you mean '<'? that also means some option changed, maybe i should > include it. Yes. >> >>> + chain->stat.tcp_larger_option++; >>> + return RSC_FINAL; >>> + } >>> + >>> + /* out of order or retransmitted. */ >>> + if ((nseq - oseq) > MAX_TCP_PAYLOAD) { >>> + chain->stat.data_out_of_win++; >>> + return RSC_FINAL; >>> + } >>> + >>> + data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; >>> + if (nseq == oseq) { >>> + if ((0 == o_unit->payload) && n_unit->payload) { >>> + /* From no payload to payload, normal case, not a dup >>> ack or etc */ >>> + chain->stat.data_after_pure_ack++; >>> + goto coalesce; >>> + } else { >>> + return virtio_net_rsc_handle_ack(chain, seg, buf, >>> + n_unit->tcp, >>> o_unit->tcp); >>> + } >>> + } else if ((nseq - oseq) != o_unit->payload) { >>> + /* Not a consistent packet, out of order */ >>> + chain->stat.data_out_of_order++; >>> + return RSC_FINAL; >>> + } else { >>> +coalesce: >>> + if ((o_ip_len + n_unit->payload) > chain->max_payload) { >>> + chain->stat.over_size++; >>> + return RSC_FINAL; >>> + } >>> + >>> + /* Here comes the right data, the payload lengh in v4/v6 is >>> different, >>> + so use the field value to update and record the new data >>> len */ >>> + o_unit->payload += n_unit->payload; /* update new data len */ >>> + >>> + /* update field in ip header */ >>> + *o_unit->ip_plen = htons(o_ip_len + n_unit->payload); >>> + >>> + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be >>> coalesced >>> + for windows guest, while this may change the behavior >>> for linux >>> + guest. */ >> This needs more thought, 'can' probably means don't. (Linux GRO won't >> merge PUSH packet). > Yes, since it's mainly for win guest, how about take this as this > first and do more test and see how to handle it? Right, this is not an issue probably but just an optimization for latency. [...] >> >>> + return NULL; >>> + } >>> + >>> + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { >>> + if (chain->proto == proto) { >>> + return chain; >>> + } >>> + } >>> + >>> + chain = g_malloc(sizeof(*chain)); >>> + chain->hdr_size = n->guest_hdr_len; >> Why introduce a specified field for instead of just use >> n->guest_hdr_len? > this is to reduce invoking times to find VirtIONet by 'nc', there are > a few places will use it. Okay, then I think you'd better keep a pointer to VirtIONet structure instead. (Consider you may want to refer more data from it, we don't want to duplicate fields in two places). >> >>> + chain->proto = proto; >>> + chain->max_payload = MAX_IP4_PAYLOAD; >>> + chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, >>> + virtio_net_rsc_purge, chain); >>> + memset(&chain->stat, 0, sizeof(chain->stat)); >>> + >>> + QTAILQ_INIT(&chain->buffers); >>> + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); >>> + >>> + return chain; >>> +} >>> + >>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc, >>> + const uint8_t *buf, size_t size) >>> +{ >>> + uint16_t proto; >>> + NetRscChain *chain; >>> + struct eth_header *eth; >>> + VirtIONet *n; >>> + >>> + n = qemu_get_nic_opaque(nc); >>> + if (size < (n->guest_hdr_len + ETH_HDR_SZ)) { >>> + return virtio_net_do_receive(nc, buf, size); >>> + } >>> + >>> + eth = (struct eth_header *)(buf + n->guest_hdr_len); >>> + proto = htons(eth->h_proto); >>> + >>> + chain = virtio_net_rsc_lookup_chain(n, nc, proto); >>> + if (!chain) { >>> + return virtio_net_do_receive(nc, buf, size); >>> + } else { >>> + chain->stat.received++; >>> + return virtio_net_rsc_receive4(chain, nc, buf, size); >>> + } >>> +} >>> + >>> +static ssize_t virtio_net_receive(NetClientState *nc, >>> + const uint8_t *buf, size_t size) >>> +{ >>> + if (virtio_net_rsc_bypass) { >>> + return virtio_net_do_receive(nc, buf, size); >> You need a feature bit for this and compat it for older machine types. >> And also need some work on virtio spec I think. > yes, not sure which way is good to support this, hmp/qmp/ethtool, this > is gonna to support win guest, > so need a well-compatible interface, any comments? I think this should be implemented through feature bits/negotiation instead of something like ethtool. [...]