Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB, and spurious-retransmission events, but the new hash is not propagated into the IPv6 ECMP path selection logic. The cached route is reused and fib6_select_path() is never re-invoked, so so the connection uses the same local ECMP decision.
This series adds the two missing pieces: 1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst is invalidated and the next transmit triggers a fresh route lookup. 2. fl6->mp_hash set from sk_txhash before each route lookup so fib6_select_path() picks a path based on the (potentially re-rolled) hash. This is conditioned on fib_multipath_hash_policy == 0 (L3) because policies 1-3 compute a deterministic hash from the flow keys which must not be overridden. Patch 1 is the kernel change; patch 2 adds selftests covering SYN rehash, SYN/ACK rehash, midstream RTO rehash, midstream ACK rehash (spurious retransmission), PLB rehash, a policy 1 negative test, a flowlabel leak regression test, two dst rebuild consistency tests (normal and syncookie) verifying that natural route invalidation does not cause unintended path changes, and a syncookie server path consistency test verifying that the SYN-ACK and post-cookie ACKs use the same ECMP nexthop. I'd like guidance on whether to use the ISN as txhash when using syncookies; it keeps the SYN-ACK and subsequent data path consistent, but one could argue that this consistency doesn't matter because no reordering is possible. Changes since v8: https://lore.kernel.org/netdev/[email protected]/ Patch 1: - Fix REPFLOW flowlabel reflection for syncookie SYN-ACKs: pass 0 as tw_isn to route_req() so tcp_v6_init_req() saves ireq->pktopts Patch 2: - Give midstream and ACK rehash attempt helpers distinct failure messages (no TX activity vs no data on alternate path vs counter not incrementing) instead of a single generic error - Drop unused ns_server parameter from ecmp_dst_rebuild_check() - Clean up server socat before break on setup failure in the dst rebuild loop Changes since v7: https://lore.kernel.org/netdev/[email protected]/ Patch 1: - Remove #if IS_ENABLED(CONFIG_IPV6) guards around __sk_dst_reset() in tcp_plb.c and tcp_timer.c (Eric Dumazet) - Guard mp_hash in inet6_csk_route_socket() on sk_protocol == IPPROTO_TCP instead of txhash != 0, since non-TCP callers like L2TP set sk_txhash in __ip6_datagram_connect() and should retain flow-key-based ECMP - Use the syncookie (ISN) as txhash for both the SYN-ACK route lookup and cookie_v6_check() socket creation, so the server's ECMP selection is consistent across the stateless SYN-ACK and the subsequent full socket. Move cookie_init_sequence() before route_req() in tcp_conn_request() so the SYN-ACK dst is computed with the cookie-derived txhash; derive txhash from snt_isn in cookie_tcp_reqsk_init() to match Patch 2: - Invalidate dst via dummy route add/del instead of route replace to avoid a transient single-nexthop state during multipath replacement - Add syncookie server path consistency test verifying the SYN-ACK and post-cookie ACKs use the same ECMP path - Strengthen policy 1 negative test to wait for multiple rehash attempts and verify SYNs landed on exactly one interface Changes since v6: https://lore.kernel.org/netdev/[email protected]/ - Guard mp_hash assignment so that non-TCP callers of inet6_csk_route_socket() fall through to rt6_multipath_hash() (superseded in v8 by sk_protocol == IPPROTO_TCP guard) - Initialize txhash in bpf_sk_assign_tcp_reqsk() to avoid reading uninitialized slab memory in inet6_csk_route_req() - Check post-rebuild busywait return status to avoid silent false pass Changes since v5: https://lore.kernel.org/netdev/[email protected]/ - Improve selftest reliability: suppress __dst_negative_advice() via tcp_retries1=255 in dst rebuild tests so a real RTO cannot trigger an unintended rehash; add internal retry to midstream and ACK rehash tests to tolerate probabilistic ECMP path selection; fix midstream baseline capture to account for packets that bypass tc filters during the prio qdisc's TCQ_F_CAN_BYPASS window - Increase ECMP_REBUILD_ROUNDS default to 10 for reliable regression detection with 2-way ECMP; replace sleep with busywait - Use tcp_allowed_congestion_control instead of changing the host's default congestion control for PLB test - Use (txhash >> 1) ?: 1 to guarantee non-zero mp_hash, since zero falls back to rt6_multipath_hash() Changes since v4: https://lore.kernel.org/netdev/[email protected]/ - Condition fl6->mp_hash on fib_multipath_hash_policy == 0 to preserve deterministic hash policies 1-3 (e.g., symmetric 5-tuple for policy 1) - Set fl6->mp_hash in tcp_v6_connect() and cookie_v6_check() for initial route lookup consistency; move sk_set_txhash() earlier (Jakub Kicinski) - Add policy 1 negative test; improve sysctl save/restore - Add flowlabel leak test confirming mp_hash does not alter the on-wire IPv6 flow label - Add dst rebuild consistency tests (normal and syncookie) verifying that route table changes do not cause unintended ECMP path changes Changes since v3: https://lore.kernel.org/netdev/[email protected]/ - Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock is held in all three call sites (Eric Dumazet) - Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP does not use sk_txhash for path selection - Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value of sk_rethink_txhash() - Move tcp_rsk(req)->txhash initialization before route_req() in tcp_conn_request() to avoid reading uninitialized memory - Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test - Skip PLB test gracefully if DCTCP is not available - Save and restore original congestion control algorithm in PLB test - Default get_netstat_counter() to 0 when counter is not found - Skip all tests if tcp_syn_linear_timeouts is not available - Replace bash/pipe data sources with socat OPEN:/dev/zero for cleaner process cleanup - Fix shellcheck warnings Changes since v2: https://lore.kernel.org/netdev/[email protected]/ - Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP (Neal Cardwell) - Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c), covering the dst rebuild path used on established sockets - Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c; the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling txhash which feeds into inet6_csk_route_req()'s mp_hash (Eric Dumazet) - Add ACK rehash and PLB rehash selftests - Improve selftest reliability Changes since v1: https://lore.kernel.org/netdev/[email protected]/ - Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...) for ECMP path selection in inet6_csk_route_req(), making the request socket path consistent with the established socket path (Eric Dumazet) - Add comments explaining the >> 1 shift for 31-bit mp_hash range - Use socat -u (unidirectional) in selftest to avoid SIGPIPE race - Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for better rehash coverage Neil Spring (2): tcp: rehash onto different local ECMP path on retransmit timeout selftests: net: add local ECMP rehash test net/core/filter.c | 1 + net/ipv4/syncookies.c | 8 +- net/ipv4/tcp_input.c | 20 +- net/ipv4/tcp_plb.c | 5 +- net/ipv4/tcp_timer.c | 2 + net/ipv6/af_inet6.c | 3 + net/ipv6/inet6_connection_sock.c | 8 + net/ipv6/syncookies.c | 4 + net/ipv6/tcp_ipv6.c | 13 +- tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/config | 1 + tools/testing/selftests/net/ecmp_rehash.sh | 1050 ++++++++++++++++++++ 12 files changed, 1104 insertions(+), 12 deletions(-) create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh -- 2.53.0-Meta
