On 14.08.2013 12:21, Luigi Rizzo wrote:
On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
I think (check the driver code in question as I'm not sure) that if you
"ifconfig <if> lro" and the driver has hardware support or has been made
aware of our software implementation, it should DTRT.

The "lower throughput than linux" that julian was seeing is either
because of a slow (CPU-bound) sender or slow receiver. Given that
the FreeBSD tx path is quite expensive (redoing route and arp lookups
on every packet, etc.) I highly suspect the sender side is at fault.
>
Then the problem remains that we should keep a copy of route and
arp information in the socket instead of redoing the lookups on
every single transmission, as they consume some 25% of the time of
a sendto(), and probably even more when it comes to large tcp
segments, sendfile() and the like.

It's the locking and ref-counting overhead in the routing table and
ARP table causing a lot of cache thrashing and bus lock cycles.

The fix is rather simple.  The routing table gets protected by a rm_lock
instead of a normal lock.  Individual routes no longer have their own
lock and no more ref-counting.  All pointers to routes and into the
routing table are prohibited.  Upon lookup the sought information is
copied out (ifp, ifaddr, nexthop) without retaining any reference to
the routing entry.  Ditto for the ARP table.  Because changes to the
routing and ARP tables are very infrequent compared to the number of
lookups performed on them, this exhibits very good cache behavior
across multiple cores and cpus.  No shared routing table memory is
dirtied during lookup.

Approaches that do NOT work (well):
 - flow caching where a separate entry is generated for every active
   connection containing direct pointers to the rtentry, arp entry and
   interface.  Besides the pointer validity and refcounting issues it
   scales very poorly for a large number of "flows" exhibiting a large
   lookup overhead.  The routing table (default and interface routes)
   and ARP table (a few hosts) stay at the same size and have a "constant"
   lookup time.
 - per cpu copies of routing and arp table have increased memory consumption
   and synchronization issues on updates especially with high core counts.
 - storing the rtentry and arp entry pointers in the inpcb has similar
   issues as the the flow table approach while periodically having to
   check if the route or arp entry changed.

The rm_lock is the fastest, cheapest and most SMP scalable approach shown
so far.  I have patches against a roughly 12 month old current laying around
if someone wants to brush them up and work out the final kinks.  The speedup
and reduction in overhead is significant.

--
Andre

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to