Diagnosing arp/neighbour table overflows and RFC for a related patch idea

Rick Jones Tue, 04 Aug 2015 13:59:20 -0700

Some of my colleagues have been doing a bit of scale testing in thelikes of OpenStack. They get to a certain number of instances launchedand then they start seeing messages of the form:

Sep 25 08:17:37 overcloud-novacompute7-hqauwkkp6cnb kernel:[176969.125887] net_ratelimit: 62 callbacks suppressedSep 25 08:17:42 overcloud-novacompute7-hqauwkkp6cnb kernel:[176974.193613] net_ratelimit: 56 callbacks suppressedSep 25 08:17:48 overcloud-novacompute7-hqauwkkp6cnb kernel:[176980.231333] net_ratelimit: 105 callbacks suppressedSep 25 08:17:54 overcloud-novacompute7-hqauwkkp6cnb kernel:[176986.270903] net_ratelimit: 51 callbacks suppressedSep 25 08:18:00 overcloud-novacompute7-hqauwkkp6cnb kernel:[176991.804470] net_ratelimit: 81 callbacks suppressedSep 25 08:18:05 overcloud-novacompute7-hqauwkkp6cnb kernel:[176997.352440] net_ratelimit: 93 callbacks suppressedSep 25 08:18:10 overcloud-novacompute7-hqauwkkp6cnb kernel:[177002.408610] net_ratelimit: 80 callbacks suppressedSep 25 08:18:16 overcloud-novacompute7-hqauwkkp6cnb kernel:[177008.083106] net_ratelimit: 38 callbacks suppressedSep 25 08:18:21 overcloud-novacompute7-hqauwkkp6cnb kernel:[177013.125626] net_ratelimit: 99 callbacks suppressed


at which point things start to go downhill...

(Don't mind the date)

And they tracked it down by virtue of a number of things to blowing pastthe default limits for the arp cache. Make them larger, poof themessages go away.

I've been playing with reproducing that outside of OpenStack and when Iknock the table size down to say 16, and then try to ping 254 distinctIPs at the same time, I see similar messages. The thing is, there areonly the suppressed messages, there is never any "base" message.

Poking around and whatnot suggests it is coming from anet_dbg_ratelimited call in ip_finish_output2. I'm guessing that sincethe log level is "debug" it isn't allowing the first message to beprinted, but that decision is made past the rate limiting thus all wesee are the callbacks suppressed messages.

That leaves things rather confusing - very difficult to know whatcallbacks/messages were being suppressed. I suppose their presence inthe likes of dmesg output says that *something* overflowed and so onemight start looking, but depending on circumstances couldn't there be apreceding message which might reasonably look like it was the one beingsuppressed?

Or put another way, should the rate limiting be looking "down" (somehow)to see if the message level would even generate output before worryingabout doing all the rate limiting stuff? Only downside is, thosecallbacks suppressed messages then wouldn't be there at all.

But I suppose that is slight digression. To make this specificsituation more explicit I was thinking of adding the following (my mailclient may fubar the formatting):


~/net-next$ git diff
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 84195da..917f760 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c

@@ -274,8 +274,11 @@ static struct neighbour *neigh_alloc(structneigh_table *tbl, struct net_device

            (entries >= tbl->gc_thresh2 &&
             time_after(now, tbl->last_flush + 5 * HZ))) {
                if (!neigh_forced_gc(tbl) &&
-                   entries >= tbl->gc_thresh3)
+                   entries >= tbl->gc_thresh3) {

+ net_info_ratelimited("%s: neighbor tableoverflow!\n",

+                                            tbl->id);
                        goto out_entries;
+               }
        }

        n = kzalloc(tbl->entry_size + dev->neigh_priv_len, GFP_ATOMIC);


When I run my little ping-based test with that in place I see the likes of:

root@qu-stbaz1-perf0000:~# dmesg | tail
[ 1533.892669] neighbour: arp_cache: neighbor table overflow!
[ 1533.932944] neighbour: arp_cache: neighbor table overflow!
[ 1533.932950] neighbour: arp_cache: neighbor table overflow!
[ 1534.023093] neighbour: arp_cache: neighbor table overflow!
[ 1538.973979] net_ratelimit: 142 callbacks suppressed
[ 1538.973981] neighbour: arp_cache: neighbor table overflow!
[ 1538.973985] neighbour: arp_cache: neighbor table overflow!
[ 1539.063320] neighbour: arp_cache: neighbor table overflow!
[ 1539.063326] neighbour: arp_cache: neighbor table overflow!
[ 1539.099621] neighbour: arp_cache: neighbor table overflow!

which seems rather more to the point. I'm still trying to decide ifadding a table_overflow stat would be indicated as well. Theforced_gc_runs stat doesn't indicate success or failure of the garbagecollection, so in and of itself it doesn't mean we had a failure to addan entry to the table.


Thoughts/comments?

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Diagnosing arp/neighbour table overflows and RFC for a related patch idea

Reply via email to