> On Fri, Feb 3, 2012 at 8:47 AM, Bart Van Assche <[email protected]> wrote:
>> It is known that backwards and forwards adjustments of the system clock can
>> cause snmpd and libsnmp to behave different than documented. The patch below
>> addresses this. Because of the impact of this patch (it changes the ABI
>> somewhat by storing monotonic clock values instead of wall clock times in
>> some struct members) I'm posting it here as an RFC.

I didn't really understand _exactly_ what this patch was meant to
accomplish, but I'd like to share what I believe to be a _related_,
major SNMPv3 issue.  I think that this patch _may_ solve it.

Two weeks ago, one of my customers had an issue with polling their
SNMPv3 devices; we have a single-process, multi-threaded poller that
uses net-snmp 5.4.3 (5.7.1 was also tested for the particular issue
that I will describe).  Their problem was that some of their SNMPv3
devices would _randomly_ stop polling for long periods of time,
sometimes indefinitely.

Now, when this happens, the _first_ thing to check is whether or not
all target agents have a _unique_ engine ID; if _any_ two agents share
an engine ID, then a single-process (no matter how many threads) will
become confused because both agents will hash to the same entry in the
global v3 hash, and only one of their sets of information will be
recorded.  Based on their various boot counts and uptimes, they may
even flip-flop in the hash.

However, after careful analysis with tshark, we deemed that _every_
agent had a _unique_ engine ID.

So, what was the problem?

After weeks of testing, we concluded this:
1. It had something to do with net-snmp; the memory was not getting
clobbered by anything else.
2. It had something to do with the fact that we would send out a
msgAuthoritativeEngineTime with values like "43,033,091" from time to
time (plus or minus a million, say).

This faulty EngineTime was being sent out in the following manner.
1. [hours of communication with many agents]
3. Ask Agent #n for its EngineID, EngineBoots, and EngineTime (the
normal SNMPv3 discovery operation).
4. Agent replies with EngineID, EngineBoots, and EngineTime.
5. Issue an SNMPGET to Agent #n with the correct EngineID,
EngineBoots, but a 100% bogus EngineTime (around 43 million).

43 million is a very interesting number--it is only 100 times smaller
than 4.2 _billion_, which is the 32-bit integer cutoff.

After some more research through the net-snmp source code, I finally
found where "our" belief about the agent is updated:
get_enginetime_ex, which uses snmpv3_local_snmpEngineTime.  Since
get_enginetime_ex looked fairly simple (and all of the integer types
seemed okay), I looked into snmpv3_local_snmpEngineTime.  _There_ is
where I found the major problem.

At first, I thought that everything _must_ be fine, since there was
_no way_ that net-snmp would use "times" instead of "gettimeofday" if
it didn't have to.  However, I threw a #error directive where
SNMP_USE_TIMES as defined as "1" and found that net-snmp chose to use
"times"!  This function has been nothing but trouble in my years of
software, and its manpage even says to use "gettimeofday" for
computing real time differences.

The "times" call happens, on my system, to have sysconf(_SC_CLK_TCK)
== 100, which matches my suspicion of 100x43-million that much
stronger.  At this point, I suspected that the problem had to with how
much processing time my polling process had used since beginning,
which I roughly confirmed.

Here's the #if macro for when to use it (snmpv3.c, line 103):
/* this is probably an over-kill ifdef, but why not */
#if defined(HAVE_SYS_TIMES_H) && defined(HAVE_UNISTD_H) &&
defined(HAVE_TIMES) && defined(_SC_CLK_TCK) && defined(HAVE_SYSCONF)
&& defined(UINT_MAX)

As far as I can tell, any linux installation will try its hardest to
be compatible with any other (especially _older_) ones.  _Of course_
my system has these things; some software needs them.  But it also has
"gettimeofday".  A much better check would be for the existence of
that function; if it doesn't exist, then you could use "times".

Naturally, the "gettimeofday" manpage says to use "clock_gettime" instead.

In any case, I patched my net-snmp to always leave SNMP_USE_TIMES
undefined, and I am 99% certain that my change will fix the crazy
issues that I've been seeing.  I'd be interested to see how this patch
performs in my situation.

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Net-snmp-coders mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/net-snmp-coders

Reply via email to