OK. I think I have an approach. The SiteMonitor plus all its expansion
units is not the "device".
The "device" is the SiteMonitor plus the index of the expansion unit.
For example:
* SiteMonitor, index 0 is the SiteMonitor device
* SiteMonitor, index 1 is the 4-port POE device
* SiteMonitor, index 2 is the SyncInjector (first instance)
* SiteMonitor, index 3 is the SyncInjector (second instance)
and so on.
So when you add a SiteMonitor, you just add the SiteMonitor. If you add
another Packetflux expansion unit, you have to add it knowing which
index (AKA "slot") it is. Put the device in a different position, and
you need to update the index.
bp
On 10/25/2014 10:52 AM, Bill Prince via Af wrote:
Yah. Except that the index moves around, depending on what's in front
of it (e.g. 4-port POE versus an 8-port POE). So I can't depend on
what index number I'll be using at any given installation. The index
name will have to stay static if I ever hope to find it. Then again,
if I install two of anything, there will be more than one index with
the same description.
Hmmm. How to do this. Maybe I do have to give each device a unique
description, and then teach cacti to index on the unique description?
bp
On 10/25/2014 10:16 AM, Forrest Christian (List Account) via Af wrote:
They should be offset by a fixed amount. Ie subtract 4
On Oct 25, 2014 10:58 AM, "Bill Prince via Af" <af@afmug.com
<mailto:af@afmug.com>> wrote:
I think that may be it. The OID I was using is no longer valid.
So the SNMP response that came back had numbers in it, but it
also looks like the checksum was broken.
Not clear to me why I thought I could do this without doing the
index thing.
I hate doing the index thing.
bp
On 10/24/2014 10:32 PM, Forrest Christian (List Account) via Af
wrote:
A power cycle and a reboot should be identical in almost every
case. The reboot actually triggers a hardware reset internally
in the processor, which should clear everything out. Of course
as soon as I say that it is identical, someone will find an
example where it is not.
I'm not where I can look at the trace you sent, but I'm
surprised it contains errors. I do know that the unit will
return a response which may look like this if the oid is invalid.
Did you adjust your oids in cacti after the removal of the
mystery expansion unit from the table? If not, this is likely
the problem.
In regards to the unit being there grin the factory.. My guess
is if you had this unit listed in there from the get go, then it
probably was the expansion unit we use to test the expansion bus
here. It's supposed to be factory reset before shipping but it
would not shock me if it wasn't. We actually had a short
period that a largish percentage went out not factory reset due
to a tester software issue. Not really a problem but we hate
to have them go out in any other state.
On Oct 24, 2014 5:08 PM, "Bill Prince via Af" <af@afmug.com
<mailto:af@afmug.com>> wrote:
You mean from the web GUI?� Sure.
I presume a power cycle does something different from a reboot?
I was always curious about this particular SiteMonitor, as
it came up with the extra device on the expansion bus from
the get-go.� I'd never worried about it, and then I saw
the discussion about getting rid of old devices with the
zeroed-serial trick.
Don't go there!� It's a trap!
bp
On 10/24/2014 2:52 PM, George Skorup (Cyber Broadcasting)
via Af wrote:
Can you post a screenshot of your expansion, binary and
analog tabs?
Also, I bet if you power-cycle it, it will be fine again. I
was working with Forrest on a bug where the SyncInjector
and some other newer modules would mysteriously disappear
from the bus. He was able to reproduce and get a fixed up
firmware load for the modules. Something about one thing
booting up faster than another, or something like that.
On 10/24/2014 4:41 PM, Bill Prince via Af wrote:
Gotcha!
I removed all the Data Sources except one (PWR1).�
Suddenly that data was making it into cacti.
Then I added back in all the Data Sources coming _JUST_
from the SiteMonitor itself.� That also worked.
Then I added in one of the Data Sources from the
SyncInjector (sync events), which happens to be the only
unit on the expansion bus past where I removed the
non-existent unit.� This broke it again.
So I have apparently uncovered a bug where removing a unit
from the expansion bus (by zeroing the serial number) that
causes the SiteMonitor to break SNMP responses.� I think
it's probably just a bad checksum, but I will leave that
up to him.� I forwarded the pcap trace to him.
I will probably also swap out the SiteMonitor that has the
problem.
Thanks guys!
bp
On 10/24/2014 1:57 PM, Bill Prince via Af wrote:
Then again....
Not sure why I didn't notice this the first (or second)
time.� Wireshark is telling me I have a malformed
packet; either a broken header or bad checksum.� So
even though the SNMP response is coming in with the
expected data, it's getting dropped before is gets into
cacti because of the malformed packet.
This would explain why removing a unit on the expansion
bus changed things...
bp
On 10/24/2014 1:32 PM, Bill Prince via Af wrote:
OK. Confirmed.� The SiteMonitor is getting the SNMP
requests, and it is responding with the expected values.
I ran a pcap trace both at the SiteMonitor as well as at
the ethernet port on the cacti server.� SNMP
requests/responses are going both ways (and at both
ends). In fact, spine appears to be doing 3 retries.
One thing I didn't expect is that just before the SNMP
requests, there are two attempts to open a telnet on the
SiteMonitor.� Not sure where that is coming from,
except perhaps for the Manage plugin (which I
de-installed several weeks ago).
So something is broken inside cacti.� How/why this was
caused by zeroing a serial number from a non-existent
expansion unit is completely baffling to me.
I also have no clue how to fix it, because cacti
"thinks" there was no response.
bp
On 10/24/2014 11:16 AM, George Skorup (Cyber
Broadcasting) via Af wrote:
I am thoroughly confused. Is your community string
correct? Can you increase the device SNMP timeout, like
1000ms instead of 250ms. What's your device down
detection set to? Is it showing down in the device list?
I have seen some base units go kinda screwy and respond
slower and a reboot doesn't fix it, they needed a
power-cycle.
On 10/24/2014 11:25 AM, Bill Prince via Af wrote:
Now thrice.
No joy in Mudville.
bp
On 10/24/2014 8:07 AM, Bill Prince via Af wrote:
Yah.� Twice now.
bp
On 10/23/2014 11:06 PM, George Skorup (Cyber
Broadcasting) via Af wrote:
Gotta be the poller cache. Did you try a rebuild?
On 10/23/2014 11:03 PM, Bill Prince via Af wrote:
Getting closer.� When I look in the SNMP cache,
there is no entry for the device.
Looking in the log (without debug), I get:
10/23/2014 08:34:25 PM - SPINE: Poller[0] Host[797
<http://10.13.112.20/host.php?action=edit&id=797>]
TH[1] DS[12316
<http://10.13.112.20/data_sources.php?action=ds_edit&id=12316>]
WARNING: SNMP timeout detected [250 ms], ignoring
host '10.13.114.254'
So there is something causing the SNMP request to
barf inside cacti.� When I do an snmpget from the
CLI, it all looks fine.� Likewise, the realtime
plugin is working fine too.
So when realtime is doing the SNMP queries outside
the poller, they are fine.� Just when spine is
doing the SNMP requests.
bp
On 10/23/2014 4:12 PM, George Skorup (Cyber
Broadcasting) via Af wrote:
You divided by zero, didn't you?
Are you sure your modules are in the same order as
before?
On 10/23/2014 1:29 PM, Bill Prince via Af wrote:
I noticed an "Expansion Unit" on one of my
SiteMonitors this morning.� It said something
about "Device Removed" or something like that.
Remembering the discussion the other day on this
topic, I put a "0" in the Serial # for the
non-existent unit, rescanned, & rebooted.
Now, none of the OIDs work in Cacti.� If I do a
simple snmpget on any of the OIDs that I use, the
correct information comes back. Several of the
OIDs are on the base unit anyway, so they would
not have moved, and further, the OIDs don't
reference the serial number.
So... what did I do, and how do I fix it?