Hi all, I have been running into some strange snmp walk timeout issues with 
snmp exporter against citrix netscaler appliances.

Running latest (0.18.0) snmp exporter as a docker container.

If I try to walk the "vServer" or other similar metrics which have a time 
series for each vserver (as opposed to e.g. netscaler appliance cpu 
metrics), the walks are failing due to timeouts in a bizzarely periodic 
way. We currently have around ~420 vservers on each load balancer.

*Behaviour*

The snmp exporter will fail to walk the netscaler at approx 15 mins past 
the hour every hour, and will not walk again correctly for 15-20 mins. I am 
walking 2 netscalers, and the scrapes fail on both netscalers at the same 
time. One resumes walking after about 15 mins, while the other takes about 
25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for 
the netscaler module from the Prometheus interface. 

[image: snmp_timeout.PNG]

The problem is not with Prometheus as you can observe the timeouts when 
targeting the netscaler from the SNMP exporter web interface which reports 
the following error:

An error has occurred while serving metrics:

error collecting metric Desc{fqName: "snmp_error", help: "Error scraping 
target", constLabels: {}, variableLabels: []}: error walking target 
example.com: Request timeout (after 3 retries)


The logs for the snmp generator container show this error:

level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 
module=citrix_adc 
target=example.com msg="Error scraping target" err="scrape canceled 
(possible timeout) walking target example.com"

A few days ago I was using snmp exporter version 0.17.0 and the error was 
more along the lines of `context canceled`. I realise there were some 
updates to timeouts made in the latest update but that doesn't seem to be 
helping in this situation (see more info about my timeout settings further 
below).

No noticible problems are happening from the netscaler's perspective, these 
are production appliances and everything is runninng fine.

I am not sure if this is an snmp exporter related problem or a netscaler 
related problem.

I have done testing from the command line to confirm snmp the netscaler is 
still responding. This command takes longer than during the 'non-timeout' 
period, but it does not time out or fail. The fact that I can run 
`snmpbulkwalk` on the entire `vserver` table from my command line and get 
no timeout error during the same period makes me think it's smnp exporter 
related, whereas the fact that it happens on a regular periodic cycle makes 
me think it could be something that's happening on the netsclaer.

If I generate a new minimal snmp.conf during the 'timeout period' with the 
vserver related OID's removed and just leave e.g. netsclaer cpu stats, the 
walks will resume straight away.

When I time the running  `snmpbulkwalk` on the verserver table (using linux 
"time" command") from the command line it normally records about 3s to run. 
During the weird hourly 'timeout' period it takes about 6 seconds.

Changing my `timeout` or `max_repetitions` does not seem to have any effect 
as I have tried setting timeout value > 30s, and both increasing and 
decreasing the `max_repetitions`  and it still fails. The snmp exporter 
fails to walk one column of a table, while I can walk the entire table with 
no failure from the command line. 

I cannot see any reference to setting of snmp timeouts or rate limiting on 
the netscaler.

Can anyone help me narrow down if this is an snmp exporter issue or a 
netscaler issue?

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com.

Reply via email to