Re: [prometheus-users] snmp exporter periodoic timeouts when walking citrix netscaler

Justin Teare Mon, 08 Jun 2020 15:18:10 -0700

Thanks Ben,

That's good info to know. Looks like my scrape timeout is not set on that 
scrape job config. However, like I said the walk is failing on the snmp 
exporter when I query the target directly via the snmp exporter web 
interface, so it's timing out based on it's own timeout settings. I 
verified by timing it going through the default 3 retries which changes 
depending on what timeout I set in the snmp generator config.


On Monday, June 8, 2020 at 5:15:01 PM UTC+12, Ben Kochie wrote:
>
> What is your scrape interval and scrape timeout on the Prometheus side? 
> Prometheus sends a default scrape timeout of 10s to the exporter. The 
> exporter timeout is only used if the timeout from the Prometheus server is 
> longer.
>
> On Mon, Jun 8, 2020 at 1:39 AM Justin Teare <[email protected] 
> <javascript:>> wrote:
>
>> Hi all, I have been running into some strange snmp walk timeout issues 
>> with snmp exporter against citrix netscaler appliances.
>>
>> Running latest (0.18.0) snmp exporter as a docker container.
>>
>> If I try to walk the "vServer" or other similar metrics which have a time 
>> series for each vserver (as opposed to e.g. netscaler appliance cpu 
>> metrics), the walks are failing due to timeouts in a bizzarely periodic 
>> way. We currently have around ~420 vservers on each load balancer.
>>
>> *Behaviour*
>>
>> The snmp exporter will fail to walk the netscaler at approx 15 mins past 
>> the hour every hour, and will not walk again correctly for 15-20 mins. I am 
>> walking 2 netscalers, and the scrapes fail on both netscalers at the same 
>> time. One resumes walking after about 15 mins, while the other takes about 
>> 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for 
>> the netscaler module from the Prometheus interface. 
>>
>> [image: snmp_timeout.PNG]
>>
>> The problem is not with Prometheus as you can observe the timeouts when 
>> targeting the netscaler from the SNMP exporter web interface which reports 
>> the following error:
>>
>> An error has occurred while serving metrics:
>>
>> error collecting metric Desc{fqName: "snmp_error", help: "Error scraping 
>> target", constLabels: {}, variableLabels: []}: error walking target 
>> example.com: Request timeout (after 3 retries)
>>
>>
>> The logs for the snmp generator container show this error:
>>
>> level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 
>> module=citrix_adc 
>> target=example.com msg="Error scraping target" err="scrape canceled 
>> (possible timeout) walking target example.com"
>>
>> A few days ago I was using snmp exporter version 0.17.0 and the error was 
>> more along the lines of `context canceled`. I realise there were some 
>> updates to timeouts made in the latest update but that doesn't seem to be 
>> helping in this situation (see more info about my timeout settings further 
>> below).
>>
>> No noticible problems are happening from the netscaler's perspective, 
>> these are production appliances and everything is runninng fine.
>>
>> I am not sure if this is an snmp exporter related problem or a netscaler 
>> related problem.
>>
>> I have done testing from the command line to confirm snmp the netscaler 
>> is still responding. This command takes longer than during the 
>> 'non-timeout' period, but it does not time out or fail. The fact that I can 
>> run `snmpbulkwalk` on the entire `vserver` table from my command line and 
>> get no timeout error during the same period makes me think it's smnp 
>> exporter related, whereas the fact that it happens on a regular periodic 
>> cycle makes me think it could be something that's happening on the 
>> netsclaer.
>>
>> If I generate a new minimal snmp.conf during the 'timeout period' with 
>> the vserver related OID's removed and just leave e.g. netsclaer cpu stats, 
>> the walks will resume straight away.
>>
>> When I time the running  `snmpbulkwalk` on the verserver table (using 
>> linux "time" command") from the command line it normally records about 3s 
>> to run. During the weird hourly 'timeout' period it takes about 6 seconds.
>>
>> Changing my `timeout` or `max_repetitions` does not seem to have any 
>> effect as I have tried setting timeout value > 30s, and both increasing 
>> and decreasing the `max_repetitions`  and it still fails. The snmp 
>> exporter fails to walk one column of a table, while I can walk the entire 
>> table with no failure from the command line. 
>>
>> I cannot see any reference to setting of snmp timeouts or rate limiting 
>> on the netscaler.
>>
>> Can anyone help me narrow down if this is an snmp exporter issue or a 
>> netscaler issue?
>>
>> Thanks.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2dba9be2-2ebd-441e-a292-4035fb35484bo%40googlegroups.com.

Re: [prometheus-users] snmp exporter periodoic timeouts when walking citrix netscaler

Reply via email to