Thanks Ben, That's good info to know. Looks like my scrape timeout is not set on that scrape job config. However, like I said the walk is failing on the snmp exporter when I query the target directly via the snmp exporter web interface, so it's timing out based on it's own timeout settings. I verified by timing it going through the default 3 retries which changes depending on what timeout I set in the snmp generator config.
On Monday, June 8, 2020 at 5:15:01 PM UTC+12, Ben Kochie wrote: > > What is your scrape interval and scrape timeout on the Prometheus side? > Prometheus sends a default scrape timeout of 10s to the exporter. The > exporter timeout is only used if the timeout from the Prometheus server is > longer. > > On Mon, Jun 8, 2020 at 1:39 AM Justin Teare <[email protected] > <javascript:>> wrote: > >> Hi all, I have been running into some strange snmp walk timeout issues >> with snmp exporter against citrix netscaler appliances. >> >> Running latest (0.18.0) snmp exporter as a docker container. >> >> If I try to walk the "vServer" or other similar metrics which have a time >> series for each vserver (as opposed to e.g. netscaler appliance cpu >> metrics), the walks are failing due to timeouts in a bizzarely periodic >> way. We currently have around ~420 vservers on each load balancer. >> >> *Behaviour* >> >> The snmp exporter will fail to walk the netscaler at approx 15 mins past >> the hour every hour, and will not walk again correctly for 15-20 mins. I am >> walking 2 netscalers, and the scrapes fail on both netscalers at the same >> time. One resumes walking after about 15 mins, while the other takes about >> 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for >> the netscaler module from the Prometheus interface. >> >> [image: snmp_timeout.PNG] >> >> The problem is not with Prometheus as you can observe the timeouts when >> targeting the netscaler from the SNMP exporter web interface which reports >> the following error: >> >> An error has occurred while serving metrics: >> >> error collecting metric Desc{fqName: "snmp_error", help: "Error scraping >> target", constLabels: {}, variableLabels: []}: error walking target >> example.com: Request timeout (after 3 retries) >> >> >> The logs for the snmp generator container show this error: >> >> level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 >> module=citrix_adc >> target=example.com msg="Error scraping target" err="scrape canceled >> (possible timeout) walking target example.com" >> >> A few days ago I was using snmp exporter version 0.17.0 and the error was >> more along the lines of `context canceled`. I realise there were some >> updates to timeouts made in the latest update but that doesn't seem to be >> helping in this situation (see more info about my timeout settings further >> below). >> >> No noticible problems are happening from the netscaler's perspective, >> these are production appliances and everything is runninng fine. >> >> I am not sure if this is an snmp exporter related problem or a netscaler >> related problem. >> >> I have done testing from the command line to confirm snmp the netscaler >> is still responding. This command takes longer than during the >> 'non-timeout' period, but it does not time out or fail. The fact that I can >> run `snmpbulkwalk` on the entire `vserver` table from my command line and >> get no timeout error during the same period makes me think it's smnp >> exporter related, whereas the fact that it happens on a regular periodic >> cycle makes me think it could be something that's happening on the >> netsclaer. >> >> If I generate a new minimal snmp.conf during the 'timeout period' with >> the vserver related OID's removed and just leave e.g. netsclaer cpu stats, >> the walks will resume straight away. >> >> When I time the running `snmpbulkwalk` on the verserver table (using >> linux "time" command") from the command line it normally records about 3s >> to run. During the weird hourly 'timeout' period it takes about 6 seconds. >> >> Changing my `timeout` or `max_repetitions` does not seem to have any >> effect as I have tried setting timeout value > 30s, and both increasing >> and decreasing the `max_repetitions` and it still fails. The snmp >> exporter fails to walk one column of a table, while I can walk the entire >> table with no failure from the command line. >> >> I cannot see any reference to setting of snmp timeouts or rate limiting >> on the netscaler. >> >> Can anyone help me narrow down if this is an snmp exporter issue or a >> netscaler issue? >> >> Thanks. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Prometheus Users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2dba9be2-2ebd-441e-a292-4035fb35484bo%40googlegroups.com.

