> We have a process that makes some (light) queries to batch.opensrs.net
> every hour, and this is a common situation. It's down (usually with the
> "Invalid private key" error) at least 5% of the time, and has been like
> that for at least a year.
>
> Perhaps it's just heavily loaded at the top of the hour, when people's
> automated script might be trigged from cron jobs? I keep meaning to put
> in a few minute delay to see if it improves things, but haven't gotten
> around to it.

Our nightly audit process starts up at a specific time, with the step that
connects to opensrs occurring about half way through the process (it's
dependent on how long the prior steps take to run).  This morning the
opensrs piece started at 5:50AM EST and about half way through it could no
longer connect to opensrs.  The opensrs piece runs for about 20 minutes,
so I guesstimated the problem occurred around 6am.  I think I will add
timestamps to the error messages recorded in our logs so I can see exactly
when they occur.

Our audit system checks the login and subsequent command requests for
errors like "no response from server", "number of requests exceeded" and
"invalid private key".  If it detects any of these situations, it will
retry the login a set number of times befor giving up (to avoid an
infinite loop when the opensrs system is really down).  Every once in a
while, we get one of the no response or invalid key errors and are able to
successfully reconnect using our retry logic.  We only get the "number of
requests exceeded" when opensrs decides to change the number of requests
they will allow per login.  It's currently 100, so we automatically
reconnect after 100 to avoid that error.

Once they got the batch system back up and running today, I just started
our script and it picked up where it left off...

- Bill

Reply via email to