> We have a process that makes some (light) queries to batch.opensrs.net > every hour, and this is a common situation. It's down (usually with the > "Invalid private key" error) at least 5% of the time, and has been like > that for at least a year. > > Perhaps it's just heavily loaded at the top of the hour, when people's > automated script might be trigged from cron jobs? I keep meaning to put > in a few minute delay to see if it improves things, but haven't gotten > around to it.
Our nightly audit process starts up at a specific time, with the step that connects to opensrs occurring about half way through the process (it's dependent on how long the prior steps take to run). This morning the opensrs piece started at 5:50AM EST and about half way through it could no longer connect to opensrs. The opensrs piece runs for about 20 minutes, so I guesstimated the problem occurred around 6am. I think I will add timestamps to the error messages recorded in our logs so I can see exactly when they occur. Our audit system checks the login and subsequent command requests for errors like "no response from server", "number of requests exceeded" and "invalid private key". If it detects any of these situations, it will retry the login a set number of times befor giving up (to avoid an infinite loop when the opensrs system is really down). Every once in a while, we get one of the no response or invalid key errors and are able to successfully reconnect using our retry logic. We only get the "number of requests exceeded" when opensrs decides to change the number of requests they will allow per login. It's currently 100, so we automatically reconnect after 100 to avoid that error. Once they got the batch system back up and running today, I just started our script and it picked up where it left off... - Bill
