Quite correct, Marc. I found a correlation between very high network congestion internally and both the stuck pending and stuck initializing.
For the stuck pending systems, if I ssh into the system and manually upgrade the agent then restart, it proceeds to the next step. This is an acceptable workaround for my purposes. For the stuck initializing systems, once the messages have been marked Failed in the Scalr Internal Messaging panel I can hit Re-send message and the messages will succeed. Is there any way to increase the time between retries, or the number of retries, or both? On Wednesday, October 19, 2016 at 2:02:14 PM UTC-5, Marc O'Brien wrote: > > Hello Slopshid, > > In case you did not have the link, our Agent change log is available here > <http://goog_330914497>. We have not had similar reports of > high-occurrence intermittent "Pending" or "Initializing" state issues as > you have described. A fully copy of your agent logs from one of these > instances using the latest agent version would be helpful to understand > what is happening. Likewise, due to the intermittent nature of the issue > you are describing it would be useful to determine if there are any factors > common to the failing instances, such as Cloud Platform as you noted, OS, > Role, time of day, network or server load, etc. > > Many thanks, > Wm. Marc O'Brien > Scalr Technical Support > > > On Wednesday, October 19, 2016 at 12:29:17 PM UTC-6, [email protected] > wrote: >> >> I've been having issues with the scalarizr agent pretty much since I >> started using Scalr, and it seems to have gotten worse with each new agent >> version. >> >> The environment: >> Scalr 5.11.22 Community Edition >> Scalarizr stable 4.6.6 through 4.10.0 >> -or- >> Scalarizr latest 4.9.3 through 4.11.10 >> CentOS 7.2 instances >> AWS and OpenStack clouds (although it happens 10x more on OpenStack than >> in AWS) >> >> The issues: >> 1) A small percentage of systems (though the percentage has increased >> with later scalarizr agent versions) will get stuck in Pending state. >> Investigating these systems, the scalarizr agent appears to have completed >> the upgrade task and then crashed. >> We're currently hosting 4.6.6 in a custom repo because that agent has the >> lowest rate of failure (~3% of all launches) - the 4.10.0 version of >> increased the failure rate to 20-25%! >> >> 2) A smaller percentage of systems will get stuck in Initializing state, >> with 2-3 failed message deliveries in the Scalr Internal Messaging panel. >> Once I realize the systems are stuck, I can resend the messages and the >> systems will come up normally. I'm not sure if the rate of this type of >> failure is higher with the later version of the agent, since the failure >> rate on the first issue was so unacceptably high. >> >> >> >> >> -- You received this message because you are subscribed to the Google Groups "scalr-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
