> Even so, if there is a script collecting this metric out there somewhere, and just checking if port 22 is reachable, it will just see this potential extra delay.
If we were to get a bug along these lines, I would close it as invalid because the alternative is for the instance to not boot at all. > Users waiting for such an instance will just see that it is never ready, and not understand why. cloud-init will be logging this for sure, but where would users see what is going on if the instance does not finish boot? Is this logging visible via the cloud's API somewhere in this scenario? Users would never be able to SSH in the first place. If one is connected to the serial console, they will see the warning logs getting emitted once per second (or once per retry period). Previously, they would get no log to that console and be unable to login because even if they provided a password to allow them to login via serial console, without a working IMDS, cloud-init has no way to retrieve and apply that password. So on any real cloud, the difference is with this change, we get a useful log in the serial console. Without it, if the IMDS only returns 503s, we still cannot SSH, we cannot login via serial console, and the serial console shows us nothing useful. > systems without metadata services...Now we will wait forever on that, and I say forever because that "cloud" really wasn't supposed to provide a metadata service, and all was working fine because we were ignoring the error. Until now. Could this theoretical scenario exist? Theoretically yes. But that means: 1. Cloud-init was told to fetch data from a server that we don't want/need data from 2. That server is continuously emitting 503s Both of these point to something very very wrong happening and that the entire environment was misconfigured. Are we promising to not break something that is completely misconfigured? Would we block an SRU because of https://xkcd.com/1172/ ? 503 isn't an accidental or generic HTTP response. Nobody should be returning that accidentally. If you have a generic server error, you return 500, and if an endpoint isn't available you return 404. The whole point of the 503 is to signal "this should be a valid request, but we're having a temporary issue, please try again later." https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.4 > e) Do all clouds recommend the retry strategy when the metadata service returns a 503? And by "clouds" I really mean anywhere cloud-init is used to configure things during boot (lxd, maas, local VMs, real clouds, etc). I think this is less about cloud recommendations and more about HTTP standard practice. I don't need to know a cloud-specific strategy to deal with a 401...I just know that I need to login. I think the same thing applies to the 503 here. Is there something we can do to reduce your concern? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2094858 Title: Cloud-init fails on AWS if IMDSv2 returns a 503 error. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2094858/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
