> Even so, if there is a script collecting this metric out there
somewhere, and just checking if port 22 is reachable, it will just see
this potential extra delay.

If we were to get a bug along these lines, I would close it as invalid
because the alternative is for the instance to not boot at all.

> Users waiting for such an instance will just see that it is never
ready, and not understand why. cloud-init will be logging this for sure,
but where would users see what is going on if the instance does not
finish boot? Is this logging visible via the cloud's API somewhere in
this scenario?

Users would never be able to SSH in the first place. If one is connected
to the serial console, they will see the warning logs getting emitted
once per second (or once per retry period). Previously, they would get
no log to that console and be unable to login because even if they
provided a password to allow them to login via serial console, without a
working IMDS, cloud-init has no way to retrieve and apply that password.
So on any real cloud, the difference is with this change, we get a
useful log in the serial console. Without it, if the IMDS only returns
503s, we still cannot SSH, we cannot login via serial console, and the
serial console shows us nothing useful.

> systems without metadata services...Now we will wait forever on that,
and I say forever because that "cloud" really wasn't supposed to provide
a metadata service, and all was working fine because we were ignoring
the error. Until now. Could this theoretical scenario exist?

Theoretically yes. But that means:
1. Cloud-init was told to fetch data from a server that we don't want/need data 
from
2. That server is continuously emitting 503s

Both of these point to something very very wrong happening and that the
entire environment was misconfigured. Are we promising to not break
something that is completely misconfigured? Would we block an SRU
because of https://xkcd.com/1172/ ?

503 isn't an accidental or generic HTTP response. Nobody should be returning 
that accidentally. If you have a generic server error, you return 500, and if 
an endpoint isn't available you return 404. The whole point of the 503 is to 
signal "this should be a valid request, but we're having a temporary issue, 
please try again later."
https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.4

> e) Do all clouds recommend the retry strategy when the metadata
service returns a 503? And by "clouds" I really mean anywhere cloud-init
is used to configure things during boot (lxd, maas, local VMs, real
clouds, etc).

I think this is less about cloud recommendations and more about HTTP
standard practice. I don't need to know a cloud-specific strategy to
deal with a 401...I just know that I need to login. I think the same
thing applies to the 503 here.

Is there something we can do to reduce your concern?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2094858

Title:
  Cloud-init fails on AWS if IMDSv2 returns a 503 error.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2094858/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to