Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Jason Hobbs Tue, 06 Feb 2018 13:51:06 -0800

Blake, that's great.  Do you have before and after numbers showing the
improvement this change made?


Do you have any data or logs that led you to believe this was the
culprit in the slow responses I saw on my cluster?

On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse <blake.ro...@canonical.com> wrote:
> Actually caching does make a difference. That method is not just caching
> the reading of a file, it caches the searching of the file based on the
> purpose, the reading of that file from disk (sure can be in kernel
> cache), the parsing of the template by tempita.
>
> All of that is redudant work that is being done on every single request.
> Searching the filesystem and reading the file from cache is all syscalls
> even if they come from the kernel cache. Since MAAS is async based that
> means that coroutine will be placed on hold while we wait for the result
> to be loaded from the kernel into the memory of the process. That gives
> other coroutines time to do other things, which means that coroutine
> doesn't get to execute until others are done or blocked by there own
> async request.
>
> Caching this information can greatly improve that by not requiring the
> coroutine to be pushed back into the eventloop while it is waiting for
> data from the kernel and without this change when the data comes back it
> still has to be processed by tempita which will take time and block the
> eventloop from completing other work.
>
> So its not simply that we should use the kernel to cache reads from the
> disk there is a lot more involved here. We have noticed improvements
> with this change on systems that are being ran with large number of VM's
> because of the reduction of IO.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   Fix Released
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Reply via email to