Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Andres Rodriguez Tue, 06 Feb 2018 16:01:43 -0800

That's what we have done to test the difference. So for the greater
audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's
at the same time other 4 nodes are PXE booting from MAAS.


Before the fix we saw:
1. client would do multiple requests for the same file.
2. maas would run up to 3 DB requests for the node object to used to render
the config.
3. Inspected why we had 3 DB requests for the same config.

With this behavior, we determined that what happens is that the rack
queries the region, obtains the object, takes a while to generate the
config and return it to the client. But before it returns it to the client,
the client makes another request and that causes another db query. With
this, we confirmed that the collapsing works as expected, provided that
this collapsing happens between region/rack communication, but the rack had
already received and response and treats the new request as a new db query.

With the fix we aw:
1. client would do multiple requests for the same file
2. maas would always perform 1 DB requqest for the node object to render
the config.

With this, we were able to identify that the rack was taking too long to
answer the client, which caused that if a new request came it, it was
treated as a new request that was server by the region. With the changes
the rack responds faster, hence MAAS collapsed multiple requests, responded
in a timely fashion before it can actually be caused to make another
request to the db.

So the fix does improve things for sure, and we believe is one of the
reasons as to why this happened while there's IO starvation. That said, it
is not the only thing to improve, as there are other sections that need
improvement and as I had earlier said, those involve improving the DB as
well.

On Tue, Feb 6, 2018 at 6:10 PM, Jason Hobbs <jason.ho...@canonical.com>
wrote:

> BTW to be clear here I'm saying I don't think the path forward on
> improving this issue is thinking about how MAAS works and throwing out
> patches that might improve performance here and there.  The path
> forward is to instrument MAAS on a system with slow i/o and to figure
> out exactly where it's getting hung up.
>
> Jason
>
> On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs <jason.ho...@canonical.com>
> wrote:
> > dm-delay looks very interesting along those lines.
> >
> > https://www.enodev.fr/posts/emulate-a-slow-block-device-
> with-dm-delay.html
> >
> > https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
> >
> > On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <jason.ho...@canonical.com>
> wrote:
> >> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
> >> <andres...@ubuntu-pe.org> wrote:
> >>> I don't have logs anymore as I have since rebuilt my environment, but
> I can
> >>> confirm seeing improvements on a maas server running with high IO
> (note it
> >>> was a single region/rack).
> >>>
> >>> see inlien:
> >>>
> >>>
> >>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <jason.ho...@canonical.com
> >
> >>> wrote:
> >>>
> >>>> Andres, it was a single test in both cases, and in both cases there
> was
> >>>> almost no delay from MAAS.  It's not significant enough to call it
> >>>> positive results.
> >>>>
> >>>>
> >>> Comment #93 shows there are /some/ improvements when comparing those
> two
> >>> samples only, but as I have already said, we need data over time to in
> both
> >>> scenarios to properly compare and determine whether the changes do
> make any
> >>> material performance improvements with the current conditions of the
> >>> samples (both samples are with a fixed io starvation on the
> environment).
> >>>
> >>>
> >>>> Since neither of you answered yes, I'll assume the answer was no to my
> >>>> question of whether there was anything in my logs or data that showed
> >>>> reading the template from disk on the rack controller was the culprit,
> >>>> and that this fix just represents a guess at what might be causing the
> >>>> delay.
> >>>>
> >>>
> >>> To be fair, your logs do not provide anything concrete to determine
> what's
> >>> the culprit of the issue on the MAAS side. It provides a lot of clues,
> and
> >>> we have since then determine that those issues were a result of IO
> >>> starvation (from the VM's writing to disk). As such, the only way we
> can
> >>> *really* see if the patch brings any significant performance
> improvements
> >>> is to run tests in the environment were you were seeing the issues in
> the
> >>> first place.
> >>
> >> I didn't think my logs provided anything concrete!  That's because the
> >> logging built into MAAS is not sufficient enough to do so.
> >>
> >> I can't break that environment to test anymore - we got it working
> >> thanks to you guy's help and it's a production environment that needs
> >> to keep running other tests.
> >>
> >> It might possible to recreate this on another maas server, using
> >> 'stress' or a similar tool to cause disk contention.
> >>
> >> Jason
> >>
> >>> As such, if you are willing to test if these make any material
> difference,
> >>> I would unfix your environment and do two runs (one without the fix,
> and
> >>> one with the fix). That's the only way we can really compare and be
> certain
> >>> in *your* environment.
> >>>
> >>>>
> >>>> --
> >>>> You received this bug notification because you are subscribed to MAAS.
> >>>> https://bugs.launchpad.net/bugs/1743249
> >>>>
> >>>> Title:
> >>>>   Failed Deployment after timeout trying to retrieve grub cfg
> >>>>
> >>>> To manage notifications about this bug go to:
> >>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
> >>>>
> >>>> Launchpad-Notification-Type: bug
> >>>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> >>>> importance=Undecided; assignee=None;
> >>>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2;
> component=main;
> >>>> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com
> ;
> >>>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> >>>> Launchpad-Bug-Information-Type: Public
> >>>> Launchpad-Bug-Private: no
> >>>> Launchpad-Bug-Security-Vulnerability: no
> >>>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
> >>>> jason-hobbs mpontillo vorlon
> >>>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> >>>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> >>>> Launchpad-Message-Rationale: Subscriber (MAAS)
> >>>> Launchpad-Message-For: andreserl
> >>>>
> >>>
> >>>
> >>> --
> >>> Andres Rodriguez (RoAkSoAx)
> >>> Ubuntu Server Developer
> >>> MSc. Telecom & Networking
> >>> Systems Engineer
> >>>
> >>> --
> >>> You received this bug notification because you are subscribed to the
> bug
> >>> report.
> >>> https://bugs.launchpad.net/bugs/1743249
> >>>
> >>> Title:
> >>>   Failed Deployment after timeout trying to retrieve grub cfg
> >>>
> >>> Status in MAAS:
> >>>   New
> >>> Status in grub2 package in Ubuntu:
> >>>   Fix Released
> >>>
> >>> Bug description:
> >>>   A node failed to deploy after it failed to retrieve a grub.cfg from
> >>>   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >>>   to retrieve the grub cfg many times, over about 30 seconds:
> >>>
> >>>   http://paste.ubuntu.com/26387256/
> >>>
> >>>   We see the same thing for other hosts around the same time:
> >>>
> >>>   http://paste.ubuntu.com/26387262/
> >>>
> >>>   It seems like MAAS is taking way too long to respond to these
> >>>   requests.
> >>>
> >>>   This is very similar to bug 1724677, which was happening pre-
> >>>   metldown/spectre. The only difference is we don't see "[critical]
> TFTP
> >>>   back-end failed" in the logs anymore.
> >>>
> >>>   I connected to the console on this system and it had errors about
> >>>   timing out retrieving the grub-cfg, then it had an error message
> along
> >>>   the lines of "error not an ip" and then "double free".  After I
> >>>   connected but before I could get a screenshot the system rebooted and
> >>>   was directed by maas to power off, which it did successfully after
> >>>   booting to linux.
> >>>
> >>>   Full logs are available here:
> >>>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >>>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >>>
> >>>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >>>
> >>> To manage notifications about this bug go to:
> >>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=Fix Released; importance=Medium; assignee=mathieu...@gmail.com;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
> jason-hobbs mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>


-- 
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Reply via email to