Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Jason Hobbs Mon, 05 Feb 2018 16:41:47 -0800

On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguez
<andres...@ubuntu-pe.org> wrote:
> @Jason,
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs <jason.ho...@canonical.com>
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>> <andres...@ubuntu-pe.org> wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have improved things
>>
>> Yes, it has.  That doesn't change whether or not there is a bug in
>> MAAS.  Can you please address the critical log errors that I mentioned
>> in comment #36?  This seems like enough to establish something is
>> going wrong in MAAS.
>>
>
> The tftp issue shows no evidence this is causing any booting failures. We
> have seen this issue before and confirmed that it doesn't cause boot
> issues. See [1]. If you want to try it, it is available in
> ppa:maas/proposed.
>
> [1].https://bugs.launchpad.net/maas/+bug/1376483


It's been "Fixed" multiple times before, in your link above and also
in bug 1724677, but we still see them, very suspiciously around the
time of failures.  I'm not convinced these are actually understood.
Do you have a specific commit  or some idea of what change that
addresses these in 2.4?

> As far as the postgresql logs with "maas@maasdb ERROR: could not serialize
> access due to concurrent update" that is  *not* a bug in MAAS or an issue.
> That's perfectly normal messages with the isolation level the MAAS DB is
> running with. This basically means something else is trying to update the
> db while something else is updating it, and MAAS already handles this by
> doing retries.

That is just one type of db error in the log.  There are many more.
Here's one that says there was a deadlock detected. That's not normal
OK behavior is it?

http://paste.ubuntu.com/26527181/

>> > 2. No tests have been run with fixed grub that have caused boot
>> failures.
>>
>> The comments from #56 were testing with the fixed grub - sorry if that
>> wasn't clear.
>>
>> > 3. AFAIK, the VM config has not changed to use less CPU to compare
>> results and whether this config change causes the bugs in question.
>>
>> The CPU load data from comments #48 and #50 shows that CPU load is not
>> the problem.  The max load average was under 12 on a 20 thread system.
>> That means there was lots of free CPU time, and that this workload is
>> not CPU bound.
>>
>
> CPU load is not CPU utilization. We know that at the time there's 6 other
> VM's with 150%+ CPU usage are writing to the disk because they are being
> deployed and/or configured (e.g. software installation).  Correct me if
> wrong, but this can cause the prioritization of whatever is writing to disk
> over anything else, like the MAAS processes access for resources.

The 150%+ number you are seeing is that process using all of 1 core
(hyperthread) and 50% of another (it's a multithreaded process).  This
does not mean the process is using 150% of the entire CPU capacity.

We don't just have load average - we also have a breakdown of CPU
utilization from top, every 5 seconds:

%Cpu(s): 23.5 us,  6.3 sy,  0.0 ni, 62.9 id,  7.2 wa,  0.0 hi,  0.1 si,
0.0 st

The top man page has more to say about this line, but have a look at
the 'id' number. It's the % of cpu time spent in the idle process
(nothing to do) in the sample period (5 seconds in the above logs).

The lowest that number ever goes in the logs I posted is 52%, meaning
over any 5 second period, we never use more than half of the available
CPU capacity.

> That being said, because CPU load doesn't show high we are making the
> *assumption* that it is not impacting MAAS, but again, this is an
> assumption. Making the requested change for having at least 4 CPUs (ideally
> 6) would allow us to determining what are the effects and see whether
> there's any difference on behavior and would help identify what other
> issues.
>
> Without having the comparison then we are making it more difficult to
> isolate the problem.

To improve performance the typical pattern is 1) identify the
bottleneck 2) eliminate that as the bottleneck 3) repeat.

We have not identified CPU as a bottleneck.  The top data says it is
not!

In the absence of data showing the CPU as being the bottleneck,
reducing CPU usage doesn't help identify the performance blocker,
because it may just move the bottleneck.  For example, it may cause
the processes that are doing disk I/O to not get scheduled to run as
much, which may then reduce the amount of disk I/O they can do, which
may alleviate the issue, but not because MAAS was CPU starved before
and now isn't.  Better to reduce the storage contention in the first
place, if the data shows that storage contention is the bottleneck.

In this case we had data from iotop that indicated storage contention
as the bottleneck, and reducing it seems to have alleviated the
problem, as we haven't hit the failure since then. We're going to take
more steps to alleviate storage contention even more soon, by making
sure MAAS/postgres are the only things using that bcache set.

We still don't know why storage contention bottlenecks MAAS, and
that's where instrumenting MAAS to show where it's getting hung up
would help.

Jason

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1743249

Title:
  Failed Deployment after timeout trying to retrieve grub cfg

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Reply via email to