Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing

Dr. David Alan Gilbert Fri, 21 Feb 2014 01:47:25 -0800

* Michael R. Hines (mrhi...@linux.vnet.ibm.com) wrote:
> On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
> >
> >I'm happy to use more memory to get FT, all I'm trying to do is see
> >if it's possible to put a lower bound than 2x on it while still maintaining
> >full FT, at the expense of performance in the case where it uses
> >a lot of memory.
> >
> >>The bottom line is: if you put a *hard* constraint on memory usage,
> >>what will happen to the guest when that garbage collection you mentioned
> >>shows up later and runs for several minutes? How about an hour?
> >>Are we just going to block the guest from being allowed to start a
> >>checkpoint until the memory usage goes down just for the sake of avoiding
> >>the 2x memory usage?
> >Yes, or move to the next checkpoint sooner than the N milliseconds when
> >we see the buffer is getting full.
> 
> OK, I see there is definitely some common ground there: So to be
> more specific, what we really need is two things: (I've learned that
> the reviewers are very cautious about adding to much policy into
> QEMU itself, but let's iron this out anyway:)
> 
> 1. First, we need to throttle down the guest (QEMU can already do this
>     using the recently introduced "auto-converge" feature). This means
>     that the guest is still making forward progress, albeit slow progress.
> 
> 2. Then we would need some kind of policy, or better yet, a trigger that
>     does something to the effect of "we're about to use a whole lot of
>     checkpoint memory soon - can we afford this much memory usage".
>     Such a trigger would be conditional on the current policy of the
>     administrator or management software: We would either have a QMP
>     command that with a boolean flag that says "Yes" or "No", it's
>     tolerable or not to use that much memory in the next checkpoint.
> 
>     If the answer is "Yes", then nothing changes.
>     If the answer is "No", then we should either:
>        a) throttle down the guest
>        b) Adjust the checkpoint frequency
>        c) Or pause it altogether while we migrate some other VMs off the
>            host such that we can complete the next checkpoint in its
> entirety.


Yes I think so, although what I was thinking was mainly (b) possibly
to the point of not starting the next checkpoint.

> It's not clear to me how much of this (or any) of this control loop should
> be in QEMU or in the management software, but I would definitely agree
> that a minimum of at least the ability to detect the situation and remedy
> the situation should be in QEMU. I'm not entirely convince that the
> ability to *decide* to remedy the situation should be in QEMU, though.

The management software access is low frequency, high latency; it should
be setting general parameters (max memory allowed, desired checkpoint
frequency etc) but I don't see that we can use it to do anything on
a sooner than a few second basis; so yes it can monitor things and
tweek the knobs if it sees the host as a whole is getting tight on RAM
etc - but we can't rely on it to throw in the breaks if this guest
suddenly decides to take bucket loads of RAM; something has to react
quickly in relation to previously set limits.

> >>If you block the guest from being checkpointed,
> >>then what happens if there is a failure during that extended period?
> >>We will have saved memory at the expense of availability.
> >If the active machine fails during this time then the secondary carries
> >on from it's last good snapshot in the knowledge that the active
> >never finished the new snapshot and so never uncorked it's previous packets.
> >
> >If the secondary machine fails during this time then tha active drops
> >it's nascent snapshot and carries on.
> 
> Yes, that makes sense. Where would that policy go, though,
> continuing the above concern?

I think there has to be some input from the management layer for failover,
because (as per my split-brain concerns) something has to make the decision
about which of the source/destination is to take over, and I don't
believe individual instances have that information.

> >However, what you have made me realise is that I don't have an answer
> >for the memory usage on the secondary; while the primary can pause
> >it's guest until the secondary ack's the checkpoint, the secondary has
> >to rely on the primary not to send it huge checkpoints.
> 
> Good question: There's a lot of work ideas out there in the academic
> community to compress the secondary, or push the secondary to
> a flash-based device, or de-duplicate the secondary. I'm sure any
> of them would put a dent in the problem, but I'm not seeing a smoking
> gun solution that would absolutely save all that memory completely.

Ah, I was thinking that flash would be a good solution for secondary;
it would be a nice demo.

> (Personally, I don't believe in swap. I wouldn't even consider swap
> or any kind of traditional disk-based remedy to be a viable solution).

Well it certainly exists - I've seen it!
Swap works well in limited circumstances; but as soon as you've got
multiple VMs fighting over something with 10s of ms latency you're doomed.

> >>The customer that is expecting 100% fault tolerance and the provider
> >>who is supporting it need to have an understanding that fault tolerance
> >>is not free and that constraining memory usage will adversely affect
> >>the VM's ability to be protected.
> >>
> >>Do I understand your expectations correctly? Is fault tolerance
> >>something you're willing to sacrifice?
> >As above, no I'm willing to sacrifice performance but not fault tolerance.
> >(It is entirely possible that others would want the other trade off, i.e.
> >some minimum performance is worse than useless, so if we can't maintain
> >that performance then dropping FT leaves us in a more-working position).
> >
> 
> Agreed - I think a "proactive" failover in this case would solve the
> problem.
> If we observed that availability/fault tolerance was going to be at
> risk soon (which is relatively easy to detect) - we could just *force*
> a failover to the secondary host and restart the protection from
> scratch.
> 
> 
> >>
> >>Well, that's simple: If there is a failure of the source, the destination
> >>will simply revert to the previous checkpoint using the same mode
> >>of operation. The lost ACKs that you're curious about only
> >>apply to the checkpoint that is in progress. Just because a
> >>checkpoint is in progress does not mean that the previous checkpoint
> >>is thrown away - it is already loaded into the destination's memory
> >>and ready to be activated.
> >I still don't see why, if the link between them fails, the destination
> >doesn't fall back it it's previous checkpoint, AND the source carries
> >on running - I don't see how they can differentiate which of them has failed.
> 
> I think you're forgetting that the source I/O is buffered - it doesn't
> matter that the source VM is still running. As long as it's output is
> buffered - it cannot have any non-fault-tolerant affect on the outside
> world.
> 
> In the future, if a technician access the machine or the network
> is restored, the management software can terminate the stale
> source virtual machine.

I think going with my comment above; I'm working on the basis it's just
as likely for the destination to fail as it is for the source to fail,
and a destination failure shouldn't kill the source; and in the case
of a destination failure the source is going to have to let it's buffered
I/Os start going again.

> >>We have a script architecture (not on github) which runs MC in a tight
> >>loop hundreds of times and kills the source QEMU and timestamps how
> >>quickly the
> >>destination QEMU loses the TCP socket connection receives an error code
> >>from the kernel - every single time, the destination resumes nearly
> >>instantaneously.
> >>I've not empirically seen a case where the socket just hangs or doesn't
> >>change state.
> >>
> >>I'm not very familiar with the internal linux TCP/IP stack
> >>implementation itself,
> >>but I have not had a problem with the dependability of the linux socket
> >>not being able to shutdown the socket as soon as possible.
> >OK, that only covers a very small range of normal failures.
> >When you kill the destination QEMU the host OS knows that QEMU is dead
> >and sends a packet back closing the socket, hence the source knows
> >the destination is dead very quickly.
> >If:
> >    a) The destination machine was to lose power or hang
> >    b) Or a network link fail  (other than the one attached to the source
> >       possibly)
> >
> >the source would have to do a full TCP timeout.
> >
> >To test a,b I'd use an iptables rule somewhere to cause the packets to
> >be dropped (not rejected).  Stopping the qemu in gdb might be good enough.
> 
> Very good idea - I'll add that to the "todo" list of things to do
> in my test infrastructure. It may indeed turn out be necessary
> to add a formal keepalive between the source and destination.
> 
> - Michael

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing

Reply via email to