* Michael R. Hines (mrhi...@linux.vnet.ibm.com) wrote: > On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote: > > > >I'm happy to use more memory to get FT, all I'm trying to do is see > >if it's possible to put a lower bound than 2x on it while still maintaining > >full FT, at the expense of performance in the case where it uses > >a lot of memory. > > > >>The bottom line is: if you put a *hard* constraint on memory usage, > >>what will happen to the guest when that garbage collection you mentioned > >>shows up later and runs for several minutes? How about an hour? > >>Are we just going to block the guest from being allowed to start a > >>checkpoint until the memory usage goes down just for the sake of avoiding > >>the 2x memory usage? > >Yes, or move to the next checkpoint sooner than the N milliseconds when > >we see the buffer is getting full. > > OK, I see there is definitely some common ground there: So to be > more specific, what we really need is two things: (I've learned that > the reviewers are very cautious about adding to much policy into > QEMU itself, but let's iron this out anyway:) > > 1. First, we need to throttle down the guest (QEMU can already do this > using the recently introduced "auto-converge" feature). This means > that the guest is still making forward progress, albeit slow progress. > > 2. Then we would need some kind of policy, or better yet, a trigger that > does something to the effect of "we're about to use a whole lot of > checkpoint memory soon - can we afford this much memory usage". > Such a trigger would be conditional on the current policy of the > administrator or management software: We would either have a QMP > command that with a boolean flag that says "Yes" or "No", it's > tolerable or not to use that much memory in the next checkpoint. > > If the answer is "Yes", then nothing changes. > If the answer is "No", then we should either: > a) throttle down the guest > b) Adjust the checkpoint frequency > c) Or pause it altogether while we migrate some other VMs off the > host such that we can complete the next checkpoint in its > entirety.
Yes I think so, although what I was thinking was mainly (b) possibly to the point of not starting the next checkpoint. > It's not clear to me how much of this (or any) of this control loop should > be in QEMU or in the management software, but I would definitely agree > that a minimum of at least the ability to detect the situation and remedy > the situation should be in QEMU. I'm not entirely convince that the > ability to *decide* to remedy the situation should be in QEMU, though. The management software access is low frequency, high latency; it should be setting general parameters (max memory allowed, desired checkpoint frequency etc) but I don't see that we can use it to do anything on a sooner than a few second basis; so yes it can monitor things and tweek the knobs if it sees the host as a whole is getting tight on RAM etc - but we can't rely on it to throw in the breaks if this guest suddenly decides to take bucket loads of RAM; something has to react quickly in relation to previously set limits. > >>If you block the guest from being checkpointed, > >>then what happens if there is a failure during that extended period? > >>We will have saved memory at the expense of availability. > >If the active machine fails during this time then the secondary carries > >on from it's last good snapshot in the knowledge that the active > >never finished the new snapshot and so never uncorked it's previous packets. > > > >If the secondary machine fails during this time then tha active drops > >it's nascent snapshot and carries on. > > Yes, that makes sense. Where would that policy go, though, > continuing the above concern? I think there has to be some input from the management layer for failover, because (as per my split-brain concerns) something has to make the decision about which of the source/destination is to take over, and I don't believe individual instances have that information. > >However, what you have made me realise is that I don't have an answer > >for the memory usage on the secondary; while the primary can pause > >it's guest until the secondary ack's the checkpoint, the secondary has > >to rely on the primary not to send it huge checkpoints. > > Good question: There's a lot of work ideas out there in the academic > community to compress the secondary, or push the secondary to > a flash-based device, or de-duplicate the secondary. I'm sure any > of them would put a dent in the problem, but I'm not seeing a smoking > gun solution that would absolutely save all that memory completely. Ah, I was thinking that flash would be a good solution for secondary; it would be a nice demo. > (Personally, I don't believe in swap. I wouldn't even consider swap > or any kind of traditional disk-based remedy to be a viable solution). Well it certainly exists - I've seen it! Swap works well in limited circumstances; but as soon as you've got multiple VMs fighting over something with 10s of ms latency you're doomed. > >>The customer that is expecting 100% fault tolerance and the provider > >>who is supporting it need to have an understanding that fault tolerance > >>is not free and that constraining memory usage will adversely affect > >>the VM's ability to be protected. > >> > >>Do I understand your expectations correctly? Is fault tolerance > >>something you're willing to sacrifice? > >As above, no I'm willing to sacrifice performance but not fault tolerance. > >(It is entirely possible that others would want the other trade off, i.e. > >some minimum performance is worse than useless, so if we can't maintain > >that performance then dropping FT leaves us in a more-working position). > > > > Agreed - I think a "proactive" failover in this case would solve the > problem. > If we observed that availability/fault tolerance was going to be at > risk soon (which is relatively easy to detect) - we could just *force* > a failover to the secondary host and restart the protection from > scratch. > > > >> > >>Well, that's simple: If there is a failure of the source, the destination > >>will simply revert to the previous checkpoint using the same mode > >>of operation. The lost ACKs that you're curious about only > >>apply to the checkpoint that is in progress. Just because a > >>checkpoint is in progress does not mean that the previous checkpoint > >>is thrown away - it is already loaded into the destination's memory > >>and ready to be activated. > >I still don't see why, if the link between them fails, the destination > >doesn't fall back it it's previous checkpoint, AND the source carries > >on running - I don't see how they can differentiate which of them has failed. > > I think you're forgetting that the source I/O is buffered - it doesn't > matter that the source VM is still running. As long as it's output is > buffered - it cannot have any non-fault-tolerant affect on the outside > world. > > In the future, if a technician access the machine or the network > is restored, the management software can terminate the stale > source virtual machine. I think going with my comment above; I'm working on the basis it's just as likely for the destination to fail as it is for the source to fail, and a destination failure shouldn't kill the source; and in the case of a destination failure the source is going to have to let it's buffered I/Os start going again. > >>We have a script architecture (not on github) which runs MC in a tight > >>loop hundreds of times and kills the source QEMU and timestamps how > >>quickly the > >>destination QEMU loses the TCP socket connection receives an error code > >>from the kernel - every single time, the destination resumes nearly > >>instantaneously. > >>I've not empirically seen a case where the socket just hangs or doesn't > >>change state. > >> > >>I'm not very familiar with the internal linux TCP/IP stack > >>implementation itself, > >>but I have not had a problem with the dependability of the linux socket > >>not being able to shutdown the socket as soon as possible. > >OK, that only covers a very small range of normal failures. > >When you kill the destination QEMU the host OS knows that QEMU is dead > >and sends a packet back closing the socket, hence the source knows > >the destination is dead very quickly. > >If: > > a) The destination machine was to lose power or hang > > b) Or a network link fail (other than the one attached to the source > > possibly) > > > >the source would have to do a full TCP timeout. > > > >To test a,b I'd use an iptables rule somewhere to cause the packets to > >be dropped (not rejected). Stopping the qemu in gdb might be good enough. > > Very good idea - I'll add that to the "todo" list of things to do > in my test infrastructure. It may indeed turn out be necessary > to add a formal keepalive between the source and destination. > > - Michael Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK