[lopsa-tech] Informal VMware FT Experience Survey

Michael Ryder Fri, 07 Sep 2012 22:46:20 -0700

I'm spawning a new thread on this topic --

Does anyone have first-hand experience with VMware's FT technology,
good or bad?  And if so, would you be willing/able to share details or
at least whether this was vSphere 4 or 5?


Mike

On Sat, Sep 8, 2012 at 12:09 AM,  <[email protected]> wrote:
> On Fri, 7 Sep 2012, Michael Ryder wrote:
>
>> David, I don't dispute your information, but I'm curious -- are these
>> your actual experiences with VMware FT?  Have you seen vLockstep fail
>> in any way?
>
>
> no, this is not experience, this is reasoning through the technology
> limitations. It takes time to get the data from one system to the other, and
> during that time they will be out of sync.
>
> If you were willing to accept the horrible slowdown of sending all the data
> to the other side and getting confirmation that it had the data safely
> before passing it to the local VM (think two phase commit on every piece of
> I/O, every network packet, every clock tick, etc) you would come very close
> to what they are claiming, but the overhead of doing that would cripple your
> system performance, even over a fast, local network.
>
> Sending the data over and replaying it as it arrives, without waiting for
> the confirmation will still add a significant amount of load to the system
>
> even so, you had better hope you don't do too much disk I/O, since it all
> needs to be sent over the network as well and local disks are faster than
> networks.
>
>
> I had a system at work a few years ago that had multiple motherboards in it,
> with their clocks hardware synced together to provide "ultimate"
> reliability, the thing cost a fortune (well over 20x what a single box with
> the same spec cost), and the list of disclaimers about corner cases that
> almost, or usually worked was impressive. That system had the boards just a
> few inches apart with very specialized hardware to tie them together.
>
> vmware is claming better results with commodity hardware over normal
> ethernet networks (not even requireing super-low latency network hardware
> like infiniband or better)
>
>
> I've read too much about the problems that OS developers have in defining
> 'now' and making operations or changes 'simultanious' even across different
> CPUs in a single system to believe that perfect replication across different
> systems is possible. Modern OSs go to great effort to try and relax any such
> 'simultanious effect' requirements even within different cores on the same
> CPU die because of the performance implications.
>
> FT is the right answer for some problems, and it's a very attractive answer
> for many problems. But it can't be perfect. I don't mind imperfect solutions
> (they are all imperfect), but I want to know where the edges are.
>
> David Lang
>
>
>> FT goes beyond the capabilities of Microsoft clustering, for example,
>> which will *completely* lose all I/O and memory-state being handled by
>> the primary node; an interruption of the application-servicing is
>> guaranteed.  In VMware FT, all non-deterministic I/O is being logged
>> and shipped to the secondary.  All I/O outputs are duplicated by the
>> secondary (though ignored by the secondary's hypervisor), and
>> continued functioning of the application is guaranteed.  In a properly
>> configured VMware cluster, there is no data loss whatsoever.
>>
>> VMware FT is dependant on low latency -- less than 1ms is pretty-much
>> required, though maybe a few ms wouldn't concern most applications --
>> that would affect how quickly the client computers are able to
>> reconnect to the FT-protected VM.  Going cross-country with VMware FT
>> would definitely be a no-no, and I would be concerned about anyone
>> saying that such a configuration is viable.  (If memory serves,
>> typical latency from NJ to CA is ~50-100ms on a 1gb segment)
>>
>> There are definitely use-cases where VMware FT is not going to work --
>> if your app requires multiple cores, for example.
>>
>> Mike
>>
>> On Fri, Sep 7, 2012 at 9:03 PM,  <[email protected]> wrote:
>>
>>> executing the exact same x86 instruction sequences will give you
>>> different
>>> random numbers
>>>
>>> timing differences will result in different results when there are race
>>> conditions, etc
>>>
>>> non-deterministic events (input, I/O, etc) are copied from one machine to
>>> the other and replayed. There is a window here where the primary is in a
>>> different state then the backup, if the failure happens to hit at this
>>> time,
>>> you can be in trouble.
>>>
>>> This is still far better than trying to mirror the state of the system.
>>>
>>> I'm also not disputing that there are cases where this is your only (or
>>> even
>>> best) choice due to software not doing clustering well.
>>>
>>> I'm just claiming that it's not going to be perfect the way that the
>>> marketing material makes it sound.
>>>
>>> This may not be a problem for you. It all depends on what the impact will
>>> be
>>> if it's not perfect.
>>>
>>> For a webserver, the worst case could be that some connections get broken
>>> in
>>> the transition, not a big deal.
>>>
>>> But I had an application that was responsible for changeing passwords on
>>> production systems that was climing that this sort of thing provided
>>> perfect
>>> reliability (even between a primary and backup system running on the east
>>> and west coasts of the US). In that situation the worst case would have
>>> been
>>> having the root password changed to something that nothing and nobody
>>> knows,
>>> resulting in an unmaintainable system. In that case the cost was not
>>> acceptable (especially since competing systems did real clustering and
>>> were
>>> not dependant on this sort of vm syncing)
>>>
>>> David Lang
>>
>>
>
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

[lopsa-tech] Informal VMware FT Experience Survey

Reply via email to