Re: Let's Consider Having Three Separated Environments in Infra

2016-07-20 Thread Anton Marchukov
On Wed, Jul 20, 2016 at 9:43 PM, Yedidyah Bar David  wrote:
>
> Not sure it's my business, but whatever:
>

It is. I think it is up to everybody who is interesting in making oVirt
infra more reliable and easier to maintain.


> 1. Do you intend also separate data centers? Such that if one (or two) of
> them looses connectivity/power/etc, we are still up?
>

I would like to. But so far we have only one physical Data Center. The only
mitigation for that we can do now is the offsite backups and offsite
mirrors. We have some of that now and working on improving the rest.


>
> 2. If so, does it mean that recreating it means copying from one of the
> others many GBs of data? And if so, is that also the plan for recovering
> from bad tests?
>

If they are completely removed than - yes. It is. But there should not be a
problem for it unless the new data are coming faster so you cannot catch up
on old. This is not the case for our infra, so eventually it will sync up.

3. If so, it probably means we'll not do that very happily, because
> "undo" will take a lot of time and bandwidth.
>

The good thing about having 3 instances is that you can allow one
"instance" even days to sync up the data if needed leaving the whole
construction in reliable state. So not sure about happily. But with such
configuration I would call it pretty nervous-free. Also the only way to get
perfect at something is well... to do it. So if it is not happily we should
make it so.


> 4. If we still want to go that route, perhaps consider having per-site
> backup, which allows syncing from the others the changes done on them
> since X (where X is "death of power/connectivity", "Start of test", etc).
> Some time ago I looked at backup tools, and found out that while there are
> several "field tested" tools, such as bacula and amanda, they are
> considered
> old-fashioned, but there are several different contenders for the future
> "perfect" tool. For an overview of some of them see [1]. For my own uses
> I chose 'bup', which isn't perfect, but seemed good and stable enough.
>

We consider on and offsite backups. The thing is that the backups is kind
of separate stuff. Because all "replicating" systems will happily replicate
all the errors you have to all the instances.  And good systems will do it
very fast. So you essentially need both.

Also my proposal is based on the reliability on the service level. E.g.
some things like "resources.ovirt.org" are quite easy to make reliable at
least for reads. You just start several ones and the only problem is the
mutation that will required to be done on all ones. There are multiple ways
to do that but I doubt we an find one solution for all services we have.
But all of them will need the underlying infra to be ready. If we store all
copies on one storage domain that is down it obviously will result in all
copies be down - less reliable when copies are separate.


> 5. This way we still can, perhaps need to, sync over the Internet many
> GBs of data if the local-site backup died too, but if it didn't, and we
> did everything right, we only need to sync diffs, which hopefully be much
> smaller.
>

This is indeed what should happen in properly designed service. Although
doubt it possible for all once we use. But if we choose per service
approach than it can be decided individually on a per service basis.

Anton.

-- 
Anton Marchukov
Senior Software Engineer - RHEV CI - Red Hat
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: Let's Consider Having Three Separated Environments in Infra

2016-07-20 Thread Yedidyah Bar David
On Wed, Jul 20, 2016 at 7:45 PM, Anton Marchukov  wrote:
> Hello All.
>
> This is a follow up to the meeting minutes. Just to record my thoughts to
> consider during the actual design.
>
> To make it straight. I think we need to target creation of 3 (yes, three)
> independent and completely similar setups with as less shared parts as
> possible.
>
> If we choose to go with reliability on service level than we do need 3
> because:
>
> 1. If we mess up with one environment (e,g, storage will be completely dead
> there) we will have 2 left working that gives us a reliability still because
> one of them can fail. So it will move us out of crunch mode into the regular
> work mode.
>
> 2. All consensus based algorithms generally require at least 2N+1 instances
> unless they utilize some special mode. The lowest is N=1 that is 3 and it
> would make sense to distribute them into different environments.
>
> I know the concern for having even 2 envs was that we will spend more effort
> to maintain them. But I think the opposite is true. Having 3 is actually
> less effort to maintain if we make them similar because of:
>
> 1. We can do gradual canary update, Same as with failure. You can test
> update on 1 instance leaving 2 left running that still provides reliability.
> So upgrade is no longer time constrained and safe.
>
> 2. If environments are similar then once we establish the correct playbook
> for one we can just apply it for second and later for third. So this
> overhead is not tripled in fact and if automated than it is no additional
> effort at all.
>
> 3. We are more open to test and play with one. We can even destroy it
> recreate from scratch, etc. Indirectly this will reduce our effort.
>
> I think the only real problem with it is the initial step when we should
> design an ideal hardware and network layout for that. But once it is done it
> will be easier to go with 3 environments. Also it may be possible to design
> the plan the way that we start with just one and later convert it into
> three.

Not sure it's my business, but whatever:

1. Do you intend also separate data centers? Such that if one (or two) of
them looses connectivity/power/etc, we are still up?

2. If so, does it mean that recreating it means copying from one of the
others many GBs of data? And if so, is that also the plan for recovering
from bad tests?

3. If so, it probably means we'll not do that very happily, because
"undo" will take a lot of time and bandwidth.

4. If we still want to go that route, perhaps consider having per-site
backup, which allows syncing from the others the changes done on them
since X (where X is "death of power/connectivity", "Start of test", etc).
Some time ago I looked at backup tools, and found out that while there are
several "field tested" tools, such as bacula and amanda, they are considered
old-fashioned, but there are several different contenders for the future
"perfect" tool. For an overview of some of them see [1]. For my own uses
I chose 'bup', which isn't perfect, but seemed good and stable enough.

5. This way we still can, perhaps need to, sync over the Internet many
GBs of data if the local-site backup died too, but if it didn't, and we
did everything right, we only need to sync diffs, which hopefully be much
smaller.

[1] 
http://changelog.complete.org/archives/9353-roundup-of-remote-encrypted-deduplicated-backups-in-linux

Best,

>
> Anton.
>
> --
> Anton Marchukov
> Senior Software Engineer - RHEV CI - Red Hat
>
>
> ___
> Infra mailing list
> Infra@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>



-- 
Didi
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Let's Consider Having Three Separated Environments in Infra

2016-07-20 Thread Anton Marchukov
Hello All.

This is a follow up to the meeting minutes. Just to record my thoughts to
consider during the actual design.

To make it straight. I think we need to target creation of 3 (yes, three)
independent and completely similar setups with as less shared parts as
possible.

If we choose to go with reliability on service level than we do need 3
because:

1. If we mess up with one environment (e,g, storage will be completely dead
there) we will have 2 left working that gives us a reliability still
because one of them can fail. So it will move us out of crunch mode into
the regular work mode.

2. All consensus based algorithms generally require at least 2N+1 instances
unless they utilize some special mode. The lowest is N=1 that is 3 and it
would make sense to distribute them into different environments.

I know the concern for having even 2 envs was that we will spend more
effort to maintain them. But I think the opposite is true. Having 3 is
actually less effort to maintain if we make them similar because of:

1. We can do gradual canary update, Same as with failure. You can test
update on 1 instance leaving 2 left running that still provides
reliability. So upgrade is no longer time constrained and safe.

2. If environments are similar then once we establish the correct playbook
for one we can just apply it for second and later for third. So this
overhead is not tripled in fact and if automated than it is no additional
effort at all.

3. We are more open to test and play with one. We can even destroy it
recreate from scratch, etc. Indirectly this will reduce our effort.

I think the only real problem with it is the initial step when we should
design an ideal hardware and network layout for that. But once it is done
it will be easier to go with 3 environments. Also it may be possible to
design the plan the way that we start with just one and later convert it
into three.

Anton.

-- 
Anton Marchukov
Senior Software Engineer - RHEV CI - Red Hat
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra