Hi.

We have to use additional states for clouds under patching/rollback, mostly 
because of:




If something goes wrong (patching interrupted or failed, orchestrator ‘killed 
in action’, some nodes went offline), orchestrator have to be told, eventually, 
the cloud is in *dirty* state - i,e. half patched and barely operational. 
Additional business logic, if any, could as well track this *dirty* state in 
order to help cloud operators to perform some recovery or rollback actions.


If something goes wrong completely (recovery or rollback failed), orchestrator 
have at least to be told it should *not ever try * to repeat patching / 
rollback actions for affected cloud due to its‘broken state` - just in order to 
prevent the things going even worse. Otherwise, results would be unpredictable. 
That is that the *fatal* state could be useful for. That should be done later - 
that is another question and it might be up to operator only to decide. 
Perhaps, only manual fallback to backups with partial data loss could help here.




Regards,

Bogdan Dobrelya.






From: Evgeniy L
Sent: ‎Thursday‎, ‎September‎ ‎11‎, ‎2014 ‎2‎:‎02‎ ‎PM
To: Mike Scherbakov
Cc: Igor Kalnitsky, fuel-dev





Hi,



>> Also, let's think and work on possible failures. What if Fuel Master node 
>> goes off during patching? What is going to be affected? How we can complete 
>> patching when Fuel Master comes back online?




The question can be summarised as "What if you kill orchestrator during the 
deployment?"

In this case user will get hung progress bar on UI until he removes task from 
nailgun.

And I'm not sure if after that he will be able to continue deployment without 
additional changes in db.

Actually the same questions related not only to patching, but to every task 
which we run under orchestrator.

The reason for this is our architecture, orchestrator was designed as a worker 
without persistent state.

But you need to keep somewhere the state in order to complete task after 
failure.

As far as I understand Mistral can help as with this issue.




>> Or compute node under patching breaks for some reason (e.g. disk issues or 
>> memory), how would it affect the patching process? How we can safely 
>> continue patching of other nodes?




How it works now, Vladimir Sharshov, correct me if I'm wrong.

We use the same strategy as for deployment.




Error during primary-controller patching - fail whole patching process

Error during patching of other roles -  continue patching process




And I'm not sure if current strategy is wrong or right.

On the one hand we shouldn't leave user's env in half patched state.

On the other hand we can break whole user's cluster because we ignore the

fact that several computes died during the patching procedure.





Thanks,






On Tue, Sep 9, 2014 at 12:15 PM, Mike Scherbakov <mscherba...@mirantis.com> 
wrote:


Folks,
I was the one who initially requested this. I thought it's going to be pretty 
similar to Stop Deployment. I becomes obvious, that it is not.




I'm fine if we have it in API. Though I think what is much more important here 
is an ability for the user to choose a few hosts for patching first, in order 
to check how patching would work on a very small part of the cluster. Ideally 
we would even move workloads to other nodes before doing patching. We should 
disable scheduling of workloads for sure for these experimental hosts.

Then user can run patching against these nodes, and see how it goes. If all 
goes fine, patching can be applied to the rest of the environment. I do not 
think though that we should do all, let's say 100 nodes, at once. This sounds 
dangerous to me. I think we would need to come up with some less dangerous 
scenario.




Also, let's think and work on possible failures. What if Fuel Master node goes 
off during patching? What is going to be affected? How we can complete patching 
when Fuel Master comes back online?




Or compute node under patching breaks for some reason (e.g. disk issues or 
memory), how would it affect the patching process? How we can safely continue 
patching of other nodes?




Thanks,





On Tue, Sep 9, 2014 at 12:08 PM, Vladimir Kuklin <vkuk...@mirantis.com> wrote:


Sorry again. Look 2 messages below, please.


09 сент. 2014 г. 12:06 пользователь "Vladimir Kuklin" <vkuk...@mirantis.com> 
написал:


Sorry, hit reply instead of replyall.

09 сент. 2014 г. 12:05 пользователь "Vladimir Kuklin" <vkuk...@mirantis.com> 
написал:


+1

Also, I think, we should add stop patching at least to api in order to allow 
advanced users and service team to do what they want.

09 сент. 2014 г. 12:02 пользователь "Igor Kalnitsky" <ikalnit...@mirantis.com> 
написал:



What we should to do with nodes in case of interrupt patching? I think
we need to mark them for re-deployment, since nodes' state may be
broken.

Any opinion?

- Igor

On Mon, Sep 8, 2014 at 3:28 PM, Evgeniy L <e...@mirantis.com> wrote:
> Hi,
>
> We were working on implementation of experimental feature
> where user could interrupt openstack patching procedure [1].
>
> It's not as easy to implement as we thought it would be.
> Current stop deployment mechanism [2] stops puppet, erases
> nodes and reboots them into bootstrap. It's ok for stop
> deployment, but it's not ok for patching, because user
> can loose his data. We can rewrite some logic in nailgun
> and in orchestrator to stop puppet and not to erase nodes.
> But I'm not sure if it works correctly because such use
> case wasn't tested. And I can see the problems like
> yum/apt-get locks cleaning after puppet interruption.
>
> As result I have several questions:
> 1. should we try to make it work for the current release?
> 2. if we shouldn't, will we need this feature for the future
>     releases? Definitely additional design and research is
>     required.
>
> [1] https://bugs.launchpad.net/fuel/+bug/1364907
> [2]
> https://github.com/stackforge/fuel-astute/blob/b622d9b36dbdd1e03b282b9ee5b7435ba649e711/lib/astute/server/dispatcher.rb#L163-L164
>
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : fuel-dev@lists.launchpad.net
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp
>

--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp


--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp






-- 

Mike Scherbakov
#mihgen



--
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp
-- 
Mailing list: https://launchpad.net/~fuel-dev
Post to     : fuel-dev@lists.launchpad.net
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to