Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Dmitry Mescheryakov
2015-12-02 16:52 GMT+03:00 Jordan Pittier :

>
> On Wed, Dec 2, 2015 at 1:05 PM, Dmitry Mescheryakov <
> dmescherya...@mirantis.com> wrote:
>
>>
>>
>> My point is simple - lets increase our architecture scalability by 2-3
>> times by _maybe_ causing more errors for users during failover. The
>> failover time itself should not get worse (to be tested by me) and errors
>> should be correctly handler by services anyway.
>>
>
> Scalability is great, but what about correctness ?
>

Jordan, users will encounter problems only when some of RabbitMQ nodes go
down. Under normal circumstances it will not cause any additional errors.
And when RabbitMQ goes down and oslo.messaging fails over to alive hosts,
we anyway have couple minutes messaging downtime at the moment, which
disrupts almost all RPC calls. On the other side, disabling mirroring
greatly reduces chances a RabbitMQ node goes down due to high load.


> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Sheena Gregson
This seems like a totally reasonable solution, and would enable us to more
thoroughly test the performance implications of this change between 8.0
and 9.0 release.

+1

-Original Message-
From: Davanum Srinivas [mailto:dava...@gmail.com]
Sent: Wednesday, December 02, 2015 9:32 AM
To: OpenStack Development Mailing List (not for usage questions)

Subject: Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in
RabbitMQ

Vova, Folks,

+1 to "set this option to false as an experimental feature"

Thanks,
Dims

On Wed, Dec 2, 2015 at 10:08 AM, Vladimir Kuklin 
wrote:
> Dmitry
>
> Although, I am a big fan of disabling replication for RPC, I think it
> is too late to introduce it so late by default. I would suggest that
> we control this part of OCF script with a specific parameter 'e.g.
> enable RPC replication' and set it to 'true' by default. Then we can
> set this option to false as an experimental feature, run some tests
> and decide whether it should be enabled by default or not. In this
> case, users who are interested in this, will be able to enable it when
> they need it, while we still stick to our old and tested approach.
>
> On Wed, Dec 2, 2015 at 5:52 PM, Konstantin Kalin 
> wrote:
>>
>> I would add on top of that Dmirty said that HA queues also increases
>> probability to have messages duplications under certain scenarios
>> (besides of that they are ~10x slower). Would Openstack services
>> tolerate if RPC request will be duplicated? What I've already learned
>> - No. Also if cluster_partition_handling=autoheal (what we currently
>> have) the messages may be lost as well during the failover scenarios
like non-HA queues.
>> Honestly I believe there is no difference between HA queues and non
>> HA-queues in RPC layer fail-tolerance in the way how we use RabbitMQ.
>>
>> Thank you,
>> Konstantin.
>>
>> On Dec 2, 2015, at 4:05 AM, Dmitry Mescheryakov
>>  wrote:
>>
>>
>>
>> 2015-12-02 12:48 GMT+03:00 Sergii Golovatiuk
:
>>>
>>> Hi,
>>>
>>>
>>> On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov 
>>> wrote:
>>>>
>>>> Hello All!
>>>>
>>>> Well, side-effects (or any other effects) are quite obvious and
>>>> predictable - this will decrease availability of RPC queues a bit.
>>>> That's for sure.
>>>
>>>
>>> Imagine the case when user creates VM instance, and some nova
>>> messages are lost. I am not sure we want half-created instances. Who
>>> is going to clean up them? Since we do not have results of
>>> destructive tests, I vote -2 for FFE for this feature.
>>
>>
>> Sergii, actually messaging layer can not provide any guarantee that
>> it will not happen even if all messages are preserved. Assume the
>> following
>> scenario:
>>
>>  * nova-scheduler (or conductor?) sends request to nova-compute to
>> spawn a VM
>>  * nova-compute receives the message and spawned the VM
>>  * due to some reason (rabbitmq unavailable, nova-compute lagged)
>> nova-compute did not respond within timeout (1 minute, I think)
>>  * nova-scheduler does not get response within 1 minute and marks the
>> VM with Error status.
>>
>> In that scenario no message was lost, but still we have a VM half
>> spawned and it is up to Nova to handle the error and do the cleanup in
that case.
>>
>> Such issue already happens here and there when something glitches.
>> For instance our favorite MessagingTimeout exception could be caused
>> by such scenario. Specifically, in that example when nova-scheduler
>> times out waiting for reply, it will throw exactly that exception.
>>
>> My point is simple - lets increase our architecture scalability by
>> 2-3 times by _maybe_ causing more errors for users during failover.
>> The failover time itself should not get worse (to be tested by me)
>> and errors should be correctly handler by services anyway.
>>
>>>>
>>>> However, Dmitry's guess is that the overall messaging backplane
>>>> stability increase (RabitMQ won't fail too often in some cases)
>>>> would compensate for this change. This issue is very much real -
>>>> speaking of me I've seen an awful cluster's performance degradation
>>>> when a failing RabbitMQ node was killed by some watchdog
>>>> application (or even worse wasn't killed at all). One of these
>>>> issues was quite recently, and I'd love to see them less frequently.
>>>>
>>>> That said I'm 

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Davanum Srinivas
Vova, Folks,

+1 to "set this option to false as an experimental feature"

Thanks,
Dims

On Wed, Dec 2, 2015 at 10:08 AM, Vladimir Kuklin  wrote:
> Dmitry
>
> Although, I am a big fan of disabling replication for RPC, I think it is too
> late to introduce it so late by default. I would suggest that we control
> this part of OCF script with a specific parameter 'e.g. enable RPC
> replication' and set it to 'true' by default. Then we can set this option to
> false as an experimental feature, run some tests and decide whether it
> should be enabled by default or not. In this case, users who are interested
> in this, will be able to enable it when they need it, while we still stick
> to our old and tested approach.
>
> On Wed, Dec 2, 2015 at 5:52 PM, Konstantin Kalin 
> wrote:
>>
>> I would add on top of that Dmirty said that HA queues also increases
>> probability to have messages duplications under certain scenarios (besides
>> of that they are ~10x slower). Would Openstack services tolerate if RPC
>> request will be duplicated? What I've already learned - No. Also if
>> cluster_partition_handling=autoheal (what we currently have) the messages
>> may be lost as well during the failover scenarios like non-HA queues.
>> Honestly I believe there is no difference between HA queues and non
>> HA-queues in RPC layer fail-tolerance in the way how we use RabbitMQ.
>>
>> Thank you,
>> Konstantin.
>>
>> On Dec 2, 2015, at 4:05 AM, Dmitry Mescheryakov
>>  wrote:
>>
>>
>>
>> 2015-12-02 12:48 GMT+03:00 Sergii Golovatiuk :
>>>
>>> Hi,
>>>
>>>
>>> On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov 
>>> wrote:

 Hello All!

 Well, side-effects (or any other effects) are quite obvious and
 predictable - this will decrease availability of RPC queues a bit.
 That's for sure.
>>>
>>>
>>> Imagine the case when user creates VM instance, and some nova messages
>>> are lost. I am not sure we want half-created instances. Who is going to
>>> clean up them? Since we do not have results of destructive tests, I vote -2
>>> for FFE for this feature.
>>
>>
>> Sergii, actually messaging layer can not provide any guarantee that it
>> will not happen even if all messages are preserved. Assume the following
>> scenario:
>>
>>  * nova-scheduler (or conductor?) sends request to nova-compute to spawn a
>> VM
>>  * nova-compute receives the message and spawned the VM
>>  * due to some reason (rabbitmq unavailable, nova-compute lagged)
>> nova-compute did not respond within timeout (1 minute, I think)
>>  * nova-scheduler does not get response within 1 minute and marks the VM
>> with Error status.
>>
>> In that scenario no message was lost, but still we have a VM half spawned
>> and it is up to Nova to handle the error and do the cleanup in that case.
>>
>> Such issue already happens here and there when something glitches. For
>> instance our favorite MessagingTimeout exception could be caused by such
>> scenario. Specifically, in that example when nova-scheduler times out
>> waiting for reply, it will throw exactly that exception.
>>
>> My point is simple - lets increase our architecture scalability by 2-3
>> times by _maybe_ causing more errors for users during failover. The failover
>> time itself should not get worse (to be tested by me) and errors should be
>> correctly handler by services anyway.
>>

 However, Dmitry's guess is that the overall messaging backplane
 stability increase (RabitMQ won't fail too often in some cases) would
 compensate for this change. This issue is very much real - speaking of
 me I've seen an awful cluster's performance degradation when a failing
 RabbitMQ node was killed by some watchdog application (or even worse
 wasn't killed at all). One of these issues was quite recently, and I'd
 love to see them less frequently.

 That said I'm uncertain about the stability impact of this change, yet
 I see a reasoning worth discussing behind it.

 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
 > Hi,
 >
 > -1 for FFE for disabling HA for RPC queue as we do not know all side
 > effects
 > in HA scenarios.
 >
 > On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
 >  wrote:
 >>
 >> Folks,
 >>
 >> I would like to request feature freeze exception for disabling HA for
 >> RPC
 >> queues in RabbitMQ [1].
 >>
 >> As I already wrote in another thread [2], I've conducted tests which
 >> clearly show benefit we will get from that change. The change itself
 >> is a
 >> very small patch [3]. The only thing which I want to do before
 >> proposing to
 >> merge this change is to conduct destructive tests against it in order
 >> to
 >> make sure that we do not have a regression here. That should take
 >> just
 >> several days, so if there will be no other objections, we will be
 >> able to
 >> merge the change in a week or two timeframe.
 >>
 >> Thank

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Vladimir Kuklin
Dmitry

Although, I am a big fan of disabling replication for RPC, I think it is
too late to introduce it so late by default. I would suggest that we
control this part of OCF script with a specific parameter 'e.g. enable RPC
replication' and set it to 'true' by default. Then we can set this option
to false as an experimental feature, run some tests and decide whether it
should be enabled by default or not. In this case, users who are interested
in this, will be able to enable it when they need it, while we still stick
to our old and tested approach.

On Wed, Dec 2, 2015 at 5:52 PM, Konstantin Kalin 
wrote:

> I would add on top of that Dmirty said that HA queues also increases
> probability to have messages duplications under certain scenarios (besides
> of that they are ~10x slower). Would Openstack services tolerate if RPC
> request will be duplicated? What I've already learned - No. Also if
> cluster_partition_handling=autoheal (what we currently have) the messages
> may be lost as well during the failover scenarios like non-HA
> queues. Honestly I believe there is no difference between HA queues and non
> HA-queues in RPC layer fail-tolerance in the way how we use RabbitMQ.
>
> Thank you,
> Konstantin.
>
> On Dec 2, 2015, at 4:05 AM, Dmitry Mescheryakov <
> dmescherya...@mirantis.com> wrote:
>
>
>
> 2015-12-02 12:48 GMT+03:00 Sergii Golovatiuk :
>
>> Hi,
>>
>>
>> On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov 
>> wrote:
>>
>>> Hello All!
>>>
>>> Well, side-effects (or any other effects) are quite obvious and
>>> predictable - this will decrease availability of RPC queues a bit.
>>> That's for sure.
>>>
>>
>> Imagine the case when user creates VM instance, and some nova messages
>> are lost. I am not sure we want half-created instances. Who is going to
>> clean up them? Since we do not have results of destructive tests, I vote -2
>> for FFE for this feature.
>>
>
> Sergii, actually messaging layer can not provide any guarantee that it
> will not happen even if all messages are preserved. Assume the following
> scenario:
>
>  * nova-scheduler (or conductor?) sends request to nova-compute to spawn a
> VM
>  * nova-compute receives the message and spawned the VM
>  * due to some reason (rabbitmq unavailable, nova-compute lagged)
> nova-compute did not respond within timeout (1 minute, I think)
>  * nova-scheduler does not get response within 1 minute and marks the VM
> with Error status.
>
> In that scenario no message was lost, but still we have a VM half spawned
> and it is up to Nova to handle the error and do the cleanup in that case.
>
> Such issue already happens here and there when something glitches. For
> instance our favorite MessagingTimeout exception could be caused by such
> scenario. Specifically, in that example when nova-scheduler times out
> waiting for reply, it will throw exactly that exception.
>
> My point is simple - lets increase our architecture scalability by 2-3
> times by _maybe_ causing more errors for users during failover. The
> failover time itself should not get worse (to be tested by me) and errors
> should be correctly handler by services anyway.
>
>
>>> However, Dmitry's guess is that the overall messaging backplane
>>> stability increase (RabitMQ won't fail too often in some cases) would
>>> compensate for this change. This issue is very much real - speaking of
>>> me I've seen an awful cluster's performance degradation when a failing
>>> RabbitMQ node was killed by some watchdog application (or even worse
>>> wasn't killed at all). One of these issues was quite recently, and I'd
>>> love to see them less frequently.
>>>
>>> That said I'm uncertain about the stability impact of this change, yet
>>> I see a reasoning worth discussing behind it.
>>>
>>> 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
>>> > Hi,
>>> >
>>> > -1 for FFE for disabling HA for RPC queue as we do not know all side
>>> effects
>>> > in HA scenarios.
>>> >
>>> > On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
>>> >  wrote:
>>> >>
>>> >> Folks,
>>> >>
>>> >> I would like to request feature freeze exception for disabling HA for
>>> RPC
>>> >> queues in RabbitMQ [1].
>>> >>
>>> >> As I already wrote in another thread [2], I've conducted tests which
>>> >> clearly show benefit we will get from that change. The change itself
>>> is a
>>> >> very small patch [3]. The only thing which I want to do before
>>> proposing to
>>> >> merge this change is to conduct destructive tests against it in order
>>> to
>>> >> make sure that we do not have a regression here. That should take just
>>> >> several days, so if there will be no other objections, we will be
>>> able to
>>> >> merge the change in a week or two timeframe.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Dmitry
>>> >>
>>> >> [1] https://review.openstack.org/247517
>>> >> [2]
>>> >>
>>> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
>>> >> [3] https://review.openstack.org/249180
>>> >>
>>> >>
>>> 

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Konstantin Kalin
I would add on top of that Dmirty said that HA queues also increases 
probability to have messages duplications under certain scenarios (besides of 
that they are ~10x slower). Would Openstack services tolerate if RPC request 
will be duplicated? What I've already learned - No. Also if 
cluster_partition_handling=autoheal (what we currently have) the messages may 
be lost as well during the failover scenarios like non-HA queues. Honestly I 
believe there is no difference between HA queues and non HA-queues in RPC layer 
fail-tolerance in the way how we use RabbitMQ. 

Thank you,
Konstantin. 

> On Dec 2, 2015, at 4:05 AM, Dmitry Mescheryakov  
> wrote:
> 
> 
> 
> 2015-12-02 12:48 GMT+03:00 Sergii Golovatiuk  >:
> Hi,
> 
> 
> On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov  > wrote:
> Hello All!
> 
> Well, side-effects (or any other effects) are quite obvious and
> predictable - this will decrease availability of RPC queues a bit.
> That's for sure.
> 
> Imagine the case when user creates VM instance, and some nova messages are 
> lost. I am not sure we want half-created instances. Who is going to clean up 
> them? Since we do not have results of destructive tests, I vote -2 for FFE 
> for this feature.
> 
> Sergii, actually messaging layer can not provide any guarantee that it will 
> not happen even if all messages are preserved. Assume the following scenario:
> 
>  * nova-scheduler (or conductor?) sends request to nova-compute to spawn a VM
>  * nova-compute receives the message and spawned the VM
>  * due to some reason (rabbitmq unavailable, nova-compute lagged) 
> nova-compute did not respond within timeout (1 minute, I think)
>  * nova-scheduler does not get response within 1 minute and marks the VM with 
> Error status.
> 
> In that scenario no message was lost, but still we have a VM half spawned and 
> it is up to Nova to handle the error and do the cleanup in that case.
> 
> Such issue already happens here and there when something glitches. For 
> instance our favorite MessagingTimeout exception could be caused by such 
> scenario. Specifically, in that example when nova-scheduler times out waiting 
> for reply, it will throw exactly that exception. 
> 
> My point is simple - lets increase our architecture scalability by 2-3 times 
> by _maybe_ causing more errors for users during failover. The failover time 
> itself should not get worse (to be tested by me) and errors should be 
> correctly handler by services anyway.
> 
> 
> However, Dmitry's guess is that the overall messaging backplane
> stability increase (RabitMQ won't fail too often in some cases) would
> compensate for this change. This issue is very much real - speaking of
> me I've seen an awful cluster's performance degradation when a failing
> RabbitMQ node was killed by some watchdog application (or even worse
> wasn't killed at all). One of these issues was quite recently, and I'd
> love to see them less frequently.
> 
> That said I'm uncertain about the stability impact of this change, yet
> I see a reasoning worth discussing behind it.
> 
> 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk  >:
> > Hi,
> >
> > -1 for FFE for disabling HA for RPC queue as we do not know all side effects
> > in HA scenarios.
> >
> > On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
> > mailto:dmescherya...@mirantis.com>> wrote:
> >>
> >> Folks,
> >>
> >> I would like to request feature freeze exception for disabling HA for RPC
> >> queues in RabbitMQ [1].
> >>
> >> As I already wrote in another thread [2], I've conducted tests which
> >> clearly show benefit we will get from that change. The change itself is a
> >> very small patch [3]. The only thing which I want to do before proposing to
> >> merge this change is to conduct destructive tests against it in order to
> >> make sure that we do not have a regression here. That should take just
> >> several days, so if there will be no other objections, we will be able to
> >> merge the change in a week or two timeframe.
> >>
> >> Thanks,
> >>
> >> Dmitry
> >>
> >> [1] https://review.openstack.org/247517 
> >> 
> >> [2]
> >> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
> >>  
> >> 
> >> [3] https://review.openstack.org/249180 
> >> 
> >>
> >> __
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe 
> >> 
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev 
> >> 
> >>
> >
> >
> > 

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Dmitry Mescheryakov
2015-12-02 13:11 GMT+03:00 Bogdan Dobrelya :

> On 01.12.2015 23:34, Peter Lemenkov wrote:
> > Hello All!
> >
> > Well, side-effects (or any other effects) are quite obvious and
> > predictable - this will decrease availability of RPC queues a bit.
> > That's for sure.
>
> And consistency. Without messages and queues being synced between all of
> the rabbit_hosts, how exactly dispatching rpc calls would work then
> workers connected to different AMQP urls?
>

There will be no problem with consistency here. Since we will disable HA,
queues will not be synced across the cluster and there will be exactly one
node hosting messages for a queue.


> Perhaps that change would only raise the partitions tolerance to the
> very high degree? But this should be clearly shown by load tests - under
> network partitions with mirroring against network partitions w/o
> mirroring. Rally could help here a lot.


Nope, the change will not increase partitioning tolerance at all. What I
expect is that it will not get worse. Regarding tests, sure we are going to
perform destructive testing to verify that there is no regression in
recovery time.


>
> >
> > However, Dmitry's guess is that the overall messaging backplane
> > stability increase (RabitMQ won't fail too often in some cases) would
> > compensate for this change. This issue is very much real - speaking of
>
> Agree, that should be proven by (rally) tests for the specific case I
> described in the spec [0]. Please correct it as I may understand things
> wrong, but here it is:
> - client 1 submits RPC call request R to the server 1 connected to the
> AMQP host X
> - worker A listens for jobs topic to the AMQP host X
> - worker B listens for jobs topic to the AMQP host Y
> - a job by the R was dispatched to the worker B
> Q: would the B never receive its job message because it just cannot see
> messages at the X?
> Q: timeout failure as the result.
>
> And things may go even much more weird for more complex scenarios.
>

Yes, in the described scenario B will receive the job. Node Y will proxy B
listening to node X. So, we will not experience timeout. Also, I have
replied in the review.


>
> [0] https://review.openstack.org/247517
>
> > me I've seen an awful cluster's performance degradation when a failing
> > RabbitMQ node was killed by some watchdog application (or even worse
> > wasn't killed at all). One of these issues was quite recently, and I'd
> > love to see them less frequently.
> >
> > That said I'm uncertain about the stability impact of this change, yet
> > I see a reasoning worth discussing behind it.
>
> I would support this to the 8.0 if only proven by the load tests within
> scenario I described plus standard destructive tests


As I said in my initial email, I've run boot_and_delete_server_with_secgroups
Rally scenario to verify my change. I think I should provide more details:

Scale team considers this test to be the worst case we have for RabbitMQ.
I've ran the test on 200 nodes lab and what I saw is that when I disable
HA, test time becomes 2 times smaller. That clearly shows that there is a
test where our current messaging system is bottleneck and just tuning it
considerably improves performance of OpenStack as a whole. Also while there
was small fail rate for HA mode (around 1-2%), in non-HA mode all tests
always completed successfully.

Overall, I think current results are already enough to consider the change
useful. What is left is to confirm that it does not make our failover worse.


> >
> > 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
> >> Hi,
> >>
> >> -1 for FFE for disabling HA for RPC queue as we do not know all side
> effects
> >> in HA scenarios.
> >>
> >> On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
> >>  wrote:
> >>>
> >>> Folks,
> >>>
> >>> I would like to request feature freeze exception for disabling HA for
> RPC
> >>> queues in RabbitMQ [1].
> >>>
> >>> As I already wrote in another thread [2], I've conducted tests which
> >>> clearly show benefit we will get from that change. The change itself
> is a
> >>> very small patch [3]. The only thing which I want to do before
> proposing to
> >>> merge this change is to conduct destructive tests against it in order
> to
> >>> make sure that we do not have a regression here. That should take just
> >>> several days, so if there will be no other objections, we will be able
> to
> >>> merge the change in a week or two timeframe.
> >>>
> >>> Thanks,
> >>>
> >>> Dmitry
> >>>
> >>> [1] https://review.openstack.org/247517
> >>> [2]
> >>>
> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
> >>> [3] https://review.openstack.org/249180
> >>>
> >>>
> __
> >>> OpenStack Development Mailing List (not for usage questions)
> >>> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>
> >>
> >>
> >>

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Jordan Pittier
On Wed, Dec 2, 2015 at 1:05 PM, Dmitry Mescheryakov <
dmescherya...@mirantis.com> wrote:

>
>
> My point is simple - lets increase our architecture scalability by 2-3
> times by _maybe_ causing more errors for users during failover. The
> failover time itself should not get worse (to be tested by me) and errors
> should be correctly handler by services anyway.
>

Scalability is great, but what about correctness ?
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Dmitry Mescheryakov
2015-12-02 12:48 GMT+03:00 Sergii Golovatiuk :

> Hi,
>
>
> On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov 
> wrote:
>
>> Hello All!
>>
>> Well, side-effects (or any other effects) are quite obvious and
>> predictable - this will decrease availability of RPC queues a bit.
>> That's for sure.
>>
>
> Imagine the case when user creates VM instance, and some nova messages are
> lost. I am not sure we want half-created instances. Who is going to clean
> up them? Since we do not have results of destructive tests, I vote -2 for
> FFE for this feature.
>

Sergii, actually messaging layer can not provide any guarantee that it will
not happen even if all messages are preserved. Assume the following
scenario:

 * nova-scheduler (or conductor?) sends request to nova-compute to spawn a
VM
 * nova-compute receives the message and spawned the VM
 * due to some reason (rabbitmq unavailable, nova-compute lagged)
nova-compute did not respond within timeout (1 minute, I think)
 * nova-scheduler does not get response within 1 minute and marks the VM
with Error status.

In that scenario no message was lost, but still we have a VM half spawned
and it is up to Nova to handle the error and do the cleanup in that case.

Such issue already happens here and there when something glitches. For
instance our favorite MessagingTimeout exception could be caused by such
scenario. Specifically, in that example when nova-scheduler times out
waiting for reply, it will throw exactly that exception.

My point is simple - lets increase our architecture scalability by 2-3
times by _maybe_ causing more errors for users during failover. The
failover time itself should not get worse (to be tested by me) and errors
should be correctly handler by services anyway.


>> However, Dmitry's guess is that the overall messaging backplane
>> stability increase (RabitMQ won't fail too often in some cases) would
>> compensate for this change. This issue is very much real - speaking of
>> me I've seen an awful cluster's performance degradation when a failing
>> RabbitMQ node was killed by some watchdog application (or even worse
>> wasn't killed at all). One of these issues was quite recently, and I'd
>> love to see them less frequently.
>>
>> That said I'm uncertain about the stability impact of this change, yet
>> I see a reasoning worth discussing behind it.
>>
>> 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
>> > Hi,
>> >
>> > -1 for FFE for disabling HA for RPC queue as we do not know all side
>> effects
>> > in HA scenarios.
>> >
>> > On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
>> >  wrote:
>> >>
>> >> Folks,
>> >>
>> >> I would like to request feature freeze exception for disabling HA for
>> RPC
>> >> queues in RabbitMQ [1].
>> >>
>> >> As I already wrote in another thread [2], I've conducted tests which
>> >> clearly show benefit we will get from that change. The change itself
>> is a
>> >> very small patch [3]. The only thing which I want to do before
>> proposing to
>> >> merge this change is to conduct destructive tests against it in order
>> to
>> >> make sure that we do not have a regression here. That should take just
>> >> several days, so if there will be no other objections, we will be able
>> to
>> >> merge the change in a week or two timeframe.
>> >>
>> >> Thanks,
>> >>
>> >> Dmitry
>> >>
>> >> [1] https://review.openstack.org/247517
>> >> [2]
>> >>
>> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
>> >> [3] https://review.openstack.org/249180
>> >>
>> >>
>> __
>> >> OpenStack Development Mailing List (not for usage questions)
>> >> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >>
>> >
>> >
>> >
>> __
>> > OpenStack Development Mailing List (not for usage questions)
>> > Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>>
>>
>>
>> --
>> With best regards, Peter Lemenkov.
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscri

Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Bogdan Dobrelya
On 01.12.2015 23:34, Peter Lemenkov wrote:
> Hello All!
> 
> Well, side-effects (or any other effects) are quite obvious and
> predictable - this will decrease availability of RPC queues a bit.
> That's for sure.

And consistency. Without messages and queues being synced between all of
the rabbit_hosts, how exactly dispatching rpc calls would work then
workers connected to different AMQP urls?

Perhaps that change would only raise the partitions tolerance to the
very high degree? But this should be clearly shown by load tests - under
network partitions with mirroring against network partitions w/o
mirroring. Rally could help here a lot.

> 
> However, Dmitry's guess is that the overall messaging backplane
> stability increase (RabitMQ won't fail too often in some cases) would
> compensate for this change. This issue is very much real - speaking of

Agree, that should be proven by (rally) tests for the specific case I
described in the spec [0]. Please correct it as I may understand things
wrong, but here it is:
- client 1 submits RPC call request R to the server 1 connected to the
AMQP host X
- worker A listens for jobs topic to the AMQP host X
- worker B listens for jobs topic to the AMQP host Y
- a job by the R was dispatched to the worker B
Q: would the B never receive its job message because it just cannot see
messages at the X?
Q: timeout failure as the result.

And things may go even much more weird for more complex scenarios.


[0] https://review.openstack.org/247517

> me I've seen an awful cluster's performance degradation when a failing
> RabbitMQ node was killed by some watchdog application (or even worse
> wasn't killed at all). One of these issues was quite recently, and I'd
> love to see them less frequently.
> 
> That said I'm uncertain about the stability impact of this change, yet
> I see a reasoning worth discussing behind it.

I would support this to the 8.0 if only proven by the load tests within
scenario I described plus standard destructive tests

> 
> 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
>> Hi,
>>
>> -1 for FFE for disabling HA for RPC queue as we do not know all side effects
>> in HA scenarios.
>>
>> On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
>>  wrote:
>>>
>>> Folks,
>>>
>>> I would like to request feature freeze exception for disabling HA for RPC
>>> queues in RabbitMQ [1].
>>>
>>> As I already wrote in another thread [2], I've conducted tests which
>>> clearly show benefit we will get from that change. The change itself is a
>>> very small patch [3]. The only thing which I want to do before proposing to
>>> merge this change is to conduct destructive tests against it in order to
>>> make sure that we do not have a regression here. That should take just
>>> several days, so if there will be no other objections, we will be able to
>>> merge the change in a week or two timeframe.
>>>
>>> Thanks,
>>>
>>> Dmitry
>>>
>>> [1] https://review.openstack.org/247517
>>> [2]
>>> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
>>> [3] https://review.openstack.org/249180
>>>
>>> __
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-02 Thread Sergii Golovatiuk
Hi,


On Tue, Dec 1, 2015 at 11:34 PM, Peter Lemenkov  wrote:

> Hello All!
>
> Well, side-effects (or any other effects) are quite obvious and
> predictable - this will decrease availability of RPC queues a bit.
> That's for sure.
>

Imagine the case when user creates VM instance, and some nova messages are
lost. I am not sure we want half-created instances. Who is going to clean
up them? Since we do not have results of destructive tests, I vote -2 for
FFE for this feature.


>
> However, Dmitry's guess is that the overall messaging backplane
> stability increase (RabitMQ won't fail too often in some cases) would
> compensate for this change. This issue is very much real - speaking of
> me I've seen an awful cluster's performance degradation when a failing
> RabbitMQ node was killed by some watchdog application (or even worse
> wasn't killed at all). One of these issues was quite recently, and I'd
> love to see them less frequently.
>
> That said I'm uncertain about the stability impact of this change, yet
> I see a reasoning worth discussing behind it.
>
> 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
> > Hi,
> >
> > -1 for FFE for disabling HA for RPC queue as we do not know all side
> effects
> > in HA scenarios.
> >
> > On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
> >  wrote:
> >>
> >> Folks,
> >>
> >> I would like to request feature freeze exception for disabling HA for
> RPC
> >> queues in RabbitMQ [1].
> >>
> >> As I already wrote in another thread [2], I've conducted tests which
> >> clearly show benefit we will get from that change. The change itself is
> a
> >> very small patch [3]. The only thing which I want to do before
> proposing to
> >> merge this change is to conduct destructive tests against it in order to
> >> make sure that we do not have a regression here. That should take just
> >> several days, so if there will be no other objections, we will be able
> to
> >> merge the change in a week or two timeframe.
> >>
> >> Thanks,
> >>
> >> Dmitry
> >>
> >> [1] https://review.openstack.org/247517
> >> [2]
> >>
> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
> >> [3] https://review.openstack.org/249180
> >>
> >>
> __
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >
> >
> >
> __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> With best regards, Peter Lemenkov.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-01 Thread Peter Lemenkov
Hello All!

Well, side-effects (or any other effects) are quite obvious and
predictable - this will decrease availability of RPC queues a bit.
That's for sure.

However, Dmitry's guess is that the overall messaging backplane
stability increase (RabitMQ won't fail too often in some cases) would
compensate for this change. This issue is very much real - speaking of
me I've seen an awful cluster's performance degradation when a failing
RabbitMQ node was killed by some watchdog application (or even worse
wasn't killed at all). One of these issues was quite recently, and I'd
love to see them less frequently.

That said I'm uncertain about the stability impact of this change, yet
I see a reasoning worth discussing behind it.

2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk :
> Hi,
>
> -1 for FFE for disabling HA for RPC queue as we do not know all side effects
> in HA scenarios.
>
> On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
>  wrote:
>>
>> Folks,
>>
>> I would like to request feature freeze exception for disabling HA for RPC
>> queues in RabbitMQ [1].
>>
>> As I already wrote in another thread [2], I've conducted tests which
>> clearly show benefit we will get from that change. The change itself is a
>> very small patch [3]. The only thing which I want to do before proposing to
>> merge this change is to conduct destructive tests against it in order to
>> make sure that we do not have a regression here. That should take just
>> several days, so if there will be no other objections, we will be able to
>> merge the change in a week or two timeframe.
>>
>> Thanks,
>>
>> Dmitry
>>
>> [1] https://review.openstack.org/247517
>> [2]
>> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
>> [3] https://review.openstack.org/249180
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
With best regards, Peter Lemenkov.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

2015-12-01 Thread Sergii Golovatiuk
Hi,

-1 for FFE for disabling HA for RPC queue as we do not know all side
effects in HA scenarios.

On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov <
dmescherya...@mirantis.com> wrote:

> Folks,
>
> I would like to request feature freeze exception for disabling HA for RPC
> queues in RabbitMQ [1].
>
> As I already wrote in another thread [2], I've conducted tests which
> clearly show benefit we will get from that change. The change itself is a
> very small patch [3]. The only thing which I want to do before proposing to
> merge this change is to conduct destructive tests against it in order to
> make sure that we do not have a regression here. That should take just
> several days, so if there will be no other objections, we will be able to
> merge the change in a week or two timeframe.
>
> Thanks,
>
> Dmitry
>
> [1] https://review.openstack.org/247517
> [2]
> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
> [3] https://review.openstack.org/249180
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev