Hi,

I think you may have some misunderstanding on the PoC design. (the proxy node 
only to listen the RPC to compute-node/cinder-volume/L2/L3 agent…)


1)      The cascading layer including the proxy nodes are assumed running in 
VMs but not in physical servers (you can do that). Even in CJK intercloud ( 
China, Japan, Korea ) intercloud, the cascading layer including API,messagebus, 
DB, proxy nodes are running in VMs



2)      For proxy nodes running in VMs, it's not strange that  multiple proxy 
nodes running over one physical server. And if the load of one proxy nodes 
increased, it’s easy to move VM from one physical server to another, this is 
quite mature technology and easy to monitor, to deal with. And most of 
virtualization also support hot scale-up for one virtual machine.



3)      It's already in some scenario that the ZooKeeper is used to manage the 
proxy node role and membership. And backup node will take over the 
responsibility of the failed node.


So I did not see that “fake node” mode will bring extra benefit. On the other 
hand, the “fake node” add additional complexity:

1 ) the complexity of the code in cascade service, to implement the RPC to 
scheduler and the RPC to compute node/cinder volume.

2 ) how to judge the load of a “fake node”.  If all “fake-nodes” will run 
flatly(no special process or thread, just a symbol) in the same process, then 
how can you judge the load of a “fake node”, by message number ? but message 
number does not imply the  load. The load is often measured through CPU 
utilization / memory occupy, so how to calculate the load for each “fake node” 
and then make decision to move which nodes to other physical server? How to 
manage this “fake-node” in Zookeeper like cluster ware. You may want to make 
fake-node run in different process or thread space, then you need to manage 
“fake-node” and process/thread relationship.

I admit that the proposal 3 is much more complex to make it work for the 
flexible load balance. We have to record relative stamp for each message in the 
queue, pick the message from message bus, and put the message into task queue 
for each site in DB, then execute this task in order.

As what has been described above that the proposal 2 does not bring extra 
benefit, and if we don’t want to strive for the 3rd direction, we’d better 
fallback to the proposal 1.

Best Regards
Chaoyi Huang ( Joe Huang )

From: e...@gampel.co.il [mailto:e...@gampel.co.il] On Behalf Of Eran Gampel
Sent: Thursday, August 27, 2015 7:07 PM
To: joehuang; Irena Berezovsky; Eshed Gal-Or; Ayal Baron; OpenStack Development 
Mailing List (not for usage questions); caizhiyuan (A); Saggi Mizrahi; Orran 
Krieger; Gal Sagie; Orran Krieger; Zhipeng Huang
Subject: Re: [openstack-dev][tricircle] multiple cascade services

Hi,
Please see my comments inline
BR,
Eran

Hello,

As what we discussed in the yesterday’s meeting, the contradict is how to scale 
out cascade services.


1)      In PoC, one proxy node will only forward to one bottom openstack, the 
proxy node will be added to a regarding AZ, and multiple proxy nodes for one 
bottom OpenStack is feasible by adding more proxy nodes into this AZ, and the 
proxy node will be scheduled like usual.



Is this perfect? No. Because the VM’s host attribute is binding to a specific 
proxy node, therefore, these multiple proxy nodes can’t work in cluster mode, 
and each proxy node has to be backup by one slave node.



[Eran] I agree with this point - In the PoC you had a limitation of single 
active proxy per bottom site.  In addition, each proxy could only support a 
Single bottom site by-design.



2)      The fake node introduced in the cascade service.

Because fanout rpc call for Neutron API is assumed, then no multiple fake nodes 
for one bottom openstack is allowed.



[Eran] In fact, this is not a limitation in the current design.  We could have 
multiple "fake nodes" to handle the same bottom site, but only 1 that is 
Active.  If this Active node becomes unavailable, one of the other "Passive" 
nodes can take over with some leader-election or any other known design pattern 
(it's an implementation decision).

And because the traffic to one bottom OpenStack is un-predictable, and move 
these fake nodes dynamically among cascade service is very complicated, 
therefore we can’t deploy multiple fake nodes in one cascade service.



[Eran] I'm not sure I follow you on this point... as we see it, there are 3 
places where load is an issue (and potential bottleneck):

1. API + message queue + database

2. Cascading Service itself (dependency builder, communication service, DAL)

3. Task execution



I think you were concerned about #2, which in our design must be a 
single-active per bottom site (to maintain task order of execution).

In our opinion, the heaviest part is actually #3 (task execution), which is 
delegated to a separate execution path (Mistral workflow or otherwise).

In case we have one Cascading Service handling multiple Bottom sites and at 
some point in time we wish to handle just one Bottom site and move the rest of 
them to a different Cascading Service instance, it is possible.

The way we see it, is we have multiple Fake Nodes running in multiple Cascading 
Services, in Active-Passive.  That way, when one Cascading Service instance 
becomes overloaded, it can give up its "Leadership" on active fake nodes, and 
some of the other Cascading Services will take over (leader election, or 
otherwise).  This is a very common design pattern, we don't see anything 
special or complicated here.



At last, we have to deploy one fake node one cascade service.

And one cascade service one bottom openstack will limit the burst traffic to 
one cascade openstack.

And you have to backup the cascade service.



[Eran] This is correct.  In the worst case of traffic burst to a single bottom 
site, a single Cascading Service will handle a single Fake Node exclusively, 
and it is not possible to handle a single Bottom Site with more than a single 
Fake Node at any given time.

Having said that, we don't see a scenario where the Fake Node / Cascading 
Service will become a bottleneck.  We think that #3 (task execution) and #1 
(message queue, API and database) will choke before, probably because the 
OpenStack components in the Top and Bottom sites will not be able to handle the 
burst (which is a completely different story).



3)      From the beginning, I prefer to run multiple cascade service in 
parallel, and all of them work in load balance cluster mode.

[Eran] I believe we already discussed this before - It is actually not possible.
If you did that, you would have race condition and miss-ordering of actions, 
and an inconsistent state in the Bottom sites.
For example, if the Top user did:
#1 create security group "111"
#2 update security group "111" with "Allow *"
#3 update security group "111" with "Drop *"

If you have more than a single Cascading service that is responsible for site 
"A", you don't know what will be the order of actions.
In the example I gave, you may end up with site "A" having security group "111" 
with "Allow *" or with "Deny *".
API of (Nova, Cinder, Neutron… ) calling cascade service through RPC, and the 
RPC call will be only forwarded to one of the cascade service ( just put the 
RPC to message bus queue, and if one of the cascade service pick up the 
message, the message will be remove from the queue, and will not be consumed by 
other cascade service ). When the cascade service received a message, will 
start a task to execute the request. If multiple bottom openstacks will be 
involved, for example, networking, then the networking request will be 
forwarded to regarding multiple bottom openstack where there is resources (VM, 
floating IP)  resides ).

To keep the correct order of operations, all tasks will store necessary data in 
DB to prevent the operation be broken for single site. (if a VM is creating, 
reboot is not allowed, such kind of use cases has already been done on API of 
(Nova.Cinder,Neutron,…) side )

[Eran] This will not enforce order - Only keep state between non-racing 
actions.  It will not guarantee consistency in common scenarios of multiple 
updates to a specific resource within a short period, as I just gave with the 
security group.
Maybe it will work for a few predictable use cases, but there will always be 
something else that you did not plan for.
It is ultimately an unsafe design.
If you propose to make the database the coordinator of this process (which I 
don't see how), you will end-up with an even worse bottleneck - in the database.



Through this way, we can dynamically add cascade service node, and balance the  
traffic dynamically.


Best Regards
Chaoyi Huang ( Joe Huang )


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to