Hi, I have confirmed that the issue related to locally queueing of resource requests, which I highlighted at the design summit, exists currently. I have also confirmed that the issue is solved in oslo.messaging version 5.0.0.
The issue is with oslo messaging library below version 5.0.0. The messages, carrying RPC calls/casts requests, are drained from the messaging server (RabbitMQ) and submitted in the thread pool executor (GreenThreadPoolExecutor from futurist library). Before submitting the message to the executor, the message is acknowledged, which means the message is deleted from the messaging server.The thread pool executor queues the messages locally when there are no eventlets available to process the message. This is bad, because the messages are queued up locally and if the process goes down, these messages are lost, it is very difficult to recover as they are not available in the messaging server. The mail thread http://lists.openstack.org/pipermail/openstack-dev/2015-July/068742.html gives more context and I cried and wept when I read it. In convergence, the heat engine casts the requests to process the resources and we don't want the heat engine failures to result in loss of those resource requests, as there is no easier way to recover them. The issue is fixed by https://review.openstack.org/#/c/297988 . I installed and tested with version 5.0.0, which is the latest version of oslo.messaging and has the fix. In the new version, the messages are acknowledged only after the message gets an eventlet. It is not ideal in the sense that it doesn't give the service/client the freedom to acknowledge when it wants to, but better than the older versions. So, if the engine process cannot get an eventlet/thread to process the message, it is not acknowledged and it remains in the messaging server. I tested with two engine processes with executor thread pool size set to 2. This means at most only 4 resources should be processed at a time and remaining should be available in the messaging server. I created a stack of 8 test resources each with 20 secs of waiting time, and saw that 4 messages were available in the messaging server while other 4 were being processed. I restarted the engine processes and the remaining messages were again taken up of processing. I am glad that the issue is fixed in the new version and we should move to it before enabling convergence by default. -- Anant
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev