Hi devs, During the testing of the Python Cartridge Agent in the past few weeks we were able to identify a few points where performance and functionality could be improved.
*1 - Agent's thread utilization * It was observed that when the agent is kept running for long periods of time, the thread count spawned would go up and reach the thread limit giving the following error. Exception in thread Thread-85: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/mnt/apache-stratos-python-cartridge-agent-4.1.4/modules/util/asyncscheduledtask.py", line 65, in run task_thread.start() File "/usr/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) error: can't start new thread Although I couldn't pinpoint the exact use case which results in this sudden spike in the thread count, it's most likely caused by the MessageBrokerHeartBeatChecker class inside subscriber.py file. This is a periodic job which checks for the liveliness of the connected message broker. We can get rid of this by implementing a callback method of Paho Mqtt Client's on_disconnect method [1]. Furthermore, the ScheduledExecutor class in the asyncscheduledtask.py file, for each invocation it spawns a new thread. while not self.terminated: time.sleep(self.delay) task_thread = Thread(target=self.task.execute_task) task_thread.start() This is not a good practice. It should rather submit the task to a thread pool. Imesh has an example code of the thread pool usage using the pythonfutures library[2] for Python 2.7. *2. Decoupling the messaging implementation* When MQTT was chosen initially as the message broker protocol, the PCA was designed to only use MQTT, using the Paho MQTT library[3]. However after coming across several issues in MQTT, it was decided to default back to AMQP as the message broker protocol for Stratos Components. Except for the PCA, since at the time, there was no AMQP 1.0 supported Python library. However, it seems that Apache QPid Proton library[4] provides AMQP 1.0 compliant clients for message broker communication. We can provide a common interface for messaging and provide implementation for MQTT and AMQP (if Apache QPid Proton proves to be helpful) protocols. This way we can provide the selection of the protocol to the configuration, similar to what we have in the Java based components such as AS, CC etc. Furthermore, this can greatly help if messaging has to be customized to cater for custom protocols such as Apache Kafka[5]. It will only be another implementation of the common interface. *3. Remove dependency on the Git Binary* Currently the PCA manipulates the Git based artifact repository through the Git binary file. This is inefficient for several reasons. 1. Every Git command is executed as a separate process, created using os.fork(). This duplicates memory and results in needless marshalling and unmarshalling of input and output among the processes. 2. If several commands are executed upon the Git binary at the same time (ex: of a multi-tenant application), it can be a performance bottleneck. 3. This is not platform independent. Therefore it will greatly help if we can go for a Git library for Python which doesn't depend on the Git binary. Dulwich [6] was considered in the past, but then releases had more features TBD. However recentl releases seems to have fixed a lot of bug reports and features. It also has plans to be re licensed as Apache v2.0 which also would help us by making it shippable. *4. Use of the maintenance mode* Sajith recently started a discussion on a patching strategy for the PCA (Thread - [Discuss] Suggesting a patching model for PCA). Patching for the PCA involves two different scenarios, offline and online. If instances that are already online needs to be patched the current PCA or Stratos design doesn't allow such a window. Maintenance Mode can be used to signal the Autoscaler that the member has gone to maintenance and scaling decisions should not be taken on that member. While in the maintenance mode, the running PCA is gracefully shutdown and the patched PCA comes up. When it publishes InstanceActivated event, the member will again be involved in the scaling decision process. There are few places that needs further development for this scenario to work. 1. It seems that Stratos would mark a member as Pending Termination if a maintenance mode event is received. 2. The PCA doesn't support graceful shutdown right now. It doesn't take any inputs from outside, and the threads spawned by it are not daemon threads, which results in a process that needs to be killed because the threads do not terminate when the main thread goes down. It can be designed that an input to a periodically checked external file or somehow OS Signals can set flags on the running process to terminate. 3. The capability of a PCA to start on an already activated instance isn't verified. Based on the status of the member that it resides, it should/should not publish certain events. *5. Update topology model based on the message broker events* Currently the topology is repopulated everytime the complete topology event is received. This was done as a quick fix to update the topology model. However for large complete topology events, building event objects and updating contexts can be costly, and on the design front, that is not the intended use of the complete topology event. The events should dynamically update the topology status of the agent. For this to happen another task should first be completed. *6. Verify message broker event classes* Currently all the message broker events are deserialized in to event objects as per to the classes defined in modules/event package. However this doesn't depend or track any changes done to the Stratos Messaging Component, and therefore can quickly be outdated without a hint on the build or the tests. The events have to be verified against their counterparts in the messaging component. *7. Decouple log file publishing protocol* The PCA has a log publishing feature where a specified list of log files will be monitored and entries published to a thrift data receiver. However there can be situations where the server logs (ex: PCA's own log) has to be monitored from outside (ex: monitoring the agent.log of a Docker container inside Kubernetes using FluentD [7], or directly publishing agent.log to ELK). For those situations the log publishing feature is not flexible enough. If we can introduce a pluggable design for the log event publisher, it can solve most of these situations with low effort. *8. Verify Python 3 compatibility* Python 2.7 is said to be supported until 2020 [8], so this is not a major concern. However since the opinions between Python 2.7 and Python 3 can be political, it might be good to verify and adjust for Python 3 compatibility. The PCA was originally written with Python 2.7 in mind, since it is still the most distributed version by default. These changes might not be immediately solved, however they can be critical in tuning some rough spots in the PCA implementation. WDYT? Ideas? [1] - http://www.hivemq.com/blog/mqtt-client-library-paho-python [2] - https://github.com/agronholm/pythonfutures/ [3] - https://eclipse.org/paho/clients/python/ [4] - https://qpid.apache.org/releases/qpid-proton-0.11.0/proton/python/book/tutorial.html [5] - http://kafka.apache.org/documentation.html#gettingStarted [6] - https://www.dulwich.io/ [7] - https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch/fluentd-es-image [8] - https://hg.python.org/peps/rev/76d43e52d978 Regards, Chamila de Alwis Committer and PMC Member - Apache Stratos Software Engineer | WSO2 | +94772207163 Blog: code.chamiladealwis.com