IMO the threading issue and Git command execution based on a python library are high priority. We might need to immediately address them. Others can be considered as low priority.
Writing agent in Go is a good thought but we might need to evaluate the effort, advantages & disadvantages first. Thanks On Sun, Dec 13, 2015 at 6:55 PM, Gayan Gunarathne <gay...@wso2.com> wrote: > Hi Chamila, > > On Sat, Dec 12, 2015 at 1:34 AM, Chamila De Alwis <chami...@wso2.com> > wrote: > >> Hi devs, >> >> During the testing of the Python Cartridge Agent in the past few weeks we >> were able to identify a few points where performance and functionality >> could be improved. >> >> *1 - Agent's thread utilization * >> It was observed that when the agent is kept running for long periods of >> time, the thread count spawned would go up and reach the thread limit >> giving the following error. >> >> Exception in thread Thread-85: >> Traceback (most recent call last): >> File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner >> self.run() >> File >> "/mnt/apache-stratos-python-cartridge-agent-4.1.4/modules/util/asyncscheduledtask.py", >> line 65, in run >> task_thread.start() >> File "/usr/lib/python2.7/threading.py", line 745, in start >> _start_new_thread(self.__bootstrap, ()) >> error: can't start new thread >> >> Although I couldn't pinpoint the exact use case which results in this >> sudden spike in the thread count, it's most likely caused by the >> MessageBrokerHeartBeatChecker class inside subscriber.py file. This is a >> periodic job which checks for the liveliness of the connected message >> broker. >> >> We can get rid of this by implementing a callback method of Paho Mqtt >> Client's on_disconnect method [1]. >> >> Furthermore, the ScheduledExecutor class in the asyncscheduledtask.py >> file, for each invocation it spawns a new thread. >> >> while not self.terminated: >> time.sleep(self.delay) >> task_thread = Thread(target=self.task.execute_task) >> task_thread.start() >> >> >> This is not a good practice. It should rather submit the task to a thread >> pool. Imesh has an example code of the thread pool usage using the >> pythonfutures library[2] for Python 2.7. >> > > +1 > >> >> *2. Decoupling the messaging implementation* >> >> When MQTT was chosen initially as the message broker protocol, the PCA >> was designed to only use MQTT, using the Paho MQTT library[3]. However >> after coming across several issues in MQTT, it was decided to default back >> to AMQP as the message broker protocol for Stratos Components. Except for >> the PCA, since at the time, there was no AMQP 1.0 supported Python library. >> However, it seems that Apache QPid Proton library[4] provides AMQP 1.0 >> compliant clients for message broker communication. >> > > It is good if we can introduce the AMQP also as a messaging protocol with > PCA. This will provide flexibility for the end use to choose the messaging > protocol which he prefers. In the case of the Apache QPid we need to > recheck about the platform independence of that library. > >> >> We can provide a common interface for messaging and provide >> implementation for MQTT and AMQP (if Apache QPid Proton proves to be >> helpful) protocols. This way we can provide the selection of the protocol >> to the configuration, similar to what we have in the Java based components >> such as AS, CC etc. >> >> Furthermore, this can greatly help if messaging has to be customized to >> cater for custom protocols such as Apache Kafka[5]. It will only be another >> implementation of the common interface. >> >> *3. Remove dependency on the Git Binary* >> >> Currently the PCA manipulates the Git based artifact repository through >> the Git binary file. This is inefficient for several reasons. >> >> 1. Every Git command is executed as a separate process, created using >> os.fork(). This duplicates memory and results in needless marshalling >> and unmarshalling of input and output among the processes. >> 2. If several commands are executed upon the Git binary at the same >> time (ex: of a multi-tenant application), it can be a performance >> bottleneck. >> 3. This is not platform independent. >> >> Therefore it will greatly help if we can go for a Git library for Python >> which doesn't depend on the Git binary. Dulwich [6] was considered in the >> past, but then releases had more features TBD. However recentl releases >> seems to have fixed a lot of bug reports and features. It also has plans >> to be re licensed as Apache v2.0 which also would help us by making it >> shippable. >> > > +1. May be we can do some research with Dulwich library and introduce this > to the PCA GIT operations. > >> >> *4. Use of the maintenance mode* >> >> Sajith recently started a discussion on a patching strategy for the PCA >> (Thread - [Discuss] Suggesting a patching model for PCA). Patching for >> the PCA involves two different scenarios, offline and online. If instances >> that are already online needs to be patched the current PCA or Stratos >> design doesn't allow such a window. >> >> Maintenance Mode can be used to signal the Autoscaler that the member has >> gone to maintenance and scaling decisions should not be taken on that >> member. While in the maintenance mode, the running PCA is gracefully >> shutdown and the patched PCA comes up. When it publishes InstanceActivated >> event, the member will again be involved in the scaling decision process. >> >> There are few places that needs further development for this scenario to >> work. >> >> 1. It seems that Stratos would mark a member as Pending Termination >> if a maintenance mode event is received. >> 2. The PCA doesn't support graceful shutdown right now. It doesn't >> take any inputs from outside, and the threads spawned by it are not daemon >> threads, which results in a process that needs to be killed because the >> threads do not terminate when the main thread goes down. It can be >> designed >> that an input to a periodically checked external file or somehow OS >> Signals >> can set flags on the running process to terminate. >> 3. The capability of a PCA to start on an already activated instance >> isn't verified. Based on the status of the member that it resides, it >> should/should not publish certain events. >> >> *5. Update topology model based on the message broker events* >> >> Currently the topology is repopulated everytime the complete topology >> event is received. This was done as a quick fix to update the topology >> model. However for large complete topology events, building event objects >> and updating contexts can be costly, and on the design front, that is not >> the intended use of the complete topology event. The events should >> dynamically update the topology status of the agent. For this to happen >> another task should first be completed. >> > > I am not clear about this point. Can you explain further? With the > introduction of the on demand complete topology publish feature feature, we > don't need to listen for the complete topology event always. Is n't? > >> >> *6. Verify message broker event classes* >> >> Currently all the message broker events are deserialized in to event >> objects as per to the classes defined in modules/event package. However >> this doesn't depend or track any changes done to the Stratos Messaging >> Component, and therefore can quickly be outdated without a hint on the >> build or the tests. The events have to be verified against their >> counterparts in the messaging component. >> > > In the case of the instance life cycle , I think important to check the > all the events that received in the CA side. > >> >> > >> *7. Decouple log file publishing protocol* >> >> The PCA has a log publishing feature where a specified list of log files >> will be monitored and entries published to a thrift data receiver. However >> there can be situations where the server logs (ex: PCA's own log) has to be >> monitored from outside (ex: monitoring the agent.log of a Docker container >> inside Kubernetes using FluentD [7], or directly publishing agent.log to >> ELK). For those situations the log publishing feature is not flexible >> enough. If we can introduce a pluggable design for the log event publisher, >> it can solve most of these situations with low effort. >> > > In the log run , I think we may need to thing about some central logging > feature support with the PCA. > > >> >> *8. Verify Python 3 compatibility* >> >> Python 2.7 is said to be supported until 2020 [8], so this is not a major >> concern. However since the opinions between Python 2.7 and Python 3 can be >> political, it might be good to verify and adjust for Python 3 >> compatibility. The PCA was originally written with Python 2.7 in mind, >> since it is still the most distributed version by default. >> >> These changes might not be immediately solved, however they can be >> critical in tuning some rough spots in the PCA implementation. WDYT? Ideas? >> >> >> [1] - http://www.hivemq.com/blog/mqtt-client-library-paho-python >> [2] - https://github.com/agronholm/pythonfutures/ >> [3] - https://eclipse.org/paho/clients/python/ >> [4] - >> https://qpid.apache.org/releases/qpid-proton-0.11.0/proton/python/book/tutorial.html >> [5] - http://kafka.apache.org/documentation.html#gettingStarted >> [6] - https://www.dulwich.io/ >> [7] - >> https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch/fluentd-es-image >> [8] - https://hg.python.org/peps/rev/76d43e52d978 >> >> Regards, >> Chamila de Alwis >> Committer and PMC Member - Apache Stratos >> Software Engineer | WSO2 | +94772207163 >> Blog: code.chamiladealwis.com >> >> >> > > > -- > > Gayan Gunarathne > Technical Lead, WSO2 Inc. (http://wso2.com) > Committer & PMC Member, Apache Stratos > email : gay...@wso2.com | mobile : +94 775030545 <%2B94%20766819985> > > > -- Imesh Gunaratne Senior Technical Lead, WSO2 Committer & PMC Member, Apache Stratos