Hi devs,

During the testing of the Python Cartridge Agent in the past few weeks we
were able to identify a few points where performance and functionality
could be improved.

*1 - Agent's thread utilization  *
It was observed that when the agent is kept running for long periods of
time, the thread count spawned would go up and reach the thread limit
giving the following error.

Exception in thread Thread-85:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File
"/mnt/apache-stratos-python-cartridge-agent-4.1.4/modules/util/asyncscheduledtask.py",
line 65, in run
    task_thread.start()
  File "/usr/lib/python2.7/threading.py", line 745, in start
    _start_new_thread(self.__bootstrap, ())
error: can't start new thread

Although I couldn't pinpoint the exact use case which results in this
sudden spike in the thread count, it's most likely caused by the
MessageBrokerHeartBeatChecker class inside subscriber.py file. This is a
periodic job which checks for the liveliness of the connected message
broker.

We can get rid of this by implementing a callback method of Paho Mqtt
Client's on_disconnect method [1].

Furthermore, the ScheduledExecutor class in the asyncscheduledtask.py file,
for each invocation it spawns a new thread.

while not self.terminated:
    time.sleep(self.delay)
    task_thread = Thread(target=self.task.execute_task)
    task_thread.start()


This is not a good practice. It should rather submit the task to a thread
pool. Imesh has an example code of the thread pool usage using the
pythonfutures library[2] for Python 2.7.

*2. Decoupling the messaging implementation*

When MQTT was chosen initially as the message broker protocol, the PCA was
designed to only use MQTT, using the Paho MQTT library[3]. However after
coming across several issues in MQTT, it was decided to default back to
AMQP as the message broker protocol for Stratos Components. Except for the
PCA, since at the time, there was no AMQP 1.0 supported Python library.
However, it seems that Apache QPid Proton library[4] provides AMQP 1.0
compliant clients for message broker communication.

We can provide a common interface for messaging and provide implementation
for MQTT and AMQP (if Apache QPid Proton proves to be helpful) protocols.
This way we can provide the selection of the protocol to the configuration,
similar to what we have in the Java based components such as AS, CC etc.

Furthermore, this can greatly help if messaging has to be customized to
cater for custom protocols such as Apache Kafka[5]. It will only be another
implementation of the common interface.

*3. Remove dependency on the Git Binary*

Currently the PCA manipulates the Git based artifact repository through the
Git binary file. This is inefficient for several reasons.

   1. Every Git command is executed as a separate process, created using
   os.fork(). This duplicates memory and results in needless marshalling
   and unmarshalling of input and output among the processes.
   2. If several commands are executed upon the Git binary at the same time
   (ex: of a multi-tenant application), it can be a performance bottleneck.
   3. This is not platform independent.

Therefore it will greatly help if we can go for a Git library for Python
which doesn't depend on the Git binary. Dulwich [6] was considered in the
past, but then releases had more features TBD. However recentl releases
seems  to have fixed a lot of bug reports and features. It also has plans
to be re licensed as Apache v2.0 which also would help us by making it
shippable.

*4. Use of the maintenance mode*

Sajith recently started a discussion on a patching strategy for the PCA
(Thread - [Discuss] Suggesting a patching model for PCA). Patching for the
PCA involves two different scenarios, offline and online. If instances that
are already online needs to be patched the current PCA or Stratos design
doesn't allow such a window.

Maintenance Mode can be used to signal the Autoscaler that the member has
gone to maintenance and scaling decisions should not be taken on that
member. While in the maintenance mode, the running PCA is gracefully
shutdown and the patched PCA comes up. When it publishes InstanceActivated
event, the member will again be involved in the scaling decision process.

There are few places that needs further development for this scenario to
work.

   1. It seems that Stratos would mark a member as Pending Termination if a
   maintenance mode event is received.
   2. The PCA doesn't support graceful shutdown right now. It doesn't take
   any inputs from outside, and the threads spawned by it are not daemon
   threads, which results in a process that needs to be killed because the
   threads do not terminate when the main thread goes down. It can be designed
   that an input to a periodically checked external file or somehow OS Signals
   can set flags on the running process to terminate.
   3. The capability of a PCA to start on an already activated instance
   isn't verified. Based on the status of the member that it resides, it
   should/should not publish certain events.


*5. Update topology model based on the message broker events*

Currently the topology is repopulated everytime the complete topology event
is received. This was done as a quick fix to update the topology model.
However for large complete topology events, building event objects and
updating contexts can be costly, and on the design front, that is not the
intended use of the complete topology event. The events should dynamically
update the topology status of the agent. For this to happen another task
should first be completed.

*6. Verify message broker event classes*

Currently all the message broker events are deserialized in to event
objects as per to the classes defined in modules/event package. However
this doesn't depend or track any changes done to the Stratos Messaging
Component, and therefore can quickly be outdated without a hint on the
build or the tests. The events have to be verified against their
counterparts in the messaging component.

*7. Decouple log file publishing protocol*

The PCA has a log publishing feature where a specified list of log files
will be monitored and entries published to a thrift data receiver. However
there can be situations where the server logs (ex: PCA's own log) has to be
monitored from outside (ex: monitoring the agent.log of a Docker container
inside Kubernetes using FluentD [7], or directly publishing agent.log to
ELK). For those situations the log publishing feature is not flexible
enough. If we can introduce a pluggable design for the log event publisher,
it can solve most of these situations with low effort.

*8. Verify Python 3 compatibility*

Python 2.7 is said to be supported until 2020 [8], so this is not a major
concern. However since the opinions between Python 2.7 and Python 3 can be
political, it might be good to verify and adjust for Python 3
compatibility. The PCA was originally written with Python 2.7 in mind,
since it is still the most distributed version by default.

These changes might not be immediately solved, however they can be critical
in tuning some rough spots in the PCA implementation. WDYT? Ideas?


[1] - http://www.hivemq.com/blog/mqtt-client-library-paho-python
[2] - https://github.com/agronholm/pythonfutures/
[3] - https://eclipse.org/paho/clients/python/
[4] -
https://qpid.apache.org/releases/qpid-proton-0.11.0/proton/python/book/tutorial.html
[5] - http://kafka.apache.org/documentation.html#gettingStarted
[6] - https://www.dulwich.io/
[7] -
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/fluentd-elasticsearch/fluentd-es-image
[8] - https://hg.python.org/peps/rev/76d43e52d978

Regards,
Chamila de Alwis
Committer and PMC Member - Apache Stratos
Software Engineer | WSO2 | +94772207163
Blog: code.chamiladealwis.com

Reply via email to