Rongtong Jin created COMDEV-522:
-----------------------------------
Summary: RocketMQ DLedger Controller Performance Optimization
Key: COMDEV-522
URL: https://issues.apache.org/jira/browse/COMDEV-522
Project: Community Development
Issue Type: Task
Components: Comdev, GSoC/Mentoring ideas
Reporter: Rongtong Jin
*Apache RocketMQ*
Apache RocketMQ is a distributed messaging and streaming platform with low
latency, high performance and reliability, trillion-level capacity, and
flexible scalability.
Page:
[https://rocketmq.apache.org|https://rocketmq.apache.org/]([https://rocketmq.apache.org/])
Repo: [https://github.com/apache/rocketmq]([https://github.com/apache/rocketmq])
*Background*
RocketMQ 5.0 introduced a new component, the controller, which controls the
high availability master-slave switch in multi-replica scenarios. It uses the
DLedger Raft library as a consensus replication state machine for metadata. As
a completely independent component, it can run normally in some scenarios, but
in large-scale clusters, it is necessary to maintain a large number of broker
groups, which is a great challenge for operational capabilities and resource
waste. When dealing with a large number of Broker groups, we need to optimize
performance in large-scale scenarios, leveraging the high-performance writing
of DLedger itself and performing some optimization for the current Controller
architecture.
*Task*
1. Polish the usage of DLedger
Currently, on the Controller side, a task queue single thread is used for
requesting reads and writes to DLedger, that is, only one read/write request
can be processed at a time. However, DLedger itself implements many
optimizations for multi-client reads and writes and can ensure linear
consistency reading. However, now all read and write processing is performed
using a single logical DLedger client, which will become a serious performance
bottleneck in large-scale scenarios.
2. Optimization of DLedger features usage
DLedger itself can implement many optimizations, such as ReadIndex read and
FollowerRead read. After implementation, we can fully leverage the performance
of reads. Currently, all Broker nodes communicate with the Leader node of the
Controller. In large-scale scenarios, this will cause the requests of each
Controller group to be concentrated on the Leader node, and the other Follower
nodes will not share the request processing of the Leader, which will cause
single-point performance bottlenecks for the Leader.
3. Full asynchronous + parallel processing
Currently, DLedger itself is fully asynchronous, but on the Controller side,
all requests for the DLedger side are synchronized, and many Controller-side
operations are performed synchronously in a single thread, such as heartbeat
checks and other timed tasks. In large-scale scenarios, the logic of these
single-threaded synchronous operations will block a large number of requests
from Broker-side, so asynchronous + parallel processing can be used for
optimization.
4. Correctness testing and performance testing
After completing the above optimizations, it is necessary to conduct
correctness testing on the new version and use distributed chaos testing
frameworks such as OpenChaos to verify correct operation under fault scenarios
such as network partition and random crashes.
After completing the correctness testing, a detailed performance testing report
can be produced by comparing the new and old versions.
*Skills Required*
- Strong interest in message middleware and distributed storage systems
- Proficient in Java development
- In-depth understanding of distributed consensus algorithms
- In-depth understanding of the high-availability module of RockeetMQ and the
DLedger library
- Understanding of distributed chaos testing and performance testing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]