ruojieranyishen commented on issue #1969:
URL:
https://github.com/apache/incubator-pegasus/issues/1969#issuecomment-2102081635
`max_copy_rate_megabytes_per_disk` causes a large number of threads in the
`replica.default` thread pool to sleep. The `nfs_service_impl::on_copy`
working on `replica.default` can also cause this issue.
In this thread pool:
1. affecting the timely processing of remote_command. For example:
```
>>> server_info -r
COMMAND: server-info
CALL [meta-server] [xxxx:45001] succeed: Pegasus Server
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started
at 2024-03-22 14:10:53
CALL [meta-server] [xxxx:45001] succeed: Pegasus Server
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started
at 2024-03-22 14:11:01
CALL [replica-server] [xxxx:45101] succeed: Pegasus Server
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started
at 2024-03-22 14:11:15
CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT
CALL [replica-server] [xxxx:45101] succeed: Pegasus Server
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started
at 2024-04-15 17:50:28
CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT
CALL [replica-server] [xxxx:45101] succeed: Pegasus Server
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started
at 2024-03-22 14:11:22
Succeed count: 5
Failed count: 2
```
2. Some RPCs are being delayed for transmission. For example:
```
RPC_CM_CONFIG_SYNC
RPC_CM_DUPLICATION_SYNC
RPC_NFS_COPY
```
May be related: https://github.com/apache/incubator-pegasus/issues/1840
3. Indirectly makes the Group Check ERR_TIMEOUT in `replica.replica` and
causes ballot increase.
```
[general]
app_name : lpf_test
app_id : 46
partition_count : 100
max_replica_count : 3
[replicas]
pidx ballot replica_count primary secondaries
0 6 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
1 7 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
2 5 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
3 9 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
4 5 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
5 6 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
6 7 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
```
# Solution
1. If the `max_copy_rate_megabytes_per_disk` is not set too low and not
frequently adjusted, it is unlikely to trigger the problem.
2. If replacing the default RPC thread pool for NFS-related operations in
`nfs_server_impl.cpp` can resolves this issue. For example:
```
-DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_LEARN)
DEFINE_TASK_CODE_RPC(RPC_NFS_GET_FILE_SIZE, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_DEFAULT)
// test timer task code
DEFINE_TASK_CODE(LPC_NFS_REQUEST_TIMER, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_DEFAULT)
-DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON,
THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON,
THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON,
THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON,
::dsn::THREAD_POOL_LEARN)
```
**Changing the thread pool solve the above problem, but it is unclear
whether there are other risks.**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]