Re: [I] Bug(Learn)：bakcup or dup a table with per disk throttling on a backup-duplication cluster , some nodes coredump [incubator-pegasus]

via GitHub Thu, 09 May 2024 00:07:33 -0700


ruojieranyishen commented on issue #1969:
URL: 
https://github.com/apache/incubator-pegasus/issues/1969#issuecomment-2102081635


   `max_copy_rate_megabytes_per_disk` causes a large number of threads in the 
`replica.default` thread pool to sleep.  The `nfs_service_impl::on_copy` 
working on `replica.default` can also cause this issue.
   
   In this thread pool:
   
   1. affecting the timely processing of remote_command. For example:
   
      ```
      >>> server_info -r
      COMMAND: server-info
      
      CALL [meta-server] [xxxx:45001] succeed: Pegasus Server 
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started 
at 2024-03-22 14:10:53
      CALL [meta-server] [xxxx:45001] succeed: Pegasus Server 
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started 
at 2024-03-22 14:11:01
      CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started 
at 2024-03-22 14:11:15
      CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT
      CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started 
at 2024-04-15 17:50:28
      CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT
      CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 
2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started 
at 2024-03-22 14:11:22
      
      Succeed count: 5
      Failed count: 2
      ```
   
   2. Some RPCs are being delayed for transmission. For example:
   
      ```
      RPC_CM_CONFIG_SYNC
      RPC_CM_DUPLICATION_SYNC
      RPC_NFS_COPY
      ```
   
      May be related： https://github.com/apache/incubator-pegasus/issues/1840 
   
   3. Indirectly makes the Group Check ERR_TIMEOUT in `replica.replica` and 
causes ballot increase.
   
   ```
   [general]
   app_name           : lpf_test
   app_id             : 46      
   partition_count    : 100     
   max_replica_count  : 3       
   [replicas]
   pidx  ballot  replica_count  primary              secondaries                
             
   0     6       3/3            x.x.x.x:55101  [x.x.x.x:55101,x.x.x.x:55101]   
   1     7       3/3            x.x.x.x:55101  [x.x.x.x:55101,x.x.x.x:55101] 
   2     5       3/3            x.x.x.x:55101  [x.x.x.x:55101,x.x.x.x:55101] 
   3     9       3/3            x.x.x.x:55101  [x.x.x.x:55101,x.x.x.x:55101] 
   4     5       3/3            x.x.x.x:55101   [x.x.x.x:55101,x.x.x.x:55101] 
   5     6       3/3            x.x.x.x:55101   [x.x.x.x:55101,x.x.x.x:55101] 
   6     7       3/3            x.x.x.x:55101  [x.x.x.x:55101,x.x.x.x:55101] 
   ```
   
   # Solution
   
   1. If the `max_copy_rate_megabytes_per_disk` is not set too low and not 
frequently adjusted, it is unlikely to trigger the problem.
   
   2. If replacing the default RPC thread pool for NFS-related operations in 
`nfs_server_impl.cpp`  can resolves this issue. For example:
   
   ```
   -DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_DEFAULT)
   +DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_LEARN)
    DEFINE_TASK_CODE_RPC(RPC_NFS_GET_FILE_SIZE, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_DEFAULT)
    // test timer task code
    DEFINE_TASK_CODE(LPC_NFS_REQUEST_TIMER, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_DEFAULT)
    
   -DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, 
THREAD_POOL_DEFAULT)
   +DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_LEARN)
    
   -DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, 
THREAD_POOL_DEFAULT)
   +DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_LEARN)
    
   -DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, 
THREAD_POOL_DEFAULT)
   +DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, 
::dsn::THREAD_POOL_LEARN)
   
   ```
   
   **Changing the thread pool solve the above problem, but it is unclear 
whether there are other risks.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Bug(Learn)：bakcup or dup a table with per disk throttling on a backup-duplication cluster , some nodes coredump [incubator-pegasus]

Reply via email to