Xinyu Tan created IOTDB-5840:
--------------------------------

             Summary: Avoid the problem that the insertRecords interface may 
cause the number of threads to balloon when there are too many data regions
                 Key: IOTDB-5840
                 URL: https://issues.apache.org/jira/browse/IOTDB-5840
             Project: Apache IoTDB
          Issue Type: Improvement
            Reporter: Xinyu Tan
            Assignee: Xinyu Tan


On a machine with sufficient CPU resources (for example, 32 cores), if the 
number of Dataregions is too small, the write pressure in the cluster is 
concentrated on the locks of these regions. As a result, the write latency is 
high and the throughput cannot be increased. When the number of DataRegion is 
large, for an InsertRecords request with a large batchSize such as 10000, its 
write request may involve many DataRegion. Once the concurrency is high, It 
takes hundreds of internalServiceClient to dispatch the planNode. Under the 
current threading model of BIO, this would also increase the number of 
InternalServiceRPC threads in the cluster to hundreds or thousands.

For example, in a user test environment, coreSize of the clientManager is set 
to 600 and maxSize is set to 1000 to prevent concurrent write requests from 
blocking each other while obtaining internalServiceClient. The result is that 
each node has nearly 1000 InternalServiceRPC threads. If the client increases 
concurrency further, a "connection reset by peer" error is reported. This error 
should be caused by the default parameters of the linux kernel not supporting 
so many connections.

The current mpp framework splits Plannodes by region only. Therefore, the 
number of RPCS to be sent per write request is closely related to the number of 
dataregion involved in the request rather than the number of Datanodes.

The solution to this problem is to aggregate RPC requests sent to the same 
datanode. This reduces the pressure on the clientManager and reduces the number 
of InternalServiceRPC threads. Avoid sending the connection reset by peer error 
to the client again.

After the optimization, the number of RPC service threads was reduced from 1000 
to 200. The connection reset by peer error was cleared. And we can increase the 
number of regions to make full use of cluster cpu resources



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to