leixm opened a new issue, #309: URL: https://github.com/apache/incubator-uniffle/issues/309
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the feature When the ShuffleServer load is high, we cannot directly judge whether the client read and write has been greatly affected according to the metrics. ### Motivation Accurately determine whether the current service load has caused a large delay to the client's read and write. ### Describe the solution Delay monitoring is divided into two parts. The first part is the delay of ShuffleServer processing logic. Here we can directly add metrics. The second part is before ShuffleServer processing logic, including network delay and rpc queue waiting time. For the second part, maybe we can record the timestamp of the request before the client initiates the read and write request, and include this timestamp in the request. When ShuffleServer receives the request it can know how long the delay time is and record it in the metrics of ShuffleServer, maybe grpc also supports related implementations. We can measure the processing delay of the current ShuffleServer through some monitoring indicators such as p95 and p99. ### Additional context No ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
