leixm opened a new issue, #309:
URL: https://github.com/apache/incubator-uniffle/issues/309

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the feature
   
   When the ShuffleServer load is high, we cannot directly judge whether the 
client read and write has been greatly affected according to the metrics.
   
   
   ### Motivation
   
   Accurately determine whether the current service load has caused a large 
delay to the client's read and write.
   
   ### Describe the solution
   
   Delay monitoring is divided into two parts. The first part is the delay of 
ShuffleServer processing logic. Here we can directly add metrics. The second 
part is before ShuffleServer processing logic, including network delay and rpc 
queue waiting time.
   For the second part, maybe we can record the timestamp of the request before 
the client initiates the read and write request, and include this timestamp in 
the request. When ShuffleServer receives the request it can know how long the 
delay time is and record it in the metrics of ShuffleServer, maybe grpc also 
supports related implementations.
   We can measure the processing delay of the current ShuffleServer through 
some monitoring indicators such as p95 and p99.
   
   ### Additional context
   
   No
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to