jerqi commented on issue #234: URL: https://github.com/apache/incubator-uniffle/issues/234#issuecomment-1254479362
> Got your thought. > > > How do the yarn resourcemanager to process this problem? > > In HA resourcemanagers, there is no such problems due to the mechanism of failing back to standby active RM by zookeeper. Let's talk about it in single-one resourcemanager or hadoop namenode. As I know, the namenode will enter in the safe mode when starting it will exit until enough block reports from datanode have been accepted. Refer to : https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html > > > I suggest that we should pend the requests instead of rejection when we start the coordinator. > > Pending will slow down the apps. I think we should make the request falling back to another coordinator. Maybe the heartbeat interval waiting when starting is a good tradeoff, this will be an indicator whether to exit the safe mode for coordinator. It means that we shouldn't restart the two coordinators during the short time. It's a little difficult for K8S controller to select a proper interval to restart them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
