[ https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinglun updated HADOOP-16403: ----------------------------- Attachment: HADOOP-16403.007.patch > Start a new statistical rpc queue and make the Reader's pendingConnection > queue runtime-replaceable > --------------------------------------------------------------------------------------------------- > > Key: HADOOP-16403 > URL: https://issues.apache.org/jira/browse/HADOOP-16403 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Jinglun > Assignee: Jinglun > Priority: Major > Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, > HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, > HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, > HADOOP-16403.007.patch, MetricLinkedBlockingQueueTest.pdf > > > I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so > after the active dead, it takes the standby more than 40s to become active. > Many requests(tcp connect request and rpc request) from Datanodes, clients > and zkfc timed out and start retrying. The suddenly request flood lasts for > the next 2 minutes and finally all requests are either handled or run out of > retry times. > Adjusting the rpc related settings might power the NameNode and solve this > problem and the key point is finding the bottle neck. The rpc server can be > described as below: > {noformat} > Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat} > By sampling some failed clients, I find many of them got > ConnectTimeoutException. It's caused by a 20s un-responded tcp connect > request. I think may be the reader queue is full and block the listener from > handling new connections. Both slow handlers and slow readers can block the > whole processing progress, and I need to know who it is. I think *a queue > that computes the qps, write log when the queue is full and could be replaced > easily* will help. > I find the nice work HADOOP-10302 implementing a runtime-swapped queue. > Using it at Reader's queue makes the reader queue runtime-swapped > automatically. The qps computing job could be done by implementing a subclass > of LinkedBlockQueue that does the computing job while put/take/... happens. > The qps data will show on jmx. > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org