[ https://issues.apache.org/jira/browse/HDFS-9723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaoyu Yao updated HDFS-9723: ----------------------------- Description: HDFS namenode handles RPC requests from DFS clients and internal processing from datanodes. It has been a recurring pain that some bad jobs overwhelm the namenode and bring the whole cluster down. FCQ (Fair Call Queue) by HADOOP-9640 is the one of the existing efforts added since Hadoop 2.4 to address this issue. In current FCQ implementation, incoming RPC calls are scheduled based on the number of recent RPC calls of different users with a time-decayed scheduler. This works well when there is a clear mapping between users and their RPC calls from different jobs. However, this may not work effectively when it is hard to track calls to a specific caller in a chain of operations from the workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for operators/administrators to throttle all the hive jobs because of one “bad” query. This JIRA proposed to leverage RPC caller context information (such as callerType: caller Id from TEZ-2851) available with HDFS-9184 as an alternative to existing UGI (or user name when delegation token is not available) based Identify Provider to improve effectiveness Hadoop RPC Fair Call Queue (HADOOP-9640) for better namenode throttling in multi-tenancy cluster deployment. was: HDFS namenode handles RPC requests from DFS clients and internal processing from datanodes. It has been a recurring pain that some bad jobs overwhelm the namenode and bring the whole cluster down. FCQ (Fair Call Queue) by HADOOP-9640 is the one of the existing efforts added since Hadoop 2.4 to address this issue. In current FCQ implementation, incoming RPC calls are scheduled based on the number of recent RPC calls (1000) of different users with a time-decayed scheduler. This works well when there is a clear mapping between users and their RPC calls from different jobs. However, this may not work effectively when it is hard to track calls to a specific caller in a chain of operations from the workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for operators/administrators to throttle all the hive jobs because of one “bad” query. This JIRA proposed to leverage RPC caller context information (such as callerType: caller Id from TEZ-2851) available with HDFS-9184 as an alternative to existing UGI (or user name when delegation token is not available) based Identify Provider to improve effectiveness Hadoop RPC Fair Call Queue (HADOOP-9640) for better namenode throttling in multi-tenancy cluster deployment. > Improve Namenode Throttling Against Bad Jobs with FCQ and CallerContext > ----------------------------------------------------------------------- > > Key: HDFS-9723 > URL: https://issues.apache.org/jira/browse/HDFS-9723 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Xiaoyu Yao > Assignee: Xiaoyu Yao > > HDFS namenode handles RPC requests from DFS clients and internal processing > from datanodes. It has been a recurring pain that some bad jobs overwhelm the > namenode and bring the whole cluster down. FCQ (Fair Call Queue) by > HADOOP-9640 is the one of the existing efforts added since Hadoop 2.4 to > address this issue. > In current FCQ implementation, incoming RPC calls are scheduled based on the > number of recent RPC calls of different users with a time-decayed scheduler. > This works well when there is a clear mapping between users and their RPC > calls from different jobs. However, this may not work effectively when it is > hard to track calls to a specific caller in a chain of operations from the > workflow (e.g.Oozie -> Hive -> Yarn). It is not feasible for > operators/administrators to throttle all the hive jobs because of one “bad” > query. > This JIRA proposed to leverage RPC caller context information (such as > callerType: caller Id from TEZ-2851) available with HDFS-9184 as an > alternative to existing UGI (or user name when delegation token is not > available) based Identify Provider to improve effectiveness Hadoop RPC Fair > Call Queue (HADOOP-9640) for better namenode throttling in multi-tenancy > cluster deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)