[ http://issues.apache.org/jira/browse/HADOOP-785?page=comments#action_12459492 ] Milind Bhandarkar commented on HADOOP-785: ------------------------------------------
Request for Comments -------------------------------- Separating Server and Client Configuration ------------------------------------------ Current mechanisms for configuring Hadoop daemons and specifying job-specific details using a single Configuration object is confusing and error-prone. The overall goal of this proposal is to make it more intuitive, and thus, less prone to errors. Detailed Goals: --------------- 1. Separate configuration variables according to the contexts in which they are used. There are two contexts in which the configuration variables are used currently. Those that are used in the server context by Hadoop daemons (Namenode, Datanodes, Jobtracker and TaskTrackers) and those that are used in the client context by running jobs (either jobs that use the MapReduce framework or standalone jobs that are DFSClients). 2. Allow job-specific configuration as a way to pass job-wide parameters from JobClient to individual tasks that belong to the job. This also includes frameworks built on top of the MapReduce framework, such as Hadoop Streaming. 3. Provide documentation for all parameters used in both server and client contexts in default configuration resources. 4. Examining the need for each configuration parameter used in Hadoop code, and eliminating unnecessary parameters for which we see no need of overriding the default values. 5. Provide mechanisms to detect configuration errors as early as possible. Configuration Parameters Used In Hadoop --------------------------------------- Configuration parameters used in Hadoop codebase are either used in server-context (dfs.name.dir, mapred.local.dir), in Client context (dfs.replication, mapred.map.tasks), or both (fs.default.name). All the configuration parameters should have default values specified in the default configuration files. In addition, we need to enforce that the server-context parameters cannot be overridden from the client-context, and vice-versa. Client configurations have effect during the lifetime of the client and the artifacts that it created. For example, the replication factor configured in the HDFS client would remain the default for only that client and for the files that the client created during its lifetime. Similarly, configuration of the JobClient would remain effective for the jobs that the JobClient created during its lifetime. Apart from the configuration parameters used in Hadoop, individual jobs or frameworks built on top of Hadoop may use their own configuration parameters as means of communication from job-client to the job. We need to make sure that these parameters do not conflict with parameters used in Hadoop. For common parameters, such as dfs.replication, which are used in the server-context and can be overridden in the client-context per file, we need to make sure that such parameters are bounded by upper and lower bounds specified in the server configuration. Class Hierarchy --------------- In order to implement the requirements outlined above, we propose the following class hierarchy, and the default and final resources that they load. Configuration (common-defaults.xml, common-final.xml) | +---ServerConfiguration (common-defauls.xml, server-defaults.xml, server-final.xml, common-final.xml) | +---ClientConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml) | +---AppConfiguration (common-defauls.xml, client-defaults.xml, common-final.xml) New configuration parameters and default-overrides are specified between the default resources and final resources. If a parameter exists in final resource already, then it cannot be overridden. Thus, server-final and common-final corresponds to current hadoop-site.xml. common-defaults.xml should contains parameters that are used in both server and client contexts, such as ipc.*, io.*, fs.*, user.* parameters. common-final.xml overrides selected parameters in common-defaults.xml. Generated job.xml file would contain parameters not specified in *-defaults.xml resources. Other Proposals --------------- In order to ensure that all configuration parameters used in the Hadoop codebase are documented in the configuration files, the default value specified in the Configuration.get* methods should be eliminated. This ensures that *ALL* configuration parameters have exactly one default value in the configuration files. If a given parameter is somehow not defined in any of the configuration resources, these methods would throw an exception called ConfigurationException. Direct use of Configuration.get* and Configuration.set* methods be allowed only from classes that derive from Configuration. That is, these methods should be protected. One should use static methods such as JobConf.setNumMapTasks(ClientConfiguration conf, int num); or HdfsClient.setReplication(ClientConfiguration, int num); in order to access or modify Configuration. This allows us to change parameter names used in Hadoop without changing the application codes. The AppConfiguration class is the only configuration class that allows usage of get* and set* methods directly. However, the ClientConfiguration class is the only way to communicate from JobClient to the Application. We would provide a static method: JobConf.setAppConfiguration(ClientConfiguration, AppConfiguration); to merge the application (or framework configuration) with JobConf. This allows us to check that the application or framework configuration does not try to reuse the same configuration parameters for different purposes. > Divide the server and client configurations > ------------------------------------------- > > Key: HADOOP-785 > URL: http://issues.apache.org/jira/browse/HADOOP-785 > Project: Hadoop > Issue Type: Improvement > Components: conf > Affects Versions: 0.9.0 > Reporter: Owen O'Malley > Assigned To: Milind Bhandarkar > Fix For: 0.10.0 > > > The configuration system is easy to misconfigure and I think we need to > strongly divide the server from client configs. > An example of the problem was a configuration where the task tracker has a > hadoop-site.xml that set mapred.reduce.tasks to 1. Therefore, the job tracker > had the right number of reduces, but the map task thought there was a single > reduce. This lead to a hard to find diagnose failure. > Therefore, I propose separating out the configuration types as: > class Configuration; > // reads site-default.xml, hadoop-default.xml > class ServerConf extends Configuration; > // reads hadoop-server.xml, $super > class DfsServerConf extends ServerConf; > // reads dfs-server.xml, $super > class MapRedServerConf extends ServerConf; > // reads mapred-server.xml, $super > class ClientConf extends Configuration; > // reads hadoop-client.xml, $super > class JobConf extends ClientConf; > // reads job.xml, $super > Note in particular, that nothing corresponds to hadoop-site.xml, which > overrides both client and server configs. Furthermore, the properties from > the *-default.xml files should never be saved into the job.xml. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
