[ https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HADOOP-3999. -------------------------------------- Resolution: Incomplete Closing this as stale. Much of this functionality has since been added to YARN and HDFS. Holes are slowly being closed! > Dynamic host configuration system (via node side plugins) > --------------------------------------------------------- > > Key: HADOOP-3999 > URL: https://issues.apache.org/jira/browse/HADOOP-3999 > Project: Hadoop Common > Issue Type: Improvement > Components: benchmarks, conf, metrics > Environment: Any > Reporter: Kai Mosebach > Attachments: cloud_divide.jpg > > > The MapReduce paradigma is limited to run MapReduce jobs with the lowest > common factor of all nodes in the cluster. > On the one hand this is wanted (cloud computing, throw simple jobs in, > nevermind who does it) > On the other hand this is limiting the possibilities quite a lot, for > instance if you had data which could/needs to be fed to a 3rd party interface > like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop. > Furthermore it could be interesting to know about the OS, the architecture, > the performance of the node in relation to the rest of the cluster. > (Performance ranking) > i.e. if i'd know about a sub cluster of very computing performant nodes or a > sub cluster of very fast disk-io nodes, the job tracker could select these > nodes regarding a so called job profile (i.e. my job is a heavy computing job > / heavy disk-io job), which can usually be estimated by a developer before. > To achieve this, node capabilities could be introduced and stored in the DFS, > giving you > a1.) basic information about each node (OS, ARCH) > a2.) more sophisticated infos (additional software, path to software, > version). > a3.) PKI collected about the node (disc-io, cpu power, memory) > a4.) network throughput to neighbor hosts, which might allow generating a > network performance map over the cluster > This would allow you to > b1.) generate jobs that have a profile (computing intensive, disk io > intensive, net io intensive) > b2.) generate jobs that have software dependencies (run on Linux only, run on > nodes with MathLab only) > b3.) generate a performance map of the cluster (sub clusters of fast disk > nodes, sub clusters of fast CPU nodes, network-speed-relation-map between > nodes) > From step b3) you could then even acquire statistical information which could > again be fed into the DFS Namenode to see if we could store data on fast disk > subclusters only (that might need to be a tool outside of hadoop core though) -- This message was sent by Atlassian JIRA (v6.2#6252)