[ 
https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer resolved HADOOP-3999.
--------------------------------------

    Resolution: Incomplete

Closing this as stale.  Much of this functionality has since been added to YARN 
and HDFS.  Holes are slowly being closed! 

> Dynamic host configuration system (via node side plugins)
> ---------------------------------------------------------
>
>                 Key: HADOOP-3999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3999
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: benchmarks, conf, metrics
>         Environment: Any
>            Reporter: Kai Mosebach
>         Attachments: cloud_divide.jpg
>
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest 
> common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in, 
> nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for 
> instance if you had data which could/needs to be fed to a 3rd party interface 
> like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture, 
> the performance of the node in relation to the rest of the cluster. 
> (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a 
> sub cluster of very fast disk-io nodes, the job tracker could select these 
> nodes regarding a so called job profile (i.e. my job is a heavy computing job 
> / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS, 
> giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software, 
> version). 
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a 
> network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io 
> intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on 
> nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk 
> nodes, sub clusters of fast CPU nodes, network-speed-relation-map between 
> nodes)
> From step b3) you could then even acquire statistical information which could 
> again be fed into the DFS Namenode to see if we could store data on fast disk 
> subclusters only (that might need to be a tool outside of hadoop core though)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to