Regards, blah.
You can use these links:
MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279
MapReduce Next Gen:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen
and you can use the Cloudera's blogs posts about YARN:
http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/
There is a great document written by Arun, Owen and more people about
the architecture of YARN but I don't have it here right now.
Best wishes
On 02/04/2013 04:35 AM, blah blah wrote:
Thank you very much for answering my question. Is there any publicly
available Hadoop-MR-Yarn UML diagrams (class, activity etc), or some
more in-depth documentation, except the one on official site. I am
interested in implementation details/documentation of MR AM and MR
containers (old TaskTracker)?
regards
blah
2013/2/1 Vinod Kumar Vavilapalli <[email protected]
<mailto:[email protected]>>
You got that mostly right. And it doesn't differ much in Hadoop
1.* either. With MR AM doing the work that was earlier done in
JobTracker., the JobClient and the task side doesn't change much.
FileInputFormat.getsplits() is called by client itself, so you
should look for logs on the client machine.
Each filesystem overrides getFileBlockLocations() and provides the
correct locations - like DFS internally uses the
getBlockLocations() API on Namenode. What you are seeing is the
default implementation for local FS.
HTH,
+Vinod
On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[email protected]
<mailto:[email protected]>> wrote:
Hi
(I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
I have a question regarding my assumptions on the Yarn-MR
design, specially the InputSplit processing. Can someone
confirm or point out my mistakes in my MR-Yarn design assumptions?
These are my assumptions regarding design.
1. JobClient submits Job
Create AppMaster etc.
2. Get number of splits // MR-AM, specially their hosts, so
that a Task can be started on the same node, use
*InputFormat.getSplts() { ...;
FileSystem.getFileBlockLocations(); ...;}
3. Start N tasks // MR-AM
4. Each Task processes its (single) split (unless splitsNr >>
tasksNr) with the use of InputFormat/RecordReader // MR-Task,
from HERE InputFormat operates only on a single Split
5. Start RecordReader and process Split // MR-Task
5. MAP() // MR-Task
6. Do rest MR // MR-Task
7. Dump to HDFS/or other storage. // MR-Task
8. Report FINISH, free resources // MR-AM
2 quick bonus questions
I have added additional log entry in the
FileInputFormat.getSplits(), however I can not see it in log
files. I am using WordCount example and INFO level. What might
be the problem?
In the FileSystem.getFileBlockLocations() the hostname is
hard-coded as "localhost", where this is mapped to the actual
host name, so that AM will know which nodes to request?
Thanks for reply
--
+Vinod
Hortonworks Inc.
http://hortonworks.com/
--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>