Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.

Marcos Ortiz Mon, 04 Feb 2013 05:37:54 -0800

Regards, blah.
You can use these links:
MAPREDUCE-279: https://issues.apache.org/jira/browse/MAPREDUCE-279

MapReduce Next Gen:http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen

and you can use the Cloudera's blogs posts about YARN:
http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

There is a great document written by Arun, Owen and more people aboutthe architecture of YARN but I don't have it here right now.

Best wishes
On 02/04/2013 04:35 AM, blah blah wrote:

Thank you very much for answering my question. Is there any publiclyavailable Hadoop-MR-Yarn UML diagrams (class, activity etc), or somemore in-depth documentation, except the one on official site. I aminterested in implementation details/documentation of MR AM and MRcontainers (old TaskTracker)?


regards
blah

2013/2/1 Vinod Kumar Vavilapalli <[email protected]<mailto:[email protected]>>


    You got that mostly right. And it doesn't differ much in Hadoop
    1.* either. With MR AM doing the work that was earlier done in
    JobTracker., the JobClient and the task side doesn't change much.

    FileInputFormat.getsplits() is called by client itself, so you
    should look for logs on the client machine.

    Each filesystem overrides getFileBlockLocations() and provides the
    correct locations - like DFS internally uses the
    getBlockLocations() API on Namenode. What you are seeing is the
    default implementation for local FS.

    HTH,
    +Vinod



    On Fri, Feb 1, 2013 at 6:24 AM, blah blah <[email protected]
    <mailto:[email protected]>> wrote:

        Hi

        (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)

        I have a question regarding my assumptions on the Yarn-MR
        design, specially the InputSplit processing. Can someone
        confirm or point out my mistakes in my MR-Yarn design assumptions?

        These are my assumptions regarding design.
        1. JobClient submits Job
        Create AppMaster etc.
        2. Get number of splits // MR-AM, specially their hosts, so
        that a Task can be started on the same node, use
        *InputFormat.getSplts() { ...;
        FileSystem.getFileBlockLocations(); ...;}
        3. Start N tasks // MR-AM
        4. Each Task processes its (single) split (unless splitsNr >>
        tasksNr) with the use of InputFormat/RecordReader // MR-Task,
        from HERE InputFormat operates only on a single Split
        5. Start RecordReader and process Split // MR-Task
        5. MAP() // MR-Task
        6. Do rest MR // MR-Task
        7. Dump to HDFS/or other storage. // MR-Task
        8. Report FINISH, free resources // MR-AM

        2 quick bonus questions

        I have added additional log entry in the
        FileInputFormat.getSplits(), however I can not see it in log
        files. I am using WordCount example and INFO level. What might
        be the problem?
        In the FileSystem.getFileBlockLocations() the hostname is
        hard-coded as "localhost", where this is mapped to the actual
        host name, so that AM will know which nodes to request?

        Thanks for reply

--+Vinod

    Hortonworks Inc.
    http://hortonworks.com/


--
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.

Reply via email to