Re: Basic questions about Giraph internals

Sertuğ Kaya Fri, 07 Feb 2014 07:21:29 -0800

Hi all;

Thanks for this resourceful Q&A's. I will also definitely try this onemapper-multiple thread setting per node.

But Claudio, in which configuration do you set multiple threads?
Thanks
Sertug


On 06-02-2014 16:04, Alexander Frolov wrote:


Claudio,
thank you very much for your help.

On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella<claudio.marte...@gmail.com <mailto:claudio.marte...@gmail.com>> wrote:





    On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov
    <alexndr.fro...@gmail.com <mailto:alexndr.fro...@gmail.com>> wrote:




        On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella
        <claudio.marte...@gmail.com
        <mailto:claudio.marte...@gmail.com>> wrote:




            On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov
            <alexndr.fro...@gmail.com
            <mailto:alexndr.fro...@gmail.com>> wrote:

                Hi Claudio,

                thank you.

                If I understood correctly, mapper and mapper task is
                the same thing.


            More or less. A mapper is a functional element of the
            programming model, while the mapper task is the task that
            executes the mapper function on the records.


        Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a
        maximum number of Workers [or Workers + Master] which will be
        created at the same node.

        That is if I have 8 node cluster
        with mapred.tasktracker.map.tasks.maximum=4, then I can run up
        to 31 Workers + 1 Master.

        Is it correct?


    That is correct. However, if you have total control over your
    cluster, you may want to run one worker per node (hence setting
    the max number of map tasks per machine to 1), and use multiple
    threads (input, compute, output).
    This is going to make better use of resources.

Should I explicitly force Giraph to use multiple threads for input,compute, output? Only three threads, I suppose? But I have 12 coresavailable in each node (24 if HT is enabled).





                On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella
                <claudio.marte...@gmail.com
                <mailto:claudio.marte...@gmail.com>> wrote:

                    Hi Alex,

                    answers are inline.


                    On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov
                    <alexndr.fro...@gmail.com
                    <mailto:alexndr.fro...@gmail.com>> wrote:

                        Hi, folks!

                        I have started small research of Giraph
                        framework and I have not much experience with
                        Giraph and Hadoop :-(.

                        I would like to ask several questions about
                        how things are working in Giraph which are not
                        straightforward for me. I am trying to use the
                        sources but sometimes it is not too easy ;-)

                        So here they are:

                        1) How Workers are assigned to TaskTrackers?


                    Each worker is a mapper, and mapper tasks are
                    assigned to tasktrackers by the jobtracker.


                That is each Worker is created at the beginning of
                superstep and then dies. In the next superstep all
                Workers are created again. Is it correct?


            Nope. The workers are created at the beginning of the
            computation, and destroyed at the end of the computation.
            A computation is persistent throughout the computation.

                    There's no control by Giraph there, and because
                    Giraph doesn't need data-locality like Mapreduce
                    does, basically nothing is done.


                This is important for me. So Giraph Worker (a.k.a
                Hadoop mapper) fetches vertex with corresponding index
                from the HDFS and perform computation. What does it do
                next with it? As I understood Giraph is fully
                in-memory framework and in the next superstep this
                vertex should be fetched from the memory by the same
                Worker. Where the vertices are stored between
                supersteps? In HDFS or in memory?


            As I said, the workers are persistent (in-memory) between
            supersteps, so they keep everything in memory.


        Ok.

        Is there any means to see assignment of Workers to
        TaskTrackers during or after the computation?


    The jobtracker http interface will show you the mapper running,
    hence i'd check there


        And is there any means to see assignment of vertices to
        Workers (as distribution function, histogram etc.)?


    You can check the worker logs, I think the information should be
    there.






                        2) How vertices are assigned to Workers? Does
                        it depend on distribution of input file on
                        DataNodes? Is there available any choice of
                        distribution politics or no?


                    In the default scheme, vertices are assigned
                    through modulo hash partitioning. Given k workers,
                    vertex v is assigned to worker i according to
                    hash(v) % k = i.


                        3) How Workers and Map tasks are related to
                        each other? (1:1)? (n:1)? (1:n)?


                    It's 1:1. Each worker is implemented by a mapper
                    task. The master is usually (but does not need to)
                    implemented by an additional mapper

                    .


                        4) Can Workers migrate from one TaskTracker to
                        the other?


                    Workers does not migrate. A Giraph computation is
                    not dynamic wrt to assignment and size of the tasks.


                        5) What is the best way to monitor Giraph app
                        execution (progress, worker assignment, load
                        balancing etc.)?


                    Just like you would for a standard Mapreduce job.
                    Go to the job page on the jobtracker http page.


                        I think this is all for the moment. Thank you.

                        Testbed description:
                        Hardware: 8 node dual-CPU cluster with IB FDR.
                        Giraph: release-1.0.0-RC2-152-g585511f
                        Hadoop: hadoop-0.20.203.0
                        <tel:0.20.203.0>, hadoop-rdma-0.9.8

                        Best,
                           Alex

--Claudio Martella

Re: Basic questions about Giraph internals

Reply via email to