Claudio, thank you very much for your help. On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella <claudio.marte...@gmail.com > wrote:
> > > > On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov < > alexndr.fro...@gmail.com> wrote: > >> >> >> >> On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella < >> claudio.marte...@gmail.com> wrote: >> >>> >>> >>> >>> On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov < >>> alexndr.fro...@gmail.com> wrote: >>> >>>> Hi Claudio, >>>> >>>> thank you. >>>> >>>> If I understood correctly, mapper and mapper task is the same thing. >>>> >>> >>> More or less. A mapper is a functional element of the programming model, >>> while the mapper task is the task that executes the mapper function on the >>> records. >>> >> >> Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number >> of Workers [or Workers + Master] which will be created at the same node. >> >> That is if I have 8 node cluster >> with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31 >> Workers + 1 Master. >> >> Is it correct? >> > > That is correct. However, if you have total control over your cluster, you > may want to run one worker per node (hence setting the max number of map > tasks per machine to 1), and use multiple threads (input, compute, output). > This is going to make better use of resources. > Should I explicitly force Giraph to use multiple threads for input, compute, output? Only three threads, I suppose? But I have 12 cores available in each node (24 if HT is enabled). > > >> >> >>> >>>> >>>> >>>> On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella < >>>> claudio.marte...@gmail.com> wrote: >>>> >>>>> Hi Alex, >>>>> >>>>> answers are inline. >>>>> >>>>> >>>>> On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov < >>>>> alexndr.fro...@gmail.com> wrote: >>>>> >>>>>> Hi, folks! >>>>>> >>>>>> I have started small research of Giraph framework and I have not much >>>>>> experience with Giraph and Hadoop :-(. >>>>>> >>>>>> I would like to ask several questions about how things are working in >>>>>> Giraph which are not straightforward for me. I am trying to use the >>>>>> sources >>>>>> but sometimes it is not too easy ;-) >>>>>> >>>>>> So here they are: >>>>>> >>>>>> 1) How Workers are assigned to TaskTrackers? >>>>>> >>>>> >>>>> Each worker is a mapper, and mapper tasks are assigned to tasktrackers >>>>> by the jobtracker. >>>>> >>>> >>>> That is each Worker is created at the beginning of superstep and then >>>> dies. In the next superstep all Workers are created again. Is it correct? >>>> >>> >>> Nope. The workers are created at the beginning of the computation, and >>> destroyed at the end of the computation. A computation is persistent >>> throughout the computation. >>> >>> >>>> >>>> >>>>> There's no control by Giraph there, and because Giraph doesn't need >>>>> data-locality like Mapreduce does, basically nothing is done. >>>>> >>>> >>>> This is important for me. So Giraph Worker (a.k.a Hadoop mapper) >>>> fetches vertex with corresponding index from the HDFS and perform >>>> computation. What does it do next with it? As I understood Giraph is fully >>>> in-memory framework and in the next superstep this vertex should be fetched >>>> from the memory by the same Worker. Where the vertices are stored between >>>> supersteps? In HDFS or in memory? >>>> >>> >>> As I said, the workers are persistent (in-memory) between supersteps, so >>> they keep everything in memory. >>> >> >> Ok. >> >> Is there any means to see assignment of Workers to TaskTrackers during or >> after the computation? >> > > The jobtracker http interface will show you the mapper running, hence i'd > check there > > >> >> And is there any means to see assignment of vertices to Workers (as >> distribution function, histogram etc.)? >> > > You can check the worker logs, I think the information should be there. > > >> >> >> >>> >>>> >>>> >>>>> >>>>>> >>>>>> 2) How vertices are assigned to Workers? Does it depend on >>>>>> distribution of input file on DataNodes? Is there available any choice of >>>>>> distribution politics or no? >>>>>> >>>>> >>>>> In the default scheme, vertices are assigned through modulo hash >>>>> partitioning. Given k workers, vertex v is assigned to worker i according >>>>> to hash(v) % k = i. >>>>> >>>> >>>>> >>>>>> >>>>>> 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)? >>>>>> (1:n)? >>>>>> >>>>> >>>>> It's 1:1. Each worker is implemented by a mapper task. The master is >>>>> usually (but does not need to) implemented by an additional mapper >>>>> >>>> . >>>>> >>>>> >>>>>> >>>>>> 4) Can Workers migrate from one TaskTracker to the other? >>>>>> >>>>> >>>>> Workers does not migrate. A Giraph computation is not dynamic wrt to >>>>> assignment and size of the tasks. >>>>> >>>> >>>>> >>>>>> >>>>>> 5) What is the best way to monitor Giraph app execution (progress, >>>>>> worker assignment, load balancing etc.)? >>>>>> >>>>> >>>>> Just like you would for a standard Mapreduce job. Go to the job page >>>>> on the jobtracker http page. >>>>> >>>>> >>>>>> >>>>>> I think this is all for the moment. Thank you. >>>>>> >>>>>> Testbed description: >>>>>> Hardware: 8 node dual-CPU cluster with IB FDR. >>>>>> Giraph: release-1.0.0-RC2-152-g585511f >>>>>> Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8 >>>>>> >>>>>> Best, >>>>>> Alex >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Claudio Martella >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Claudio Martella >>> >>> >> >> > > > -- > Claudio Martella > >