Da, Holding objects in serialized form as bytes in byte arrays consumes much less memory than holding them as Java objects (which have a huge overhead), I think that is the other main reason for serialization.
--sebastian On 18.10.2013 19:28, YAN Da wrote: > Dear Claudio Martella, > > According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph > currently organize vertices as byte streams, probabily pages. > > In the url, "This also significantly reduces GC time, as there are less > objects to GC." > > Why there's "also" there? I mean, is reducing GC time the only reason for > doing serialization? > > Regards, > Da > >> Dear Claudio Martella, >> >> I don't quite get what you mean. Our cluster has 15 servers each with 24 >> cores, so ideally there can be 15*24 threads/partitions work in parallel, >> right? (Perhaps deduct one for ZooKeeper) >> >> However, when we set the "-Dgiraph.numComputeThreads" option, we find that >> we cannot have even 20 threads, and when set to 10, the CPU usage is just >> a little bit doubles that of the default setting, not anything close to >> 100*numComputeThreads%. >> >> How can we set it to work on our server to utilize all the processors? >> >> Regards, >> Da Yan >> >>> It actually depends on the setup of your cluster. >>> >>> Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node >>> (ideally to run giraph), so that you would have 14 workers, one per >>> computing node, plus one for master+zookeeper. Once that is reached, you >>> would have a number of compute threads equals to the number of threads >>> that >>> you can run on each node (24 in your case). >>> >>> Does this make sense to you? >>> >>> >>> On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu <luyi0...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I have a computer cluster consisting of 15 slave machines and 1 master >>>> machine. >>>> >>>> On each slave machine, there are two Xeon E5-2620 CPUs. With the help >>>> of >>>> HT, there are 24 threads. >>>> >>>> I am wondering how to specify parameters in order to run giraph job in >>>> parallel on my cluster. >>>> >>>> I am using the following parameters to run a pagerank algorithm. >>>> >>>> hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner >>>> SimplePageRank -vif PageRankInputFormat -vip /input -vof >>>> PageRankOutputFormat -op /pagerank -w 1 -mc >>>> SimplePageRank\$SimplePageRankMasterCompute -wc >>>> SimplePageRank\$SimplePageRankWorkerContext >>>> >>>> In particular, >>>> >>>> 1)I know I can use “-w” to specify the number of workers. In my >>>> opinion, >>>> the number of workers equals to the number of mappers in hadoop except >>>> zookeeper. Therefore, in my case(15 slave machine), which number should >>>> be >>>> chosen? Is 15 a good choice? Since, I find if I input a large number, >>>> e.g. >>>> 100, the mappers will hang. >>>> >>>> 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex >>>> computing thread number. However, if I specify it to 10, the total >>>> runtime >>>> is much longer than default. I think the default is 1, which is found >>>> in >>>> the source code. I wonder if I want to use this parameter, which number >>>> should be chosen. >>>> >>>> 3)When the giraph job is running, I use “top” command to monitor my cpu >>>> usage on slave machines. I find that the java process can use 200%-300% >>>> cpu >>>> resource. However, if I change the number of vertex computing threads >>>> to >>>> 10, the java process can use 800% cpu resource. I think it is not a >>>> linear >>>> relation and I want to know why. >>>> >>>> >>>> Thanks for your help. >>>> >>>> Best, >>>> >>>> -Yi >>>> >>> >>> >>> >>> -- >>> Claudio Martella >>> claudio.marte...@gmail.com >>> >> >> >> > > > >