It seems that you are reading the data from a single file stored on a local
machine with multiple threads. Having multiple threads accessing the disk
causes IO interference which in turn reduces the IO performance. If you are
reading from a single file on a local machine with 8 threads, the results
you've got is kind of expected. In such case, you are better off using
single thread in reading from the disk. You can also try to do it with two
threads, so that you may be able to get some overlapping benefit of reading
from disk and deserializing the input.

On Thu, Aug 25, 2016 at 5:36 PM, José Luis Larroque <larroques...@gmail.com>
wrote:

> he cluster used for this was 1 master and one slave, both of a r3.8xlarge
> EC2 instance on AWS.
>
> 2016-08-25 19:26 GMT-03:00 José Luis Larroque <larroques...@gmail.com>:
>
>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>> converted the [dump][1] into a file that could be read with Giraph.
>>
>> Using 1 worker, a file of 1 GB took 492 seconds. I executed Giraph with
>> this command:
>>
>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>> -vif 
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true
>>
>> Container logs:
>>
>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>> generateVertexInputSplits: Got 8 input splits for 1 input threads
>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>> createVertexInputSplits: Starting to write input split data to zookeeper
>> with 1 threads
>>     16/08/24 21:17:02 INFO master.BspServiceMaster:
>> createVertexInputSplits: Done writing input split data to zookeeper
>>     16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0]
>> MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to
>> start superstep -1
>>     16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without
>> authentication.
>>     16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses:
>> Successfully added 1 connections, (1 total connected) 0 failed, 0 failures
>> total.
>>     16/08/24 21:17:02 INFO partition.PartitionUtils:
>> computePartitionCount: Creating 1, default would have been 1 partitions.
>>     ...
>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
>>     16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait
>> threshold, 1 connections closed, releasing resources now.
>>     16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
>>     16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing
>> resources
>>     16/08/24 21:25:44 INFO bsp.BspService: process:
>> cleanedUpChildrenChanged signaled
>>     16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
>>     16/08/24 21:25:47 INFO bsp.BspService: process:
>> masterElectionChildrenChanged signaled
>>     16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 seconds.
>>     16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took
>> 452.531 seconds.
>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376
>> seconds.
>>     16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591
>> seconds.
>>     16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609
>> seconds.
>>     16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006
>> seconds.
>>
>> As you guys can see, the first line tell us that input superstep is
>> executing with only **one** thread. And took 492 second in finish Input
>> Superstep.
>>
>> I did another test, using giraph.numInputThreads=8, tryng to do the input
>> superstep with 8 threads:
>>
>>     /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote
>> -vif 
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat
>> -vip /user/hduser/input/grafo-wikipedia.txt -vof
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca
>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,
>> giraph.numInputThreads=8
>>
>> The result was the following one:
>>
>>         16/08/24 21:54:00 INFO master.BspServiceMaster:
>> generateVertexInputSplits: Got 8 input splits for 8 input threads
>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>> createVertexInputSplits: Starting to write input split data to zookeeper
>> with 1 threads
>>     16/08/24 21:54:00 INFO master.BspServiceMaster:
>> createVertexInputSplits: Done writing input split data to zookeeper
>>     ...
>>
>>     16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 seconds.
>>     16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took
>> 891.339 seconds.
>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635
>> seconds.
>>     16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837
>> seconds.
>>     16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605
>> seconds.
>>     16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512
>> seconds.
>>
>>
>> So, my question is, how can be possible that Giraph is using 492 seconds
>> without input threads and 891 seconds with them? Should be exacly the
>> opposite, right?
>>
>>
>>   [1]: https://dumps.wikimedia.org/eswiki/20160601/ "dump"
>>
>
>

Reply via email to