Hi Hassan, thanks for your answer. I don't know if is clear (because i'm using my own algorithm instead of using one of the Giraph examples) but the input file ( /user/hduser/input/grafo-wikipedia.txt) is loaded in HDFS. i was thinking that with 8 input splits, giraph was partitioning the input file for paralell proccesing, but i wasnt' sure.
I will try with only two threads in same worker. Maybe using giraph.userPartitionCount will help? Or those partitions are destinated to computation steps only? Bye! Jose 2016-08-26 3:11 GMT-03:00 Hassan Eslami <hsn.esl...@gmail.com>: > It seems that you are reading the data from a single file stored on a > local machine with multiple threads. Having multiple threads accessing the > disk causes IO interference which in turn reduces the IO performance. If > you are reading from a single file on a local machine with 8 threads, the > results you've got is kind of expected. In such case, you are better off > using single thread in reading from the disk. You can also try to do it > with two threads, so that you may be able to get some overlapping benefit > of reading from disk and deserializing the input. > > On Thu, Aug 25, 2016 at 5:36 PM, José Luis Larroque < > larroques...@gmail.com> wrote: > >> he cluster used for this was 1 master and one slave, both of a r3.8xlarge >> EC2 instance on AWS. >> >> 2016-08-25 19:26 GMT-03:00 José Luis Larroque <larroques...@gmail.com>: >> >>> I'm doing BFS search through the Wikipedia (spanish edition) site. I >>> converted the [dump][1] into a file that could be read with Giraph. >>> >>> Using 1 worker, a file of 1 GB took 492 seconds. I executed Giraph with >>> this command: >>> >>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar >>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote >>> -vif >>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat >>> -vip /user/hduser/input/grafo-wikipedia.txt -vof >>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat >>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca >>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true >>> >>> Container logs: >>> >>> 16/08/24 21:17:02 INFO master.BspServiceMaster: >>> generateVertexInputSplits: Got 8 input splits for 1 input threads >>> 16/08/24 21:17:02 INFO master.BspServiceMaster: >>> createVertexInputSplits: Starting to write input split data to zookeeper >>> with 1 threads >>> 16/08/24 21:17:02 INFO master.BspServiceMaster: >>> createVertexInputSplits: Done writing input split data to zookeeper >>> 16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0] >>> MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to >>> start superstep -1 >>> 16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without >>> authentication. >>> 16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses: >>> Successfully added 1 connections, (1 total connected) 0 failed, 0 failures >>> total. >>> 16/08/24 21:17:02 INFO partition.PartitionUtils: >>> computePartitionCount: Creating 1, default would have been 1 partitions. >>> ... >>> 16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client >>> 16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait >>> threshold, 1 connections closed, releasing resources now. >>> 16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted >>> 16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server >>> 16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing >>> resources >>> 16/08/24 21:25:44 INFO bsp.BspService: process: >>> cleanedUpChildrenChanged signaled >>> 16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted >>> 16/08/24 21:25:47 INFO bsp.BspService: process: >>> masterElectionChildrenChanged signaled >>> 16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 >>> seconds. >>> 16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took >>> 452.531 seconds. >>> 16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376 >>> seconds. >>> 16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591 >>> seconds. >>> 16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609 >>> seconds. >>> 16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006 >>> seconds. >>> >>> As you guys can see, the first line tell us that input superstep is >>> executing with only **one** thread. And took 492 second in finish Input >>> Superstep. >>> >>> I did another test, using giraph.numInputThreads=8, tryng to do the >>> input superstep with 8 threads: >>> >>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar >>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote >>> -vif >>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat >>> -vip /user/hduser/input/grafo-wikipedia.txt -vof >>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat >>> -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca >>> giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true, >>> giraph.numInputThreads=8 >>> >>> The result was the following one: >>> >>> 16/08/24 21:54:00 INFO master.BspServiceMaster: >>> generateVertexInputSplits: Got 8 input splits for 8 input threads >>> 16/08/24 21:54:00 INFO master.BspServiceMaster: >>> createVertexInputSplits: Starting to write input split data to zookeeper >>> with 1 threads >>> 16/08/24 21:54:00 INFO master.BspServiceMaster: >>> createVertexInputSplits: Done writing input split data to zookeeper >>> ... >>> >>> 16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 >>> seconds. >>> 16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took >>> 891.339 seconds. >>> 16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635 >>> seconds. >>> 16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837 >>> seconds. >>> 16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605 >>> seconds. >>> 16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512 >>> seconds. >>> >>> >>> So, my question is, how can be possible that Giraph is using 492 seconds >>> without input threads and 891 seconds with them? Should be exacly the >>> opposite, right? >>> >>> >>> [1]: https://dumps.wikimedia.org/eswiki/20160601/ "dump" >>> >> >> >