I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the [dump][1] into a file that could be read with Giraph.
Using 1 worker, a file of 1 GB took 492 seconds. I executed Giraph with this command: /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true Container logs: 16/08/24 21:17:02 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 1 input threads 16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads 16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper 16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to start superstep -1 16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without authentication. 16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses: Successfully added 1 connections, (1 total connected) 0 failed, 0 failures total. 16/08/24 21:17:02 INFO partition.PartitionUtils: computePartitionCount: Creating 1, default would have been 1 partitions. ... 16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client 16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait threshold, 1 connections closed, releasing resources now. 16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted 16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server 16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing resources 16/08/24 21:25:44 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled 16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted 16/08/24 21:25:47 INFO bsp.BspService: process: masterElectionChildrenChanged signaled 16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 seconds. 16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took 452.531 seconds. 16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376 seconds. 16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591 seconds. 16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609 seconds. 16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006 seconds. As you guys can see, the first line tell us that input superstep is executing with only **one** thread. And took 492 second in finish Input Superstep. I did another test, using giraph.numInputThreads=8, tryng to do the input superstep with 8 threads: /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.numInputThreads=8 The result was the following one: 16/08/24 21:54:00 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 8 input threads 16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads 16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper ... 16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 seconds. 16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took 891.339 seconds. 16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635 seconds. 16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837 seconds. 16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605 seconds. 16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512 seconds. So, my question is, how can be possible that Giraph is using 492 seconds without input threads and 891 seconds with them? Should be exacly the opposite, right? [1]: https://dumps.wikimedia.org/eswiki/20160601/ "dump"