The logs appear to show that you get two identical input slits: 14/01/28 11:02:41 INFO worker.InputSplitsCallable: getInputSplit: Reserved /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/ 0 from ZooKeeper and got input split 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/sm all:0+172'
14/01/28 11:02:42 INFO worker.InputSplitsCallable: getInputSplit: Reserved /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/ 1 from ZooKeeper and got input split 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/sm all:0+172' Have you by any chance accidentally passed in the input file twice? Rob From: Eric Kimbrel <lekimb...@gmail.com> Reply-To: <user@giraph.apache.org> Date: Wednesday, 29 January 2014 09:08 To: <user@giraph.apache.org> Subject: duplicate edges created with TextVertexInputFormat > I am reading in an adjacency list using an input format which extends > TextVertexInputFormat. My code doesn¹t do anything to address input splits, > but leaves that to the underlying giraph implementation. However it appears > that as the data is being read 2 identical input splits are created and read > in, resulting in edges for each vertex being created twice. > > My input format is a simple adjacency list, where each node is represented by > a single line of text which lists the node id, and all of its neighbors. > I read the edges into an edge list and then create the vertex via: >> Vertex<Text, LouvainNodeState, LongWritable> vertex = >> this.getConf().createVertex(); >> vertex.initialize(id, state, edgesList); > > > Logs below show the edges being read in twice (as part of two different input > splits in the input stage) and then being represented twice per node in the > computation phase. > This example is using 1 compute thread and 1 worker. > > If I am creating the vertex incorrectly or doing something else wrong please > let me know. Thanks. > > > > Log snippet of vertex input process. > > 14/01/28 11:02:41 INFO worker.BspServiceWorker: loadInputSplits: Using 1 > thread(s), originally 1 threads(s) for 2 total splits. > 14/01/28 11:02:41 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved > input split path > /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0, > overall roughly 0.0% input splits reserved > 14/01/28 11:02:41 INFO worker.InputSplitsCallable: getInputSplit: Reserved > /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0 > from ZooKeeper and got input split > 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/smal > l:0+172' > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1 > > other nodes processed > > 14/01/28 11:02:42 INFO worker.InputSplitsCallable: loadFromInputSplit: > Finished loading > /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/0 > (v=9, e=34) > 14/01/28 11:02:42 INFO worker.InputSplitsHandler: reserveInputSplit: Reserved > input split path > /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1, > overall roughly 50.0% input splits reserved > 14/01/28 11:02:42 INFO worker.InputSplitsCallable: getInputSplit: Reserved > /_hadoopBsp/giraph_yarn_application_1390861968364_0029/_vertexInputSplitDir/1 > from ZooKeeper and got input split > 'hdfs://arcus1.silverdale.dev/tmp/louvain-giraph-example/1390935731/input/smal > l:0+172' > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 2:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 3:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 4:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 5:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexInputFormat: Node 1 added edge 6:1 > > other nodes processed again > > > Logs from the compute phase show that edges really are added twice (format > below shows edge #:target:weight) > While each node should only have one edge to each other, it instead has two. > > 4/01/28 11:02:42 INFO giraph.LouvainVertexComputation: NODE: 1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 1: 2:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 2: 3:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 3: 4:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 4: 5:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 5: 6:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 6: 2:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 7: 3:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 8: 4:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 9: 5:1 > 14/01/28 11:02:42 INFO giraph.LouvainVertexComputation: EDGE 10: 6:1 > >