I am seeing something strange with outerJoinVertices(and triangle count that relies on this api):
Here is what I am doing: 1) Created a Graph with multiple partitions i.e created a graph with minEdgePartitions(in api GraphLoader.edgeListFile), where minEdgePartitions >=1; and use partitionBy(PartitionStrategy.RandomVertexCut) on generated graph. Note: vertex attribute type is Int in this case 2) next I am building neighborhood ids by calling collectNeighborIds i.e. returned vertex attribute type is Array[VertexId] ; VertexRDD[Array[VertexId]] 3) finally join vertex ids from 2 to graph (generated in step 1) via outerJoinVertices 4) Create a subgraph on joined graph from 3 where I only keep the edges with ed.srcAttr != -1 && ed.dstAttr != -1 i.e. filter out null attr vertices 5) Finally checked the number edges left in subgraph from step4 I ran this program in a loop where minEdgePartitions is changed in each iteration. When minEdgePartitions == 1 I see correct number of edges. When minEdgePartitions == 2 result is ~1/2 number of edges; when minEdgePartitions == 3 result is ~1/3 number of edges and so on It seems that outerJoinVertices is returning srcAttr(and dstAtt) = nulll for many attributes; and from numbers it seems that it might be returning null for vertices residing on other partitions ? Environment : I am using RC5; and 22 executers. BUT I get correct number of edges in each iteration when I repeated my experiment by keeping the vertex attribute type Int in step 2 (i.e. just kept the number of vertices instead of array of vertices), which is same as the type vertex attribute in graph before join. Is this a know bug fixed recently? or are we supposed to set some flags when updating the vertex attribute type? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-outerJoinVertices-seems-to-return-null-for-vertex-attributes-when-input-was-partitioned-and-tp6799.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.