Re: GraphX: .edges.distinct().count() is 10?
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think the fix will be in the next release. But until then, do: g.edges.map(_.copy()).distinct.count On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote: Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's the test that succeeded: test(graph.edges.distinct.count) { withSpark { sc = val edgeFullStrRDD: RDD[String] = sc.parallelize(List( 394365859\t136153151, 589404147\t1361045425)) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(1).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) assert(edgeTupRDD.distinct.count() === 2) assert(g.numEdges === 2) assert(g.edges.distinct.count() === 2) } } Ankur
GraphX: .edges.distinct().count() is 10?
I am trying to read an edge list into a Graph. My data looks like 394365859 -- 136153151 589404147 -- 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(2).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) Now, edgeTupRDD.distinct().count() tells me I have 240086 distinct lines in the file, g.numEdges tells me they combined into 240096 weighted edges (which is really weird since that's more lines than in the RDD), but g.edges.distinct().count() tells me I have 10. Why?
Re: GraphX: .edges.distinct().count() is 10?
I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's the test that succeeded: test(graph.edges.distinct.count) { withSpark { sc = val edgeFullStrRDD: RDD[String] = sc.parallelize(List( 394365859\t136153151, 589404147\t1361045425)) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(1).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) assert(edgeTupRDD.distinct.count() === 2) assert(g.numEdges === 2) assert(g.edges.distinct.count() === 2) } } Ankur http://www.ankurdave.com/
Re: GraphX: .edges.distinct().count() is 10?
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's the test that succeeded: test(graph.edges.distinct.count) { withSpark { sc = val edgeFullStrRDD: RDD[String] = sc.parallelize(List( 394365859\t136153151, 589404147\t1361045425)) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x = (x(0).toLong, x(1).toLong)) val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123, uniqueEdges = Option(CanonicalRandomVertexCut)) assert(edgeTupRDD.distinct.count() === 2) assert(g.numEdges === 2) assert(g.edges.distinct.count() === 2) } } Ankur