Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think
the fix will be in the next release. But until then, do:

g.edges.map(_.copy()).distinct.count



On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote:

 Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt

 This code:

 println(g.numEdges)
 println(g.numVertices)
 println(g.edges.distinct().count())

 gave me

 1
 9294
 2



 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote:
  I wasn't able to reproduce this with a small test file, but I did change
 the
  file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to
 take
  the third column rather than the second?
 
  If so, would you mind posting a larger sample of the file, or even the
 whole
  file if possible?
 
  Here's the test that succeeded:
 
test(graph.edges.distinct.count) {
  withSpark { sc =
val edgeFullStrRDD: RDD[String] = sc.parallelize(List(
  394365859\t136153151, 589404147\t1361045425))
val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
  .map(x = (x(0).toLong, x(1).toLong))
val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
  uniqueEdges = Option(CanonicalRandomVertexCut))
assert(edgeTupRDD.distinct.count() === 2)
assert(g.numEdges === 2)
assert(g.edges.distinct.count() === 2)
  }
}
 
  Ankur



GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like

394365859 -- 136153151
589404147 -- 1361045425

I read it into a Graph via:

val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName)
val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
   .map(x = (x(0).toLong, x(2).toLong))
val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
uniqueEdges = Option(CanonicalRandomVertexCut))

Now, edgeTupRDD.distinct().count() tells me I have 240086 distinct
lines in the file, g.numEdges tells me they combined into 240096
weighted edges (which is really weird since that's more lines than in
the RDD), but g.edges.distinct().count() tells me I have 10. Why?


Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ankur Dave
I wasn't able to reproduce this with a small test file, but I did change
the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to
take the third column rather than the second?

If so, would you mind posting a larger sample of the file, or even the
whole file if possible?

Here's the test that succeeded:

  test(graph.edges.distinct.count) {
withSpark { sc =
  val edgeFullStrRDD: RDD[String] = sc.parallelize(List(
394365859\t136153151, 589404147\t1361045425))
  val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
.map(x = (x(0).toLong, x(1).toLong))
  val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
uniqueEdges = Option(CanonicalRandomVertexCut))
  assert(edgeTupRDD.distinct.count() === 2)
  assert(g.numEdges === 2)
  assert(g.edges.distinct.count() === 2)
}
  }

Ankur http://www.ankurdave.com/


Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt

This code:

println(g.numEdges)
println(g.numVertices)
println(g.edges.distinct().count())

gave me

1
9294
2



On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote:
 I wasn't able to reproduce this with a small test file, but I did change the
 file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take
 the third column rather than the second?

 If so, would you mind posting a larger sample of the file, or even the whole
 file if possible?

 Here's the test that succeeded:

   test(graph.edges.distinct.count) {
 withSpark { sc =
   val edgeFullStrRDD: RDD[String] = sc.parallelize(List(
 394365859\t136153151, 589404147\t1361045425))
   val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
 .map(x = (x(0).toLong, x(1).toLong))
   val g = Graph.fromEdgeTuples(edgeTupRDD, defaultValue = 123,
 uniqueEdges = Option(CanonicalRandomVertexCut))
   assert(edgeTupRDD.distinct.count() === 2)
   assert(g.numEdges === 2)
   assert(g.edges.distinct.count() === 2)
 }
   }

 Ankur