Re: how to create a Graph in GraphX?
You should be able to construct the edges in a single map() call without using collect(): val edges: RDD[Edge[String]] = sc.textFile(...).map { line = val row = line.split(,) Edge(row(0), row(1), row(2) } val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-create-a-Graph-in-GraphX-tp18635p18646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: counting degrees graphx
Sorry, I missed vertex 6 in that example. It should be [{1}, {1}, {1}, {1}, {1, 6}, {6}, {7}, {7}]. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/counting-degrees-graphx-tp6370p6378.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Caching in graphX
Unfortunately it's very difficult to get uncaching right with GraphX due to the complicated internal dependency structure that it creates. It's necessary to know exactly what operations you're doing on the graph in order to unpersist correctly (i.e., in a way that avoids recomputation). I have a pull request (https://github.com/apache/spark/pull/497) that may make this a bit easier, but your best option is to use the Pregel API for iterative algorithms if possible. If that's not possible, leaving things cached has actually not been very costly in my experience, at least as long as VD and ED are primitive types to reduce the load on the garbage collector. Ankur -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-in-graphX-tp5482p5514.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Variables outside of mapPartitions scope
In general, you can find out exactly what's not serializable by adding -Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS. Since a this reference to the enclosing class is often what's causing the problem, a general workaround is to move the mapPartitions call to a static method where there is no this reference. This transforms this: class A { def f() = rdd.mapPartitions(iter = ...)} into this: class A { def f() = A.helper(rdd)}object A { def helper(rdd: RDD[...]) = rdd.mapPartitions(iter = ...)} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is there any problem on the spark mailing list?
I haven't been getting mail either. This was the last message I received: http://apache-spark-user-list.1001560.n3.nabble.com/master-attempted-to-re-register-the-worker-and-then-took-all-workers-as-unregistered-tp553p5491.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509p5515.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: There is an error in Graphx
This problem occurs because graph.triplets generates an iterator that reuses the same EdgeTriplet object for every triplet in the partition. The workaround is to force a copy using graph.triplets.map(_.copy()). The solution in the AMPCamp tutorial is mistaken -- I'm not sure if that ever worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2836.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: There is an error in Graphx
The workaround is to force a copy using graph.triplets.map(_.copy()). Sorry, this actually won't copy the entire triplet, only the attributes defined in Edge. The right workaround is to copy the EdgeTriplet explicitly: graph.triplets.map { et = val et2 = new EdgeTriplet[VD, ED] // Replace VD and ED with the correct types et2.srcId = et.srcId et2.dstId = et.dstId et2.attr = et.attr et2.srcAttr = et.srcAttr et2.dstAttr = et.dstAttr et2 } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/There-is-an-error-in-Graphx-tp1575p2837.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: sample data for pagerank?
The examples in graphx/data are meant to show the input data format, but if you want to play around with larger and more interesting datasets, we've been using the following ones, among others: - SNAP's web-Google dataset (5M edges): https://snap.stanford.edu/data/web-Google.html - SNAP's soc-LiveJournal1 dataset (69M edges): https://snap.stanford.edu/data/soc-LiveJournal1.html These come in edge list format and, after decompression, can directly be loaded using GraphLoader. Ankur -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sample-data-for-pagerank-tp2655p2839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.