Hi, Vertices are simply hash-partitioned by spark.HashPartitioner, so you easily calculate partition ids by yourself.
Also, you can type the lines to check ids; import org.apache.spark.graphx._ graph.vertices.mapPartitionsWithIndex { (pid, iter) => val vids = Array.newBuilder[VertexId] for (d <- iter) vids += d._1 Iterator((pid, vids.result)) } .map(d => s"PID:${d._1} IDs:${d._2.toSeq.toString}") .collect .foreach(println) On Thu, Feb 19, 2015 at 12:31 AM, Matthew Bucci <mrbucci...@gmail.com> wrote: > Thanks for all the responses so far! I have started to understand the > system more, but I just had another question while I was going along. Is > there a way to check the individual partitions of an RDD? For example, if I > had a graph with vertices a,b,c,d and it was split into 2 partitions could > I check which vertices belonged in partition 1 and parition 2? > > Thank You, > Matthew Bucci > > On Fri, Feb 13, 2015 at 10:58 PM, Ankur Dave <ankurd...@gmail.com> wrote: > >> At 2015-02-13 12:19:46 -0800, Matthew Bucci <mrbucci...@gmail.com> wrote: >> > 1) How do you actually run programs in GraphX? At the moment I've been >> doing >> > everything live through the shell, but I'd obviously like to be able to >> work >> > on it by writing and running scripts. >> >> You can create your own projects that build against Spark and GraphX >> through a Maven dependency [1], then run those applications using the >> bin/spark-submit script included with Spark [2]. >> >> These guides assume you already know how to do this using your preferred >> build tool (SBT or Maven). In short, here's how to do it with SBT: >> >> 1. Install SBT locally (`brew install sbt` on OS X). >> >> 2. Inside your project directory, create a build.sbt file listing Spark >> and GraphX as a dependency, as in [3]. >> >> 3. Run `sbt package` in a shell. >> >> 4. Pass the JAR in your_project_dir/target/scala-2.10/ to >> bin/spark-submit. >> >> [1] >> http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark >> [2] http://spark.apache.org/docs/latest/submitting-applications.html >> [3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4 >> >> >> 2) Is there a way to check the status of the partitions of a graph? For >> > example, I want to determine for starters if the number of partitions >> > requested are always made, like if I ask for 8 partitions but only have >> 4 >> > cores what happens? >> >> You can look at `graph.vertices` and `graph.edges`, which are both RDDs, >> so you can do for example: graph.vertices.partitions >> >> > 3) Would I be able to partition by vertex instead of edges, even if I >> had to >> > write it myself? I know partitioning by edges is favored in a majority >> of >> > the cases, but for the sake of research I'd like to be able to do both. >> >> If you pass PartitionStrategy.EdgePartition1D, this will partition edges >> by their source vertices, so all edges with the same source will be >> co-partitioned, and the communication pattern will be similar to >> vertex-partitioned (edge-cut) systems like Giraph. >> >> > 4) Is there a better way to time processes outside of using built-in >> unix >> > timing through the logs or something? >> >> I think the options are Unix timing, log file timestamp parsing, looking >> at the web UI, or writing timing code within your program >> (System.currentTimeMillis and System.nanoTime). >> >> Ankur >> > > -- --- Takeshi Yamamuro