Thanks for all the responses so far! I have started to understand the system more, but I just had another question while I was going along. Is there a way to check the individual partitions of an RDD? For example, if I had a graph with vertices a,b,c,d and it was split into 2 partitions could I check which vertices belonged in partition 1 and parition 2?
Thank You, Matthew Bucci On Fri, Feb 13, 2015 at 10:58 PM, Ankur Dave <ankurd...@gmail.com> wrote: > At 2015-02-13 12:19:46 -0800, Matthew Bucci <mrbucci...@gmail.com> wrote: > > 1) How do you actually run programs in GraphX? At the moment I've been > doing > > everything live through the shell, but I'd obviously like to be able to > work > > on it by writing and running scripts. > > You can create your own projects that build against Spark and GraphX > through a Maven dependency [1], then run those applications using the > bin/spark-submit script included with Spark [2]. > > These guides assume you already know how to do this using your preferred > build tool (SBT or Maven). In short, here's how to do it with SBT: > > 1. Install SBT locally (`brew install sbt` on OS X). > > 2. Inside your project directory, create a build.sbt file listing Spark > and GraphX as a dependency, as in [3]. > > 3. Run `sbt package` in a shell. > > 4. Pass the JAR in your_project_dir/target/scala-2.10/ to bin/spark-submit. > > [1] > http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark > [2] http://spark.apache.org/docs/latest/submitting-applications.html > [3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4 > > >> 2) Is there a way to check the status of the partitions of a graph? For > > example, I want to determine for starters if the number of partitions > > requested are always made, like if I ask for 8 partitions but only have 4 > > cores what happens? > > You can look at `graph.vertices` and `graph.edges`, which are both RDDs, > so you can do for example: graph.vertices.partitions > > > 3) Would I be able to partition by vertex instead of edges, even if I > had to > > write it myself? I know partitioning by edges is favored in a majority of > > the cases, but for the sake of research I'd like to be able to do both. > > If you pass PartitionStrategy.EdgePartition1D, this will partition edges > by their source vertices, so all edges with the same source will be > co-partitioned, and the communication pattern will be similar to > vertex-partitioned (edge-cut) systems like Giraph. > > > 4) Is there a better way to time processes outside of using built-in unix > > timing through the logs or something? > > I think the options are Unix timing, log file timestamp parsing, looking > at the web UI, or writing timing code within your program > (System.currentTimeMillis and System.nanoTime). > > Ankur >