Re: Learning GraphX Questions

Takeshi Yamamuro Thu, 19 Feb 2015 22:07:26 -0800

Hi,

Vertices are simply hash-partitioned by spark.HashPartitioner, so
you easily calculate partition ids by yourself.


Also, you can type the lines to check ids;

import org.apache.spark.graphx._

graph.vertices.mapPartitionsWithIndex { (pid, iter) =>
  val vids = Array.newBuilder[VertexId]
  for (d <- iter) vids += d._1
  Iterator((pid, vids.result))
}
.map(d => s"PID:${d._1} IDs:${d._2.toSeq.toString}")
.collect
.foreach(println)








On Thu, Feb 19, 2015 at 12:31 AM, Matthew Bucci <mrbucci...@gmail.com>
wrote:

> Thanks for all the responses so far! I have started to understand the
> system more, but I just had another question while I was going along. Is
> there a way to check the individual partitions of an RDD? For example, if I
> had a graph with vertices a,b,c,d and it was split into 2 partitions could
> I check which vertices belonged in partition 1 and parition 2?
>
> Thank You,
> Matthew Bucci
>
> On Fri, Feb 13, 2015 at 10:58 PM, Ankur Dave <ankurd...@gmail.com> wrote:
>
>> At 2015-02-13 12:19:46 -0800, Matthew Bucci <mrbucci...@gmail.com> wrote:
>> > 1) How do you actually run programs in GraphX? At the moment I've been
>> doing
>> > everything live through the shell, but I'd obviously like to be able to
>> work
>> > on it by writing and running scripts.
>>
>> You can create your own projects that build against Spark and GraphX
>> through a Maven dependency [1], then run those applications using the
>> bin/spark-submit script included with Spark [2].
>>
>> These guides assume you already know how to do this using your preferred
>> build tool (SBT or Maven). In short, here's how to do it with SBT:
>>
>> 1. Install SBT locally (`brew install sbt` on OS X).
>>
>> 2. Inside your project directory, create a build.sbt file listing Spark
>> and GraphX as a dependency, as in [3].
>>
>> 3. Run `sbt package` in a shell.
>>
>> 4. Pass the JAR in your_project_dir/target/scala-2.10/ to
>> bin/spark-submit.
>>
>> [1]
>> http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark
>> [2] http://spark.apache.org/docs/latest/submitting-applications.html
>> [3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4
>>
>> >> 2) Is there a way to check the status of the partitions of a graph? For
>> > example, I want to determine for starters if the number of partitions
>> > requested are always made, like if I ask for 8 partitions but only have
>> 4
>> > cores what happens?
>>
>> You can look at `graph.vertices` and `graph.edges`, which are both RDDs,
>> so you can do for example: graph.vertices.partitions
>>
>> > 3) Would I be able to partition by vertex instead of edges, even if I
>> had to
>> > write it myself? I know partitioning by edges is favored in a majority
>> of
>> > the cases, but for the sake of research I'd like to be able to do both.
>>
>> If you pass PartitionStrategy.EdgePartition1D, this will partition edges
>> by their source vertices, so all edges with the same source will be
>> co-partitioned, and the communication pattern will be similar to
>> vertex-partitioned (edge-cut) systems like Giraph.
>>
>> > 4) Is there a better way to time processes outside of using built-in
>> unix
>> > timing through the logs or something?
>>
>> I think the options are Unix timing, log file timestamp parsing, looking
>> at the web UI, or writing timing code within your program
>> (System.currentTimeMillis and System.nanoTime).
>>
>> Ankur
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Learning GraphX Questions

Reply via email to