The latest version of IndexedRDD supports any key type with a defined
serializer
https://github.com/amplab/spark-indexedrdd/blob/master/src/main/scala/edu/berkeley/cs/amplab/spark/indexedrdd/KeySerializer.scala,
including Strings. It's not released yet, but you can use it from the
master branch if
If you know the partition IDs, you can launch a job that runs tasks on only
those partitions by calling sc.runJob
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1686.
For example, we do this in IndexedRDD
I'm the primary author of IndexedRDD. To answer your questions:
1. Operations on an IndexedRDD partition can only be performed from a task
operating on that partition, since doing otherwise would require
decentralized coordination between workers, which is difficult in Spark. If
you want to
Actually, GraphX doesn't need to scan all the edges, because it
maintains a clustered index on the source vertex id (that is, it sorts
the edges by source vertex id and stores the offsets in a hash table).
If the activeDirection is appropriately set, it can then jump only to
the clusters with
We thought it would be better to simplify the interface, since the
active set is a performance optimization but the result is identical
to calling subgraph before aggregateMessages.
The active set option is still there in the package-private method
aggregateMessagesWithActiveSet. You can actually
This might be because partitions are getting dropped from memory and
needing to be recomputed. How much memory is in the cluster, and how large
are the partitions? This information should be in the Executors and Storage
pages in the web UI.
Ankur http://www.ankurdave.com/
On Tue, Mar 24, 2015 at
At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote:
1) How do you actually run programs in GraphX? At the moment I've been doing
everything live through the shell, but I'd obviously like to be able to work
on it by writing and running scripts.
You can create your own
Thanks for the reminder. I just created a PR:
https://github.com/apache/spark/pull/4273
Ankur
On Thu, Jan 29, 2015 at 7:25 AM, Jay Hutfles jayhutf...@gmail.com wrote:
Just curious, is this set to be merged at some point?
-
To
You can do this using leftJoin, as collectNeighbors [1] does:
graph.vertices.leftJoin(graph.inDegrees) {
(vid, attr, inDegOpt) = inDegOpt.getOrElse(0)
}
[1]
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145
Ankur
On Sun, Jan 25,
At 2015-01-22 02:06:37 -0800, NicolasC nicolas.ch...@inria.fr wrote:
I try to execute a simple program that runs the ShortestPaths algorithm
(org.apache.spark.graphx.lib.ShortestPaths) on a small grid graph.
I use Spark 1.2.0 downloaded from spark.apache.org.
This program runs more than 2
[-dev]
What size of graph are you hoping to run this on? For small graphs where
materializing the all-pairs shortest path is an option, you could simply
find the APSP using https://github.com/apache/spark/pull/3619 and then take
the average distance (apsp.map(_._2.toDouble).mean).
Ankur
At 2014-12-08 12:12:16 -0800, spr s...@yarcdata.com wrote:
OK, have waded into implementing this and have gotten pretty far, but am now
hitting something I don't understand, an NoSuchMethodError.
[...]
The (short) traceback looks like
Exception in thread main java.lang.NoSuchMethodError:
At 2014-12-05 02:26:52 -0800, Yifan LI iamyifa...@gmail.com wrote:
I have a graph in where each vertex keep several messages to some faraway
neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g.
k = 5).
now, I propose to distribute these messages to their
At 2014-12-04 02:08:45 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a graph and I want to create RDDs equal in number to the nodes in
the graph. How can I do that?
If I have 10 nodes then I want to create 10 rdds. Is that possible in
GraphX?
This is possible: you can collect
There's no built-in support for doing this, so the best option is to copy and
modify Pregel to check the accumulator at the end of each iteration. This is
robust and shouldn't be too hard, since the Pregel code is short and only uses
public GraphX APIs.
Ankur
At 2014-12-03 09:37:01 -0800, Jay
At 2014-12-04 16:26:50 -0800, spr s...@yarcdata.com wrote:
I'm also looking at how to represent literals as vertex properties. It seems
one way to do this is via positional convention in an Array/Tuple/List that is
the VD; i.e., to represent height, weight, and eyeColor, the VD could be a
At 2014-12-03 02:13:49 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
We cannot do sc.parallelize(List(VertexRDD)), can we?
There's no need to do this, because every VertexRDD is also a pair RDD:
class VertexRDD[VD] extends RDD[(VertexId, VD)]
You can simply use graph.vertices in
At 2014-12-02 22:01:20 -0800, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a graph which returns the following on doing graph.vertices
(1, 1.0)
(2, 1.0)
(3, 2.0)
(4, 2.0)
(5, 0.0)
I want to group all the vertices with the same attribute together, like into
one RDD or something. I
To get that function in scope you have to import
org.apache.spark.SparkContext._
Ankur
On Wednesday, December 3, 2014, Deep Pradhan pradhandeep1...@gmail.com
wrote:
But groupByKey() gives me the error saying that it is not a member of
org.apache.spark,rdd,RDD[(Double,
At 2014-11-26 05:25:10 -0800, Hlib Mykhailenko hlib.mykhaile...@inria.fr
wrote:
I work with Graphx. When I call graph.partitionBy(..) nothing happens,
because, as I understood, that all transformation are lazy and partitionBy is
built using transformations.
Is there way how to force spark
At 2014-11-11 01:51:43 +, Buttler, David buttl...@llnl.gov wrote:
I am building a graph from a large CSV file. Each record contains a couple
of nodes and about 10 edges. When I try to load a large portion of the
graph, using multiple partitions, I get inconsistent results in the number
At 2014-11-13 21:28:52 +, Ommen, Jurgen omme0...@stthomas.edu wrote:
I'm using GraphX and playing around with its PageRank algorithm. However, I
can't see from the documentation how to use edge weight when running PageRank.
Is this possible to consider edge weights and how would I do it?
At 2014-11-15 18:01:22 -0700, tom85 tom.manha...@gmail.com wrote:
This line: val newPR = oldPR + (1.0 - resetProb) * msgSum
makes no sense to me. Should it not be:
val newPR = resetProb/graph.vertices.count() + (1.0 - resetProb) * msgSum
?
This is an unusual version of PageRank where the
At 2014-11-17 14:47:50 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I was going through the graphx section in the Spark API in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$
Here, I find the word landmark. Can anyone explain to me
At 2014-11-18 12:02:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I just ran the PageRank code in GraphX with some sample data. What I am
seeing is that the total rank changes drastically if I change the number of
iterations from 10 to 100. Why is that so?
As far as I understand, the
At 2014-11-18 14:59:20 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
So landmark can contain just one vertex right?
Right.
Which algorithm has been used to compute the shortest path?
It's distributed Bellman-Ford.
Ankur
At 2014-11-18 14:51:54 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I am using Spark-1.0.0. There are two GraphX directories that I can see here
1. spark-1.0.0/examples/src/main/scala/org/apache/sprak/examples/graphx
which contains LiveJournalPageRank,scala
2.
At 2014-11-18 15:29:08 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Does Bellman-Ford give the best solution?
It gives the same solution as any other algorithm, since there's only one
correct solution for shortest paths and it's guaranteed to find it eventually.
There are probably
At 2014-11-18 15:35:13 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Now, how do I run the LiveJournalPageRank.scala that is there in 1?
I think it should work to use
MASTER=local[*] $SPARK_HOME/bin/run-example graphx.LiveJournalPageRank
/edge-list-file.txt --numEPart=8 --numIter=10
At 2014-11-18 15:44:31 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
I meant to ask whether it gives the solution faster than other algorithms.
No, it's just that it's much simpler and easier to implement than the others.
Section 5.2 of the Pregel paper [1] justifies using it for a graph
At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
Yes the above command works, but there is this problem. Most of the times,
the total rank is Nan (Not a Number). Why is it so?
I've also seen this, but I'm not sure why it happens. If you could find out
which vertices
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:
sc.textFile(/path/to/file).foreach(println)
The NullPointerException seems to be because edge.dstAttr is null, which
might be due to SPARK-3936
https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I
edited the Gist with a workaround. Does that fix the problem?
Ankur http://www.ankurdave.com/
On Mon, Nov 3, 2014 at 12:23
How large is your graph, and how much memory does your cluster have?
We don't have a good way to determine the *optimal* number of partitions
aside from trial and error, but to get the job to at least run to
completion, it might help to use the MEMORY_AND_DISK storage level and a
large number of
At 2014-10-25 08:56:34 +0530, Arpit Kumar arp8...@gmail.com wrote:
GraphLoader1.scala:49: error: class EdgePartitionBuilder in package impl
cannot be accessed in package org.apache.spark.graphx.impl
[INFO] val builder = new EdgePartitionBuilder[Int, Int]
Here's a workaround:
1. Copy and
At 2014-10-28 16:27:20 +0300, Zuhair Khayyat zuhair.khay...@gmail.com wrote:
I am using connected components function of GraphX (on Spark 1.0.2) on some
graph. However for some reason the fails with StackOverflowError. The graph
is not too big; it contains 1 vertices and 50 edges.
At 2014-10-23 09:48:55 +0530, Arpit Kumar arp8...@gmail.com wrote:
error: value partitionBy is not a member of
org.apache.spark.rdd.RDD[(org.apache.spark.graphx.PartitionID,
org.apache.spark.graphx.Edge[ED])]
Since partitionBy is a member of PairRDDFunctions, it sounds like the implicit
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote:
hi again. just want to check in again to see if anyone could advise on how
to implement a mutable, growing graph with graphx?
we're building a graph is growing over time. it adds more vertices and
edges every iteration of
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote:
a related question, what is the best way to update the values of existing
vertices and edges?
Many of the Graph methods deal with updating the existing values in bulk,
including mapVertices, mapEdges, mapTriplets,
At 2014-10-13 18:22:44 -0400, Soumitra Siddharth Johri
soumitra.siddha...@gmail.com wrote:
I have a flat tab separated file like below:
[...]
where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the
properties which should form the edges between the nodes.
How can I construct a
At 2014-10-13 21:08:15 -0400, Soumitra Johri soumitra.siddha...@gmail.com
wrote:
There is no 'long' field in my file. So when I form the edge I get a type
mismatch error. Is it mandatory for GraphX that every vertex should have a
distinct id. ?
in my case n1,n2,n3,n4 are all strings.
(+user
At 2014-09-25 06:52:46 -0700, Cheuk Lam chl...@hotmail.com wrote:
This is a question on using the Pregel function in GraphX. Does a message
get serialized and then de-serialized in the scenario where both the source
and the destination vertices are in the same compute node/machine?
Yes,
At 2014-09-17 11:39:19 -0700, spr s...@yarcdata.com wrote:
I'm trying to implement label propagation in GraphX. The core step of that
algorithm is
- for each vertex, find the most frequent label among its neighbors and set
its label to that.
[...]
It seems on the broken line above, I
At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote:
- from [1], and my understanding, the existing inactive feature in graphx
pregel api is “if there is no in-edges, from active vertex, to this vertex,
then we will say this one is inactive”, right?
Well, that's true when
At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote:
but I am wondering if there is a message(none?) sent to the target vertex(the
rank change is less than tolerance) in below dynamic page rank implementation,
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote:
I am wondering if the vertex active/inactive(corresponding the change of its
value between two supersteps) feature is introduced in Pregel API of GraphX?
Vertex activeness in Pregel is controlled by messages: if a vertex did
At 2014-09-05 12:13:18 +0200, Yifan LI iamyifa...@gmail.com wrote:
But how to assign the storage level to a new vertices RDD that mapped from
an existing vertices RDD,
e.g.
*val newVertexRDD =
graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId,
a:Array[VertexId]) = (id,
At 2014-09-05 21:40:51 +0800, marylucy qaz163wsx_...@hotmail.com wrote:
But running graphx edgeFileList ,some tasks failed
error:requested array size exceed vm limits
Try passing a higher value for minEdgePartitions when calling
GraphLoader.edgeListFile.
Ankur
At 2014-09-03 17:58:09 +0200, Yifan LI iamyifa...@gmail.com wrote:
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
Error: java.lang.UnsupportedOperationException: Cannot
At 2014-08-26 01:20:09 -0700, BertrandR bertrand.rondepierre...@gmail.com
wrote:
I actually tried without unpersisting, but given the performance I tryed to
add these in order to free the memory. After your anwser I tried to remove
them again, but without any change in the execution time...
At 2014-08-25 06:41:36 -0700, BertrandR bertrand.rondepierre...@gmail.com
wrote:
Unfortunately, this works well for extremely small graphs, but it becomes
exponentially slow with the size of the graph and the number of iterations
(doesn't finish 20 iterations with graphs having 48000 edges).
At 2014-08-25 11:23:37 -0700, Sunita Arvind sunitarv...@gmail.com wrote:
Does this We introduce GraphX, which combines the advantages of both
data-parallel and graph-parallel systems by efficiently expressing graph
computation within the Spark data-parallel framework. We leverage new ideas
in
At 2014-08-23 08:33:48 -0700, Denis RP qq378789...@gmail.com wrote:
Bottleneck seems to be I/O, the CPU usage ranges 10%~15% most time per VM.
The caching is maintained by pregel, should be reliable. Storage level is
MEMORY_AND_DISK_SER.
I'd suggest trying the DISK_ONLY storage level and
At 2014-08-20 10:57:57 -0700, Mohit Singh mohit1...@gmail.com wrote:
I was wondering if Personalized Page Rank algorithm is implemented in graphx.
If the talks and presentation were to be believed
(https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it
At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote:
I would like to get the type B vertices that are connected through type A
vertices where the edges have a score greater than 5. So, from the example
above I would like to get V1 and V4.
It sounds like you're trying to
(+user)
On Tue, Aug 19, 2014 at 12:05 PM, spr s...@yarcdata.com wrote:
I want to assign each vertex to a community with the name of the vertex.
As I understand it, you want to set the vertex attributes of a graph to the
corresponding vertex ids. You can do this using Graph#mapVertices [1] as
At 2014-08-19 12:47:16 -0700, spr s...@yarcdata.com wrote:
One follow-up question. If I just wanted to get those values into a vanilla
variable (not a VertexRDD or Graph or ...) so I could easily look at them in
the REPL, what would I do? Are the aggregate data structures inside the
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI iamyifa...@gmail.com wrote:
I am testing our application(similar to personalised page rank using
Pregel, and note that each vertex property will need pretty much more space
to store after new iteration)
[...]
But when we ran it on larger graph(e.g.
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:
To perform the page rank I have to create a graph object, adding the edges
by setting sourceID=id and distID=brand. In GraphLab there is function: g =
SGraph().add_edges(data, src_field='id',
At 2014-08-04 20:52:26 +0800, Bin wubin_phi...@126.com wrote:
I wonder how spark parameters, e.g., number of paralellism, affect Pregel
performance? Specifically, sendmessage, mergemessage, and vertexprogram?
I have tried label propagation on a 300,000 edges graph, and I found that no
At 2014-08-02 21:29:33 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote:
How should I run graphx codes?
At the moment it's a little more complicated to run the GraphX algorithms than
the Spark examples due to SPARK-1986 [1]. There is a driver program in
org.apache.spark.graphx.lib.Analytics
At 2014-08-02 19:04:22 +0200, Yifan LI iamyifa...@gmail.com wrote:
But I am thinking of if I can compute only some selected vertexes(hubs), not
to do update on every vertex…
is it possible to do this using Pregel API?
The Pregel API already only runs vprog on vertices that received messages
At 2014-08-01 11:23:49 +0800, Bin wubin_phi...@126.com wrote:
I am wondering what is the best way to construct a graph?
Say I have some attributes for each user, and specific weight for each user
pair. The way I am currently doing is first read user information and edge
triple into two
Attempting to build Spark from source on EC2 using sbt gives the error
sbt.ResolveException: unresolved dependency:
org.scala-lang#scala-library;2.10.2: not found. This only seems to happen on
EC2, not on my local machine.
To reproduce, launch a cluster using spark-ec2, clone the Spark
At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com wrote:
It seems that I could do this with mapPartition so that each element in a
partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query each partition's index
with it?
ShreyanshB shreyanshpbh...@gmail.com writes:
The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb.
It'd be great if you can tell me how to configure and invoke this spark
version.
Sorry for the delay on this. Assuming you're planning to launch an EC2
Yifan LI iamyifa...@gmail.com writes:
Maybe you could get the vertex, for instance, which id is 80, by using:
graph.vertices.filter{case(id, _) = id==80}.collect
but I am not sure this is the exactly efficient way.(it will scan the whole
table? if it can not get benefit from index of
Denis RP qq378789...@gmail.com writes:
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to
stage failure: Task 6.0:4 failed 4 times, most recent failure: Exception
failure in TID 598 on host worker6.local: java.lang.NullPointerException
[error]
On Mon, Jul 28, 2014 at 4:29 AM, Larry Xiao xia...@sjtu.edu.cn wrote:
On 7/28/14, 3:41 PM, shijiaxin wrote:
There is a VertexPartition in the EdgePartition,which is created by
EdgePartitionBuilder.toEdgePartition.
and There is also a ShippableVertexPartition in the VertexRDD.
These two
On Mon, Jul 28, 2014 at 12:53 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
But when done processing, one would still have to pull out the wrapped
object, knowing what it was, and I don't see how to do that.
It's pretty tricky to get the level of type safety you're looking for. I
On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar toga...@gmail.com wrote:
While using pregel API for Iterations how to figure out which super step
the iteration currently in.
The Pregel API doesn't currently expose this, but it's very straightforward
to modify Pregel.scala
We removed
https://github.com/apache/spark/commit/732333d78e46ee23025d81ca9fbe6d1e13e9f253
the PowerGraph abstraction layer when merging GraphX into Spark to reduce
the maintenance costs. You can still read the code
On Tue, Jul 22, 2014 at 7:08 AM, Yifan LI iamyifa...@gmail.com wrote:
1) what is the difference between Duration(Stages - Completed Stages)
and Task Time(Executors) ?
Stages are composed of tasks that run on executors. Tasks within a stage
may run concurrently, since there are multiple
On Mon, Jul 21, 2014 at 8:05 PM, Bin WU bw...@connect.ust.hk wrote:
I am not sure how to specify different initial values to each node in the
graph. Moreover, I am wondering why initial message is necessary. I think
we can instead initialize the graph and then pass it into Pregel interface?
On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB shreyanshpbh...@gmail.com
wrote:
Does the suggested version with in-memory shuffle affects performance too
much?
We've observed a 2-3x speedup from it, at least on larger graphs like
twitter-2010 http://law.di.unimi.it/webdata/twitter-2010/ and
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote:
Yes, is possible to defining a custom partition strategy?
Yes, you just need to create a subclass of PartitionStrategy as follows:
import org.apache.spark.graphx._
object MyPartitionStrategy extends PartitionStrategy {
Sorry, I didn't read your vertex replication example carefully, so my
previous answer is wrong. Here's the correct one:
On Fri, Jul 18, 2014 at 9:13 AM, Yifan LI iamyifa...@gmail.com wrote:
I don't understand, for instance, we have 3 edge partition tables(EA: a -
b, a - c; EB: a - d, a - e;
Thanks for your interest. I should point out that the numbers in the arXiv
paper are from GraphX running on top of a custom version of Spark with an
experimental in-memory shuffle prototype. As a result, if you benchmark
GraphX at the current master, it's expected that it will be 2-3x slower
than
If your sendMsg function needs to know the incoming messages as well as the
vertex value, you could define VD to be a tuple of the vertex value and the
last received message. The vprog function would then store the incoming
messages into the tuple, allowing sendMsg to access them.
For example, if
On Jul 15, 2014, at 12:06 PM, Yifan LI iamyifa...@gmail.com wrote:
Btw, is there any possibility to customise the partition strategy as we
expect?
I'm not sure I understand. Are you asking about defining a custom
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB shreyanshpbh...@gmail.com
wrote:
-- Is it a correct way to load file to get best performance?
Yes, edgeListFile should be efficient at loading the edges.
-- What should be the partition size? =computing node or =cores?
In general it should be a
I don't think it should affect performance very much, because GraphX
doesn't serialize ShippableVertexPartition in the fast path of
mapReduceTriplets execution (instead it calls
ShippableVertexPartition.shipVertexAttributes and serializes the result). I
think it should only get serialized for
Spark just uses opens up inter-slave TCP connections for message passing
during shuffles (I think the relevant code is in ConnectionManager). Since
TCP automatically determines
http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm the
optimal sending rate, Spark doesn't need any
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI iamyifa...@gmail.com wrote:
- how to build the latest version of Spark from the master branch, which
contains a fix?
Instead of downloading a prebuilt Spark release from
http://spark.apache.org/downloads.html, follow the instructions under
Development
I think tiers/priorities for caching are a very good idea and I'd be
interested to see what others think. In addition to letting libraries cache
RDDs liberally, it could also unify memory management across other parts of
Spark. For example, small shuffles benefit from explicitly keeping the
Well, the alternative is to do a deep equality check on the index arrays,
which would be somewhat expensive since these are pretty large arrays (one
element per vertex in the graph). But, in case the reference equality check
fails, it actually might be a good idea to do the deep check before
Interesting problem! My understanding is that you want to (1) find paths
matching a particular pattern, and (2) add edges between the start and end
vertices of the matched paths.
For (1), I implemented a pattern matcher for GraphX
When joining two VertexRDDs with identical indexes, GraphX can use a fast
code path (a zip join without any hash lookups). However, the check for
identical indexes is performed using reference equality.
Without caching, two copies of the index are created. Although the two
indexes are
A common reason for the Joining ... is slow message is that you're
joining VertexRDDs without having cached them first. This will cause Spark
to recompute unnecessarily, and as a side effect, the same index will get
created twice and GraphX won't be able to do an efficient zip join.
For example,
Yes, the subgraph operator takes a vertex predicate and keeps only the
edges where both vertices satisfy the predicate, so it will work as long as
you can express the sublist in terms of a vertex predicate.
If that's not possible, you can still obtain the same effect, but you'll
have to use
Many merge operations can be broken up to work incrementally. For example,
if the merge operation is to sum *n* rank updates, then you can set mergeMsg
= (a, b) = a + b and this function will be applied to all *n* updates in
arbitrary order to yield a final sum. Addition, multiplication, min, max,
Master/worker assignment and barrier synchronization are handled by Spark,
so Pregel will automatically coordinate the computation, communication, and
synchronization needed to run for the specified number of supersteps (or
until there are no messages sent in a particular superstep).
Ankur
Vertices can have arbitrary properties associated with them:
http://spark.apache.org/docs/latest/graphx-programming-guide.html#the-property-graph
Ankur http://www.ankurdave.com/
I've been trying to reproduce this but I haven't succeeded so far. For
example, on the web-Google
https://snap.stanford.edu/data/web-Google.htmlgraph, I get the
expected results both on v0.9.1-handle-empty-partitions
and on master:
// Load web-Google and run connected componentsimport
I think what's desired here is for input to be unpersisted automatically as
soon as result is materialized. I don't think there's currently a way to do
this, but the usual workaround is to force result to be materialized
immediately and then unpersist input:
input.cache()val count =
Oh, looks like the Scala Map isn't serializable. I switched the code to use
java.util.HashMap, which should work.
Ankur http://www.ankurdave.com/
On Mon, May 26, 2014 at 3:21 PM, daze5112 david.zeel...@ato.gov.au wrote:
Excellent thanks Ankur, looks like what im looking for Only one problem
On closer inspection it looks like Map normally is serializable, and it's
just a bug in mapValues, so I changed to using the .map(identity)
workaround described in https://issues.scala-lang.org/browse/SI-7005.
Ankur http://www.ankurdave.com/
Once the graph is built, edges are stored in parallel primitive arrays, so
each edge should only take 20 bytes to store (srcId: Long, dstId: Long,
attr: Int). Unfortunately, the current implementation in
EdgePartitionBuilder uses an array of Edge objects as an intermediate
representation for
I'm not sure I understand what you're looking for. Could you provide some
more examples to clarify?
One interpretation is that you want to tag the source vertices in a graph
(those with zero indegree) and find for each vertex the set of sources that
lead to that vertex. For vertices 1-8 in the
Do you mean you want to obtain a list of adjacent edges for every vertex? A
mapReduceTriplets followed by a join is the right way to do this. The join
will be cheap because the original and derived vertices will share indices.
There's a built-in function to do this for neighboring vertex
1 - 100 of 110 matches
Mail list logo