Re: Some questions in using Graphx

2014-04-22 Thread Ankur Dave
These are excellent questions. Answers below:

On Tue, Apr 22, 2014 at 8:20 AM, wu zeming zemin...@gmail.com wrote:

 1. Why do some transformations like partitionBy, mapVertices cache the new
 graph and some like outerJoinVertices not?


In general, we cache RDDs that are used more than once to avoid
recomputation. In mapVertices, the newVerts RDD is used to construct both
changedVerts and the new graph, so we have to cache it. Good catch that
outerJoinVertices doesn't follow this convention! To be safe, we should be
caching newVerts there as well. Also, partitionBy doesn't need to cache
newEdges, so we should remove the call to cache().

2. I use Pregel api and just use edgeTriplet.srcAttr in sendMsg, after that
 I get a new Graph and I use graph.mapReduceTriplets and use
 edgeTriplet.srcAttr and edgeTriplet.dstAttr in sendMsg. I found that with
 the implement of ReplicatedVertexView, spark will complute all the graph
 which should has been computer before. Can anyone explain the implement
 here?


This is a known problem with the current implementation of join
elimination. If you do an operation that only needs one side of the triplet
(say, the source attribute) followed by an operation that uses both, the
second operation will re-replicate the source attributes unnecessarily. I
wrote a PR that solves this problem by only
movinghttps://github.com/ankurdave/graphx/blob/74349e9c81fa626949172d1905dd7713ab6160ac/graphx/src/main/scala/org/apache/spark/graphx/impl/ReplicatedVertexView.scala#L52the
necessary vertex attributes:
https://github.com/amplab/graphx/pull/137.

3. Why dose not VertexPartition extends Serializable? It's used by RDD.


Though we do create RDD[VertexPartition]s, VertexPartition itself should
never need to be serialized. Instead, we move vertex attributes using
VertexAttributeBlock.

Some other classes in GraphX (GraphImpl, for example) are marked
serializable for a different reason: to make it harmless if closures
accidentally capture them. These classes have all their fields marked
@transient so nothing really gets serialized.

4. Can you provide an spark.default.cache.useDisk for using DISK_ONLY in
 cache by default?


This isn't officially supported yet, but I have a
patchhttps://github.com/ankurdave/graphx/commit/aa78d782e606a6c0c0f5b8253ae4f408531fd015that
will let you do it in a hacky way by passing the desired storage level
everywhere.

Ankur http://www.ankurdave.com/


Re: Some questions in using Graphx

2014-04-22 Thread Wu Zeming
Hi Ankur, 
Thanks for your reply! I think these are usefully for me. I hope these can
be improved in spark-0.9.2 or spark-1.0.

Another thing I forgot. I think the persist api for Graph, VertexRDD and
EdgeRDD should not be set public now, because it will lead to
UnsupportedOperationException when I change the StorageLevel.



-
Wu Zeming
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Some-questions-in-using-Graphx-tp4604p4634.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.