Actually, GraphX doesn't need to scan all the edges, because it maintains a clustered index on the source vertex id (that is, it sorts the edges by source vertex id and stores the offsets in a hash table). If the activeDirection is appropriately set, it can then jump only to the clusters with active source vertices.
See the EdgePartition#index field [1], which stores the offsets, and the logic in GraphImpl#aggregateMessagesWithActiveSet [2], which decides whether to do a full scan or use the index. [1] https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala#L60 [2]. https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266 Ankur On Thu, Apr 9, 2015 at 3:21 AM, James <alcaid1...@gmail.com> wrote: > In aggregateMessagesWithActiveSet, Spark still have to read all edges. It > means that a fixed time which scale with graph size is unavoidable on a > pregel-like iteration. > > But what if I have to iterate nearly 100 iterations but at the last 50 > iterations there are only < 0.1% nodes need to be updated ? The fixed time > make the program finished at a unacceptable time consumption. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org