Question about PageRank with Live Journal

yunming zhang Sun, 03 May 2015 20:33:22 -0700

Hi,

I have a question about running PageRan with live journal data as suggested
by the example at


org.apache.spark.examples.graphx.LiveJournalPageRank


I ran with the following options

bin/run-example org.apache.spark.examples.graphx.LiveJournalPageRank
data/graphx/soc-LiveJournal1.txt --numEPart=1


And it seems that from the SparkUI, the data that

mapPartitions at GraphImpl.scala:235

shuffle  read size is steadily increasing all the way to 2.1GB on a single
node machine. I think the shuffle read size should be decreasing as the
number of messages decrease?  I tried with 4 partitions and it seems that
the shuffle read for mapPartitions job is decreasing as the program
progresses. But I am not sure why it is actually increasing for one
partition?

And it really destroys the performance for a single partition even though
single partition uses much less time on reduce phase than the 4-partitions
configuration on a single node.


Thanks

Question about PageRank with Live Journal

Reply via email to