I am facing a somewhat confusing problem:

My spark app reads data from a database, calculates certain values and then
runs a shortest path Pregel operation on them. If I save the RDD to disk and
then read the information out again, my app runs between 30-50% faster than
keeping it in memory, plus the in memory solution crashes with GC overhead
exceptions every other time and I get 
"TaskSetManager: Stage 12 contains a task of very large size (890 KB). The
maximum recommended task size is 100 KB" errors on the memory solution.

Here is the part:

Slower:

val dailyRate:RDD[OneCalc] = getDaily(originalData)
val distance: RDD[Edge[Int]] = dailyRate.map(x => Edge(x.fromEdge, x.toEdge,
x.avg))
val graph = Graph(datapoints, distance, MySpark.defaultRoute,
StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER)
=> Pregel

Faster:

val dailyRate:RDD[OneCalc] = getDaily(originalData).collect() 
sc.parallelize(dailyRate.map(x=>s"${x.from},${x.to},${x.avg}"))
      .repartition(1).saveAsTextFile("file:///tmp/spark")
val distance = sc.textFile("file:///tmp/spark/part-00000").flatMap(x=> {
    val line = x.split(",")
    OneCalc(line)
    }).map(x => Edge(x.fromEdge, x.toEdge, x.avg))
val graph = Graph(datapoints, distance, MySpark.defaultRoute,
StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER)
=> Pregel


Any idea why this is happening?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Writing-and-reading-file-faster-than-memory-option-tp20588.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to