I am facing a somewhat confusing problem: My spark app reads data from a database, calculates certain values and then runs a shortest path Pregel operation on them. If I save the RDD to disk and then read the information out again, my app runs between 30-50% faster than keeping it in memory, plus the in memory solution crashes with GC overhead exceptions every other time and I get "TaskSetManager: Stage 12 contains a task of very large size (890 KB). The maximum recommended task size is 100 KB" errors on the memory solution.
Here is the part: Slower: val dailyRate:RDD[OneCalc] = getDaily(originalData) val distance: RDD[Edge[Int]] = dailyRate.map(x => Edge(x.fromEdge, x.toEdge, x.avg)) val graph = Graph(datapoints, distance, MySpark.defaultRoute, StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER) => Pregel Faster: val dailyRate:RDD[OneCalc] = getDaily(originalData).collect() sc.parallelize(dailyRate.map(x=>s"${x.from},${x.to},${x.avg}")) .repartition(1).saveAsTextFile("file:///tmp/spark") val distance = sc.textFile("file:///tmp/spark/part-00000").flatMap(x=> { val line = x.split(",") OneCalc(line) }).map(x => Edge(x.fromEdge, x.toEdge, x.avg)) val graph = Graph(datapoints, distance, MySpark.defaultRoute, StorageLevel.MEMORY_AND_DISK_SER, StorageLevel.MEMORY_AND_DISK_SER) => Pregel Any idea why this is happening? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-and-reading-file-faster-than-memory-option-tp20588.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org