Hi, New to Spark here. I had a question to see if my basic understanding is correct when running Spark with a Tachyon / HDFS setup.
Would it be fair to say that unless there are several potentially costly transformations on a given RDD, that there is no need to call .cache() as in the Caching example here: http://spark.incubator.apache.org/docs/latest/quick-start.html The reason being that when Tachyon is in front of HDFS, the first call to load the data set would result in a load from disk but then Tachyon would store it in its in memory cache, so that subsequent load calls would access the in-memory cache layer (with a potential network trip) of Tachyon? In this post, Haoyaun's answer near the top implies that Spark when running with Tachyon will make best effort attempts to route tasks to where the data is cached in Tachyon. http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3CCAG2iju2OqjkHaZZtq=10htBf0=zoxxwlcn41_rcsxsb4-4r...@mail.gmail.com%3E If that is the case, is Spark's cluster wide caching a redundant feature when running under Tachyon? Or is there some other feature(s) it would provide that I am missing? Thanks for any information. Adam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
