Hi,

New to Spark here. I had a question to see if my basic understanding is
correct when running Spark with a Tachyon / HDFS setup.

Would it be fair to say that unless there are several potentially costly
transformations on a given RDD, that there is no need to call .cache() as in
the Caching example here:
http://spark.incubator.apache.org/docs/latest/quick-start.html

The reason being that when Tachyon is in front of HDFS, the first call to
load the data set would result in a load from disk but then Tachyon would
store it in its in memory cache, so that subsequent load calls would access
the in-memory cache layer (with a potential network trip) of Tachyon?

In this post, Haoyaun's answer near the top implies that Spark when running
with Tachyon will make best effort attempts to route tasks to where the data
is cached in Tachyon.
http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3CCAG2iju2OqjkHaZZtq=10htBf0=zoxxwlcn41_rcsxsb4-4r...@mail.gmail.com%3E

If that is the case, is Spark's cluster wide caching a redundant feature
when running under Tachyon? Or is there some other feature(s) it would
provide that I am missing?

Thanks for any information.

Adam




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to