On Mon, May 12, 2014 at 12:14 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > That API is something the HDFS administrator uses outside of any application > to tell HDFS to cache certain files or directories. But once you’ve done > that, any existing HDFS client accesses them directly from the cache.
Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use that facility for adding files to the cache or anything like that. But yes, it does benefit from things already cached. > On May 12, 2014, at 11:10 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > >> Is that true? I believe that API Chanwit is talking about requires >> explicitly asking for files to be cached in HDFS. >> >> Spark automatically benefits from the kernel's page cache (i.e. if >> some block is in the kernel's page cache, it will be read more >> quickly). But the explicit HDFS cache is a different thing; Spark >> applications that want to use it would have to explicitly call the >> respective HDFS APIs. >> >> On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >>> Yes, Spark goes through the standard HDFS client and will automatically >>> benefit from this. >>> >>> Matei >>> >>> On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via >>>> sc.textFile() and other HDFS-related APIs? >>>> >>>> http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html >>>> >>>> Best regards, >>>> >>>> -chanwit >>>> >>>> -- >>>> Chanwit Kaewkasi >>>> linkedin.com/in/chanwit >>> >> >> >> >> -- >> Marcelo > -- Marcelo