It's worth mentioning that leveraging HDFS caching in Spark doesn't work smoothly out of the box right now. By default, cached files in HDFS will have 3 on-disk replicas and only one of these will be an in-memory replica. In its scheduling, Spark will prefer all equally, meaning that, even when resources aren't contended, on average only 1/3 of the data will be read from memory. SPARK-1767 <https://issues.apache.org/jira/browse/SPARK-1767> aims to fix this.
On Tue, May 13, 2014 at 6:26 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote: > Great to know that! Thank you, Matei. > > Best regards, > > -chanwit > > -- > Chanwit Kaewkasi > linkedin.com/in/chanwit > > > On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > > That API is something the HDFS administrator uses outside of any > application to tell HDFS to cache certain files or directories. But once > you've done that, any existing HDFS client accesses them directly from the > cache. > > > > Matei > > > > On May 12, 2014, at 11:10 AM, Marcelo Vanzin <van...@cloudera.com> > wrote: > > > >> Is that true? I believe that API Chanwit is talking about requires > >> explicitly asking for files to be cached in HDFS. > >> > >> Spark automatically benefits from the kernel's page cache (i.e. if > >> some block is in the kernel's page cache, it will be read more > >> quickly). But the explicit HDFS cache is a different thing; Spark > >> applications that want to use it would have to explicitly call the > >> respective HDFS APIs. > >> > >> On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia < > matei.zaha...@gmail.com> wrote: > >>> Yes, Spark goes through the standard HDFS client and will > automatically benefit from this. > >>> > >>> Matei > >>> > >>> On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi <chan...@gmail.com> > wrote: > >>> > >>>> Hi all, > >>>> > >>>> Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via > >>>> sc.textFile() and other HDFS-related APIs? > >>>> > >>>> > http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html > >>>> > >>>> Best regards, > >>>> > >>>> -chanwit > >>>> > >>>> -- > >>>> Chanwit Kaewkasi > >>>> linkedin.com/in/chanwit > >>> > >> > >> > >> > >> -- > >> Marcelo > > >