It's worth mentioning that leveraging HDFS caching in Spark doesn't work
smoothly out of the box right now.  By default, cached files in HDFS will
have 3 on-disk replicas and only one of these will be an in-memory replica.
 In its scheduling, Spark will prefer all equally, meaning that, even when
resources aren't contended, on average only 1/3 of the data will be read
from memory.  SPARK-1767
<https://issues.apache.org/jira/browse/SPARK-1767> aims
to fix this.



On Tue, May 13, 2014 at 6:26 AM, Chanwit Kaewkasi <chan...@gmail.com> wrote:

> Great to know that! Thank you, Matei.
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>
>
> On Tue, May 13, 2014 at 2:14 AM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> > That API is something the HDFS administrator uses outside of any
> application to tell HDFS to cache certain files or directories. But once
> you've done that, any existing HDFS client accesses them directly from the
> cache.
> >
> > Matei
> >
> > On May 12, 2014, at 11:10 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >
> >> Is that true? I believe that API Chanwit is talking about requires
> >> explicitly asking for files to be cached in HDFS.
> >>
> >> Spark automatically benefits from the kernel's page cache (i.e. if
> >> some block is in the kernel's page cache, it will be read more
> >> quickly). But the explicit HDFS cache is a different thing; Spark
> >> applications that want to use it would have to explicitly call the
> >> respective HDFS APIs.
> >>
> >> On Sun, May 11, 2014 at 11:04 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
> >>> Yes, Spark goes through the standard HDFS client and will
> automatically benefit from this.
> >>>
> >>> Matei
> >>>
> >>> On May 8, 2014, at 4:43 AM, Chanwit Kaewkasi <chan...@gmail.com>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via
> >>>> sc.textFile() and other HDFS-related APIs?
> >>>>
> >>>>
> http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
> >>>>
> >>>> Best regards,
> >>>>
> >>>> -chanwit
> >>>>
> >>>> --
> >>>> Chanwit Kaewkasi
> >>>> linkedin.com/in/chanwit
> >>>
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
>

Reply via email to