Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-28 Thread qingyang li
hi, haoyuan, thanks for replying.


2014-07-21 16:29 GMT+08:00 Haoyuan Li :

> Qingyang,
>
> Aha. Got it.
>
> 800MB data is pretty small. Loading from Tachyon does have a bit of extra
> overhead. But it will have more benefit when the data size is larger. Also,
> if you store the table in Tachyon, you can have different shark servers to
> query the data at the same time. For more trade-off, please refer to this
> page: http://tachyon-project.org/Running-Shark-on-Tachyon.html
>
> Best,
>
> Haoyuan
>
>
> On Wed, Jul 16, 2014 at 12:06 AM, qingyang li 
> wrote:
>
> > let's me describe my scene:
> > --
> > i have 8 machines (24 core , 16G memory, per machine) of spark cluster
> and
> > tachyon cluster.  On tachyon,  I create one table which contains 800M
> data,
> > when i run query sql on shark,   it will cost 2.43s,  but when i create
> the
> > same table on spark memory , i run  the same sql , it will cost 1.56s.
> >  data on tachyon cost more time than data on spark memory.   they all
> have
> > 150 map process,  and per node 16-20 map process.
> > I think the reason is that when data is on tachyon, shark will let spark
> > slave load data from tachyon salve which is on the same node with tachyon
> > slave,
> > i have tried to set some configuration to tune shark and tachyon, but
> still
> > can not make the former more fast than 2.43s.
> > do anyone have some ideas ?
> >
> > By the way ,  my tachyon block size is 1GB now,  i want to reset block
> size
> > ,  will it work by setting tachyon.user.default.block.size.byte=8M ?  if
> > not,  what does tachyon.user.default.block.size.byte mean?
> >
> >
> > 2014-07-14 13:13 GMT+08:00 qingyang li :
> >
> > > Shark,  thanks for replying.
> > > Let's me clear my question again.
> > > --
> > > i create a table using " create table xxx1
> > > tblproperties("shark.cache"="tachyon") as select * from xxx2"
> > > when excuting some sql (for example , select * from xxx1) using shark,
> > >  shark will read data into shark's memory  from tachyon's memory.
> > > I think if each time we execute sql, shark always load data from
> tachyon,
> > > it is less effient.
> > > could we use some cache policy (such as,  CacheAllPolicy
> FIFOCachePolicy
> > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for
> > > each sql query?
> > > --
> > >
> > >
> > >
> > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li :
> > >
> > > Qingyang,
> > >>
> > >> Are you asking Spark or Shark (The first email was "Shark", the last
> > email
> > >> was "Spark".)?
> > >>
> > >> Best,
> > >>
> > >> Haoyuan
> > >>
> > >>
> > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li  >
> > >> wrote:
> > >>
> > >> > could i set some cache policy to let spark load data from tachyon
> only
> > >> one
> > >> > time for all sql query?  for example by using CacheAllPolicy
> > >> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy,
> > >> they
> > >> > are not useful.
> > >> > I think , if spark always load data for each sql query,  it will
> > impact
> > >> the
> > >> > query speed , it will take more time than the case that data are
> > >> managed by
> > >> > spark itself.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li :
> > >> >
> > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
> > >> > "shark.cache=memory",
> > >> > > have the same ser/de overhead. Shark loads data from outsize of
> the
> > >> > process
> > >> > > in Tachyon mode with the following benefits:
> > >> > >
> > >> > >
> > >> > >- In-memory data sharing across multiple Shark instances (i.e.
> > >> > stronger
> > >> > >isolation)
> > >> > >- Instant recovery of in-memory tables
> > >> > >- Reduce heap size => faster GC in shark
> > >> > >- If the table is larger than the memory size, only the hot
> > columns
> > >> > will
> > >> > >be cached in memory
> > >> > >
> > >> > > from
> > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
> > >> and
> > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> > >> > >
> > >> > > Haoyuan
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <
> ilike...@gmail.com>
> > >> > wrote:
> > >> > >
> > >> > > > Shark's in-memory format is already serialized (it's compressed
> > and
> > >> > > > column-based).
> > >> > > >
> > >> > > >
> > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
> > >> mri...@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > You are ignoring serde costs :-)
> > >> > > > >
> > >> > > > > - Mridul
> > >> > > > >
> > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <
> > >> ilike...@gmail.com>
> > >> > > > wrote:
> > >> > > > > > Tachyon should only be marginally less performant than
> > >> memory_only,
> > >> > > > > because
> > >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to,
> > say,
> > >

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-21 Thread Haoyuan Li
Qingyang,

Aha. Got it.

800MB data is pretty small. Loading from Tachyon does have a bit of extra
overhead. But it will have more benefit when the data size is larger. Also,
if you store the table in Tachyon, you can have different shark servers to
query the data at the same time. For more trade-off, please refer to this
page: http://tachyon-project.org/Running-Shark-on-Tachyon.html

Best,

Haoyuan


On Wed, Jul 16, 2014 at 12:06 AM, qingyang li 
wrote:

> let's me describe my scene:
> --
> i have 8 machines (24 core , 16G memory, per machine) of spark cluster and
> tachyon cluster.  On tachyon,  I create one table which contains 800M data,
> when i run query sql on shark,   it will cost 2.43s,  but when i create the
> same table on spark memory , i run  the same sql , it will cost 1.56s.
>  data on tachyon cost more time than data on spark memory.   they all have
> 150 map process,  and per node 16-20 map process.
> I think the reason is that when data is on tachyon, shark will let spark
> slave load data from tachyon salve which is on the same node with tachyon
> slave,
> i have tried to set some configuration to tune shark and tachyon, but still
> can not make the former more fast than 2.43s.
> do anyone have some ideas ?
>
> By the way ,  my tachyon block size is 1GB now,  i want to reset block size
> ,  will it work by setting tachyon.user.default.block.size.byte=8M ?  if
> not,  what does tachyon.user.default.block.size.byte mean?
>
>
> 2014-07-14 13:13 GMT+08:00 qingyang li :
>
> > Shark,  thanks for replying.
> > Let's me clear my question again.
> > --
> > i create a table using " create table xxx1
> > tblproperties("shark.cache"="tachyon") as select * from xxx2"
> > when excuting some sql (for example , select * from xxx1) using shark,
> >  shark will read data into shark's memory  from tachyon's memory.
> > I think if each time we execute sql, shark always load data from tachyon,
> > it is less effient.
> > could we use some cache policy (such as,  CacheAllPolicy FIFOCachePolicy
> > LRUCachePolicy ) to cache data to invoid reading data from tachyon for
> > each sql query?
> > --
> >
> >
> >
> > 2014-07-14 2:47 GMT+08:00 Haoyuan Li :
> >
> > Qingyang,
> >>
> >> Are you asking Spark or Shark (The first email was "Shark", the last
> email
> >> was "Spark".)?
> >>
> >> Best,
> >>
> >> Haoyuan
> >>
> >>
> >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li 
> >> wrote:
> >>
> >> > could i set some cache policy to let spark load data from tachyon only
> >> one
> >> > time for all sql query?  for example by using CacheAllPolicy
> >> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy,
> >> they
> >> > are not useful.
> >> > I think , if spark always load data for each sql query,  it will
> impact
> >> the
> >> > query speed , it will take more time than the case that data are
> >> managed by
> >> > spark itself.
> >> >
> >> >
> >> >
> >> >
> >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li :
> >> >
> >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
> >> > "shark.cache=memory",
> >> > > have the same ser/de overhead. Shark loads data from outsize of the
> >> > process
> >> > > in Tachyon mode with the following benefits:
> >> > >
> >> > >
> >> > >- In-memory data sharing across multiple Shark instances (i.e.
> >> > stronger
> >> > >isolation)
> >> > >- Instant recovery of in-memory tables
> >> > >- Reduce heap size => faster GC in shark
> >> > >- If the table is larger than the memory size, only the hot
> columns
> >> > will
> >> > >be cached in memory
> >> > >
> >> > > from
> http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
> >> and
> >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> >> > >
> >> > > Haoyuan
> >> > >
> >> > >
> >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson 
> >> > wrote:
> >> > >
> >> > > > Shark's in-memory format is already serialized (it's compressed
> and
> >> > > > column-based).
> >> > > >
> >> > > >
> >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
> >> mri...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > You are ignoring serde costs :-)
> >> > > > >
> >> > > > > - Mridul
> >> > > > >
> >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <
> >> ilike...@gmail.com>
> >> > > > wrote:
> >> > > > > > Tachyon should only be marginally less performant than
> >> memory_only,
> >> > > > > because
> >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to,
> say,
> >> > > > transfer
> >> > > > > > the data over a pipe from Tachyon; we can directly read from
> the
> >> > > > buffers
> >> > > > > in
> >> > > > > > the same way that Shark reads from its in-memory columnar
> >> format.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> >> > > liqingyang1...@gmail.com>
> >> > > > > > wrote:
> >> > > > >

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-16 Thread qingyang li
let's me describe my scene:
--
i have 8 machines (24 core , 16G memory, per machine) of spark cluster and
tachyon cluster.  On tachyon,  I create one table which contains 800M data,
when i run query sql on shark,   it will cost 2.43s,  but when i create the
same table on spark memory , i run  the same sql , it will cost 1.56s.
 data on tachyon cost more time than data on spark memory.   they all have
150 map process,  and per node 16-20 map process.
I think the reason is that when data is on tachyon, shark will let spark
slave load data from tachyon salve which is on the same node with tachyon
slave,
i have tried to set some configuration to tune shark and tachyon, but still
can not make the former more fast than 2.43s.
do anyone have some ideas ?

By the way ,  my tachyon block size is 1GB now,  i want to reset block size
,  will it work by setting tachyon.user.default.block.size.byte=8M ?  if
not,  what does tachyon.user.default.block.size.byte mean?


2014-07-14 13:13 GMT+08:00 qingyang li :

> Shark,  thanks for replying.
> Let's me clear my question again.
> --
> i create a table using " create table xxx1
> tblproperties("shark.cache"="tachyon") as select * from xxx2"
> when excuting some sql (for example , select * from xxx1) using shark,
>  shark will read data into shark's memory  from tachyon's memory.
> I think if each time we execute sql, shark always load data from tachyon,
> it is less effient.
> could we use some cache policy (such as,  CacheAllPolicy FIFOCachePolicy
> LRUCachePolicy ) to cache data to invoid reading data from tachyon for
> each sql query?
> --
>
>
>
> 2014-07-14 2:47 GMT+08:00 Haoyuan Li :
>
> Qingyang,
>>
>> Are you asking Spark or Shark (The first email was "Shark", the last email
>> was "Spark".)?
>>
>> Best,
>>
>> Haoyuan
>>
>>
>> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li 
>> wrote:
>>
>> > could i set some cache policy to let spark load data from tachyon only
>> one
>> > time for all sql query?  for example by using CacheAllPolicy
>> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy,
>> they
>> > are not useful.
>> > I think , if spark always load data for each sql query,  it will impact
>> the
>> > query speed , it will take more time than the case that data are
>> managed by
>> > spark itself.
>> >
>> >
>> >
>> >
>> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li :
>> >
>> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
>> > "shark.cache=memory",
>> > > have the same ser/de overhead. Shark loads data from outsize of the
>> > process
>> > > in Tachyon mode with the following benefits:
>> > >
>> > >
>> > >- In-memory data sharing across multiple Shark instances (i.e.
>> > stronger
>> > >isolation)
>> > >- Instant recovery of in-memory tables
>> > >- Reduce heap size => faster GC in shark
>> > >- If the table is larger than the memory size, only the hot columns
>> > will
>> > >be cached in memory
>> > >
>> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
>> and
>> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
>> > >
>> > > Haoyuan
>> > >
>> > >
>> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson 
>> > wrote:
>> > >
>> > > > Shark's in-memory format is already serialized (it's compressed and
>> > > > column-based).
>> > > >
>> > > >
>> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
>> mri...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > You are ignoring serde costs :-)
>> > > > >
>> > > > > - Mridul
>> > > > >
>> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson <
>> ilike...@gmail.com>
>> > > > wrote:
>> > > > > > Tachyon should only be marginally less performant than
>> memory_only,
>> > > > > because
>> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
>> > > > transfer
>> > > > > > the data over a pipe from Tachyon; we can directly read from the
>> > > > buffers
>> > > > > in
>> > > > > > the same way that Shark reads from its in-memory columnar
>> format.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
>> > > liqingyang1...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> hi, when i create a table, i can point the cache strategy using
>> > > > > >> shark.cache,
>> > > > > >> i think "shark.cache=memory_only"  means data are managed by
>> > spark,
>> > > > and
>> > > > > >> data are in the same jvm with excutor;   while
>> > >  "shark.cache=tachyon"
>> > > > > >>  means  data are managed by tachyon which is off heap, and data
>> > are
>> > > > not
>> > > > > in
>> > > > > >> the same jvm with excutor,  so spark will load data from
>> tachyon
>> > for
>> > > > > each
>> > > > > >> query sql , so,  is  tachyon less efficient than memory_only
>> cache
>> > > > > strategy
>> > > > > >>  ?
>> > > > > >> if yes, can we let spark load all data once from tachyon  for
>> all
>>

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-13 Thread qingyang li
Shark,  thanks for replying.
Let's me clear my question again.
--
i create a table using " create table xxx1
tblproperties("shark.cache"="tachyon") as select * from xxx2"
when excuting some sql (for example , select * from xxx1) using shark,
 shark will read data into shark's memory  from tachyon's memory.
I think if each time we execute sql, shark always load data from tachyon,
it is less effient.
could we use some cache policy (such as,  CacheAllPolicy FIFOCachePolicy
LRUCachePolicy ) to cache data to invoid reading data from tachyon for each
sql query?
--



2014-07-14 2:47 GMT+08:00 Haoyuan Li :

> Qingyang,
>
> Are you asking Spark or Shark (The first email was "Shark", the last email
> was "Spark".)?
>
> Best,
>
> Haoyuan
>
>
> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li 
> wrote:
>
> > could i set some cache policy to let spark load data from tachyon only
> one
> > time for all sql query?  for example by using CacheAllPolicy
> > FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy, they
> > are not useful.
> > I think , if spark always load data for each sql query,  it will impact
> the
> > query speed , it will take more time than the case that data are managed
> by
> > spark itself.
> >
> >
> >
> >
> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li :
> >
> > > Yes. For Shark, two modes, "shark.cache=tachyon" and
> > "shark.cache=memory",
> > > have the same ser/de overhead. Shark loads data from outsize of the
> > process
> > > in Tachyon mode with the following benefits:
> > >
> > >
> > >- In-memory data sharing across multiple Shark instances (i.e.
> > stronger
> > >isolation)
> > >- Instant recovery of in-memory tables
> > >- Reduce heap size => faster GC in shark
> > >- If the table is larger than the memory size, only the hot columns
> > will
> > >be cached in memory
> > >
> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html
> and
> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> > >
> > > Haoyuan
> > >
> > >
> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson 
> > wrote:
> > >
> > > > Shark's in-memory format is already serialized (it's compressed and
> > > > column-based).
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan <
> mri...@gmail.com>
> > > > wrote:
> > > >
> > > > > You are ignoring serde costs :-)
> > > > >
> > > > > - Mridul
> > > > >
> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  >
> > > > wrote:
> > > > > > Tachyon should only be marginally less performant than
> memory_only,
> > > > > because
> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> > > > transfer
> > > > > > the data over a pipe from Tachyon; we can directly read from the
> > > > buffers
> > > > > in
> > > > > > the same way that Shark reads from its in-memory columnar format.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> > > liqingyang1...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> hi, when i create a table, i can point the cache strategy using
> > > > > >> shark.cache,
> > > > > >> i think "shark.cache=memory_only"  means data are managed by
> > spark,
> > > > and
> > > > > >> data are in the same jvm with excutor;   while
> > >  "shark.cache=tachyon"
> > > > > >>  means  data are managed by tachyon which is off heap, and data
> > are
> > > > not
> > > > > in
> > > > > >> the same jvm with excutor,  so spark will load data from tachyon
> > for
> > > > > each
> > > > > >> query sql , so,  is  tachyon less efficient than memory_only
> cache
> > > > > strategy
> > > > > >>  ?
> > > > > >> if yes, can we let spark load all data once from tachyon  for
> all
> > > sql
> > > > > query
> > > > > >>  if i want to use tachyon cache strategy since tachyon is more
> HA
> > > than
> > > > > >> memory_only ?
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Haoyuan Li
> > > AMPLab, EECS, UC Berkeley
> > > http://www.cs.berkeley.edu/~haoyuan/
> > >
> >
>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-13 Thread Haoyuan Li
Qingyang,

Are you asking Spark or Shark (The first email was "Shark", the last email
was "Spark".)?

Best,

Haoyuan


On Wed, Jul 9, 2014 at 7:40 PM, qingyang li 
wrote:

> could i set some cache policy to let spark load data from tachyon only one
> time for all sql query?  for example by using CacheAllPolicy
> FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy, they
> are not useful.
> I think , if spark always load data for each sql query,  it will impact the
> query speed , it will take more time than the case that data are managed by
> spark itself.
>
>
>
>
> 2014-07-09 1:19 GMT+08:00 Haoyuan Li :
>
> > Yes. For Shark, two modes, "shark.cache=tachyon" and
> "shark.cache=memory",
> > have the same ser/de overhead. Shark loads data from outsize of the
> process
> > in Tachyon mode with the following benefits:
> >
> >
> >- In-memory data sharing across multiple Shark instances (i.e.
> stronger
> >isolation)
> >- Instant recovery of in-memory tables
> >- Reduce heap size => faster GC in shark
> >- If the table is larger than the memory size, only the hot columns
> will
> >be cached in memory
> >
> > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
> > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
> >
> > Haoyuan
> >
> >
> > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson 
> wrote:
> >
> > > Shark's in-memory format is already serialized (it's compressed and
> > > column-based).
> > >
> > >
> > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
> > > wrote:
> > >
> > > > You are ignoring serde costs :-)
> > > >
> > > > - Mridul
> > > >
> > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson 
> > > wrote:
> > > > > Tachyon should only be marginally less performant than memory_only,
> > > > because
> > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> > > transfer
> > > > > the data over a pipe from Tachyon; we can directly read from the
> > > buffers
> > > > in
> > > > > the same way that Shark reads from its in-memory columnar format.
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> > liqingyang1...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> hi, when i create a table, i can point the cache strategy using
> > > > >> shark.cache,
> > > > >> i think "shark.cache=memory_only"  means data are managed by
> spark,
> > > and
> > > > >> data are in the same jvm with excutor;   while
> >  "shark.cache=tachyon"
> > > > >>  means  data are managed by tachyon which is off heap, and data
> are
> > > not
> > > > in
> > > > >> the same jvm with excutor,  so spark will load data from tachyon
> for
> > > > each
> > > > >> query sql , so,  is  tachyon less efficient than memory_only cache
> > > > strategy
> > > > >>  ?
> > > > >> if yes, can we let spark load all data once from tachyon  for all
> > sql
> > > > query
> > > > >>  if i want to use tachyon cache strategy since tachyon is more HA
> > than
> > > > >> memory_only ?
> > > > >>
> > > >
> > >
> >
> >
> >
> > --
> > Haoyuan Li
> > AMPLab, EECS, UC Berkeley
> > http://www.cs.berkeley.edu/~haoyuan/
> >
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-09 Thread qingyang li
could i set some cache policy to let spark load data from tachyon only one
time for all sql query?  for example by using CacheAllPolicy
FIFOCachePolicy LRUCachePolicy.  But I have tried that three policy, they
are not useful.
I think , if spark always load data for each sql query,  it will impact the
query speed , it will take more time than the case that data are managed by
spark itself.




2014-07-09 1:19 GMT+08:00 Haoyuan Li :

> Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory",
> have the same ser/de overhead. Shark loads data from outsize of the process
> in Tachyon mode with the following benefits:
>
>
>- In-memory data sharing across multiple Shark instances (i.e. stronger
>isolation)
>- Instant recovery of in-memory tables
>- Reduce heap size => faster GC in shark
>- If the table is larger than the memory size, only the hot columns will
>be cached in memory
>
> from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
> https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon
>
> Haoyuan
>
>
> On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson  wrote:
>
> > Shark's in-memory format is already serialized (it's compressed and
> > column-based).
> >
> >
> > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
> > wrote:
> >
> > > You are ignoring serde costs :-)
> > >
> > > - Mridul
> > >
> > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson 
> > wrote:
> > > > Tachyon should only be marginally less performant than memory_only,
> > > because
> > > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> > transfer
> > > > the data over a pipe from Tachyon; we can directly read from the
> > buffers
> > > in
> > > > the same way that Shark reads from its in-memory columnar format.
> > > >
> > > >
> > > >
> > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li <
> liqingyang1...@gmail.com>
> > > > wrote:
> > > >
> > > >> hi, when i create a table, i can point the cache strategy using
> > > >> shark.cache,
> > > >> i think "shark.cache=memory_only"  means data are managed by spark,
> > and
> > > >> data are in the same jvm with excutor;   while
>  "shark.cache=tachyon"
> > > >>  means  data are managed by tachyon which is off heap, and data are
> > not
> > > in
> > > >> the same jvm with excutor,  so spark will load data from tachyon for
> > > each
> > > >> query sql , so,  is  tachyon less efficient than memory_only cache
> > > strategy
> > > >>  ?
> > > >> if yes, can we let spark load all data once from tachyon  for all
> sql
> > > query
> > > >>  if i want to use tachyon cache strategy since tachyon is more HA
> than
> > > >> memory_only ?
> > > >>
> > >
> >
>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Haoyuan Li
Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory",
have the same ser/de overhead. Shark loads data from outsize of the process
in Tachyon mode with the following benefits:


   - In-memory data sharing across multiple Shark instances (i.e. stronger
   isolation)
   - Instant recovery of in-memory tables
   - Reduce heap size => faster GC in shark
   - If the table is larger than the memory size, only the hot columns will
   be cached in memory

from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and
https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon

Haoyuan


On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson  wrote:

> Shark's in-memory format is already serialized (it's compressed and
> column-based).
>
>
> On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
> wrote:
>
> > You are ignoring serde costs :-)
> >
> > - Mridul
> >
> > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson 
> wrote:
> > > Tachyon should only be marginally less performant than memory_only,
> > because
> > > we mmap the data from Tachyon's ramdisk. We do not have to, say,
> transfer
> > > the data over a pipe from Tachyon; we can directly read from the
> buffers
> > in
> > > the same way that Shark reads from its in-memory columnar format.
> > >
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> > > wrote:
> > >
> > >> hi, when i create a table, i can point the cache strategy using
> > >> shark.cache,
> > >> i think "shark.cache=memory_only"  means data are managed by spark,
> and
> > >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> > >>  means  data are managed by tachyon which is off heap, and data are
> not
> > in
> > >> the same jvm with excutor,  so spark will load data from tachyon for
> > each
> > >> query sql , so,  is  tachyon less efficient than memory_only cache
> > strategy
> > >>  ?
> > >> if yes, can we let spark load all data once from tachyon  for all sql
> > query
> > >>  if i want to use tachyon cache strategy since tachyon is more HA than
> > >> memory_only ?
> > >>
> >
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Shark's in-memory format is already serialized (it's compressed and
column-based).


On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
wrote:

> You are ignoring serde costs :-)
>
> - Mridul
>
> On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  wrote:
> > Tachyon should only be marginally less performant than memory_only,
> because
> > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> > the data over a pipe from Tachyon; we can directly read from the buffers
> in
> > the same way that Shark reads from its in-memory columnar format.
> >
> >
> >
> > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> > wrote:
> >
> >> hi, when i create a table, i can point the cache strategy using
> >> shark.cache,
> >> i think "shark.cache=memory_only"  means data are managed by spark, and
> >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> >>  means  data are managed by tachyon which is off heap, and data are not
> in
> >> the same jvm with excutor,  so spark will load data from tachyon for
> each
> >> query sql , so,  is  tachyon less efficient than memory_only cache
> strategy
> >>  ?
> >> if yes, can we let spark load all data once from tachyon  for all sql
> query
> >>  if i want to use tachyon cache strategy since tachyon is more HA than
> >> memory_only ?
> >>
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Mridul Muralidharan
You are ignoring serde costs :-)

- Mridul

On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  wrote:
> Tachyon should only be marginally less performant than memory_only, because
> we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> the data over a pipe from Tachyon; we can directly read from the buffers in
> the same way that Shark reads from its in-memory columnar format.
>
>
>
> On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> wrote:
>
>> hi, when i create a table, i can point the cache strategy using
>> shark.cache,
>> i think "shark.cache=memory_only"  means data are managed by spark, and
>> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
>>  means  data are managed by tachyon which is off heap, and data are not in
>> the same jvm with excutor,  so spark will load data from tachyon for each
>> query sql , so,  is  tachyon less efficient than memory_only cache strategy
>>  ?
>> if yes, can we let spark load all data once from tachyon  for all sql query
>>  if i want to use tachyon cache strategy since tachyon is more HA than
>> memory_only ?
>>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Tachyon should only be marginally less performant than memory_only, because
we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
the data over a pipe from Tachyon; we can directly read from the buffers in
the same way that Shark reads from its in-memory columnar format.



On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
wrote:

> hi, when i create a table, i can point the cache strategy using
> shark.cache,
> i think "shark.cache=memory_only"  means data are managed by spark, and
> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
>  means  data are managed by tachyon which is off heap, and data are not in
> the same jvm with excutor,  so spark will load data from tachyon for each
> query sql , so,  is  tachyon less efficient than memory_only cache strategy
>  ?
> if yes, can we let spark load all data once from tachyon  for all sql query
>  if i want to use tachyon cache strategy since tachyon is more HA than
> memory_only ?
>