Re: on shark, is tachyon less efficient than memory_only cache strategy ?
hi, haoyuan, thanks for replying. 2014-07-21 16:29 GMT+08:00 Haoyuan Li : > Qingyang, > > Aha. Got it. > > 800MB data is pretty small. Loading from Tachyon does have a bit of extra > overhead. But it will have more benefit when the data size is larger. Also, > if you store the table in Tachyon, you can have different shark servers to > query the data at the same time. For more trade-off, please refer to this > page: http://tachyon-project.org/Running-Shark-on-Tachyon.html > > Best, > > Haoyuan > > > On Wed, Jul 16, 2014 at 12:06 AM, qingyang li > wrote: > > > let's me describe my scene: > > -- > > i have 8 machines (24 core , 16G memory, per machine) of spark cluster > and > > tachyon cluster. On tachyon, I create one table which contains 800M > data, > > when i run query sql on shark, it will cost 2.43s, but when i create > the > > same table on spark memory , i run the same sql , it will cost 1.56s. > > data on tachyon cost more time than data on spark memory. they all > have > > 150 map process, and per node 16-20 map process. > > I think the reason is that when data is on tachyon, shark will let spark > > slave load data from tachyon salve which is on the same node with tachyon > > slave, > > i have tried to set some configuration to tune shark and tachyon, but > still > > can not make the former more fast than 2.43s. > > do anyone have some ideas ? > > > > By the way , my tachyon block size is 1GB now, i want to reset block > size > > , will it work by setting tachyon.user.default.block.size.byte=8M ? if > > not, what does tachyon.user.default.block.size.byte mean? > > > > > > 2014-07-14 13:13 GMT+08:00 qingyang li : > > > > > Shark, thanks for replying. > > > Let's me clear my question again. > > > -- > > > i create a table using " create table xxx1 > > > tblproperties("shark.cache"="tachyon") as select * from xxx2" > > > when excuting some sql (for example , select * from xxx1) using shark, > > > shark will read data into shark's memory from tachyon's memory. > > > I think if each time we execute sql, shark always load data from > tachyon, > > > it is less effient. > > > could we use some cache policy (such as, CacheAllPolicy > FIFOCachePolicy > > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > > > each sql query? > > > -- > > > > > > > > > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li : > > > > > > Qingyang, > > >> > > >> Are you asking Spark or Shark (The first email was "Shark", the last > > email > > >> was "Spark".)? > > >> > > >> Best, > > >> > > >> Haoyuan > > >> > > >> > > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li > > > >> wrote: > > >> > > >> > could i set some cache policy to let spark load data from tachyon > only > > >> one > > >> > time for all sql query? for example by using CacheAllPolicy > > >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, > > >> they > > >> > are not useful. > > >> > I think , if spark always load data for each sql query, it will > > impact > > >> the > > >> > query speed , it will take more time than the case that data are > > >> managed by > > >> > spark itself. > > >> > > > >> > > > >> > > > >> > > > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li : > > >> > > > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and > > >> > "shark.cache=memory", > > >> > > have the same ser/de overhead. Shark loads data from outsize of > the > > >> > process > > >> > > in Tachyon mode with the following benefits: > > >> > > > > >> > > > > >> > >- In-memory data sharing across multiple Shark instances (i.e. > > >> > stronger > > >> > >isolation) > > >> > >- Instant recovery of in-memory tables > > >> > >- Reduce heap size => faster GC in shark > > >> > >- If the table is larger than the memory size, only the hot > > columns > > >> > will > > >> > >be cached in memory > > >> > > > > >> > > from > > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html > > >> and > > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > > >> > > > > >> > > Haoyuan > > >> > > > > >> > > > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson < > ilike...@gmail.com> > > >> > wrote: > > >> > > > > >> > > > Shark's in-memory format is already serialized (it's compressed > > and > > >> > > > column-based). > > >> > > > > > >> > > > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < > > >> mri...@gmail.com> > > >> > > > wrote: > > >> > > > > > >> > > > > You are ignoring serde costs :-) > > >> > > > > > > >> > > > > - Mridul > > >> > > > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < > > >> ilike...@gmail.com> > > >> > > > wrote: > > >> > > > > > Tachyon should only be marginally less performant than > > >> memory_only, > > >> > > > > because > > >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, > > say, > > >
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Qingyang, Aha. Got it. 800MB data is pretty small. Loading from Tachyon does have a bit of extra overhead. But it will have more benefit when the data size is larger. Also, if you store the table in Tachyon, you can have different shark servers to query the data at the same time. For more trade-off, please refer to this page: http://tachyon-project.org/Running-Shark-on-Tachyon.html Best, Haoyuan On Wed, Jul 16, 2014 at 12:06 AM, qingyang li wrote: > let's me describe my scene: > -- > i have 8 machines (24 core , 16G memory, per machine) of spark cluster and > tachyon cluster. On tachyon, I create one table which contains 800M data, > when i run query sql on shark, it will cost 2.43s, but when i create the > same table on spark memory , i run the same sql , it will cost 1.56s. > data on tachyon cost more time than data on spark memory. they all have > 150 map process, and per node 16-20 map process. > I think the reason is that when data is on tachyon, shark will let spark > slave load data from tachyon salve which is on the same node with tachyon > slave, > i have tried to set some configuration to tune shark and tachyon, but still > can not make the former more fast than 2.43s. > do anyone have some ideas ? > > By the way , my tachyon block size is 1GB now, i want to reset block size > , will it work by setting tachyon.user.default.block.size.byte=8M ? if > not, what does tachyon.user.default.block.size.byte mean? > > > 2014-07-14 13:13 GMT+08:00 qingyang li : > > > Shark, thanks for replying. > > Let's me clear my question again. > > -- > > i create a table using " create table xxx1 > > tblproperties("shark.cache"="tachyon") as select * from xxx2" > > when excuting some sql (for example , select * from xxx1) using shark, > > shark will read data into shark's memory from tachyon's memory. > > I think if each time we execute sql, shark always load data from tachyon, > > it is less effient. > > could we use some cache policy (such as, CacheAllPolicy FIFOCachePolicy > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > > each sql query? > > -- > > > > > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li : > > > > Qingyang, > >> > >> Are you asking Spark or Shark (The first email was "Shark", the last > email > >> was "Spark".)? > >> > >> Best, > >> > >> Haoyuan > >> > >> > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li > >> wrote: > >> > >> > could i set some cache policy to let spark load data from tachyon only > >> one > >> > time for all sql query? for example by using CacheAllPolicy > >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, > >> they > >> > are not useful. > >> > I think , if spark always load data for each sql query, it will > impact > >> the > >> > query speed , it will take more time than the case that data are > >> managed by > >> > spark itself. > >> > > >> > > >> > > >> > > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li : > >> > > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and > >> > "shark.cache=memory", > >> > > have the same ser/de overhead. Shark loads data from outsize of the > >> > process > >> > > in Tachyon mode with the following benefits: > >> > > > >> > > > >> > >- In-memory data sharing across multiple Shark instances (i.e. > >> > stronger > >> > >isolation) > >> > >- Instant recovery of in-memory tables > >> > >- Reduce heap size => faster GC in shark > >> > >- If the table is larger than the memory size, only the hot > columns > >> > will > >> > >be cached in memory > >> > > > >> > > from > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html > >> and > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > >> > > > >> > > Haoyuan > >> > > > >> > > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson > >> > wrote: > >> > > > >> > > > Shark's in-memory format is already serialized (it's compressed > and > >> > > > column-based). > >> > > > > >> > > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < > >> mri...@gmail.com> > >> > > > wrote: > >> > > > > >> > > > > You are ignoring serde costs :-) > >> > > > > > >> > > > > - Mridul > >> > > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < > >> ilike...@gmail.com> > >> > > > wrote: > >> > > > > > Tachyon should only be marginally less performant than > >> memory_only, > >> > > > > because > >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, > say, > >> > > > transfer > >> > > > > > the data over a pipe from Tachyon; we can directly read from > the > >> > > > buffers > >> > > > > in > >> > > > > > the same way that Shark reads from its in-memory columnar > >> format. > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > >> > > liqingyang1...@gmail.com> > >> > > > > > wrote: > >> > > > >
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
let's me describe my scene: -- i have 8 machines (24 core , 16G memory, per machine) of spark cluster and tachyon cluster. On tachyon, I create one table which contains 800M data, when i run query sql on shark, it will cost 2.43s, but when i create the same table on spark memory , i run the same sql , it will cost 1.56s. data on tachyon cost more time than data on spark memory. they all have 150 map process, and per node 16-20 map process. I think the reason is that when data is on tachyon, shark will let spark slave load data from tachyon salve which is on the same node with tachyon slave, i have tried to set some configuration to tune shark and tachyon, but still can not make the former more fast than 2.43s. do anyone have some ideas ? By the way , my tachyon block size is 1GB now, i want to reset block size , will it work by setting tachyon.user.default.block.size.byte=8M ? if not, what does tachyon.user.default.block.size.byte mean? 2014-07-14 13:13 GMT+08:00 qingyang li : > Shark, thanks for replying. > Let's me clear my question again. > -- > i create a table using " create table xxx1 > tblproperties("shark.cache"="tachyon") as select * from xxx2" > when excuting some sql (for example , select * from xxx1) using shark, > shark will read data into shark's memory from tachyon's memory. > I think if each time we execute sql, shark always load data from tachyon, > it is less effient. > could we use some cache policy (such as, CacheAllPolicy FIFOCachePolicy > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > each sql query? > -- > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li : > > Qingyang, >> >> Are you asking Spark or Shark (The first email was "Shark", the last email >> was "Spark".)? >> >> Best, >> >> Haoyuan >> >> >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li >> wrote: >> >> > could i set some cache policy to let spark load data from tachyon only >> one >> > time for all sql query? for example by using CacheAllPolicy >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, >> they >> > are not useful. >> > I think , if spark always load data for each sql query, it will impact >> the >> > query speed , it will take more time than the case that data are >> managed by >> > spark itself. >> > >> > >> > >> > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li : >> > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and >> > "shark.cache=memory", >> > > have the same ser/de overhead. Shark loads data from outsize of the >> > process >> > > in Tachyon mode with the following benefits: >> > > >> > > >> > >- In-memory data sharing across multiple Shark instances (i.e. >> > stronger >> > >isolation) >> > >- Instant recovery of in-memory tables >> > >- Reduce heap size => faster GC in shark >> > >- If the table is larger than the memory size, only the hot columns >> > will >> > >be cached in memory >> > > >> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html >> and >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon >> > > >> > > Haoyuan >> > > >> > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson >> > wrote: >> > > >> > > > Shark's in-memory format is already serialized (it's compressed and >> > > > column-based). >> > > > >> > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < >> mri...@gmail.com> >> > > > wrote: >> > > > >> > > > > You are ignoring serde costs :-) >> > > > > >> > > > > - Mridul >> > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < >> ilike...@gmail.com> >> > > > wrote: >> > > > > > Tachyon should only be marginally less performant than >> memory_only, >> > > > > because >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, >> > > > transfer >> > > > > > the data over a pipe from Tachyon; we can directly read from the >> > > > buffers >> > > > > in >> > > > > > the same way that Shark reads from its in-memory columnar >> format. >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < >> > > liqingyang1...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > >> hi, when i create a table, i can point the cache strategy using >> > > > > >> shark.cache, >> > > > > >> i think "shark.cache=memory_only" means data are managed by >> > spark, >> > > > and >> > > > > >> data are in the same jvm with excutor; while >> > > "shark.cache=tachyon" >> > > > > >> means data are managed by tachyon which is off heap, and data >> > are >> > > > not >> > > > > in >> > > > > >> the same jvm with excutor, so spark will load data from >> tachyon >> > for >> > > > > each >> > > > > >> query sql , so, is tachyon less efficient than memory_only >> cache >> > > > > strategy >> > > > > >> ? >> > > > > >> if yes, can we let spark load all data once from tachyon for >> all >>
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Shark, thanks for replying. Let's me clear my question again. -- i create a table using " create table xxx1 tblproperties("shark.cache"="tachyon") as select * from xxx2" when excuting some sql (for example , select * from xxx1) using shark, shark will read data into shark's memory from tachyon's memory. I think if each time we execute sql, shark always load data from tachyon, it is less effient. could we use some cache policy (such as, CacheAllPolicy FIFOCachePolicy LRUCachePolicy ) to cache data to invoid reading data from tachyon for each sql query? -- 2014-07-14 2:47 GMT+08:00 Haoyuan Li : > Qingyang, > > Are you asking Spark or Shark (The first email was "Shark", the last email > was "Spark".)? > > Best, > > Haoyuan > > > On Wed, Jul 9, 2014 at 7:40 PM, qingyang li > wrote: > > > could i set some cache policy to let spark load data from tachyon only > one > > time for all sql query? for example by using CacheAllPolicy > > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, they > > are not useful. > > I think , if spark always load data for each sql query, it will impact > the > > query speed , it will take more time than the case that data are managed > by > > spark itself. > > > > > > > > > > 2014-07-09 1:19 GMT+08:00 Haoyuan Li : > > > > > Yes. For Shark, two modes, "shark.cache=tachyon" and > > "shark.cache=memory", > > > have the same ser/de overhead. Shark loads data from outsize of the > > process > > > in Tachyon mode with the following benefits: > > > > > > > > >- In-memory data sharing across multiple Shark instances (i.e. > > stronger > > >isolation) > > >- Instant recovery of in-memory tables > > >- Reduce heap size => faster GC in shark > > >- If the table is larger than the memory size, only the hot columns > > will > > >be cached in memory > > > > > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html > and > > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > > > > > > Haoyuan > > > > > > > > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson > > wrote: > > > > > > > Shark's in-memory format is already serialized (it's compressed and > > > > column-based). > > > > > > > > > > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < > mri...@gmail.com> > > > > wrote: > > > > > > > > > You are ignoring serde costs :-) > > > > > > > > > > - Mridul > > > > > > > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson > > > > > wrote: > > > > > > Tachyon should only be marginally less performant than > memory_only, > > > > > because > > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, > > > > transfer > > > > > > the data over a pipe from Tachyon; we can directly read from the > > > > buffers > > > > > in > > > > > > the same way that Shark reads from its in-memory columnar format. > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > > > liqingyang1...@gmail.com> > > > > > > wrote: > > > > > > > > > > > >> hi, when i create a table, i can point the cache strategy using > > > > > >> shark.cache, > > > > > >> i think "shark.cache=memory_only" means data are managed by > > spark, > > > > and > > > > > >> data are in the same jvm with excutor; while > > > "shark.cache=tachyon" > > > > > >> means data are managed by tachyon which is off heap, and data > > are > > > > not > > > > > in > > > > > >> the same jvm with excutor, so spark will load data from tachyon > > for > > > > > each > > > > > >> query sql , so, is tachyon less efficient than memory_only > cache > > > > > strategy > > > > > >> ? > > > > > >> if yes, can we let spark load all data once from tachyon for > all > > > sql > > > > > query > > > > > >> if i want to use tachyon cache strategy since tachyon is more > HA > > > than > > > > > >> memory_only ? > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > Haoyuan Li > > > AMPLab, EECS, UC Berkeley > > > http://www.cs.berkeley.edu/~haoyuan/ > > > > > > > > > -- > Haoyuan Li > AMPLab, EECS, UC Berkeley > http://www.cs.berkeley.edu/~haoyuan/ >
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Qingyang, Are you asking Spark or Shark (The first email was "Shark", the last email was "Spark".)? Best, Haoyuan On Wed, Jul 9, 2014 at 7:40 PM, qingyang li wrote: > could i set some cache policy to let spark load data from tachyon only one > time for all sql query? for example by using CacheAllPolicy > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, they > are not useful. > I think , if spark always load data for each sql query, it will impact the > query speed , it will take more time than the case that data are managed by > spark itself. > > > > > 2014-07-09 1:19 GMT+08:00 Haoyuan Li : > > > Yes. For Shark, two modes, "shark.cache=tachyon" and > "shark.cache=memory", > > have the same ser/de overhead. Shark loads data from outsize of the > process > > in Tachyon mode with the following benefits: > > > > > >- In-memory data sharing across multiple Shark instances (i.e. > stronger > >isolation) > >- Instant recovery of in-memory tables > >- Reduce heap size => faster GC in shark > >- If the table is larger than the memory size, only the hot columns > will > >be cached in memory > > > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > > > > Haoyuan > > > > > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson > wrote: > > > > > Shark's in-memory format is already serialized (it's compressed and > > > column-based). > > > > > > > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan > > > wrote: > > > > > > > You are ignoring serde costs :-) > > > > > > > > - Mridul > > > > > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson > > > wrote: > > > > > Tachyon should only be marginally less performant than memory_only, > > > > because > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, > > > transfer > > > > > the data over a pipe from Tachyon; we can directly read from the > > > buffers > > > > in > > > > > the same way that Shark reads from its in-memory columnar format. > > > > > > > > > > > > > > > > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > > liqingyang1...@gmail.com> > > > > > wrote: > > > > > > > > > >> hi, when i create a table, i can point the cache strategy using > > > > >> shark.cache, > > > > >> i think "shark.cache=memory_only" means data are managed by > spark, > > > and > > > > >> data are in the same jvm with excutor; while > > "shark.cache=tachyon" > > > > >> means data are managed by tachyon which is off heap, and data > are > > > not > > > > in > > > > >> the same jvm with excutor, so spark will load data from tachyon > for > > > > each > > > > >> query sql , so, is tachyon less efficient than memory_only cache > > > > strategy > > > > >> ? > > > > >> if yes, can we let spark load all data once from tachyon for all > > sql > > > > query > > > > >> if i want to use tachyon cache strategy since tachyon is more HA > > than > > > > >> memory_only ? > > > > >> > > > > > > > > > > > > > > > -- > > Haoyuan Li > > AMPLab, EECS, UC Berkeley > > http://www.cs.berkeley.edu/~haoyuan/ > > > -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
could i set some cache policy to let spark load data from tachyon only one time for all sql query? for example by using CacheAllPolicy FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, they are not useful. I think , if spark always load data for each sql query, it will impact the query speed , it will take more time than the case that data are managed by spark itself. 2014-07-09 1:19 GMT+08:00 Haoyuan Li : > Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory", > have the same ser/de overhead. Shark loads data from outsize of the process > in Tachyon mode with the following benefits: > > >- In-memory data sharing across multiple Shark instances (i.e. stronger >isolation) >- Instant recovery of in-memory tables >- Reduce heap size => faster GC in shark >- If the table is larger than the memory size, only the hot columns will >be cached in memory > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > > Haoyuan > > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson wrote: > > > Shark's in-memory format is already serialized (it's compressed and > > column-based). > > > > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan > > wrote: > > > > > You are ignoring serde costs :-) > > > > > > - Mridul > > > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson > > wrote: > > > > Tachyon should only be marginally less performant than memory_only, > > > because > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, > > transfer > > > > the data over a pipe from Tachyon; we can directly read from the > > buffers > > > in > > > > the same way that Shark reads from its in-memory columnar format. > > > > > > > > > > > > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > liqingyang1...@gmail.com> > > > > wrote: > > > > > > > >> hi, when i create a table, i can point the cache strategy using > > > >> shark.cache, > > > >> i think "shark.cache=memory_only" means data are managed by spark, > > and > > > >> data are in the same jvm with excutor; while > "shark.cache=tachyon" > > > >> means data are managed by tachyon which is off heap, and data are > > not > > > in > > > >> the same jvm with excutor, so spark will load data from tachyon for > > > each > > > >> query sql , so, is tachyon less efficient than memory_only cache > > > strategy > > > >> ? > > > >> if yes, can we let spark load all data once from tachyon for all > sql > > > query > > > >> if i want to use tachyon cache strategy since tachyon is more HA > than > > > >> memory_only ? > > > >> > > > > > > > > > -- > Haoyuan Li > AMPLab, EECS, UC Berkeley > http://www.cs.berkeley.edu/~haoyuan/ >
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Yes. For Shark, two modes, "shark.cache=tachyon" and "shark.cache=memory", have the same ser/de overhead. Shark loads data from outsize of the process in Tachyon mode with the following benefits: - In-memory data sharing across multiple Shark instances (i.e. stronger isolation) - Instant recovery of in-memory tables - Reduce heap size => faster GC in shark - If the table is larger than the memory size, only the hot columns will be cached in memory from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html and https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon Haoyuan On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson wrote: > Shark's in-memory format is already serialized (it's compressed and > column-based). > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan > wrote: > > > You are ignoring serde costs :-) > > > > - Mridul > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson > wrote: > > > Tachyon should only be marginally less performant than memory_only, > > because > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, > transfer > > > the data over a pipe from Tachyon; we can directly read from the > buffers > > in > > > the same way that Shark reads from its in-memory columnar format. > > > > > > > > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li > > > wrote: > > > > > >> hi, when i create a table, i can point the cache strategy using > > >> shark.cache, > > >> i think "shark.cache=memory_only" means data are managed by spark, > and > > >> data are in the same jvm with excutor; while "shark.cache=tachyon" > > >> means data are managed by tachyon which is off heap, and data are > not > > in > > >> the same jvm with excutor, so spark will load data from tachyon for > > each > > >> query sql , so, is tachyon less efficient than memory_only cache > > strategy > > >> ? > > >> if yes, can we let spark load all data once from tachyon for all sql > > query > > >> if i want to use tachyon cache strategy since tachyon is more HA than > > >> memory_only ? > > >> > > > -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Shark's in-memory format is already serialized (it's compressed and column-based). On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan wrote: > You are ignoring serde costs :-) > > - Mridul > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote: > > Tachyon should only be marginally less performant than memory_only, > because > > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer > > the data over a pipe from Tachyon; we can directly read from the buffers > in > > the same way that Shark reads from its in-memory columnar format. > > > > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li > > wrote: > > > >> hi, when i create a table, i can point the cache strategy using > >> shark.cache, > >> i think "shark.cache=memory_only" means data are managed by spark, and > >> data are in the same jvm with excutor; while "shark.cache=tachyon" > >> means data are managed by tachyon which is off heap, and data are not > in > >> the same jvm with excutor, so spark will load data from tachyon for > each > >> query sql , so, is tachyon less efficient than memory_only cache > strategy > >> ? > >> if yes, can we let spark load all data once from tachyon for all sql > query > >> if i want to use tachyon cache strategy since tachyon is more HA than > >> memory_only ? > >> >
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
You are ignoring serde costs :-) - Mridul On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson wrote: > Tachyon should only be marginally less performant than memory_only, because > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer > the data over a pipe from Tachyon; we can directly read from the buffers in > the same way that Shark reads from its in-memory columnar format. > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li > wrote: > >> hi, when i create a table, i can point the cache strategy using >> shark.cache, >> i think "shark.cache=memory_only" means data are managed by spark, and >> data are in the same jvm with excutor; while "shark.cache=tachyon" >> means data are managed by tachyon which is off heap, and data are not in >> the same jvm with excutor, so spark will load data from tachyon for each >> query sql , so, is tachyon less efficient than memory_only cache strategy >> ? >> if yes, can we let spark load all data once from tachyon for all sql query >> if i want to use tachyon cache strategy since tachyon is more HA than >> memory_only ? >>
Re: on shark, is tachyon less efficient than memory_only cache strategy ?
Tachyon should only be marginally less performant than memory_only, because we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer the data over a pipe from Tachyon; we can directly read from the buffers in the same way that Shark reads from its in-memory columnar format. On Tue, Jul 8, 2014 at 1:18 AM, qingyang li wrote: > hi, when i create a table, i can point the cache strategy using > shark.cache, > i think "shark.cache=memory_only" means data are managed by spark, and > data are in the same jvm with excutor; while "shark.cache=tachyon" > means data are managed by tachyon which is off heap, and data are not in > the same jvm with excutor, so spark will load data from tachyon for each > query sql , so, is tachyon less efficient than memory_only cache strategy > ? > if yes, can we let spark load all data once from tachyon for all sql query > if i want to use tachyon cache strategy since tachyon is more HA than > memory_only ? >