let's me describe my scene: ---------------------- i have 8 machines (24 core , 16G memory, per machine) of spark cluster and tachyon cluster. On tachyon, I create one table which contains 800M data, when i run query sql on shark, it will cost 2.43s, but when i create the same table on spark memory , i run the same sql , it will cost 1.56s. data on tachyon cost more time than data on spark memory. they all have 150 map process, and per node 16-20 map process. I think the reason is that when data is on tachyon, shark will let spark slave load data from tachyon salve which is on the same node with tachyon slave, i have tried to set some configuration to tune shark and tachyon, but still can not make the former more fast than 2.43s. do anyone have some ideas ?
By the way , my tachyon block size is 1GB now, i want to reset block size , will it work by setting tachyon.user.default.block.size.byte=8M ? if not, what does tachyon.user.default.block.size.byte mean? 2014-07-14 13:13 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > Shark, thanks for replying. > Let's me clear my question again. > ---------------------------------------------- > i create a table using " create table xxx1 > tblproperties("shark.cache"="tachyon") as select * from xxx2" > when excuting some sql (for example , select * from xxx1) using shark, > shark will read data into shark's memory from tachyon's memory. > I think if each time we execute sql, shark always load data from tachyon, > it is less effient. > could we use some cache policy (such as, CacheAllPolicy FIFOCachePolicy > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > each sql query? > ---------------------------------------------- > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > > Qingyang, >> >> Are you asking Spark or Shark (The first email was "Shark", the last email >> was "Spark".)? >> >> Best, >> >> Haoyuan >> >> >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1...@gmail.com> >> wrote: >> >> > could i set some cache policy to let spark load data from tachyon only >> one >> > time for all sql query? for example by using CacheAllPolicy >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, >> they >> > are not useful. >> > I think , if spark always load data for each sql query, it will impact >> the >> > query speed , it will take more time than the case that data are >> managed by >> > spark itself. >> > >> > >> > >> > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: >> > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and >> > "shark.cache=memory", >> > > have the same ser/de overhead. Shark loads data from outsize of the >> > process >> > > in Tachyon mode with the following benefits: >> > > >> > > >> > > - In-memory data sharing across multiple Shark instances (i.e. >> > stronger >> > > isolation) >> > > - Instant recovery of in-memory tables >> > > - Reduce heap size => faster GC in shark >> > > - If the table is larger than the memory size, only the hot columns >> > will >> > > be cached in memory >> > > >> > > from http://tachyon-project.org/master/Running-Shark-on-Tachyon.html >> and >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon >> > > >> > > Haoyuan >> > > >> > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilike...@gmail.com> >> > wrote: >> > > >> > > > Shark's in-memory format is already serialized (it's compressed and >> > > > column-based). >> > > > >> > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < >> mri...@gmail.com> >> > > > wrote: >> > > > >> > > > > You are ignoring serde costs :-) >> > > > > >> > > > > - Mridul >> > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < >> ilike...@gmail.com> >> > > > wrote: >> > > > > > Tachyon should only be marginally less performant than >> memory_only, >> > > > > because >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, say, >> > > > transfer >> > > > > > the data over a pipe from Tachyon; we can directly read from the >> > > > buffers >> > > > > in >> > > > > > the same way that Shark reads from its in-memory columnar >> format. >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < >> > > liqingyang1...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > >> hi, when i create a table, i can point the cache strategy using >> > > > > >> shark.cache, >> > > > > >> i think "shark.cache=memory_only" means data are managed by >> > spark, >> > > > and >> > > > > >> data are in the same jvm with excutor; while >> > > "shark.cache=tachyon" >> > > > > >> means data are managed by tachyon which is off heap, and data >> > are >> > > > not >> > > > > in >> > > > > >> the same jvm with excutor, so spark will load data from >> tachyon >> > for >> > > > > each >> > > > > >> query sql , so, is tachyon less efficient than memory_only >> cache >> > > > > strategy >> > > > > >> ? >> > > > > >> if yes, can we let spark load all data once from tachyon for >> all >> > > sql >> > > > > query >> > > > > >> if i want to use tachyon cache strategy since tachyon is more >> HA >> > > than >> > > > > >> memory_only ? >> > > > > >> >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Haoyuan Li >> > > AMPLab, EECS, UC Berkeley >> > > http://www.cs.berkeley.edu/~haoyuan/ >> > > >> > >> >> >> >> -- >> Haoyuan Li >> AMPLab, EECS, UC Berkeley >> http://www.cs.berkeley.edu/~haoyuan/ >> > >