Hi Tomas, Have you considered using something like https://www.alluxio.org/ for you cache? Seems like a possible solution for what your trying to do.
-Todd On Tue, Jan 15, 2019 at 11:24 PM 大啊 <belie...@163.com> wrote: > Hi ,Tomas. > Thanks for your question give me some prompt.But the best way use cache > usually stores smaller data. > I think cache large data will consume memory or disk space too much. > Spill the cached data in parquet format maybe a good improvement. > > At 2019-01-16 02:20:56, "Tomas Bartalos" <tomas.barta...@gmail.com> wrote: > > Hello, > > I'm using spark-thrift server and I'm searching for best performing > solution to query hot set of data. I'm processing records with nested > structure, containing subtypes and arrays. 1 record takes up several KB. > > I tried to make some improvement with cache table: > > cache table event_jan_01 as select * from events where day_registered = > 20190102; > > > If I understood correctly, the data should be stored in *in-memory > columnar* format with storage level MEMORY_AND_DISK. So data which > doesn't fit to memory will be spille to disk (I assume also in columnar > format (?)) > I cached 1 day of data (1 M records) and according to spark UI storage tab > none of the data was cached to memory and everything was spilled to disk. > The size of the data was *5.7 GB.* > Typical queries took ~ 20 sec. > > Then I tried to store the data to parquet format: > > CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" > as > > select * from event_jan_01; > > > The whole parquet took up only *178MB.* > And typical queries took 5-10 sec. > > Is it possible to tune spark to spill the cached data in parquet format ? > Why the whole cached table was spilled to disk and nothing stayed in > memory ? > > Spark version: 2.4.0 > > Best regards, > Tomas > > > > >