Hello, I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.
I tried to make some improvement with cache table: cache table event_jan_01 as select * from events where day_registered = 20190102; If I understood correctly, the data should be stored in *in-memory columnar* format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?)) I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was *5.7 GB.* Typical queries took ~ 20 sec. Then I tried to store the data to parquet format: CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as select * from event_jan_01; The whole parquet took up only *178MB.* And typical queries took 5-10 sec. Is it possible to tune spark to spill the cached data in parquet format ? Why the whole cached table was spilled to disk and nothing stayed in memory ? Spark version: 2.4.0 Best regards, Tomas