cache table vs. parquet table performance

Tomas Bartalos Tue, 15 Jan 2019 10:29:33 -0800

Hello,

I'm using spark-thrift server and I'm searching for best performing
solution to query hot set of data. I'm processing records with nested
structure, containing subtypes and arrays. 1 record takes up several KB.


I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered =
20190102;


If I understood correctly, the data should be stored in *in-memory columnar*
format with storage level MEMORY_AND_DISK. So data which doesn't fit to
memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab
none of the data was cached to memory and everything was spilled to disk.
The size of the data was *5.7 GB.*
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as


select * from event_jan_01;


The whole parquet took up only *178MB.*
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory
?

Spark version: 2.4.0

Best regards,
Tomas

cache table vs. parquet table performance

Reply via email to