Hi Tomas,

Have you considered using something like https://www.alluxio.org/ for you
cache?  Seems like a possible solution for what your trying to do.

-Todd

On Tue, Jan 15, 2019 at 11:24 PM 大啊 <belie...@163.com> wrote:

> Hi ,Tomas.
> Thanks for your question give me some prompt.But the best way use cache
> usually stores smaller data.
> I think cache large data will consume memory or disk space too much.
> Spill the cached data in parquet format maybe a good improvement.
>
> At 2019-01-16 02:20:56, "Tomas Bartalos" <tomas.barta...@gmail.com> wrote:
>
> Hello,
>
> I'm using spark-thrift server and I'm searching for best performing
> solution to query hot set of data. I'm processing records with nested
> structure, containing subtypes and arrays. 1 record takes up several KB.
>
> I tried to make some improvement with cache table:
>
> cache table event_jan_01 as select * from events where day_registered =
> 20190102;
>
>
> If I understood correctly, the data should be stored in *in-memory
> columnar* format with storage level MEMORY_AND_DISK. So data which
> doesn't fit to memory will be spille to disk (I assume also in columnar
> format (?))
> I cached 1 day of data (1 M records) and according to spark UI storage tab
> none of the data was cached to memory and everything was spilled to disk.
> The size of the data was *5.7 GB.*
> Typical queries took ~ 20 sec.
>
> Then I tried to store the data to parquet format:
>
> CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02"
> as
>
> select * from event_jan_01;
>
>
> The whole parquet took up only *178MB.*
> And typical queries took 5-10 sec.
>
> Is it possible to tune spark to spill the cached data in parquet format ?
> Why the whole cached table was spilled to disk and nothing stayed in
> memory ?
>
> Spark version: 2.4.0
>
> Best regards,
> Tomas
>
>
>
>
>

Reply via email to