Oh, thanks. Make sense to me. 

Best,
Sun.



fightf...@163.com
 
From: Takeshi Yamamuro
Date: 2016-02-04 16:01
To: fightf...@163.com
CC: user
Subject: Re: Re: About cache table performance in spark sql
Hi,

Parquet data are column-wise and highly compressed, so the size of deserialized 
rows in spark
could be bigger than that of parquet data on disk.
That is, I think that  24.59GB of parquet data becomes (18.1GB + 23.6GB) data 
in spark.

Yes, you know cached data in spark also are compressed by default though, 
spark uses simpler compression algorithms than parquet does and
ISTM the compression ratios are typically worse than those of parquet.


On Thu, Feb 4, 2016 at 3:16 PM, fightf...@163.com <fightf...@163.com> wrote:
Hi, 
Thanks a lot for your explaination. I know that the slow process mainly caused 
by GC pressure and I had understand this difference 
just from your advice. 

I had each executor memory with 6GB and try to cache table. 
I had 3 executors and finally I can see some info from the spark job ui 
storage, like the following: 
 

RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory Size in 
ExternalBlockStore Size on Disk
In-memory table video1203 Memory Deserialized 1x Replicated 251 100% 18.1 GB 
0.0 B 23.6 GB

I can see that spark sql try to cache data into memory. And when I ran the 
following queries over this table video1203, I can get
fast response. Another thing that confused me is that the above data size (in 
memory and on Disk). I can see that the in memory
data size is 18.1GB, which almost equals sum of my executor memory. But why the 
Disk size if 23.6GB? From impala I get the overall
parquet file size if about 24.59GB. Would be good to had some correction on 
this. 

Best,
Sun.



fightf...@163.com
 
From: Prabhu Joseph
Date: 2016-02-04 14:35
To: fightf...@163.com
CC: user
Subject: Re: About cache table performance in spark sql
Sun,

   When Executor don't have enough memory and if it tries to cache the data, it 
spends lot of time on GC and hence the job will be slow. Either,

     1. We should allocate enough memory to cache all RDD and hence the job 
will complete fast
Or 2. Don't use cache when there is not enough Executor memory.

  To check the GC time, use  --conf 
"spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" 
while submitting the job and SPARK_WORKER_DIR will have sysout with GC.
The sysout will show many "Full GC" happening when cache is used and executor 
does not have enough heap.


Thanks,
Prabhu Joseph

On Thu, Feb 4, 2016 at 11:25 AM, fightf...@163.com <fightf...@163.com> wrote:
Hi, 

I want to make sure that the cache table indeed would accelerate sql queries. 
Here is one of my use case : 
  impala table size : 24.59GB,  no partitions, with about 1 billion+ rows.
I use sqlContext.sql to run queries over this table and try to do cache and 
uncache command to see if there
is any performance disparity. I ran the following query : select * from 
video1203 where id > 10 and id < 20 and added_year != 1989
I can see the following results : 

1  If I did not run cache table and just ran sqlContext.sql(), I can see the 
above query run about 25 seconds.
2  If I firstly run sqlContext.cacheTable("video1203"), the query runs super 
slow and would cause driver OOM exception, but I can 
get final results with about running 9 minuts. 

Would any expert can explain this for me ? I can see that cacheTable cause OOM 
just because the in-memory columnar storage 
cannot hold the 24.59GB+ table size into memory. But why the performance is so 
different and even so bad ? 

Best,
Sun.



fightf...@163.com




-- 
---
Takeshi Yamamuro

Reply via email to