Hi, Parquet data are column-wise and highly compressed, so the size of deserialized rows in spark could be bigger than that of parquet data on disk. That is, I think that 24.59GB of parquet data becomes (18.1GB + 23.6GB) data in spark.
Yes, you know cached data in spark also are compressed by default though, spark uses simpler compression algorithms than parquet does and ISTM the compression ratios are typically worse than those of parquet. On Thu, Feb 4, 2016 at 3:16 PM, fightf...@163.com <fightf...@163.com> wrote: > Hi, > Thanks a lot for your explaination. I know that the slow process mainly > caused by GC pressure and I had understand this difference > just from your advice. > > I had each executor memory with 6GB and try to cache table. > I had 3 executors and finally I can see some info from the spark job ui > storage, like the following: > > > RDD Name Storage Level Cached Partitions Fraction Cached Size in Memory > Size in ExternalBlockStore Size on Disk > In-memory table video1203 Memory Deserialized 1x Replicated 251 100% > 18.1 GB 0.0 B 23.6 GB > > I can see that spark sql try to cache data into memory. And when I ran the > following queries over this table video1203, I can get > fast response. Another thing that confused me is that the above data size > (in memory and on Disk). I can see that the in memory > data size is 18.1GB, which almost equals sum of my executor memory. But > why the Disk size if 23.6GB? From impala I get the overall > parquet file size if about 24.59GB. Would be good to had some correction > on this. > > Best, > Sun. > > ------------------------------ > fightf...@163.com > > > *From:* Prabhu Joseph <prabhujose.ga...@gmail.com> > *Date:* 2016-02-04 14:35 > *To:* fightf...@163.com > *CC:* user <user@spark.apache.org> > *Subject:* Re: About cache table performance in spark sql > Sun, > > When Executor don't have enough memory and if it tries to cache the > data, it spends lot of time on GC and hence the job will be slow. Either, > > 1. We should allocate enough memory to cache all RDD and hence the > job will complete fast > Or 2. Don't use cache when there is not enough Executor memory. > > To check the GC time, use --conf > "spark.executor.extraJavaOptions=-XX:+PrintGCDetails > -XX:+PrintGCTimeStamps" while submitting the job and SPARK_WORKER_DIR will > have sysout with GC. > The sysout will show many "Full GC" happening when cache is used and > executor does not have enough heap. > > > Thanks, > Prabhu Joseph > > On Thu, Feb 4, 2016 at 11:25 AM, fightf...@163.com <fightf...@163.com> > wrote: > >> Hi, >> >> I want to make sure that the cache table indeed would accelerate sql >> queries. Here is one of my use case : >> impala table size : 24.59GB, no partitions, with about 1 billion+ rows. >> I use sqlContext.sql to run queries over this table and try to do cache >> and uncache command to see if there >> is any performance disparity. I ran the following query : >> select * from video1203 where id > 10 and id < 20 and added_year != 1989 >> I can see the following results : >> >> 1 If I did not run cache table and just ran sqlContext.sql(), I can see >> the above query run about 25 seconds. >> 2 If I firstly run sqlContext.cacheTable("video1203"), the query runs >> super slow and would cause driver OOM exception, but I can >> get final results with about running 9 minuts. >> >> Would any expert can explain this for me ? I can see that cacheTable >> cause OOM just because the in-memory columnar storage >> cannot hold the 24.59GB+ table size into memory. But why the performance >> is so different and even so bad ? >> >> Best, >> Sun. >> >> ------------------------------ >> fightf...@163.com >> > > -- --- Takeshi Yamamuro