Re: SQL group by on Parquet table slower when table cached

Manoj Samel Mon, 09 Feb 2015 13:56:01 -0800

Hi Michael,

The storage tab shows the RDD resides fully in memory (10 partitions) with
zero disk usage. Tasks for subsequent select on this table in cache shows
minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is
not issue. However, it is still twice as slow as reading uncached table.

I have spark.rdd.compress = true, spark.sql.inMemoryColumnarStorage.compressed
= true, spark.serializer = org.apache.spark.serializer.KryoSerializer

Something that may be of relevance ...

The underlying table is Parquet, 10 partitions totaling ~350 MB. For
mapPartition phase of query on uncached table shows input size of 351 MB.
However, after the table is cached, the storage shows the cache size as
12GB. So the in-memory representation seems much bigger than on-disk, even
with the compression options turned on. Any thoughts on this ?

mapPartition phase same query for cache table shows input size of 12GB
(full size of cache table) and takes twice the time as mapPartition for
uncached query.

Thanks,

On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Check the storage tab.  Does the table actually fit in memory? Otherwise
> you are rebuilding column buffers in addition to reading the data off of
> the disk.
>
> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com>
> wrote:
>
>> Spark 1.2
>>
>> Data stored in parquet table (large number of rows)
>>
>> Test 1
>>
>> select a, sum(b), sum(c) from table
>>
>> Test
>>
>> sqlContext.cacheTable()
>> select a, sum(b), sum(c) from table  - "seed cache" First time slow since
>> loading cache ?
>> select a, sum(b), sum(c) from table  - Second time it should be faster as
>> it should be reading from cache, not HDFS. But it is slower than test1
>>
>> Any thoughts? Should a different query be used to seed cache ?
>>
>> Thanks,
>>
>>
>

Re: SQL group by on Parquet table slower when table cached

Reply via email to