Re: SQL group by on Parquet table slower when table cached

Michael Armbrust Mon, 09 Feb 2015 14:38:54 -0800

You'll probably only get good compression for strings when dictionary
encoding works.  We don't optimize decimals in the in-memory columnar
storage, so you are paying expensive serialization there likely.


On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel <manojsamelt...@gmail.com>
wrote:

> Flat data of types String, Int and couple of decimal(14,4)
>
> On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Is this nested data or flat data?
>>
>> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com>
>> wrote:
>>
>>> Hi Michael,
>>>
>>> The storage tab shows the RDD resides fully in memory (10 partitions)
>>> with zero disk usage. Tasks for subsequent select on this table in cache
>>> shows minimal overheads (GC, queueing, shuffle write etc. etc.), so
>>> overhead is not issue. However, it is still twice as slow as reading
>>> uncached table.
>>>
>>> I have spark.rdd.compress = true, 
>>> spark.sql.inMemoryColumnarStorage.compressed
>>> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer
>>>
>>> Something that may be of relevance ...
>>>
>>> The underlying table is Parquet, 10 partitions totaling ~350 MB. For
>>> mapPartition phase of query on uncached table shows input size of 351 MB.
>>> However, after the table is cached, the storage shows the cache size as
>>> 12GB. So the in-memory representation seems much bigger than on-disk, even
>>> with the compression options turned on. Any thoughts on this ?
>>>
>>> mapPartition phase same query for cache table shows input size of 12GB
>>> (full size of cache table) and takes twice the time as mapPartition for
>>> uncached query.
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com
>>> > wrote:
>>>
>>>> Check the storage tab.  Does the table actually fit in memory?
>>>> Otherwise you are rebuilding column buffers in addition to reading the data
>>>> off of the disk.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com>
>>>> wrote:
>>>>
>>>>> Spark 1.2
>>>>>
>>>>> Data stored in parquet table (large number of rows)
>>>>>
>>>>> Test 1
>>>>>
>>>>> select a, sum(b), sum(c) from table
>>>>>
>>>>> Test
>>>>>
>>>>> sqlContext.cacheTable()
>>>>> select a, sum(b), sum(c) from table  - "seed cache" First time slow
>>>>> since loading cache ?
>>>>> select a, sum(b), sum(c) from table  - Second time it should be faster
>>>>> as it should be reading from cache, not HDFS. But it is slower than test1
>>>>>
>>>>> Any thoughts? Should a different query be used to seed cache ?
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SQL group by on Parquet table slower when table cached

Reply via email to