You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals in the in-memory columnar storage, so you are paying expensive serialization there likely.
On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel <manojsamelt...@gmail.com> wrote: > Flat data of types String, Int and couple of decimal(14,4) > > On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Is this nested data or flat data? >> >> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com> >> wrote: >> >>> Hi Michael, >>> >>> The storage tab shows the RDD resides fully in memory (10 partitions) >>> with zero disk usage. Tasks for subsequent select on this table in cache >>> shows minimal overheads (GC, queueing, shuffle write etc. etc.), so >>> overhead is not issue. However, it is still twice as slow as reading >>> uncached table. >>> >>> I have spark.rdd.compress = true, >>> spark.sql.inMemoryColumnarStorage.compressed >>> = true, spark.serializer = org.apache.spark.serializer.KryoSerializer >>> >>> Something that may be of relevance ... >>> >>> The underlying table is Parquet, 10 partitions totaling ~350 MB. For >>> mapPartition phase of query on uncached table shows input size of 351 MB. >>> However, after the table is cached, the storage shows the cache size as >>> 12GB. So the in-memory representation seems much bigger than on-disk, even >>> with the compression options turned on. Any thoughts on this ? >>> >>> mapPartition phase same query for cache table shows input size of 12GB >>> (full size of cache table) and takes twice the time as mapPartition for >>> uncached query. >>> >>> Thanks, >>> >>> >>> >>> >>> >>> >>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust <mich...@databricks.com >>> > wrote: >>> >>>> Check the storage tab. Does the table actually fit in memory? >>>> Otherwise you are rebuilding column buffers in addition to reading the data >>>> off of the disk. >>>> >>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel <manojsamelt...@gmail.com> >>>> wrote: >>>> >>>>> Spark 1.2 >>>>> >>>>> Data stored in parquet table (large number of rows) >>>>> >>>>> Test 1 >>>>> >>>>> select a, sum(b), sum(c) from table >>>>> >>>>> Test >>>>> >>>>> sqlContext.cacheTable() >>>>> select a, sum(b), sum(c) from table - "seed cache" First time slow >>>>> since loading cache ? >>>>> select a, sum(b), sum(c) from table - Second time it should be faster >>>>> as it should be reading from cache, not HDFS. But it is slower than test1 >>>>> >>>>> Any thoughts? Should a different query be used to seed cache ? >>>>> >>>>> Thanks, >>>>> >>>>> >>>> >>> >> >