Thanks Michael for the pointer & Sorry for the delayed reply. Taking a quick inventory of scope of change - Is the column type for Decimal caching needed only in the caching layer (4 files in org.apache.spark.sql.columnar - ColumnAccessor.scala, ColumnBuilder.scala, ColumnStats.scala, ColumnType.scala)
Or do other SQL components also need to be touched ? Hoping for a quick feedback of top of your head ... Thanks, On Mon, Feb 9, 2015 at 3:16 PM, Michael Armbrust <mich...@databricks.com> wrote: > You could add a new ColumnType > <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala> > . > > PRs welcome :) > > On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel <manojsamelt...@gmail.com> > wrote: > >> Hi Michael, >> >> As a test, I have same data loaded as another parquet - except with the 2 >> decimal(14,4) replaced by double. With this, the on disk size is ~345MB, >> the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the >> time of uncached query. >> >> Would it be possible for Spark to store in-memory decimal in some form of >> long with decoration ? >> >> For the immediate future, is there any hook that we can use to provide >> custom caching / processing for the decimal type in RDD so other semantic >> does not changes ? >> >> Thanks, >> >> >> >> >> On Mon, Feb 9, 2015 at 2:41 PM, Manoj Samel <manojsamelt...@gmail.com> >> wrote: >> >>> Could you share which data types are optimized in the in-memory storage >>> and how are they optimized ? >>> >>> On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust <mich...@databricks.com >>> > wrote: >>> >>>> You'll probably only get good compression for strings when dictionary >>>> encoding works. We don't optimize decimals in the in-memory columnar >>>> storage, so you are paying expensive serialization there likely. >>>> >>>> On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel <manojsamelt...@gmail.com> >>>> wrote: >>>> >>>>> Flat data of types String, Int and couple of decimal(14,4) >>>>> >>>>> On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> Is this nested data or flat data? >>>>>> >>>>>> On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel <manojsamelt...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> Hi Michael, >>>>>>> >>>>>>> The storage tab shows the RDD resides fully in memory (10 >>>>>>> partitions) with zero disk usage. Tasks for subsequent select on this >>>>>>> table >>>>>>> in cache shows minimal overheads (GC, queueing, shuffle write etc. >>>>>>> etc.), >>>>>>> so overhead is not issue. However, it is still twice as slow as reading >>>>>>> uncached table. >>>>>>> >>>>>>> I have spark.rdd.compress = true, >>>>>>> spark.sql.inMemoryColumnarStorage.compressed >>>>>>> = true, spark.serializer = >>>>>>> org.apache.spark.serializer.KryoSerializer >>>>>>> >>>>>>> Something that may be of relevance ... >>>>>>> >>>>>>> The underlying table is Parquet, 10 partitions totaling ~350 MB. For >>>>>>> mapPartition phase of query on uncached table shows input size of 351 >>>>>>> MB. >>>>>>> However, after the table is cached, the storage shows the cache size as >>>>>>> 12GB. So the in-memory representation seems much bigger than on-disk, >>>>>>> even >>>>>>> with the compression options turned on. Any thoughts on this ? >>>>>>> >>>>>>> mapPartition phase same query for cache table shows input size of >>>>>>> 12GB (full size of cache table) and takes twice the time as mapPartition >>>>>>> for uncached query. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 6, 2015 at 6:47 PM, Michael Armbrust < >>>>>>> mich...@databricks.com> wrote: >>>>>>> >>>>>>>> Check the storage tab. Does the table actually fit in memory? >>>>>>>> Otherwise you are rebuilding column buffers in addition to reading the >>>>>>>> data >>>>>>>> off of the disk. >>>>>>>> >>>>>>>> On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel < >>>>>>>> manojsamelt...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Spark 1.2 >>>>>>>>> >>>>>>>>> Data stored in parquet table (large number of rows) >>>>>>>>> >>>>>>>>> Test 1 >>>>>>>>> >>>>>>>>> select a, sum(b), sum(c) from table >>>>>>>>> >>>>>>>>> Test >>>>>>>>> >>>>>>>>> sqlContext.cacheTable() >>>>>>>>> select a, sum(b), sum(c) from table - "seed cache" First time >>>>>>>>> slow since loading cache ? >>>>>>>>> select a, sum(b), sum(c) from table - Second time it should be >>>>>>>>> faster as it should be reading from cache, not HDFS. But it is slower >>>>>>>>> than >>>>>>>>> test1 >>>>>>>>> >>>>>>>>> Any thoughts? Should a different query be used to seed cache ? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >