Hi Michael,
The storage tab shows the RDD resides fully in memory (10 partitions) with
zero disk usage. Tasks for subsequent select on this table in cache shows
minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is
not issue. However, it is still twice as slow as reading
You'll probably only get good compression for strings when dictionary
encoding works. We don't optimize decimals in the in-memory columnar
storage, so you are paying expensive serialization there likely.
On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Flat data of
Could you share which data types are optimized in the in-memory storage and
how are they optimized ?
On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust mich...@databricks.com
wrote:
You'll probably only get good compression for strings when dictionary
encoding works. We don't optimize decimals
You could add a new ColumnType
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
.
PRs welcome :)
On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi Michael,
As a test, I have same data loaded as
Hi Michael,
As a test, I have same data loaded as another parquet - except with the 2
decimal(14,4) replaced by double. With this, the on disk size is ~345MB,
the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the
time of uncached query.
Would it be possible for Spark to
Spark 1.2
Data stored in parquet table (large number of rows)
Test 1
select a, sum(b), sum(c) from table
Test
sqlContext.cacheTable()
select a, sum(b), sum(c) from table - seed cache First time slow since
loading cache ?
select a, sum(b), sum(c) from table - Second time it should be faster
Check the storage tab. Does the table actually fit in memory? Otherwise
you are rebuilding column buffers in addition to reading the data off of
the disk.
On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
Data stored in parquet table (large number of rows)