There was a discussion happened on that earlier, let me re-post it for you.
For the following code: val *df* = sqlContext.parquetFile(path) *df* remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val *cdf* = df.cache() *cdf* is also columnar but that's different from Parquet. When a DataFrame is cached, Spark SQL turns it into a private in-memory columnar format. Some more details about the in-memory columnar structure: it's columnar, but much simpler than the one Parquet uses. The columnar byte arrays are split into batches with a fixed row count (configured by " spark.sql.inMemoryColumnarStorage.batchSize"). Also, each column is compressed with a compression scheme chose according to the data type and statistics information of that column. Supported compression schemes include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding. You may find the implementation here: https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar This was originally written by Cheng. Thanks Best Regards On Sun, Jul 12, 2015 at 11:37 PM, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > Hi Akhil, > > It's interesting if RDDs are stored internally in a columnar format as > well? > Or it is only when an RDD is cached in SQL context, it is converted to > columnar format. > What about data frames? > > Thanks! > > > -- > Ruslan Dautkhanov > > On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> >> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory >> >> Thanks >> Best Regards >> >> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vinodsachin...@gmail.com> >> wrote: >> >>> Hi Guys, >>> >>> Can any one please share me how to use caching feature of spark via >>> spark sql queries? >>> >>> -Vinod >>> >> >> >