Re: Caching in spark

Akhil Das Sun, 12 Jul 2015 23:18:44 -0700

There was a discussion happened on that earlier, let me re-post it for you.

For the following code:

     val *df* = sqlContext.parquetFile(path)

*df* remains columnar (actually it just reads from the columnar Parquet
file on disk).

For the following code:

     val *cdf* = df.cache()

*cdf* is also columnar but that's different from Parquet. When a DataFrame
is cached, Spark SQL turns it into a private in-memory columnar format.

Some more details about the in-memory columnar structure: it's columnar,
but much simpler than the one Parquet uses. The columnar byte arrays are
split into batches with a fixed row count (configured by "
spark.sql.inMemoryColumnarStorage.batchSize"). Also, each column is
compressed with a compression scheme chose according to the data type and
statistics information of that column. Supported compression schemes
include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding.

You may find the implementation here:
https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar

This was originally written by Cheng.

Thanks
Best Regards

On Sun, Jul 12, 2015 at 11:37 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Hi Akhil,
>
> It's interesting if RDDs are stored internally in a columnar format as
> well?
> Or it is only when an RDD is cached in SQL context, it is converted to
> columnar format.
> What about data frames?
>
> Thanks!
>
>
> --
> Ruslan Dautkhanov
>
> On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vinodsachin...@gmail.com>
>> wrote:
>>
>>> Hi Guys,
>>>
>>> Can any one please share me how to use caching feature of spark via
>>> spark sql queries?
>>>
>>> -Vinod
>>>
>>
>>
>

Re: Caching in spark

Reply via email to