Re: columnar structure of RDDs from Parquet or ORC files

Cheng Lian Sun, 07 Jun 2015 07:09:53 -0700

Interesting, just posted on another thread asking exactly the samequestion :) My answer there quoted below:


> For the following code:
>
>     val df = sqlContext.parquetFile(path)
>

> `df` remains columnar (actually it just reads from the columnarParquet file on disk). For the following code:

>
>     val cdf = df.cache()
>

> `cdf` is also columnar but that's different from Parquet. When aDataFrame is cached, Spark SQL turns it into a private in-memorycolumnar format.

>
> So for your last question, the answer is: yes.

Some more details about the in-memory columnar structure: it's columnar,but much simpler than the one Parquet uses. The columnar byte arrays aresplit into batches with a fixed row count (configured by "spark.sql.inMemoryColumnarStorage.batchSize"). Also, each column iscompressed with a compression scheme chose according to the data typeand statistics information of that column. Supported compression schemesinclude RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding.

You may find the implementation here:https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar


Cheng

On 6/3/15 10:40 PM, kiran lonikar wrote:

When spark reads parquet files (sqlContext.parquetFile), it creates aDataFrame RDD. I would like to know if the resulting DataFrame hascolumnar structure (many rows of a column coalesced together inmemory) or its a row wise structure that a spark RDD has. The sectionSpark SQL and DataFrames<http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory> saysyou need to call sqlContext.cacheTable("tableName") or df.cache() tomake it columnar. What exactly is this columnar structure?
To be precise: What does the row represent in the expressiondf.cache().map{row => ...}?
Is it a logical row which maintains an array of columns and eachcolumn in turn is an array of values for batchSize rows?
-Kiran

Re: columnar structure of RDDs from Parquet or ORC files

Reply via email to