columnar structure of RDDs from Parquet or ORC files

kiran lonikar Wed, 03 Jun 2015 07:42:38 -0700

When spark reads parquet files (sqlContext.parquetFile), it creates a
DataFrame RDD. I would like to know if the resulting DataFrame has columnar
structure (many rows of a column coalesced together in memory) or its a row
wise structure that a spark RDD has. The section Spark SQL and DataFrames
<http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory>
says
you need to call sqlContext.cacheTable("tableName") or df.cache() to make
it columnar. What exactly is this columnar structure?


To be precise: What does the row represent in the expression
df.cache().map{row => ...}?

Is it a logical row which maintains an array of columns and each column in
turn is an array of values for batchSize rows?

-Kiran

columnar structure of RDDs from Parquet or ORC files

Reply via email to