When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames <http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory> says you need to call sqlContext.cacheTable("tableName") or df.cache() to make it columnar. What exactly is this columnar structure?
To be precise: What does the row represent in the expression df.cache().map{row => ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org