Thanks for the comments guys. Parquet is awesome. My question with using Parquet for on disk storage - how should I load that into memory as a spark RDD and cache it and keep it in a columnar format?
I know I can use Spark SQL with parquet which is awesome. But as soon as I step out of SQL we have problems as it kinda gets converted back to a row oriented format. @Koert - that looks really exciting. Do you have any statistics on memory and scan performance? On Saturday, February 14, 2015, Koert Kuipers <ko...@tresata.com> wrote: > i wrote a proof of concept to automatically store any RDD of tuples or > case classes in columar format using arrays (and strongly typed, so you get > the benefit of primitive arrays). see: > https://github.com/tresata/spark-columnar > > On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust <mich...@databricks.com > <javascript:_e(%7B%7D,'cvml','mich...@databricks.com');>> wrote: > >> Shark's in-memory code was ported to Spark SQL and is used by default >> when you run .cache on a SchemaRDD or CACHE TABLE. >> >> I'd also look at parquet which is more efficient and handles nested data >> better. >> >> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf <nightwolf...@gmail.com >> <javascript:_e(%7B%7D,'cvml','nightwolf...@gmail.com');>> wrote: >> >>> Hi all, >>> >>> I'd like to build/use column oriented RDDs in some of my Spark code. A >>> normal Spark RDD is stored as row oriented object if I understand >>> correctly. >>> >>> I'd like to leverage some of the advantages of a columnar memory format. >>> Shark (used to) and SparkSQL uses a columnar storage format using primitive >>> arrays for each column. >>> >>> I'd be interested to know more about this approach and how I could build >>> my own custom columnar-oriented RDD which I can use outside of Spark SQL. >>> >>> Could anyone give me some pointers on where to look to do something like >>> this, either from scratch or using whats there in the SparkSQL libs or >>> elsewhere. I know Evan Chan in a presentation made mention of building a >>> custom RDD of column-oriented blocks of data. >>> >>> Cheers, >>> ~N >>> >> >> >