Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly.
I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using primitive arrays for each column. I'd be interested to know more about this approach and how I could build my own custom columnar-oriented RDD which I can use outside of Spark SQL. Could anyone give me some pointers on where to look to do something like this, either from scratch or using whats there in the SparkSQL libs or elsewhere. I know Evan Chan in a presentation made mention of building a custom RDD of column-oriented blocks of data. Cheers, ~N