Hi all, I'm playing with Spark currently as a possible solution at work, and I've been recently working out a rough correlation between our input data size and RAM needed to cache an RDD that will be used multiple times in a job.
As part of this I've been trialling different methods of representing the data, and I came across a result that surprised me, so I just wanted to check what I was seeing. So my data set is comprised of CSV with appx. 17 fields. When I load my sample data set locally, and cache it after splitting on the comma as an RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in available RAM. When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so case classes are actually taking up less serialized space than an array. Is it because case class represents numbers as numbers, as opposed to the string array keeping them as strings? Cheers, Liam Clarke