Hi all,

I'm playing with Spark currently as a possible solution at work, and I've
been recently working out a rough correlation between our input data size
and RAM needed to cache an RDD that will be used multiple times in a job.

As part of this I've been trialling different methods of representing the
data, and I came across a result that surprised me, so I just wanted to
check what I was seeing.

So my data set is comprised of CSV with appx. 17 fields. When I load my
sample data set locally, and cache it after splitting on the comma as an
RDD[Array[String]], the Spark UI shows 8% of the RDD can be cached in
available RAM.

When I cache it as an RDD of a case class, 11% of the RDD is cacheable, so
case classes are actually taking up less serialized space than an array.

Is it because case class represents numbers as numbers, as opposed to the
string array keeping them as strings?

Cheers,

Liam Clarke

Reply via email to