Hi everyone, I am very new to Spark, so as a learning exercise I've set up a small cluster consisting of 4 EC2 m1.large instances (1 master, 3 slaves), which I'm hoping to use to calculate ngram frequencies from text files of various sizes (I'm not doing anything with them; I just thought this would be slightly more interesting than the usual 'word count' example). Currently, I'm trying to work with a 1GB text file, but running into memory issues. I'm wondering what parameters I should be setting (in spark-env.sh) in order to properly utilize the cluster. Right now, I'd be happy just to have the process complete successfully with the 1 gig file, so I'd really appreciate any suggestions you all might have.
Here's a summary of the code I'm running through the spark shell on the master: def ngrams(s: String, n: Int = 3): Seq[String] = { (s.split("\\s+").sliding(n)).filter(_.length == n).map(_.mkString(" ")).map(_.trim).toList } val text = sc.textFile("s3n://my-bucket/my-1gb-text-file") val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3)) So far so good; the problems come during the reduce phase. With small files, I was able to issue the following to calculate the most frequently occurring trigram: val topNgram = (mapped countByValue) reduce((a:(String, Long), b:(String, Long)) => if (a._2 > b._2) a else b) With the 1 gig file, though, I've been running into OutOfMemory errors, so I decided to split the reduction to several steps, starting with simply issuing countByValue of my "mapped" RDD, but I have yet to get it to complete successfully. SPARK_MEM is currently set to 6154m. I also bumped up the spark.akka.framesize setting to 500 (though at this point, I was grasping at straws; I'm not sure what a "proper" value would be). What properties should I be setting for a job of this size on a cluster of 3 m1.large slaves? (The cluster was initially configured using the spark-ec2 scripts). Also, programmatically, what should I be doing differently? (For example, should I be setting the minimum number of splits when reading the text file? If so, what would be a good default?). I apologize for what I'm sure are very naive questions. I think Spark is a fantastic project and have enjoyed working with it, but I'm still very much a newbie and would appreciate any help you all can provide (as well as any 'rules-of-thumb' or best practices I should be following). Thanks, Tim Perrigo