Hi everyone,
I am very new to Spark, so as a learning exercise I've set up a small
cluster consisting of 4 EC2 m1.large instances (1 master, 3 slaves), which
I'm hoping to use to calculate ngram frequencies from text files of various
sizes (I'm not doing anything with them; I just thought this would be
slightly more interesting than the usual 'word count' example).  Currently,
I'm trying to work with a 1GB text file, but running into memory issues.
 I'm wondering what parameters I should be setting (in spark-env.sh) in
order to properly utilize the cluster.  Right now, I'd be happy just to
have the process complete successfully with the 1 gig file, so I'd really
appreciate any suggestions you all might have.

Here's a summary of the code I'm running through the spark shell on the
master:

def ngrams(s: String, n: Int = 3): Seq[String] = {
  (s.split("\\s+").sliding(n)).filter(_.length == n).map(_.mkString("
")).map(_.trim).toList
}

val text = sc.textFile("s3n://my-bucket/my-1gb-text-file")

val mapped = text.filter(_.trim.length > 0).flatMap(ngrams(_, 3))

So far so good; the problems come during the reduce phase.  With small
files, I was able to issue the following to calculate the most frequently
occurring trigram:

val topNgram = (mapped countByValue) reduce((a:(String, Long), b:(String,
Long)) => if (a._2 > b._2) a else b)

With the 1 gig file, though, I've been running into OutOfMemory errors, so
I decided to split the reduction to several steps, starting with simply
issuing countByValue of my "mapped" RDD, but I have yet to get it to
complete successfully.

SPARK_MEM is currently set to 6154m.  I also bumped up the
spark.akka.framesize setting to 500 (though at this point, I was grasping
at straws; I'm not sure what a "proper" value would be).  What properties
should I be setting for a job of this size on a cluster of 3 m1.large
slaves? (The cluster was initially configured using the spark-ec2 scripts).
 Also, programmatically, what should I be doing differently?  (For example,
should I be setting the minimum number of splits when reading the text
file?  If so, what would be a good default?).

I apologize for what I'm sure are very naive questions.  I think Spark is a
fantastic project and have enjoyed working with it, but I'm still very much
a newbie and would appreciate any help you all can provide (as well as any
'rules-of-thumb' or best practices I should be following).

Thanks,
Tim Perrigo

Reply via email to