Hello, I am writing a Spark application to use speech recognition to
transcribe a very large number of recordings.

I need some help configuring Spark.

My app is basically a transformation with no side effects: recording URL
--> transcript.  The input is a huge file with one URL per line, and the
output is a huge file of transcripts.

The speech recognizer is written in Java (Sphinx4), so it can be packaged
as a JAR.

The recognizer is very processor intensive, so you can't run too many on
one machine-- perhaps one recognizer per core.  The recognizer is also
big-- maybe 1 GB.  But, most of the recognizer is a immutable acoustic and
language models that can be shared with other instances of the recognizer.

So I want to run about one recognizer per core of each machine in my
cluster.  I want all recognizer on one machine to run within the same JVM
and share the same models.

How does one configure Spark for this sort of application?  How does one
control how Spark deploys the stages of the process.  Can someone point me
to an appropriate doc or keywords I should Google.

Thanks
Peter

Reply via email to