Hello, I am writing a Spark application to use speech recognition to transcribe a very large number of recordings.
I need some help configuring Spark. My app is basically a transformation with no side effects: recording URL --> transcript. The input is a huge file with one URL per line, and the output is a huge file of transcripts. The speech recognizer is written in Java (Sphinx4), so it can be packaged as a JAR. The recognizer is very processor intensive, so you can't run too many on one machine-- perhaps one recognizer per core. The recognizer is also big-- maybe 1 GB. But, most of the recognizer is a immutable acoustic and language models that can be shared with other instances of the recognizer. So I want to run about one recognizer per core of each machine in my cluster. I want all recognizer on one machine to run within the same JVM and share the same models. How does one configure Spark for this sort of application? How does one control how Spark deploys the stages of the process. Can someone point me to an appropriate doc or keywords I should Google. Thanks Peter