You might also want to consider broadcasting the models to ensure you get one 
instance shared across cores in each machine, otherwise the model will be 
serialised to each task and you'll get a copy per executor (roughly core in 
this instance)

Simon 

Sent from my iPhone

> On 30 Jul 2015, at 10:14, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> 
> Like this?
> 
> val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls => 
> speachRecognizer(urls))
> 
> Let 24 be the total number of cores that you have on all the workers.
> 
> Thanks
> Best Regards
> 
>> On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <opus...@gmail.com> wrote:
>> Hello, I am writing a Spark application to use speech recognition to 
>> transcribe a very large number of recordings.
>> 
>> I need some help configuring Spark.
>> 
>> My app is basically a transformation with no side effects: recording URL --> 
>> transcript.  The input is a huge file with one URL per line, and the output 
>> is a huge file of transcripts.  
>> 
>> The speech recognizer is written in Java (Sphinx4), so it can be packaged as 
>> a JAR.
>> 
>> The recognizer is very processor intensive, so you can't run too many on one 
>> machine-- perhaps one recognizer per core.  The recognizer is also big-- 
>> maybe 1 GB.  But, most of the recognizer is a immutable acoustic and 
>> language models that can be shared with other instances of the recognizer.
>> 
>> So I want to run about one recognizer per core of each machine in my 
>> cluster.  I want all recognizer on one machine to run within the same JVM 
>> and share the same models.
>> 
>> How does one configure Spark for this sort of application?  How does one 
>> control how Spark deploys the stages of the process.  Can someone point me 
>> to an appropriate doc or keywords I should Google.
>> 
>> Thanks
>> Peter 
> 

Reply via email to