Oh... That was embarrassingly easy! Thank you that was exactly the understanding of partitions that I needed.
P On Thu, Jul 30, 2015 at 6:35 AM, Simon Elliston Ball < si...@simonellistonball.com> wrote: > You might also want to consider broadcasting the models to ensure you get > one instance shared across cores in each machine, otherwise the model will > be serialised to each task and you'll get a copy per executor (roughly core > in this instance) > > Simon > > Sent from my iPhone > > On 30 Jul 2015, at 10:14, Akhil Das <ak...@sigmoidanalytics.com> wrote: > > Like this? > > val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls > => speachRecognizer(urls)) > > Let 24 be the total number of cores that you have on all the workers. > > Thanks > Best Regards > > On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf <opus...@gmail.com> wrote: > >> Hello, I am writing a Spark application to use speech recognition to >> transcribe a very large number of recordings. >> >> I need some help configuring Spark. >> >> My app is basically a transformation with no side effects: recording URL >> --> transcript. The input is a huge file with one URL per line, and the >> output is a huge file of transcripts. >> >> The speech recognizer is written in Java (Sphinx4), so it can be packaged >> as a JAR. >> >> The recognizer is very processor intensive, so you can't run too many on >> one machine-- perhaps one recognizer per core. The recognizer is also >> big-- maybe 1 GB. But, most of the recognizer is a immutable acoustic and >> language models that can be shared with other instances of the recognizer. >> >> So I want to run about one recognizer per core of each machine in my >> cluster. I want all recognizer on one machine to run within the same JVM >> and share the same models. >> >> How does one configure Spark for this sort of application? How does one >> control how Spark deploys the stages of the process. Can someone point me >> to an appropriate doc or keywords I should Google. >> >> Thanks >> Peter >> > >