Hello all,

I am a new user to Spark, please bear with me if this has been discussed
earlier.

I am trying to run batch inference using DL frameworks pre-trained models
and Spark. Basically, I want to download a model(which is usually ~500 MB)
onto the workers and load the model and run inference on images fetched
from the source like S3 something like this
rdd = sc.parallelize(load_from_s3)
rdd.map(fetch_from_s3).map(read_file).map(predict)

I was able to get it running in local mode on Jupyter, However, I would
like to load the model only once and not every map operation. A setup hook
would have nice which loads the model once into the JVM, I came across this
JIRA https://issues.apache.org/jira/browse/SPARK-650  which suggests that I
can use Singleton and static initialization. I tried to do this using
a Singleton metaclass following the thread here https://stackoverflow.com/
questions/6760685/creating-a-singleton-in-python. Following this failed
miserably complaining that Spark cannot serialize ctype objects with
pointer references.

After a lot of trial and error, I moved the code to a separate file by
creating a static method for predict that checks if a class variable is set
or not and loads the model if not set. This approach does not sound thread
safe to me, So I wanted to reach out and see if there are established
patterns on how to achieve something like this.


Also, I would like to understand the executor->tasks->python process
mapping, Does each task gets mapped to a separate python process?  The
reason I ask is I want to be to use mapPartition method to load a batch of
files and run inference on them separately for which I need to load the
object once per task. Any


Thanks for your time in answering my question.

Cheers, Naveen

Reply via email to