Re: Loading objects only once
Something like this: ``` object Model { @transient lazy val modelObject = new ModelLoader("model-filename") def get() = modelObject } object SparkJob { def main(args: Array[String]) = { sc.addFile("s3://bucket/path/model-filename") sc.parallelize(…).map(test => { Model.get().use(…) }) } } ``` On Thu, Sep 28, 2017 at 3:49 PM, Vadim Semenov wrote: > as an alternative > ``` > spark-submit --files > ``` > > the files will be put on each executor in the working directory, so you > can then load it alongside your `map` function > > Behind the scene it uses `SparkContext.addFile` method that you can use > too > https://github.com/apache/spark/blob/master/core/src/ > main/scala/org/apache/spark/SparkContext.scala?utf8=✓#L1508-L1558 > > On Wed, Sep 27, 2017 at 10:08 PM, Naveen Swamy wrote: > >> Hello all, >> >> I am a new user to Spark, please bear with me if this has been discussed >> earlier. >> >> I am trying to run batch inference using DL frameworks pre-trained models >> and Spark. Basically, I want to download a model(which is usually ~500 MB) >> onto the workers and load the model and run inference on images fetched >> from the source like S3 something like this >> rdd = sc.parallelize(load_from_s3) >> rdd.map(fetch_from_s3).map(read_file).map(predict) >> >> I was able to get it running in local mode on Jupyter, However, I would >> like to load the model only once and not every map operation. A setup hook >> would have nice which loads the model once into the JVM, I came across this >> JIRA https://issues.apache.org/jira/browse/SPARK-650 which suggests >> that I can use Singleton and static initialization. I tried to do this >> using a Singleton metaclass following the thread here >> https://stackoverflow.com/questions/6760685/creating-a-singl >> eton-in-python. Following this failed miserably complaining that Spark >> cannot serialize ctype objects with pointer references. >> >> After a lot of trial and error, I moved the code to a separate file by >> creating a static method for predict that checks if a class variable is set >> or not and loads the model if not set. This approach does not sound thread >> safe to me, So I wanted to reach out and see if there are established >> patterns on how to achieve something like this. >> >> >> Also, I would like to understand the executor->tasks->python process >> mapping, Does each task gets mapped to a separate python process? The >> reason I ask is I want to be to use mapPartition method to load a batch of >> files and run inference on them separately for which I need to load the >> object once per task. Any >> >> >> Thanks for your time in answering my question. >> >> Cheers, Naveen >> >> >> >
Re: Loading objects only once
as an alternative ``` spark-submit --files ``` the files will be put on each executor in the working directory, so you can then load it alongside your `map` function Behind the scene it uses `SparkContext.addFile` method that you can use too https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala?utf8=✓#L1508-L1558 On Wed, Sep 27, 2017 at 10:08 PM, Naveen Swamy wrote: > Hello all, > > I am a new user to Spark, please bear with me if this has been discussed > earlier. > > I am trying to run batch inference using DL frameworks pre-trained models > and Spark. Basically, I want to download a model(which is usually ~500 MB) > onto the workers and load the model and run inference on images fetched > from the source like S3 something like this > rdd = sc.parallelize(load_from_s3) > rdd.map(fetch_from_s3).map(read_file).map(predict) > > I was able to get it running in local mode on Jupyter, However, I would > like to load the model only once and not every map operation. A setup hook > would have nice which loads the model once into the JVM, I came across this > JIRA https://issues.apache.org/jira/browse/SPARK-650 which suggests that > I can use Singleton and static initialization. I tried to do this using > a Singleton metaclass following the thread here > https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python. > Following this failed miserably complaining that Spark cannot serialize > ctype objects with pointer references. > > After a lot of trial and error, I moved the code to a separate file by > creating a static method for predict that checks if a class variable is set > or not and loads the model if not set. This approach does not sound thread > safe to me, So I wanted to reach out and see if there are established > patterns on how to achieve something like this. > > > Also, I would like to understand the executor->tasks->python process > mapping, Does each task gets mapped to a separate python process? The > reason I ask is I want to be to use mapPartition method to load a batch of > files and run inference on them separately for which I need to load the > object once per task. Any > > > Thanks for your time in answering my question. > > Cheers, Naveen > > >
RE: Loading objects only once
Maybe load the model on each executor’s disk and load it from there? Depending on how you use the data/model, using something like Livy and sharing the same connection may help? From: Naveen Swamy [mailto:mnnav...@gmail.com] Sent: Wednesday, September 27, 2017 9:08 PM To: user@spark.apache.org Subject: Loading objects only once Hello all, I am a new user to Spark, please bear with me if this has been discussed earlier. I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this rdd = sc.parallelize(load_from_s3) rdd.map(fetch_from_s3).map(read_file).map(predict) I was able to get it running in local mode on Jupyter, However, I would like to load the model only once and not every map operation. A setup hook would have nice which loads the model once into the JVM, I came across this JIRA https://issues.apache.org/jira/browse/SPARK-650 which suggests that I can use Singleton and static initialization. I tried to do this using a Singleton metaclass following the thread here https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python. Following this failed miserably complaining that Spark cannot serialize ctype objects with pointer references. After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this. Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any Thanks for your time in answering my question. Cheers, Naveen
Re: Loading objects only once
Hello, maybe broadcast can help you here. [1] You can load the model once on the driver and then broadcast it to the workers with `bc_model = sc.broadcast(model)`? You can access the model in the map function with `bc_model.value()`. Best Eike [1] https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.broadcast Naveen Swamy schrieb am Do., 28. Sep. 2017 um 04:09 Uhr: > Hello all, > > I am a new user to Spark, please bear with me if this has been discussed > earlier. > > I am trying to run batch inference using DL frameworks pre-trained models > and Spark. Basically, I want to download a model(which is usually ~500 MB) > onto the workers and load the model and run inference on images fetched > from the source like S3 something like this > rdd = sc.parallelize(load_from_s3) > rdd.map(fetch_from_s3).map(read_file).map(predict) > > I was able to get it running in local mode on Jupyter, However, I would > like to load the model only once and not every map operation. A setup hook > would have nice which loads the model once into the JVM, I came across this > JIRA https://issues.apache.org/jira/browse/SPARK-650 which suggests that > I can use Singleton and static initialization. I tried to do this using > a Singleton metaclass following the thread here > https://stackoverflow.com/questions/6760685/creating-a-singleton-in-python. > Following this failed miserably complaining that Spark cannot serialize > ctype objects with pointer references. > > After a lot of trial and error, I moved the code to a separate file by > creating a static method for predict that checks if a class variable is set > or not and loads the model if not set. This approach does not sound thread > safe to me, So I wanted to reach out and see if there are established > patterns on how to achieve something like this. > > > Also, I would like to understand the executor->tasks->python process > mapping, Does each task gets mapped to a separate python process? The > reason I ask is I want to be to use mapPartition method to load a batch of > files and run inference on them separately for which I need to load the > object once per task. Any > > > Thanks for your time in answering my question. > > Cheers, Naveen > > >