Yes, I save it to S3 in a different process. It is actually the RandomForestClassificationModel.load method (passed an s3 path) where I run into problems. When you say you load it during map stages, do you mean that you are able to directly load a model from inside of a transformation? When I try this, it passes the function to a worker, and the load method itself appears to attempt to create a new SparkContext, which causes an NPE downstream (because creating a SparkContext on the worker is not an appropriate thing to do, according to various threads I've found).
Maybe there is a different load function I should be using? Thanks! Sumona On Thu, Jan 12, 2017 at 6:26 PM ayan guha <guha.a...@gmail.com> wrote: > Hi > > Given training and predictions are two different applications, I typically > save model objects to hdfs and load it back during prediction map stages. > > Best > Ayan > > On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumos...@gmail.com> wrote: > > Hi all, > I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. > > I encountered two frustrating issues and would really appreciate some > advice: > > 1) RandomForestClassificationModel is effectively not serializable (I > assume it's referencing something that can't be serialized, since it itself > extends serializable), so I ended up with the well-known exception: > org.apache.spark.SparkException: Task not serializable. > Basically, my original intention was to pass the model as a parameter > > because which model we use is dynamic based on what record we are > > predicting on. > > Has anyone else encountered this? Is this currently being addressed? I > would expect objects from Spark's own libraries be able to be used > seamlessly in their applications without these types of exceptions. > > 2) The RandomForestClassificationModel.load method appears to hang > indefinitely when executed from inside a map function (which I assume is > passed to the executor). So, I basically cannot load a model from a worker. > We have multiple "profiles" that use differently trained models, which are > accessed from within a map function to run predictions on different sets of > data. > The thread that is hanging has this as the latest (most pertinent) code: > > org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391) > Looking at the code in github, it appears that it is calling sc.textFile. > I could not find anything stating that this particular function would not > work from within a map function. > > Are there any suggestions as to how I can get this model to work on a real > production job (either by allowing it to be serializable and passed around > or loaded from a worker)? > > I've extenisvely POCed this model (saving, loading, transforming, > training, etc.), however this is the first time I'm attempting to use it > from within a real application. > > Sumona > >