Yes, I save it to S3 in a different process. It is actually the
RandomForestClassificationModel.load method (passed an s3 path) where I run
into problems.
When you say you load it during map stages, do you mean that you are able
to directly load a model from inside of a transformation? When I try this,
it passes the function to a worker, and the load method itself appears to
attempt to create a new SparkContext, which causes an NPE downstream
(because creating a SparkContext on the worker is not an appropriate thing
to do, according to various threads I've found).

Maybe there is a different load function I should be using?

Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> Given training and predictions are two different applications, I typically
> save model objects to hdfs and load it back during prediction map stages.
>
> Best
> Ayan
>
> On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumos...@gmail.com> wrote:
>
> Hi all,
> I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
>
> I encountered two frustrating issues and would really appreciate some
> advice:
>
> 1)  RandomForestClassificationModel is effectively not serializable (I
> assume it's referencing something that can't be serialized, since it itself
> extends serializable), so I ended up with the well-known exception:
> org.apache.spark.SparkException: Task not serializable.
> Basically, my original intention was to pass the model as a parameter
>
> because which model we use is dynamic based on what record we are
>
> predicting on.
>
> Has anyone else encountered this? Is this currently being addressed? I
> would expect objects from Spark's own libraries be able to be used
> seamlessly in their applications without these types of exceptions.
>
> 2) The RandomForestClassificationModel.load method appears to hang
> indefinitely when executed from inside a map function (which I assume is
> passed to the executor). So, I basically cannot load a model from a worker.
> We have multiple "profiles" that use differently trained models, which are
> accessed from within a map function to run predictions on different sets of
> data.
> The thread that is hanging has this as the latest (most pertinent) code:
>
> org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
> Looking at the code in github, it appears that it is calling sc.textFile.
> I could not find anything stating that this particular function would not
> work from within a map function.
>
> Are there any suggestions as to how I can get this model to work on a real
> production job (either by allowing it to be serializable and passed around
> or loaded from a worker)?
>
> I've extenisvely POCed this model (saving, loading, transforming,
> training, etc.), however this is the first time I'm attempting to use it
> from within a real application.
>
> Sumona
>
>

Reply via email to