RE: Can't load a RandomForestClassificationModel in Spark job

Jianhong Xia Tue, 14 Feb 2017 16:47:00 -0800

Is there any update on this problem?

I encountered the same issue that was mentioned here.


I have CrossValidatorModel.transform(df) running on workers, which requires 
DataFrame as an input. However, we only have Arrays on workers. When we deploy 
our model into cluster mode, we could not create createDataFrame on workers. It 
will give me error:


17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
     at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
     at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
     at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
     at 
com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)



On the other hand, if we run in the local, everything works fine.

Just want to know, if there is any successful case that run machine learning 
models on the workers.


Thanks,
Jianhong


From: Sumona Routh [mailto:sumos...@gmail.com]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha <guha.a...@gmail.com>; user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Yes, I save it to S3 in a different process. It is actually the 
RandomForestClassificationModel.load method (passed an s3 path) where I run 
into problems.
When you say you load it during map stages, do you mean that you are able to 
directly load a model from inside of a transformation? When I try this, it 
passes the function to a worker, and the load method itself appears to attempt 
to create a new SparkContext, which causes an NPE downstream (because creating 
a SparkContext on the worker is not an appropriate thing to do, according to 
various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha 
<guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote:
Hi

Given training and predictions are two different applications, I typically save 
model objects to hdfs and load it back during prediction map stages.

Best
Ayan

On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh 
<sumos...@gmail.com<mailto:sumos...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:

1)  RandomForestClassificationModel is effectively not serializable (I assume 
it's referencing something that can't be serialized, since it itself extends 
serializable), so I ended up with the well-known exception: 
org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter

because which model we use is dynamic based on what record we are

predicting on.
Has anyone else encountered this? Is this currently being addressed? I would 
expect objects from Spark's own libraries be able to be used seamlessly in 
their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely 
when executed from inside a map function (which I assume is passed to the 
executor). So, I basically cannot load a model from a worker. We have multiple 
"profiles" that use differently trained models, which are accessed from within 
a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I 
could not find anything stating that this particular function would not work 
from within a map function.
Are there any suggestions as to how I can get this model to work on a real 
production job (either by allowing it to be serializable and passed around or 
loaded from a worker)?
I've extenisvely POCed this model (saving, loading, transforming, training, 
etc.), however this is the first time I'm attempting to use it from within a 
real application.
Sumona

RE: Can't load a RandomForestClassificationModel in Spark job

Reply via email to