Thanks Hollin.
I will take a look at mleap and will let you know if I have any questions.
Jianhong
From: Hollin Wilkins [mailto:hol...@combust.ml]
Sent: Tuesday, February 14, 2017 11:48 PM
To: Jianhong Xia
Cc: Sumona Routh ; ayan guha ;
user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job
Hey there,
Creating a new SparkContext on workers will not work, only the driver is
allowed to own a SparkContext. Are you trying to distribute your model to
workers so you can create a distributed scoring service? If so, it may be worth
looking into taking your models outside of a SparkContext and serving them
separately.
If this is your use case, take a look at MLeap. We use it in production to
serve high-volume realtime requests from Spark-trained models:
https://github.com/combust/mleap
Cheers,
Hollin
On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia
mailto:j...@infoblox.com>> wrote:
Is there any update on this problem?
I encountered the same issue that was mentioned here.
I have CrossValidatorModel.transform(df) running on workers, which requires
DataFrame as an input. However, we only have Arrays on workers. When we deploy
our model into cluster mode, we could not create createDataFrame on workers. It
will give me error:
17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
at
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
at
com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)
On the other hand, if we run in the local, everything works fine.
Just want to know, if there is any successful case that run machine learning
models on the workers.
Thanks,
Jianhong
From: Sumona Routh [mailto:sumos...@gmail.com<mailto:sumos...@gmail.com>]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha mailto:guha.a...@gmail.com>>;
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Can't load a RandomForestClassificationModel in Spark job
Yes, I save it to S3 in a different process. It is actually the
RandomForestClassificationModel.load method (passed an s3 path) where I run
into problems.
When you say you load it during map stages, do you mean that you are able to
directly load a model from inside of a transformation? When I try this, it
passes the function to a worker, and the load method itself appears to attempt
to create a new SparkContext, which causes an NPE downstream (because creating
a SparkContext on the worker is not an appropriate thing to do, according to
various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona
On Thu, Jan 12, 2017 at 6:26 PM ayan guha
mailto:guha.a...@gmail.com>> wrote:
Hi
Given training and predictions are two different applications, I typically save
model objects to hdfs and load it back during prediction map stages.
Best
Ayan
On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh
mailto:sumos...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:
1) RandomForestClassificationModel is effectively not serializable (I assume
it's referencing something that can't be serialized, since it itself extends
serializable), so I ended up with the well-known exception:
org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter
because which model we use is dynamic based on what record we are
predicting on.
Has anyone else encountered this? Is this currently being addressed? I would
expect objects from Spark's own libraries be able to be used seamlessly in
their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely
when executed from inside a map function (which I assume is passed to the
executor). So, I basically cannot load a model from a worker. We have multiple
"profiles" that use differently trained models, which are accessed from within
a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I
could not find anything stating that this particular function would not work
from within a map function.
Are there any suggestions as to how I can get this model to work on a real
production job (either by allowing it to be serializable and passed aro