RE: Can't load a RandomForestClassificationModel in Spark job

2017-02-16 Thread Jianhong Xia
Thanks Hollin.

I will take a look at mleap and will let you know if I have any questions.

Jianhong


From: Hollin Wilkins [mailto:hol...@combust.ml]
Sent: Tuesday, February 14, 2017 11:48 PM
To: Jianhong Xia 
Cc: Sumona Routh ; ayan guha ; 
user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Hey there,

Creating a new SparkContext on workers will not work, only the driver is 
allowed to own a SparkContext. Are you trying to distribute your model to 
workers so you can create a distributed scoring service? If so, it may be worth 
looking into taking your models outside of a SparkContext and serving them 
separately.

If this is your use case, take a look at MLeap. We use it in production to 
serve high-volume realtime requests from Spark-trained models: 
https://github.com/combust/mleap

Cheers,
Hollin

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia 
mailto:j...@infoblox.com>> wrote:
Is there any update on this problem?

I encountered the same issue that was mentioned here.

I have CrossValidatorModel.transform(df) running on workers, which requires 
DataFrame as an input. However, we only have Arrays on workers. When we deploy 
our model into cluster mode, we could not create createDataFrame on workers. It 
will give me error:


17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
 at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
 at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
 at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
 at 
com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)



On the other hand, if we run in the local, everything works fine.

Just want to know, if there is any successful case that run machine learning 
models on the workers.


Thanks,
Jianhong


From: Sumona Routh [mailto:sumos...@gmail.com<mailto:sumos...@gmail.com>]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha mailto:guha.a...@gmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Yes, I save it to S3 in a different process. It is actually the 
RandomForestClassificationModel.load method (passed an s3 path) where I run 
into problems.
When you say you load it during map stages, do you mean that you are able to 
directly load a model from inside of a transformation? When I try this, it 
passes the function to a worker, and the load method itself appears to attempt 
to create a new SparkContext, which causes an NPE downstream (because creating 
a SparkContext on the worker is not an appropriate thing to do, according to 
various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha 
mailto:guha.a...@gmail.com>> wrote:
Hi

Given training and predictions are two different applications, I typically save 
model objects to hdfs and load it back during prediction map stages.

Best
Ayan

On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh 
mailto:sumos...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:

1)  RandomForestClassificationModel is effectively not serializable (I assume 
it's referencing something that can't be serialized, since it itself extends 
serializable), so I ended up with the well-known exception: 
org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter

because which model we use is dynamic based on what record we are

predicting on.
Has anyone else encountered this? Is this currently being addressed? I would 
expect objects from Spark's own libraries be able to be used seamlessly in 
their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely 
when executed from inside a map function (which I assume is passed to the 
executor). So, I basically cannot load a model from a worker. We have multiple 
"profiles" that use differently trained models, which are accessed from within 
a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I 
could not find anything stating that this particular function would not work 
from within a map function.
Are there any suggestions as to how I can get this model to work on a real 
production job (either by allowing it to be serializable and passed aro

RE: Can't load a RandomForestClassificationModel in Spark job

2017-02-14 Thread Jianhong Xia
Is there any update on this problem?

I encountered the same issue that was mentioned here.

I have CrossValidatorModel.transform(df) running on workers, which requires 
DataFrame as an input. However, we only have Arrays on workers. When we deploy 
our model into cluster mode, we could not create createDataFrame on workers. It 
will give me error:


17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
java.lang.NullPointerException
 at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
 at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
 at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270)
 at 
com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370)



On the other hand, if we run in the local, everything works fine.

Just want to know, if there is any successful case that run machine learning 
models on the workers.


Thanks,
Jianhong


From: Sumona Routh [mailto:sumos...@gmail.com]
Sent: Thursday, January 12, 2017 6:20 PM
To: ayan guha ; user@spark.apache.org
Subject: Re: Can't load a RandomForestClassificationModel in Spark job

Yes, I save it to S3 in a different process. It is actually the 
RandomForestClassificationModel.load method (passed an s3 path) where I run 
into problems.
When you say you load it during map stages, do you mean that you are able to 
directly load a model from inside of a transformation? When I try this, it 
passes the function to a worker, and the load method itself appears to attempt 
to create a new SparkContext, which causes an NPE downstream (because creating 
a SparkContext on the worker is not an appropriate thing to do, according to 
various threads I've found).
Maybe there is a different load function I should be using?
Thanks!
Sumona

On Thu, Jan 12, 2017 at 6:26 PM ayan guha 
mailto:guha.a...@gmail.com>> wrote:
Hi

Given training and predictions are two different applications, I typically save 
model objects to hdfs and load it back during prediction map stages.

Best
Ayan

On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh 
mailto:sumos...@gmail.com>> wrote:
Hi all,
I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
I encountered two frustrating issues and would really appreciate some advice:

1)  RandomForestClassificationModel is effectively not serializable (I assume 
it's referencing something that can't be serialized, since it itself extends 
serializable), so I ended up with the well-known exception: 
org.apache.spark.SparkException: Task not serializable.
Basically, my original intention was to pass the model as a parameter

because which model we use is dynamic based on what record we are

predicting on.
Has anyone else encountered this? Is this currently being addressed? I would 
expect objects from Spark's own libraries be able to be used seamlessly in 
their applications without these types of exceptions.
2) The RandomForestClassificationModel.load method appears to hang indefinitely 
when executed from inside a map function (which I assume is passed to the 
executor). So, I basically cannot load a model from a worker. We have multiple 
"profiles" that use differently trained models, which are accessed from within 
a map function to run predictions on different sets of data.
The thread that is hanging has this as the latest (most pertinent) code:
org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391)
Looking at the code in github, it appears that it is calling sc.textFile. I 
could not find anything stating that this particular function would not work 
from within a map function.
Are there any suggestions as to how I can get this model to work on a real 
production job (either by allowing it to be serializable and passed around or 
loaded from a worker)?
I've extenisvely POCed this model (saving, loading, transforming, training, 
etc.), however this is the first time I'm attempting to use it from within a 
real application.
Sumona