Is there any update on this problem? I encountered the same issue that was mentioned here.
I have CrossValidatorModel.transform(df) running on workers, which requires DataFrame as an input. However, we only have Arrays on workers. When we deploy our model into cluster mode, we could not create createDataFrame on workers. It will give me error: 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111) at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:270) at com.mycompany.analytics.models.app.serializable.AppModeler.detection(modeler.scala:370) On the other hand, if we run in the local, everything works fine. Just want to know, if there is any successful case that run machine learning models on the workers. Thanks, Jianhong From: Sumona Routh [mailto:sumos...@gmail.com] Sent: Thursday, January 12, 2017 6:20 PM To: ayan guha <guha.a...@gmail.com>; user@spark.apache.org Subject: Re: Can't load a RandomForestClassificationModel in Spark job Yes, I save it to S3 in a different process. It is actually the RandomForestClassificationModel.load method (passed an s3 path) where I run into problems. When you say you load it during map stages, do you mean that you are able to directly load a model from inside of a transformation? When I try this, it passes the function to a worker, and the load method itself appears to attempt to create a new SparkContext, which causes an NPE downstream (because creating a SparkContext on the worker is not an appropriate thing to do, according to various threads I've found). Maybe there is a different load function I should be using? Thanks! Sumona On Thu, Jan 12, 2017 at 6:26 PM ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>> wrote: Hi Given training and predictions are two different applications, I typically save model objects to hdfs and load it back during prediction map stages. Best Ayan On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumos...@gmail.com<mailto:sumos...@gmail.com>> wrote: Hi all, I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. I encountered two frustrating issues and would really appreciate some advice: 1) RandomForestClassificationModel is effectively not serializable (I assume it's referencing something that can't be serialized, since it itself extends serializable), so I ended up with the well-known exception: org.apache.spark.SparkException: Task not serializable. Basically, my original intention was to pass the model as a parameter because which model we use is dynamic based on what record we are predicting on. Has anyone else encountered this? Is this currently being addressed? I would expect objects from Spark's own libraries be able to be used seamlessly in their applications without these types of exceptions. 2) The RandomForestClassificationModel.load method appears to hang indefinitely when executed from inside a map function (which I assume is passed to the executor). So, I basically cannot load a model from a worker. We have multiple "profiles" that use differently trained models, which are accessed from within a map function to run predictions on different sets of data. The thread that is hanging has this as the latest (most pertinent) code: org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:391) Looking at the code in github, it appears that it is calling sc.textFile. I could not find anything stating that this particular function would not work from within a map function. Are there any suggestions as to how I can get this model to work on a real production job (either by allowing it to be serializable and passed around or loaded from a worker)? I've extenisvely POCed this model (saving, loading, transforming, training, etc.), however this is the first time I'm attempting to use it from within a real application. Sumona