Hey there, Creating a new SparkContext on workers will not work, only the driver is allowed to own a SparkContext. Are you trying to distribute your model to workers so you can create a distributed scoring service? If so, it may be worth looking into taking your models outside of a SparkContext and serving them separately.
If this is your use case, take a look at MLeap. We use it in production to serve high-volume realtime requests from Spark-trained models: https://github.com/combust/mleap Cheers, Hollin On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia <j...@infoblox.com> wrote: > Is there any update on this problem? > > > > I encountered the same issue that was mentioned here. > > > > I have CrossValidatorModel.transform(df) running on workers, which > requires DataFrame as an input. However, we only have Arrays on workers. > When we deploy our model into cluster mode, we could not create > createDataFrame on workers. It will give me error: > > > > > > 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats > > java.lang.NullPointerException > > at org.apache.spark.sql.SparkSession.sessionState$ > lzycompute(SparkSession.scala:111) > > at org.apache.spark.sql.SparkSession.sessionState( > SparkSession.scala:109) > > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) > > at org.apache.spark.sql.SparkSession.createDataFrame( > SparkSession.scala:270) > > at com.mycompany.analytics.models.app.serializable. > AppModeler.detection(modeler.scala:370) > > > > > > > > On the other hand, if we run in the local, everything works fine. > > > > Just want to know, if there is any successful case that run machine > learning models on the workers. > > > > > > Thanks, > > Jianhong > > > > > > *From:* Sumona Routh [mailto:sumos...@gmail.com] > *Sent:* Thursday, January 12, 2017 6:20 PM > *To:* ayan guha <guha.a...@gmail.com>; user@spark.apache.org > *Subject:* Re: Can't load a RandomForestClassificationModel in Spark job > > > > Yes, I save it to S3 in a different process. It is actually the > RandomForestClassificationModel.load method (passed an s3 path) where I > run into problems. > When you say you load it during map stages, do you mean that you are able > to directly load a model from inside of a transformation? When I try this, > it passes the function to a worker, and the load method itself appears to > attempt to create a new SparkContext, which causes an NPE downstream > (because creating a SparkContext on the worker is not an appropriate thing > to do, according to various threads I've found). > > Maybe there is a different load function I should be using? > > Thanks! > > Sumona > > > > On Thu, Jan 12, 2017 at 6:26 PM ayan guha <guha.a...@gmail.com> wrote: > > Hi > > > > Given training and predictions are two different applications, I typically > save model objects to hdfs and load it back during prediction map stages. > > > > Best > > Ayan > > > > On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh <sumos...@gmail.com> wrote: > > Hi all, > > I've been working with Spark mllib 2.0.2 RandomForestClassificationModel. > > I encountered two frustrating issues and would really appreciate some > advice: > > 1) RandomForestClassificationModel is effectively not serializable (I > assume it's referencing something that can't be serialized, since it itself > extends serializable), so I ended up with the well-known exception: > org.apache.spark.SparkException: Task not serializable. > Basically, my original intention was to pass the model as a parameter > > because which model we use is dynamic based on what record we are > > predicting on. > > Has anyone else encountered this? Is this currently being addressed? I > would expect objects from Spark's own libraries be able to be used > seamlessly in their applications without these types of exceptions. > > 2) The RandomForestClassificationModel.load method appears to hang > indefinitely when executed from inside a map function (which I assume is > passed to the executor). So, I basically cannot load a model from a worker. > We have multiple "profiles" that use differently trained models, which are > accessed from within a map function to run predictions on different sets of > data. > > The thread that is hanging has this as the latest (most pertinent) code: > org.apache.spark.ml.util.DefaultParamsReader$. > loadMetadata(ReadWrite.scala:391) > > Looking at the code in github, it appears that it is calling sc.textFile. > I could not find anything stating that this particular function would not > work from within a map function. > > Are there any suggestions as to how I can get this model to work on a real > production job (either by allowing it to be serializable and passed around > or loaded from a worker)? > > I've extenisvely POCed this model (saving, loading, transforming, > training, etc.), however this is the first time I'm attempting to use it > from within a real application. > > Sumona > >