Re: Can't load a RandomForestClassificationModel in Spark job

2017-02-14 Thread Hollin Wilkins
Hey there,

Creating a new SparkContext on workers will not work, only the driver is
allowed to own a SparkContext. Are you trying to distribute your model to
workers so you can create a distributed scoring service? If so, it may be
worth looking into taking your models outside of a SparkContext and serving
them separately.

If this is your use case, take a look at MLeap. We use it in production to
serve high-volume realtime requests from Spark-trained models:
https://github.com/combust/mleap

Cheers,
Hollin

On Tue, Feb 14, 2017 at 4:46 PM, Jianhong Xia  wrote:

> Is there any update on this problem?
>
>
>
> I encountered the same issue that was mentioned here.
>
>
>
> I have CrossValidatorModel.transform(df) running on workers, which
> requires DataFrame as an input. However, we only have Arrays on workers.
> When we deploy our model into cluster mode, we could not create
> createDataFrame on workers. It will give me error:
>
>
>
>
>
> 17/02/13 20:21:27 ERROR Detector$: Error while detecting threats
>
> java.lang.NullPointerException
>
>  at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:111)
>
>  at org.apache.spark.sql.SparkSession.sessionState(
> SparkSession.scala:109)
>
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
>
>  at org.apache.spark.sql.SparkSession.createDataFrame(
> SparkSession.scala:270)
>
>  at com.mycompany.analytics.models.app.serializable.
> AppModeler.detection(modeler.scala:370)
>
>
>
>
>
>
>
> On the other hand, if we run in the local, everything works fine.
>
>
>
> Just want to know, if there is any successful case that run machine
> learning models on the workers.
>
>
>
>
>
> Thanks,
>
> Jianhong
>
>
>
>
>
> *From:* Sumona Routh [mailto:sumos...@gmail.com]
> *Sent:* Thursday, January 12, 2017 6:20 PM
> *To:* ayan guha ; user@spark.apache.org
> *Subject:* Re: Can't load a RandomForestClassificationModel in Spark job
>
>
>
> Yes, I save it to S3 in a different process. It is actually the
> RandomForestClassificationModel.load method (passed an s3 path) where I
> run into problems.
> When you say you load it during map stages, do you mean that you are able
> to directly load a model from inside of a transformation? When I try this,
> it passes the function to a worker, and the load method itself appears to
> attempt to create a new SparkContext, which causes an NPE downstream
> (because creating a SparkContext on the worker is not an appropriate thing
> to do, according to various threads I've found).
>
> Maybe there is a different load function I should be using?
>
> Thanks!
>
> Sumona
>
>
>
> On Thu, Jan 12, 2017 at 6:26 PM ayan guha  wrote:
>
> Hi
>
>
>
> Given training and predictions are two different applications, I typically
> save model objects to hdfs and load it back during prediction map stages.
>
>
>
> Best
>
> Ayan
>
>
>
> On Fri, 13 Jan 2017 at 5:39 am, Sumona Routh  wrote:
>
> Hi all,
>
> I've been working with Spark mllib 2.0.2 RandomForestClassificationModel.
>
> I encountered two frustrating issues and would really appreciate some
> advice:
>
> 1)  RandomForestClassificationModel is effectively not serializable (I
> assume it's referencing something that can't be serialized, since it itself
> extends serializable), so I ended up with the well-known exception:
> org.apache.spark.SparkException: Task not serializable.
> Basically, my original intention was to pass the model as a parameter
>
> because which model we use is dynamic based on what record we are
>
> predicting on.
>
> Has anyone else encountered this? Is this currently being addressed? I
> would expect objects from Spark's own libraries be able to be used
> seamlessly in their applications without these types of exceptions.
>
> 2) The RandomForestClassificationModel.load method appears to hang
> indefinitely when executed from inside a map function (which I assume is
> passed to the executor). So, I basically cannot load a model from a worker.
> We have multiple "profiles" that use differently trained models, which are
> accessed from within a map function to run predictions on different sets of
> data.
>
> The thread that is hanging has this as the latest (most pertinent) code:
> org.apache.spark.ml.util.DefaultParamsReader$.
> loadMetadata(ReadWrite.scala:391)
>
> Looking at the code in github, it appears that it is calling sc.textFile.
> I could not find anything stating that this particular function would not
> work from within a map function.
>
> Are there any suggestions as to how I can get this model to work on a real
> production job (either by allowing it to be serializable and passed around
> or loaded from a worker)?
>
> I've extenisvely POCed this model (saving, loading, transforming,
> training, etc.), however this is the first time I'm attempting to use it
> from within a real application.
>
> Sumona
>
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Hollin Wilkins
0 but that's critical
>> item on our list.
>>
>> For exposing models out of PipelineModel, let me look into the ML
>> tasks...we should add it since dataframe should not be must for model
>> scoring...many times model are scored on api or streaming paths which don't
>> have micro batching involved...data directly lands from http or kafka/msg
>> queues...for such cases raw access to ML model is essential similar to
>> mllib model access...
>>
>> Thanks.
>> Deb
>> On Feb 4, 2017 9:58 PM, "Aseem Bansal"  wrote:
>>
>>> @Debasish
>>>
>>> I see that the spark version being used in the project that you
>>> mentioned is 1.6.0. I would suggest that you take a look at some blogs
>>> related to Spark 2.0 Pipelines, Models in new ml package. The new ml
>>> package's API as of latest Spark 2.1.0 release has no way to call predict
>>> on single vector. There is no API exposed. It is WIP but not yet released.
>>>
>>> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das 
>>> wrote:
>>>
>>>> If we expose an API to access the raw models out of PipelineModel can't
>>>> we call predict directly on it from an API ? Is there a task open to expose
>>>> the model out of PipelineModel so that predict can be called on itthere
>>>> is no dependency of spark context in ml model...
>>>> On Feb 4, 2017 9:11 AM, "Aseem Bansal"  wrote:
>>>>
>>>>>
>>>>>- In Spark 2.0 there is a class called PipelineModel. I know that
>>>>>the title says pipeline but it is actually talking about PipelineModel
>>>>>trained via using a Pipeline.
>>>>>- Why PipelineModel instead of pipeline? Because usually there is
>>>>>a series of stuff that needs to be done when doing ML which warrants an
>>>>>ordered sequence of operations. Read the new spark ml docs or one of 
>>>>> the
>>>>>databricks blogs related to spark pipelines. If you have used python's
>>>>>sklearn library the concept is inspired from there.
>>>>>- "once model is deserialized as ml model from the store of choice
>>>>>within ms" - The timing of loading the model was not what I was
>>>>>referring to when I was talking about timing.
>>>>>- "it can be used on incoming features to score through
>>>>>spark.ml.Model predict API". The predict API is in the old mllib 
>>>>> package
>>>>>not the new ml package.
>>>>>- "why r we using dataframe and not the ML model directly from
>>>>>API" - Because as of now the new ml package does not have the direct 
>>>>> API.
>>>>>
>>>>>
>>>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <
>>>>> debasish.da...@gmail.com> wrote:
>>>>>
>>>>>> I am not sure why I will use pipeline to do scoring...idea is to
>>>>>> build a model, use model ser/deser feature to put it in the row or column
>>>>>> store of choice and provide a api access to the model...we support these
>>>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>>>> spark context in local or distributed mode...once model is deserialized 
>>>>>> as
>>>>>> ml model from the store of choice within ms, it can be used on incoming
>>>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>>>> 2200x speedup...why r we using dataframe and not the ML model directly 
>>>>>> from
>>>>>> API ?
>>>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:
>>>>>>
>>>>>>> Does this support Java 7?
>>>>>>> What is your timezone in case someone wanted to talk?
>>>>>>>
>>>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Aseem,
>>>>>>>>
>>>>>>>> We have built pipelines that execute several string indexers, one
>>>>>>>> hot encoders, scaling, and a random forest or linear regression at the 
>>>>>>>> end.
>>>>>>>> Execution time for the linear regression was on the order of 11
>>>>>&

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Hollin Wilkins
Hey Asher,

A phone call may be the best to discuss all of this. But in short:
1. It is quite easy to add custom pipelines/models to MLeap. All of our
out-of-the-box transformers can serve as a good example of how to do this.
We are also putting together documentation on how to do this in our docs
web site.
2. MLlib models are not supported, but it wouldn't be too difficult to add
support for them
3. We have benchmarked this, and with MLeap it was roughly 2200x faster
than SparkContext with a LocalRelation-backed DataFrame. The pipeline we
used for benchmarking included string indexing, one hot encoding, vector
assembly, scaling and a linear regression model. The reason for the speed
difference is that MLeap is optimized for one off requests, Spark is
incredible for scoring large batches of data because it takes time to
optimize your pipeline before execution. That time it takes to optimize is
noticeable when trying to build services around models.
4. Tensorflow support is early, but we have already built pipelines
including a Spark pipeline and a Tensorflow neural network all served from
one MLeap pipeline, using the same data structures as you would with just a
regular Spark pipeline. Eventually we will offer Tensorflow support as a
module that *just works TM* from Maven Central, but we are not quite there
yet.

Feel free to email me privately if you would like to discuss any of this
more, or join our gitter:
https://gitter.im/combust/mleap

Best,
Hollin

On Fri, Feb 3, 2017 at 10:48 AM, Asher Krim  wrote:

> I have a bunch of questions for you Hollin:
>
> How easy is it to add support for custom pipelines/models?
> Are Spark mllib models supported?
> We currently run spark in local mode in an api service. It's not super
> terrible, but performance is a constant struggle. Have you benchmarked any
> performance differences between MLeap and vanilla Spark?
> What does Tensorflow support look like? I would love to serve models from
> a java stack while being agnostic to what framework was used to train them.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>
> On Fri, Feb 3, 2017 at 11:53 AM, Hollin Wilkins  wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>>> wrote:
>>>
>>>> Is computational time for predictions on the order of few milliseconds
>>>> (< 10 ms) like the old mllib library?
>>>>
>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>>
>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>> about MLeap and how you can use it to build production services from your
>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>> based on Protobuf 3 and JSON.
>>>>>
>>>>>
>>>>> The recent 0.5.0 release provides serialization and inference support
>>>>> for close to 100% of Spark transformers (we don’t yet support ALS and 
>>>>> LDA).
>>>>>
>>>>>
>>>>> MLeap is open-source, take a look at our Github page:
>>>>>
>>>>> https://github.com/combust/mleap
>>>>>
>>>>>
>>>>> Or join the conversation on Gitter:
>>>>>
>>>>> https://gitter.im/combust/mleap
>>>>>
>>>>>
>>>>> We have a set of documentation to help get you started here:
>>>>>
>>>>> http://mleap-docs.combust.ml/
>>>>>
>>>>>
>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>> logistic and random forest models:
>>>>>
>>>>> https://github.com/combust/mleap-demo
>>>>>
>>>>>
>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>
>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>
>>>>>
>>>>> Several companies are using MLeap in production and even more are
>>>>> currently evaluating it. Take a look and tell us what you think! We hope 
>>>>> to
>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> Hollin and Mikhail
>>>>>
>>>>
>>>>
>>>
>>
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Hollin Wilkins
Hey Aseem,

We have built pipelines that execute several string indexers, one hot
encoders, scaling, and a random forest or linear regression at the end.
Execution time for the linear regression was on the order of 11
microseconds, a bit longer for random forest. This can be further optimized
by using row-based transformations if your pipeline is simple to around 2-3
microseconds. The pipeline operated on roughly 12 input features, and by
the time all the processing was done, we had somewhere around 1000 features
or so going into the linear regression after one hot encoding and
everything else.

Hope this helps,
Hollin

On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal  wrote:

> Does this support Java 7?
>
> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal  wrote:
>
>> Is computational time for predictions on the order of few milliseconds (<
>> 10 ms) like the old mllib library?
>>
>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>> wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>> about MLeap and how you can use it to build production services from your
>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>> dependencies on a Spark context and the serialization format is entirely
>>> based on Protobuf 3 and JSON.
>>>
>>>
>>> The recent 0.5.0 release provides serialization and inference support
>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>
>>>
>>> MLeap is open-source, take a look at our Github page:
>>>
>>> https://github.com/combust/mleap
>>>
>>>
>>> Or join the conversation on Gitter:
>>>
>>> https://gitter.im/combust/mleap
>>>
>>>
>>> We have a set of documentation to help get you started here:
>>>
>>> http://mleap-docs.combust.ml/
>>>
>>>
>>> We even have a set of demos, for training ML Pipelines and linear,
>>> logistic and random forest models:
>>>
>>> https://github.com/combust/mleap-demo
>>>
>>>
>>> Check out our latest MLeap-serving Docker image, which allows you to
>>> expose a REST interface to your Spark ML pipeline models:
>>>
>>> http://mleap-docs.combust.ml/mleap-serving/
>>>
>>>
>>> Several companies are using MLeap in production and even more are
>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>> talk with you soon and welcome feedback/suggestions!
>>>
>>>
>>> Sincerely,
>>>
>>> Hollin and Mikhail
>>>
>>
>>
>


[ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-02 Thread Hollin Wilkins
Hey everyone,


Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about
MLeap and how you can use it to build production services from your
Spark-trained ML pipelines. MLeap is an open-source technology that allows
Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
Models to a scoring engine instantly. The MLeap execution engine has no
dependencies on a Spark context and the serialization format is entirely
based on Protobuf 3 and JSON.


The recent 0.5.0 release provides serialization and inference support for
close to 100% of Spark transformers (we don’t yet support ALS and LDA).


MLeap is open-source, take a look at our Github page:

https://github.com/combust/mleap


Or join the conversation on Gitter:

https://gitter.im/combust/mleap


We have a set of documentation to help get you started here:

http://mleap-docs.combust.ml/


We even have a set of demos, for training ML Pipelines and linear, logistic
and random forest models:

https://github.com/combust/mleap-demo


Check out our latest MLeap-serving Docker image, which allows you to expose
a REST interface to your Spark ML pipeline models:

http://mleap-docs.combust.ml/mleap-serving/


Several companies are using MLeap in production and even more are currently
evaluating it. Take a look and tell us what you think! We hope to talk with
you soon and welcome feedback/suggestions!


Sincerely,

Hollin and Mikhail


Re: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Hollin Wilkins
Hey Aseem,

If you are looking for a full-featured library to execute Spark ML
pipelines outside of Spark, take a look at MLeap:
https://github.com/combust/mleap

Not only does it support transforming single instances of a feature vector,
but you can execute your entire ML pipeline including feature extraction.

Cheers,
Hollin

On Wed, Feb 1, 2017 at 8:49 AM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> In Spark.ML the coefficients are not "pivoted" meaning that they do not
> set one of the coefficient sets equal to zero. You can read more about it
> here: https://en.wikipedia.org/wiki/Multinomial_logistic_
> regression#As_a_set_of_independent_binary_regressions
>
> You can translate your set of coefficients to a pivoted version by simply
> subtracting one of the sets of coefficients from all the others. That
> leaves the one you selected, the "pivot", as all zeros. You can then pass
> this into the mllib model, disregarding the "pivot" coefficients. The
> coefficients should be laid out like:
>
> [feature0_class0, feature1_class0, feature2_class0, intercept0,
> feature0_class1, ..., intercept1]
>
> So you have 9 coefficients and 3 intercepts, but you are going to get rid
> of one class's coefficients, leaving you with 6 coefficients and two
> intercepts - so a vector of length 8 for mllib's model.
>
> Note: if you use regularization then it is not exactly correct to convert
> from the non-pivoted version to the pivoted one, since the algorithms will
> give different results in those cases, though it is still possible to do it.
>
> On Wed, Feb 1, 2017 at 3:42 AM, Aseem Bansal  wrote:
>
>> *What I want to do*
>> I have a trained a ml.classification.LogisticRegressionModel using spark
>> ml package.
>>
>> It has 3 features and 3 classes. So the generated model has coefficients
>> in (3, 3) matrix and intercepts in Vector of length (3) as expected.
>>
>> Now, I want to take these coefficients and convert this
>> ml.classification.LogisticRegressionModel model to an instance of
>> mllib.classification.LogisticRegressionModel model.
>>
>> *Why I want to do this*
>> Computational Speed as SPARK-10413 is still in progress and scheduled for
>> Spark 2.2 which is not yet released.
>>
>> *Why I think this is possible*
>> I checked https://spark.apache.org/docs/latest/mllib-linear-me
>> thods.html#logistic-regression and in that example a multinomial
>> Logistic Regression is trained. So as per this the class
>> mllib.classification.LogisticRegressionModel can encapsulate these
>> parameters.
>>
>> *Problem faced*
>> The only constructor in mllib.classification.LogisticRegressionModel
>> takes a single vector as coefficients and single double as intercept but I
>> have a Matrix of coefficients and Vector of intercepts respectively.
>>
>> I tried converting matrix to a vector by just taking the values (Guess
>> work) but got
>>
>> requirement failed: LogisticRegressionModel.load with numClasses = 3 and
>> numFeatures = 3 expected weights of length 6 (without intercept) or 8 (with
>> intercept), but was given weights of length 9
>>
>> So any ideas?
>>
>
>


Re: ML version of Kmeans

2017-01-31 Thread Hollin Wilkins
Hey,

You could also take a look at MLeap, which provides a runtime for any Spark
transformer and does not have any dependencies on a SparkContext or Spark
libraries (excepting MLlib-local for linear algebra).

https://github.com/combust/mleap

On Tue, Jan 31, 2017 at 2:33 AM, Aseem Bansal  wrote:

> If you want to predict using dataset then transform is the way to go. If
> you want to predict on vectors then you will have to wait on this issue to
> be completed https://issues.apache.org/jira/browse/SPARK-10413
>
> On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau 
> wrote:
>
>> You most likely want the transform function on KMeansModel (although that
>> works on a dataset input rather than a single element at a time).
>>
>> On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am not able to find predict method on "ML" version of Kmeans.
>>>
>>> Mllib version has a predict method.  KMeansModel.predict(point: Vector)
>>> .
>>> How to predict the cluster point for new vectors in ML version of kmeans
>>> ?
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>