Re: Deploying ML Pipeline Model

2016-07-01 Thread Saurabh Sardeshpande
Hi Nick,

Thanks for the answer. Do you think an implementation like the one in this
article is infeasible in production for say, hundreds of queries per
minute?
https://www.codementor.io/spark/tutorial/building-a-web-service-with-apache-spark-flask-example-app-part2.
The article uses Flask to define routes and Spark for evaluating requests.

Regards,
Saurabh






On Fri, Jul 1, 2016 at 10:47 AM, Nick Pentreath 
wrote:

> Generally there are 2 ways to use a trained pipeline model - (offline)
> batch scoring, and real-time online scoring.
>
> For batch (or even "mini-batch" e.g. on Spark streaming data), then yes
> certainly loading the model back in Spark and feeding new data through the
> pipeline for prediction works just fine, and this is essentially what is
> supported in 1.6 (and more or less full coverage in 2.0). For large batch
> cases this can be quite efficient.
>
> However, usually for real-time use cases, the latency required is fairly
> low - of the order of a few ms to a few 100ms for a request (some examples
> include recommendations, ad-serving, fraud detection etc).
>
> In these cases, using Spark has 2 issues: (1) latency for prediction on
> the pipeline, which is based on DataFrames and therefore distributed
> execution, is usually fairly high "per request"; (2) this requires pulling
> in all of Spark for your real-time serving layer (or running a full Spark
> cluster), which is usually way too much overkill - all you really need for
> serving is a bit of linear algebra and some basic transformations.
>
> So for now, unfortunately there is not much in the way of options for
> exporting your pipelines and serving them outside of Spark - the
> JPMML-based project mentioned on this thread is one option. The other
> option at this point is to write your own export functionality and your own
> serving layer.
>
> There is (very initial) movement towards improving the local serving
> possibilities (see https://issues.apache.org/jira/browse/SPARK-13944 which
> was the "first step" in this process).
>
> On Fri, 1 Jul 2016 at 19:24 Jacek Laskowski  wrote:
>
>> Hi Rishabh,
>>
>> I've just today had similar conversation about how to do a ML Pipeline
>> deployment and couldn't really answer this question and more because I
>> don't really understand the use case.
>>
>> What would you expect from ML Pipeline model deployment? You can save
>> your model to a file by model.write.overwrite.save("model_v1").
>>
>> model_v1
>> |-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- stages
>> |-- 0_regexTok_b4265099cc1c
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> |-- 1_hashingTF_8de997cf54ba
>> |   `-- metadata
>> |   |-- _SUCCESS
>> |   `-- part-0
>> `-- 2_linReg_3942a71d2c0e
>> |-- data
>> |   |-- _SUCCESS
>> |   |-- _common_metadata
>> |   |-- _metadata
>> |   `--
>> part-r-0-2096c55a-d654-42b2-90d3-5a310101cba5.gz.parquet
>> `-- metadata
>> |-- _SUCCESS
>> `-- part-0
>>
>> 9 directories, 12 files
>>
>> What would you like to have outside SparkContext? What's wrong with
>> using Spark? Just curious hoping to understand the use case better.
>> Thanks.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Jul 1, 2016 at 12:54 PM, Rishabh Bhardwaj 
>> wrote:
>> > Hi All,
>> >
>> > I am looking for ways to deploy a ML Pipeline model in production .
>> > Spark has already proved to be a one of the best framework for model
>> > training and creation, but once the ml pipeline model is ready how can I
>> > deploy it outside spark context ?
>> > MLlib model has toPMML method but today Pipeline model can not be saved
>> to
>> > PMML. There are some frameworks like MLeap which are trying to abstract
>> > Pipeline Model and provide ML Pipeline Model deployment outside spark
>> > context,but currently they don't have most of the ml transformers and
>> > estimators.
>> > I am looking for related work going on this area.
>> > Any pointers will be helpful.
>> >
>> > Thanks,
>> > Rishabh.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Ideas to put a Spark ML model in production

2016-06-23 Thread Saurabh Sardeshpande
Hi all,

How do you reliably deploy a spark model in production? Let's say I've done
a lot of analysis and come up with a model that performs great. I have this
"model file" and I'm not sure what to do with it. I want to build some kind
of service around it that takes some inputs, converts them into a feature,
runs the equivalent of 'transform', i.e. predict the output and return the
output.

At the Spark Summit I heard a lot of talk about how this will be easy to do
in Spark 2.0, but I'm looking for an solution sooner. PMML support is
limited and the model I have can't be exported in that format.

I would appreciate any ideas around this, especially from personal
experiences.

Regards,
Saurabh


Re: Explode row with start and end dates into row for each date

2016-06-22 Thread Saurabh Sardeshpande
I don't think there would be any issues since MLlib is part of Spark as
against being an external package. Most of the problems I've had to deal
were because of the existence of both versions of Python on a system, and
not Python 3 itself.

On Wed, Jun 22, 2016 at 3:51 PM, John Aherne <john.ahe...@justenough.com>
wrote:

> Thanks Saurabh!
>
> That explode function looks like it is exactly what I need.
>
> We will be using MLlib quite a lot - Do I have to worry about python
> versions for that?
>
> John
>
> On Wed, Jun 22, 2016 at 4:34 PM, Saurabh Sardeshpande <
> saurabh...@gmail.com> wrote:
>
>> Hi John,
>>
>> If you can do it in Hive, you should be able to do it in Spark. Just make
>> sure you import HiveContext instead of SQLContext.
>>
>> If your intent is to explore rather than get stuff done, I've not aware
>> of any RDD operations that do this for you, but there is a DataFrame
>> operation called 'explode' which does this -
>> https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.functions.explode.
>> You'll just have to generate the array of dates using something like this -
>> http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
>> .
>>
>> It's generally recommended to use Python 3 if you're starting a new
>> project and don't have old dependencies. But remember that there is still
>> quite a lot of stuff that is not yet ported to Python 3.
>>
>> Regards,
>> Saurabh
>>
>> On Wed, Jun 22, 2016 at 3:20 PM, John Aherne <john.ahe...@justenough.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I am pretty new to Spark (and the mailing list), so forgive me if the
>>> answer is obvious.
>>>
>>> I have a dataset, and each row contains a start date and end date.
>>>
>>> I would like to explode each row so that each day between the start and
>>> end dates becomes its own row.
>>> e.g.
>>> row1  2015-01-01  2015-01-03
>>> becomes
>>> row1   2015-01-01
>>> row1   2015-01-02
>>> row1   2015-01-03
>>>
>>> So, my questions are:
>>> Is Spark a good place to do that?
>>> I can do it in Hive, but it's a bit messy, and this seems like a good
>>> problem to use for learning Spark (and Python).
>>>
>>> If so, any pointers on what methods I should use? Particularly how to
>>> split one row into multiples.
>>>
>>> Lastly, I am a bit hesitant to ask but is there a recommendation on
>>> which version of python to use? Not interested in which is better, just
>>> want to know if they are both supported equally.
>>>
>>> I am using Spark 1.6.1 (Hortonworks distro).
>>>
>>> Thanks!
>>> John
>>>
>>> --
>>>
>>> John Aherne
>>> Big Data and SQL Developer
>>>
>>> [image: JustEnough Logo]
>>>
>>> Cell:
>>> Email:
>>> Skype:
>>> Web:
>>>
>>> +1 (303) 809-9718
>>> john.ahe...@justenough.com
>>> john.aherne.je
>>> www.justenough.com
>>>
>>>
>>> Confidentiality Note: The information contained in this email and 
>>> document(s) attached are for the exclusive use of the addressee and may 
>>> contain confidential, privileged and non-disclosable information. If the 
>>> recipient of this email is not the addressee, such recipient is strictly 
>>> prohibited from reading, photocopying, distribution or otherwise using this 
>>> email or its contents in any way.
>>>
>>>
>>
>
>
> --
>
> John Aherne
> Big Data and SQL Developer
>
> [image: JustEnough Logo]
>
> Cell:
> Email:
> Skype:
> Web:
>
> +1 (303) 809-9718
> john.ahe...@justenough.com
> john.aherne.je
> www.justenough.com
>
>
> Confidentiality Note: The information contained in this email and document(s) 
> attached are for the exclusive use of the addressee and may contain 
> confidential, privileged and non-disclosable information. If the recipient of 
> this email is not the addressee, such recipient is strictly prohibited from 
> reading, photocopying, distribution or otherwise using this email or its 
> contents in any way.
>
>


Re: Explode row with start and end dates into row for each date

2016-06-22 Thread Saurabh Sardeshpande
Hi John,

If you can do it in Hive, you should be able to do it in Spark. Just make
sure you import HiveContext instead of SQLContext.

If your intent is to explore rather than get stuff done, I've not aware of
any RDD operations that do this for you, but there is a DataFrame operation
called 'explode' which does this -
https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.functions.explode.
You'll just have to generate the array of dates using something like this -
http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
.

It's generally recommended to use Python 3 if you're starting a new project
and don't have old dependencies. But remember that there is still quite a
lot of stuff that is not yet ported to Python 3.

Regards,
Saurabh

On Wed, Jun 22, 2016 at 3:20 PM, John Aherne 
wrote:

> Hi Everyone,
>
> I am pretty new to Spark (and the mailing list), so forgive me if the
> answer is obvious.
>
> I have a dataset, and each row contains a start date and end date.
>
> I would like to explode each row so that each day between the start and
> end dates becomes its own row.
> e.g.
> row1  2015-01-01  2015-01-03
> becomes
> row1   2015-01-01
> row1   2015-01-02
> row1   2015-01-03
>
> So, my questions are:
> Is Spark a good place to do that?
> I can do it in Hive, but it's a bit messy, and this seems like a good
> problem to use for learning Spark (and Python).
>
> If so, any pointers on what methods I should use? Particularly how to
> split one row into multiples.
>
> Lastly, I am a bit hesitant to ask but is there a recommendation on which
> version of python to use? Not interested in which is better, just want to
> know if they are both supported equally.
>
> I am using Spark 1.6.1 (Hortonworks distro).
>
> Thanks!
> John
>
> --
>
> John Aherne
> Big Data and SQL Developer
>
> [image: JustEnough Logo]
>
> Cell:
> Email:
> Skype:
> Web:
>
> +1 (303) 809-9718
> john.ahe...@justenough.com
> john.aherne.je
> www.justenough.com
>
>
> Confidentiality Note: The information contained in this email and document(s) 
> attached are for the exclusive use of the addressee and may contain 
> confidential, privileged and non-disclosable information. If the recipient of 
> this email is not the addressee, such recipient is strictly prohibited from 
> reading, photocopying, distribution or otherwise using this email or its 
> contents in any way.
>
>