How to Integrate Spark mllib Streaming Training Models To Spark Structured Streaming

2019-09-17 Thread Praful Rana
Spark mllib library Streaming Training models work with DStream. So is
there any way to use them with spark structured streaming.


Re: MLLib + Streaming

2016-03-06 Thread Lan Jiang
Thanks, Guru. After reading the implementation of StreamingKMean, 
StreamingLinearRegressionWithSGD and StreamingLogisticRegressionWithSGD, I 
reached the same conclusion. But unfortunately, this distinction between true 
online learning and offline learning are implied in the documentation and I was 
not sure if my understanding was correct or not. Thanks for confirming this!

However, I have a different opinion on your last paragraph —  that we cannot 
hold test data during model training for online learning. Taking 
StreamingLinearRegressionWithSGD for example, you can certainly split the each 
micro-batch as 70% — 30% and do evaluation based on the RMSE. At the very 
beginning, the RMSE will be large. But as more and more micro-batch arrives, 
you should see RMSE becomes smaller as the weights approach optimal. IMHO, I 
don’t see much difference regarding holding test data between online and 
offline learning.  

Lan

> On Mar 6, 2016, at 2:43 AM, Chris Miller  wrote:
> 
> Guru:This is a really great response. Thanks for taking the time to explain 
> all of this. Helpful for me too.
> 
> 
> --
> Chris Miller
> 
> On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani  > wrote:
> Hi Lan,
> 
> Streaming Means, Linear Regression and Logistic Regression support online 
> machine learning as you mentioned. Online machine learning is where model is 
> being trained and updated on every batch of streaming data. These models have 
> trainOn() and predictOn() methods where you can simply pass in DStreams you 
> want to train the model on and DStreams you want the model to predict on. So 
> when the next batch of data arrives model is trained and updated again. In 
> this case model weights are continually updated and hopefully model performs 
> better in terms of convergence and accuracy over time. What we are really 
> trying to do in online learning case is that we are only showing few examples 
> of the data at a time ( stream of data) and updating the parameters in case 
> of Linear and Logistic Regression and updating the centers in case of 
> K-Means. In the case of Linear or Logistic Regression this is possible due to 
> the optimizer that is chosen for minimizing the cost function which is 
> Stochastic Gradient Descent. This optimizer helps us to move closer and 
> closer to the optimal weights after every batch and over the time we will 
> have a model that has learned how to represent our data and predict well.
> 
> In the scenario of using any MLlib algorithms and doing training with 
> DStream.transform() and DStream.foreachRDD() operations, when the first batch 
> of data arrives we build a model, let’s call this model1. Once you have the 
> model1 you can make predictions on the same DStream or a different DStream 
> source. But for the next batch if you follow the same procedure and create a 
> model, let’s call this model2. This model2 will be significantly different 
> than model1 based on how different the data is in the second DStream vs the 
> first DStream as it is not continually updating the model. It’s like weight 
> vectors are jumping from one place to the other for every batch and we never 
> know if the algorithm is converging to the optimal weights. So I believe it 
> is not possible to do true online learning with other MLLib models in Spark 
> Streaming.  I am not sure if this is because the models don’t generally 
> support this streaming scenarios or if the streaming versions simply haven’t 
> been implemented yet.
> 
> Though technically you can use any of the MLlib algorithms in Spark Streaming 
> with the procedure you mentioned and make predictions, it is important to 
> figure out if the model you are choosing can converge by showing only a 
> subset(batches  - DStreams) of the data over time. Based on the algorithm you 
> choose certain optimizers won’t necessarily be able to converge by showing 
> only individual data points and require to see majority of the data to be 
> able to learn optimal weights.  In these cases, you can still do offline 
> learning/training with Spark bach processing using any of the MLlib 
> algorithms and save those models on hdfs. You can then start a streaming job 
> and load these saved models into your streaming application and make 
> predictions. This is traditional offline learning.
> 
> In general, online learning is hard as it’s hard to evaluate since we are not 
> holding any test data during the model training. We are simply training the 
> model and predicting. So in the initial batches, results can vary quite a bit 
> and have significant errors in terms of the predictions. So choosing online 
> learning vs. offline learning depends on how much tolerance the application 
> can have towards wild predictions in the beginning. Offline training is 
> simple and cheap where as online training can be hard and needs to be 
> constantly monitored to see how it is performing.
> 
> 

Re: MLLib + Streaming

2016-03-06 Thread Chris Miller
Guru:This is a really great response. Thanks for taking the time to explain
all of this. Helpful for me too.


--
Chris Miller

On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani  wrote:

> Hi Lan,
>
> Streaming Means, Linear Regression and Logistic Regression support online
> machine learning as you mentioned. Online machine learning is where model
> is being trained and updated on every batch of streaming data. These models
> have trainOn() and predictOn() methods where you can simply pass in
> DStreams you want to train the model on and DStreams you want the model to
> predict on. So when the next batch of data arrives model is trained and
> updated again. In this case model weights are continually updated and
> hopefully model performs better in terms of convergence and accuracy over
> time. What we are really trying to do in online learning case is that we
> are only showing few examples of the data at a time ( stream of data) and
> updating the parameters in case of Linear and Logistic Regression and
> updating the centers in case of K-Means. In the case of Linear or Logistic
> Regression this is possible due to the optimizer that is chosen for
> minimizing the cost function which is Stochastic Gradient Descent. This
> optimizer helps us to move closer and closer to the optimal weights after
> every batch and over the time we will have a model that has learned how to
> represent our data and predict well.
>
> In the scenario of using any MLlib algorithms and doing training with
> DStream.transform() and DStream.foreachRDD() operations, when the first
> batch of data arrives we build a model, let’s call this model1. Once you
> have the model1 you can make predictions on the same DStream or a different
> DStream source. But for the next batch if you follow the same procedure and
> create a model, let’s call this model2. This model2 will be significantly
> different than model1 based on how different the data is in the second
> DStream vs the first DStream as it is not continually updating the model.
> It’s like weight vectors are jumping from one place to the other for every
> batch and we never know if the algorithm is converging to the optimal
> weights. So I believe it is not possible to do true online learning with
> other MLLib models in Spark Streaming.  I am not sure if this is because
> the models don’t generally support this streaming scenarios or if the
> streaming versions simply haven’t been implemented yet.
>
> Though technically you can use any of the MLlib algorithms in Spark
> Streaming with the procedure you mentioned and make predictions, it is
> important to figure out if the model you are choosing can converge by
> showing only a subset(batches  - DStreams) of the data over time. Based on
> the algorithm you choose certain optimizers won’t necessarily be able to
> converge by showing only individual data points and require to see majority
> of the data to be able to learn optimal weights.  In these cases, you can
> still do offline learning/training with Spark bach processing using any of
> the MLlib algorithms and save those models on hdfs. You can then start a
> streaming job and load these saved models into your streaming application
> and make predictions. This is traditional offline learning.
>
> In general, online learning is hard as it’s hard to evaluate since we are
> not holding any test data during the model training. We are simply training
> the model and predicting. So in the initial batches, results can vary quite
> a bit and have significant errors in terms of the predictions. So choosing
> online learning vs. offline learning depends on how much tolerance the
> application can have towards wild predictions in the beginning. Offline
> training is simple and cheap where as online training can be hard and needs
> to be constantly monitored to see how it is performing.
>
> Hope this helps in understanding offline learning vs. online learning and
> which algorithms you can choose for online learning in MLlib.
>
> Guru Medasani
> gdm...@gmail.com
>
>
>
> > On Mar 5, 2016, at 7:37 PM, Lan Jiang  wrote:
> >
> > Hi, there
> >
> > I hope someone can clarify this for me.  It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regression have a
> Streaming version, which can do online machine learning. But does that mean
> other MLLib algorithm cannot be used in Spark streaming applications, such
> as random forest, SVM, collaborate filtering, etc??
> >
> > DStreams are essentially a sequence of RDDs. We can use
> DStream.transform() and DStream.foreachRDD() operations, which allows you
> access RDDs in a DStream and apply MLLib functions on them. So it looks
> like all MLLib algorithms should be able to run in the streaming
> application. Am I wrong?
> >
> > Lan
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For 

MLLib + Streaming

2016-03-05 Thread Lan Jiang
Hi, there

I hope someone can clarify this for me.  It seems that some of the MLlib 
algorithms such as KMean, Linear Regression and Logistics Regression have a 
Streaming version, which can do online machine learning. But does that mean 
other MLLib algorithm cannot be used in Spark streaming applications, such as 
random forest, SVM, collaborate filtering, etc??

DStreams are essentially a sequence of RDDs. We can use DStream.transform() and 
DStream.foreachRDD() operations, which allows you access RDDs in a DStream and 
apply MLLib functions on them. So it looks like all MLLib algorithms should be 
able to run in the streaming application. Am I wrong? 

Lan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib + Streaming

2014-12-28 Thread Jeremy Freeman
Hi Fernando,

There’s currently no streaming ALS in Spark. I’m exploring a streaming singular 
value decomposition (JIRA) based on this paper 
(http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf), which might be one way to 
think about it.

There has also been some cool recent work explicitly on streaming ALS w/ SGD 
that we should look into 
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf).

— Jeremy

-
jeremyfreeman.net
@thefreemanlab

On Dec 23, 2014, at 2:47 PM, Fernando O. fot...@gmail.com wrote:

 Hey Xiangrui,
   
 Is there any plan to have a streaming compatible ALS version?
 
 Or if it's currently doable, is there any example?
 
 
 
 On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng men...@gmail.com wrote:
 We have streaming linear regression (since v1.1) and k-means (v1.2) in
 MLlib. You can check the user guide:
 
 http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
 http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering
 
 -Xiangrui
 
 On Tue, Dec 23, 2014 at 10:01 AM, Gianmarco De Francisci Morales
 g...@apache.org wrote:
  Hi,
 
  I have recently seen a demo of Spark where different pieces were put
  together (training via MLlib + deploying on Spark Streaming).
  I was wondering if MLlib currently works to directly train on Streaming.
  And, if so, what are the semantics of the algorithms?
  If not, would it be interesting to have ML algorithms developed for the
  streaming setting?
 
  Thanks,
  --
  Gianmarco
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:

http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering

-Xiangrui

On Tue, Dec 23, 2014 at 10:01 AM, Gianmarco De Francisci Morales
g...@apache.org wrote:
 Hi,

 I have recently seen a demo of Spark where different pieces were put
 together (training via MLlib + deploying on Spark Streaming).
 I was wondering if MLlib currently works to directly train on Streaming.
 And, if so, what are the semantics of the algorithms?
 If not, would it be interesting to have ML algorithms developed for the
 streaming setting?

 Thanks,
 --
 Gianmarco

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib + Streaming

2014-12-23 Thread Fernando O.
Hey Xiangrui,

Is there any plan to have a streaming compatible ALS version?

Or if it's currently doable, is there any example?



On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng men...@gmail.com wrote:

 We have streaming linear regression (since v1.1) and k-means (v1.2) in
 MLlib. You can check the user guide:


 http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression

 http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering

 -Xiangrui

 On Tue, Dec 23, 2014 at 10:01 AM, Gianmarco De Francisci Morales
 g...@apache.org wrote:
  Hi,
 
  I have recently seen a demo of Spark where different pieces were put
  together (training via MLlib + deploying on Spark Streaming).
  I was wondering if MLlib currently works to directly train on Streaming.
  And, if so, what are the semantics of the algorithms?
  If not, would it be interesting to have ML algorithms developed for the
  streaming setting?
 
  Thanks,
  --
  Gianmarco

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org