Adaryl,

most ML algorithms  are based on some form of numerical optimization, using
something like online gradient descent
<http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
gradient
<http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> (e.g
in SVM classifiers). In its simplest form it is a nested FOR loop where on
each iteration you update the weights or parameters of the model until
reaching some convergence threshold that minimizes the prediction error
(usually the goal is to minimize  a Loss function
<http://en.wikipedia.org/wiki/Loss_function>, as in a popular least squares
<http://en.wikipedia.org/wiki/Least_squares> technique). You could
parallelize this loop using a brute force divide-and-conquer approach,
mapping a chunk of data to each node and a computing partial sum there,
then aggregating the results from each node into a global sum in a 'reduce'
stage, and repeating this map-reduce cycle until convergence. You can look
up distributed gradient descent
<http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc> or
check out Mahout
<https://mahout.apache.org/users/recommender/matrix-factorization.html>
or Spark
MLlib <https://spark.apache.org/docs/latest/mllib-guide.html> for examples.
Alternatively you can use something like GraphLab
<http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
.

Cassandra can serve a data store from which you load the training data e.g.
into Spark  using this connector
<https://github.com/datastax/spark-cassandra-connector> and then train the
model using MLlib or Mahout (it has Spark bindings I believe). Once you
trained the model, you could save the parameters back in Cassandra. Then
the next stage is using the model to classify new data, e.g. recommend
similar items based on a log of new purchases, there you could once again
use Spark or Storm with something like this
<https://github.com/pmerienne/trident-ml>.

Alex




On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefi...@hotmail.com> wrote:

>   I’m planning to speak at a local meet-up and I need to know if what I
> have in my head is even possible.
>
>  I want to give an example of working with data in Cassandra. I have data
> coming in through Kafka and Storm and I’m saving it off to Cassandra (this
> is only on paper at this point). I then want to run an ML algorithm over
> the data. My problem here is, while my data is distributed, I don’t know
> how to do the analysis in a distributed manner. I could certainly use R but
> processing the data on a single machine would seem to defeat the purpose of
> all this scalability.
>
>  What is my solution?
>  B.
>

Reply via email to