Re: [ML] - data storage and basic design approach

Thomas Jungblut Mon, 09 Jul 2012 09:14:23 -0700

For the matrix/vector I would propose my library interface: (quite like
mahouts math, but without boundary checks)
https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java


https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
Full Writable for Vector and basic Writable for Matrix:
https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable

It is an enough to make all machine learning algorithms I've seen until now
and the builder pattern allows really nice chaining of commands to easily
code equations or translate code from matlab/octave.
See for example logistic regression cost function
https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java

For the interfaces of the algorithms:
I guess we need to get some more experience, I can not tell how the
interfaces for them should look like, mainly because I don't know how the
BSP version of them will call the algorithm logic.

But having stable math interfaces is the key point.

2012/7/9 Tommaso Teofili <[email protected]>

> Ok, so let's sketch up here what these interfaces should look like.
> Any proposal is more than welcome.
> Regards,
> Tommaso
>
> 2012/7/7 Thomas Jungblut <[email protected]>
>
> > Looks fine to me.
> > The key are the interfaces for learning and predicting so we should
> define
> > some vectors and matrices.
> > It would be enough to define the algorithms via the interfaces and a
> > generic BSP should just run them based on the given input.
> >
> > 2012/7/7 Tommaso Teofili <[email protected]>
> >
> > > Hi all,
> > >
> > > in my spare time I started writing some basic BSP based machine
> learning
> > > algorithms for our ml module, now I'm wondering, from a design point of
> > > view, where it'd make sense to put the training data / model. I'd
> assume
> > > the obvious answer would be HDFS so this makes me think we should come
> > with
> > > (at least) two BSP jobs for each algorithm: one for learning and one
> for
> > > "predicting" each to be run separately.
> > > This would allow to read the training data from HDFS, and consequently
> > > create a model (also on HDFS) and then the created model could be read
> > > (again from HDFS) in order to predict an output for a new input.
> > > Does that make sense?
> > > I'm just wondering what a general purpose design for Hama based ML
> stuff
> > > would look like so this is just to start the discussion, any opinion is
> > > welcome.
> > >
> > > Cheers,
> > > Tommaso
> > >
> >
>

Re: [ML] - data storage and basic design approach

Reply via email to