Re: [ML] - data storage and basic design approach

Tommaso Teofili Mon, 09 Jul 2012 09:38:28 -0700

2012/7/9 Thomas Jungblut <[email protected]>

> For the matrix/vector I would propose my library interface: (quite like
> mahouts math, but without boundary checks)
>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
>
>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
> Full Writable for Vector and basic Writable for Matrix:
>
> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
>
> It is an enough to make all machine learning algorithms I've seen until now
> and the builder pattern allows really nice chaining of commands to easily
> code equations or translate code from matlab/octave.
> See for example logistic regression cost function
>
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java



very nice, +1!


>
>
> For the interfaces of the algorithms:
> I guess we need to get some more experience, I can not tell how the
> interfaces for them should look like, mainly because I don't know how the
> BSP version of them will call the algorithm logic.
>

you're right, it's more reasonable to just proceed bottom - up with this as
we're going to have a clearer idea while developing the different
algorithms.
So for now I'd introduce your library Writables and then proceed 1 step at
a time with the more common API.
Thanks,
Tommaso




>
> But having stable math interfaces is the key point.
>
> 2012/7/9 Tommaso Teofili <[email protected]>
>
> > Ok, so let's sketch up here what these interfaces should look like.
> > Any proposal is more than welcome.
> > Regards,
> > Tommaso
> >
> > 2012/7/7 Thomas Jungblut <[email protected]>
> >
> > > Looks fine to me.
> > > The key are the interfaces for learning and predicting so we should
> > define
> > > some vectors and matrices.
> > > It would be enough to define the algorithms via the interfaces and a
> > > generic BSP should just run them based on the given input.
> > >
> > > 2012/7/7 Tommaso Teofili <[email protected]>
> > >
> > > > Hi all,
> > > >
> > > > in my spare time I started writing some basic BSP based machine
> > learning
> > > > algorithms for our ml module, now I'm wondering, from a design point
> of
> > > > view, where it'd make sense to put the training data / model. I'd
> > assume
> > > > the obvious answer would be HDFS so this makes me think we should
> come
> > > with
> > > > (at least) two BSP jobs for each algorithm: one for learning and one
> > for
> > > > "predicting" each to be run separately.
> > > > This would allow to read the training data from HDFS, and
> consequently
> > > > create a model (also on HDFS) and then the created model could be
> read
> > > > (again from HDFS) in order to predict an output for a new input.
> > > > Does that make sense?
> > > > I'm just wondering what a general purpose design for Hama based ML
> > stuff
> > > > would look like so this is just to start the discussion, any opinion
> is
> > > > welcome.
> > > >
> > > > Cheers,
> > > > Tommaso
> > > >
> > >
> >
>

Re: [ML] - data storage and basic design approach

Reply via email to