Re: [ML] - data storage and basic design approach

Thomas Jungblut Tue, 10 Jul 2012 01:53:36 -0700

Feel free to commit this, but take care to add the apache license headers.
Also I wanted to add a few testcases over the next few weekends.


2012/7/10 Tommaso Teofili <[email protected]>

> nice idea, quickly thinking to it it looks to me that (C)GD is a good fit
> for BSP.
> Also I was trying to implement some easy meta learning algorithm like the
> weighed majority algorithm where each peer as a proper learning algorithm
> and gest penalized for each mistaken prediction.
> Regarding your math library do you plan to commit it yourself? Otherwise I
> can do it.
> Regards,
> Tommaso
>
>
> 2012/7/10 Thomas Jungblut <[email protected]>
>
> > Maybe a first good step towards algorithms would be to try to evaluate
> how
> > we can implement some non-linear optimizers in BSP. (BFGS or conjugate
> > gradient method)
> >
> > 2012/7/9 Tommaso Teofili <[email protected]>
> >
> > > 2012/7/9 Thomas Jungblut <[email protected]>
> > >
> > > > For the matrix/vector I would propose my library interface: (quite
> like
> > > > mahouts math, but without boundary checks)
> > > >
> > > >
> > >
> >
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
> > > >
> > > >
> > > >
> > >
> >
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
> > > > Full Writable for Vector and basic Writable for Matrix:
> > > >
> > > >
> > >
> >
> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
> > > >
> > > > It is an enough to make all machine learning algorithms I've seen
> until
> > > now
> > > > and the builder pattern allows really nice chaining of commands to
> > easily
> > > > code equations or translate code from matlab/octave.
> > > > See for example logistic regression cost function
> > > >
> > > >
> > >
> >
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
> > >
> > >
> > > very nice, +1!
> > >
> > >
> > > >
> > > >
> > > > For the interfaces of the algorithms:
> > > > I guess we need to get some more experience, I can not tell how the
> > > > interfaces for them should look like, mainly because I don't know how
> > the
> > > > BSP version of them will call the algorithm logic.
> > > >
> > >
> > > you're right, it's more reasonable to just proceed bottom - up with
> this
> > as
> > > we're going to have a clearer idea while developing the different
> > > algorithms.
> > > So for now I'd introduce your library Writables and then proceed 1 step
> > at
> > > a time with the more common API.
> > > Thanks,
> > > Tommaso
> > >
> > >
> > >
> > >
> > > >
> > > > But having stable math interfaces is the key point.
> > > >
> > > > 2012/7/9 Tommaso Teofili <[email protected]>
> > > >
> > > > > Ok, so let's sketch up here what these interfaces should look like.
> > > > > Any proposal is more than welcome.
> > > > > Regards,
> > > > > Tommaso
> > > > >
> > > > > 2012/7/7 Thomas Jungblut <[email protected]>
> > > > >
> > > > > > Looks fine to me.
> > > > > > The key are the interfaces for learning and predicting so we
> should
> > > > > define
> > > > > > some vectors and matrices.
> > > > > > It would be enough to define the algorithms via the interfaces
> and
> > a
> > > > > > generic BSP should just run them based on the given input.
> > > > > >
> > > > > > 2012/7/7 Tommaso Teofili <[email protected]>
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > in my spare time I started writing some basic BSP based machine
> > > > > learning
> > > > > > > algorithms for our ml module, now I'm wondering, from a design
> > > point
> > > > of
> > > > > > > view, where it'd make sense to put the training data / model.
> I'd
> > > > > assume
> > > > > > > the obvious answer would be HDFS so this makes me think we
> should
> > > > come
> > > > > > with
> > > > > > > (at least) two BSP jobs for each algorithm: one for learning
> and
> > > one
> > > > > for
> > > > > > > "predicting" each to be run separately.
> > > > > > > This would allow to read the training data from HDFS, and
> > > > consequently
> > > > > > > create a model (also on HDFS) and then the created model could
> be
> > > > read
> > > > > > > (again from HDFS) in order to predict an output for a new
> input.
> > > > > > > Does that make sense?
> > > > > > > I'm just wondering what a general purpose design for Hama based
> > ML
> > > > > stuff
> > > > > > > would look like so this is just to start the discussion, any
> > > opinion
> > > > is
> > > > > > > welcome.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Tommaso
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [ML] - data storage and basic design approach

Reply via email to