Re: [ML] - data storage and basic design approach

Thomas Jungblut Tue, 10 Jul 2012 04:27:18 -0700

Splitting out a math module would be smarter, but let's just keep that in
the ML package.


Anyone volunteer to code a simple (mini-) batch gradient descent in BSP?
http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html


2012/7/10 Edward J. Yoon <edwardy...@apache.org>

> would like to move core module so that other can reuse it.
>
> On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili
> <tommaso.teof...@gmail.com> wrote:
> > I've done the first import, we can start from that now, thanks Thomas.
> > Tommaso
> >
> > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com>
> >
> >> ok, I'll try that, thanks :)
> >> Tommaso
> >>
> >> 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>
> >>> I don't know if we need sparse/named vectors for the first scratch.
> >>> You can just use the interface and the dense implementations and remove
> >>> all
> >>> the uncompilable code in the writables.
> >>>
> >>> 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com>
> >>>
> >>> > Thomas, while inspecting the code I realize I may need to import
> >>> most/all
> >>> > of the classes inside your math library for the writables to compile,
> >>> is it
> >>> > ok for you or you don't want that?
> >>> > Regards,
> >>> > Tommaso
> >>> >
> >>> > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>> >
> >>> > > great, thank you for taking care of it ;)
> >>> > >
> >>> > > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com>
> >>> > >
> >>> > > > Ok, sure, I'll just add the writables along with
> DoubleMatrix/Vector
> >>> > with
> >>> > > > the AL2 headers on top.
> >>> > > > Thanks Thomas for the contribution and feedback.
> >>> > > > Tommaso
> >>> > > >
> >>> > > > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>> > > >
> >>> > > > > Feel free to commit this, but take care to add the apache
> license
> >>> > > > headers.
> >>> > > > > Also I wanted to add a few testcases over the next few
> weekends.
> >>> > > > >
> >>> > > > > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com>
> >>> > > > >
> >>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD
> is a
> >>> > good
> >>> > > > fit
> >>> > > > > > for BSP.
> >>> > > > > > Also I was trying to implement some easy meta learning
> algorithm
> >>> > like
> >>> > > > the
> >>> > > > > > weighed majority algorithm where each peer as a proper
> learning
> >>> > > > algorithm
> >>> > > > > > and gest penalized for each mistaken prediction.
> >>> > > > > > Regarding your math library do you plan to commit it
> yourself?
> >>> > > > Otherwise
> >>> > > > > I
> >>> > > > > > can do it.
> >>> > > > > > Regards,
> >>> > > > > > Tommaso
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>> > > > > >
> >>> > > > > > > Maybe a first good step towards algorithms would be to try
> to
> >>> > > > evaluate
> >>> > > > > > how
> >>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS
> or
> >>> > > > conjugate
> >>> > > > > > > gradient method)
> >>> > > > > > >
> >>> > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teof...@gmail.com>
> >>> > > > > > >
> >>> > > > > > > > 2012/7/9 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>> > > > > > > >
> >>> > > > > > > > > For the matrix/vector I would propose my library
> >>> interface:
> >>> > > > (quite
> >>> > > > > > like
> >>> > > > > > > > > mahouts math, but without boundary checks)
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
> >>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix:
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
> >>> > > > > > > > >
> >>> > > > > > > > > It is an enough to make all machine learning algorithms
> >>> I've
> >>> > > seen
> >>> > > > > > until
> >>> > > > > > > > now
> >>> > > > > > > > > and the builder pattern allows really nice chaining of
> >>> > commands
> >>> > > > to
> >>> > > > > > > easily
> >>> > > > > > > > > code equations or translate code from matlab/octave.
> >>> > > > > > > > > See for example logistic regression cost function
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > very nice, +1!
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > > > For the interfaces of the algorithms:
> >>> > > > > > > > > I guess we need to get some more experience, I can not
> >>> tell
> >>> > how
> >>> > > > the
> >>> > > > > > > > > interfaces for them should look like, mainly because I
> >>> don't
> >>> > > know
> >>> > > > > how
> >>> > > > > > > the
> >>> > > > > > > > > BSP version of them will call the algorithm logic.
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > you're right, it's more reasonable to just proceed
> bottom -
> >>> up
> >>> > > with
> >>> > > > > > this
> >>> > > > > > > as
> >>> > > > > > > > we're going to have a clearer idea while developing the
> >>> > different
> >>> > > > > > > > algorithms.
> >>> > > > > > > > So for now I'd introduce your library Writables and then
> >>> > proceed
> >>> > > 1
> >>> > > > > step
> >>> > > > > > > at
> >>> > > > > > > > a time with the more common API.
> >>> > > > > > > > Thanks,
> >>> > > > > > > > Tommaso
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > > > But having stable math interfaces is the key point.
> >>> > > > > > > > >
> >>> > > > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teof...@gmail.com>
> >>> > > > > > > > >
> >>> > > > > > > > > > Ok, so let's sketch up here what these interfaces
> should
> >>> > look
> >>> > > > > like.
> >>> > > > > > > > > > Any proposal is more than welcome.
> >>> > > > > > > > > > Regards,
> >>> > > > > > > > > > Tommaso
> >>> > > > > > > > > >
> >>> > > > > > > > > > 2012/7/7 Thomas Jungblut <thomas.jungb...@gmail.com>
> >>> > > > > > > > > >
> >>> > > > > > > > > > > Looks fine to me.
> >>> > > > > > > > > > > The key are the interfaces for learning and
> >>> predicting so
> >>> > > we
> >>> > > > > > should
> >>> > > > > > > > > > define
> >>> > > > > > > > > > > some vectors and matrices.
> >>> > > > > > > > > > > It would be enough to define the algorithms via the
> >>> > > > interfaces
> >>> > > > > > and
> >>> > > > > > > a
> >>> > > > > > > > > > > generic BSP should just run them based on the given
> >>> > input.
> >>> > > > > > > > > > >
> >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili <
> tommaso.teof...@gmail.com>
> >>> > > > > > > > > > >
> >>> > > > > > > > > > > > Hi all,
> >>> > > > > > > > > > > >
> >>> > > > > > > > > > > > in my spare time I started writing some basic BSP
> >>> based
> >>> > > > > machine
> >>> > > > > > > > > > learning
> >>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering,
> >>> from a
> >>> > > > > design
> >>> > > > > > > > point
> >>> > > > > > > > > of
> >>> > > > > > > > > > > > view, where it'd make sense to put the training
> >>> data /
> >>> > > > model.
> >>> > > > > > I'd
> >>> > > > > > > > > > assume
> >>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me
> >>> think
> >>> > > we
> >>> > > > > > should
> >>> > > > > > > > > come
> >>> > > > > > > > > > > with
> >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one
> for
> >>> > > > learning
> >>> > > > > > and
> >>> > > > > > > > one
> >>> > > > > > > > > > for
> >>> > > > > > > > > > > > "predicting" each to be run separately.
> >>> > > > > > > > > > > > This would allow to read the training data from
> >>> HDFS,
> >>> > and
> >>> > > > > > > > > consequently
> >>> > > > > > > > > > > > create a model (also on HDFS) and then the
> created
> >>> > model
> >>> > > > > could
> >>> > > > > > be
> >>> > > > > > > > > read
> >>> > > > > > > > > > > > (again from HDFS) in order to predict an output
> for
> >>> a
> >>> > new
> >>> > > > > > input.
> >>> > > > > > > > > > > > Does that make sense?
> >>> > > > > > > > > > > > I'm just wondering what a general purpose design
> for
> >>> > Hama
> >>> > > > > based
> >>> > > > > > > ML
> >>> > > > > > > > > > stuff
> >>> > > > > > > > > > > > would look like so this is just to start the
> >>> > discussion,
> >>> > > > any
> >>> > > > > > > > opinion
> >>> > > > > > > > > is
> >>> > > > > > > > > > > > welcome.
> >>> > > > > > > > > > > >
> >>> > > > > > > > > > > > Cheers,
> >>> > > > > > > > > > > > Tommaso
> >>> > > > > > > > > > > >
> >>> > > > > > > > > > >
> >>> > > > > > > > > >
> >>> > > > > > > > >
> >>> > > > > > > >
> >>> > > > > > >
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: [ML] - data storage and basic design approach

Reply via email to