Re: [ML] - data storage and basic design approach

Tommaso Teofili Tue, 10 Jul 2012 03:14:35 -0700

I've done the first import, we can start from that now, thanks Thomas.
Tommaso


2012/7/10 Tommaso Teofili <[email protected]>

> ok, I'll try that, thanks :)
> Tommaso
>
> 2012/7/10 Thomas Jungblut <[email protected]>
>
>> I don't know if we need sparse/named vectors for the first scratch.
>> You can just use the interface and the dense implementations and remove
>> all
>> the uncompilable code in the writables.
>>
>> 2012/7/10 Tommaso Teofili <[email protected]>
>>
>> > Thomas, while inspecting the code I realize I may need to import
>> most/all
>> > of the classes inside your math library for the writables to compile,
>> is it
>> > ok for you or you don't want that?
>> > Regards,
>> > Tommaso
>> >
>> > 2012/7/10 Thomas Jungblut <[email protected]>
>> >
>> > > great, thank you for taking care of it ;)
>> > >
>> > > 2012/7/10 Tommaso Teofili <[email protected]>
>> > >
>> > > > Ok, sure, I'll just add the writables along with DoubleMatrix/Vector
>> > with
>> > > > the AL2 headers on top.
>> > > > Thanks Thomas for the contribution and feedback.
>> > > > Tommaso
>> > > >
>> > > > 2012/7/10 Thomas Jungblut <[email protected]>
>> > > >
>> > > > > Feel free to commit this, but take care to add the apache license
>> > > > headers.
>> > > > > Also I wanted to add a few testcases over the next few weekends.
>> > > > >
>> > > > > 2012/7/10 Tommaso Teofili <[email protected]>
>> > > > >
>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD is a
>> > good
>> > > > fit
>> > > > > > for BSP.
>> > > > > > Also I was trying to implement some easy meta learning algorithm
>> > like
>> > > > the
>> > > > > > weighed majority algorithm where each peer as a proper learning
>> > > > algorithm
>> > > > > > and gest penalized for each mistaken prediction.
>> > > > > > Regarding your math library do you plan to commit it yourself?
>> > > > Otherwise
>> > > > > I
>> > > > > > can do it.
>> > > > > > Regards,
>> > > > > > Tommaso
>> > > > > >
>> > > > > >
>> > > > > > 2012/7/10 Thomas Jungblut <[email protected]>
>> > > > > >
>> > > > > > > Maybe a first good step towards algorithms would be to try to
>> > > > evaluate
>> > > > > > how
>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS or
>> > > > conjugate
>> > > > > > > gradient method)
>> > > > > > >
>> > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>> > > > > > >
>> > > > > > > > 2012/7/9 Thomas Jungblut <[email protected]>
>> > > > > > > >
>> > > > > > > > > For the matrix/vector I would propose my library
>> interface:
>> > > > (quite
>> > > > > > like
>> > > > > > > > > mahouts math, but without boundary checks)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
>> > > > > > > > >
>> > > > > > > > > It is an enough to make all machine learning algorithms
>> I've
>> > > seen
>> > > > > > until
>> > > > > > > > now
>> > > > > > > > > and the builder pattern allows really nice chaining of
>> > commands
>> > > > to
>> > > > > > > easily
>> > > > > > > > > code equations or translate code from matlab/octave.
>> > > > > > > > > See for example logistic regression cost function
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > very nice, +1!
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > For the interfaces of the algorithms:
>> > > > > > > > > I guess we need to get some more experience, I can not
>> tell
>> > how
>> > > > the
>> > > > > > > > > interfaces for them should look like, mainly because I
>> don't
>> > > know
>> > > > > how
>> > > > > > > the
>> > > > > > > > > BSP version of them will call the algorithm logic.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > you're right, it's more reasonable to just proceed bottom -
>> up
>> > > with
>> > > > > > this
>> > > > > > > as
>> > > > > > > > we're going to have a clearer idea while developing the
>> > different
>> > > > > > > > algorithms.
>> > > > > > > > So for now I'd introduce your library Writables and then
>> > proceed
>> > > 1
>> > > > > step
>> > > > > > > at
>> > > > > > > > a time with the more common API.
>> > > > > > > > Thanks,
>> > > > > > > > Tommaso
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > But having stable math interfaces is the key point.
>> > > > > > > > >
>> > > > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>> > > > > > > > >
>> > > > > > > > > > Ok, so let's sketch up here what these interfaces should
>> > look
>> > > > > like.
>> > > > > > > > > > Any proposal is more than welcome.
>> > > > > > > > > > Regards,
>> > > > > > > > > > Tommaso
>> > > > > > > > > >
>> > > > > > > > > > 2012/7/7 Thomas Jungblut <[email protected]>
>> > > > > > > > > >
>> > > > > > > > > > > Looks fine to me.
>> > > > > > > > > > > The key are the interfaces for learning and
>> predicting so
>> > > we
>> > > > > > should
>> > > > > > > > > > define
>> > > > > > > > > > > some vectors and matrices.
>> > > > > > > > > > > It would be enough to define the algorithms via the
>> > > > interfaces
>> > > > > > and
>> > > > > > > a
>> > > > > > > > > > > generic BSP should just run them based on the given
>> > input.
>> > > > > > > > > > >
>> > > > > > > > > > > 2012/7/7 Tommaso Teofili <[email protected]>
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi all,
>> > > > > > > > > > > >
>> > > > > > > > > > > > in my spare time I started writing some basic BSP
>> based
>> > > > > machine
>> > > > > > > > > > learning
>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering,
>> from a
>> > > > > design
>> > > > > > > > point
>> > > > > > > > > of
>> > > > > > > > > > > > view, where it'd make sense to put the training
>> data /
>> > > > model.
>> > > > > > I'd
>> > > > > > > > > > assume
>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me
>> think
>> > > we
>> > > > > > should
>> > > > > > > > > come
>> > > > > > > > > > > with
>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one for
>> > > > learning
>> > > > > > and
>> > > > > > > > one
>> > > > > > > > > > for
>> > > > > > > > > > > > "predicting" each to be run separately.
>> > > > > > > > > > > > This would allow to read the training data from
>> HDFS,
>> > and
>> > > > > > > > > consequently
>> > > > > > > > > > > > create a model (also on HDFS) and then the created
>> > model
>> > > > > could
>> > > > > > be
>> > > > > > > > > read
>> > > > > > > > > > > > (again from HDFS) in order to predict an output for
>> a
>> > new
>> > > > > > input.
>> > > > > > > > > > > > Does that make sense?
>> > > > > > > > > > > > I'm just wondering what a general purpose design for
>> > Hama
>> > > > > based
>> > > > > > > ML
>> > > > > > > > > > stuff
>> > > > > > > > > > > > would look like so this is just to start the
>> > discussion,
>> > > > any
>> > > > > > > > opinion
>> > > > > > > > > is
>> > > > > > > > > > > > welcome.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Cheers,
>> > > > > > > > > > > > Tommaso
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [ML] - data storage and basic design approach

Reply via email to