Re: [ML] - data storage and basic design approach

Edward J. Yoon Tue, 10 Jul 2012 03:28:41 -0700

would like to move core module so that other can reuse it.

On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili
<[email protected]> wrote:
> I've done the first import, we can start from that now, thanks Thomas.
> Tommaso
>
> 2012/7/10 Tommaso Teofili <[email protected]>
>
>> ok, I'll try that, thanks :)
>> Tommaso
>>
>> 2012/7/10 Thomas Jungblut <[email protected]>
>>
>>> I don't know if we need sparse/named vectors for the first scratch.
>>> You can just use the interface and the dense implementations and remove
>>> all
>>> the uncompilable code in the writables.
>>>
>>> 2012/7/10 Tommaso Teofili <[email protected]>
>>>
>>> > Thomas, while inspecting the code I realize I may need to import
>>> most/all
>>> > of the classes inside your math library for the writables to compile,
>>> is it
>>> > ok for you or you don't want that?
>>> > Regards,
>>> > Tommaso
>>> >
>>> > 2012/7/10 Thomas Jungblut <[email protected]>
>>> >
>>> > > great, thank you for taking care of it ;)
>>> > >
>>> > > 2012/7/10 Tommaso Teofili <[email protected]>
>>> > >
>>> > > > Ok, sure, I'll just add the writables along with DoubleMatrix/Vector
>>> > with
>>> > > > the AL2 headers on top.
>>> > > > Thanks Thomas for the contribution and feedback.
>>> > > > Tommaso
>>> > > >
>>> > > > 2012/7/10 Thomas Jungblut <[email protected]>
>>> > > >
>>> > > > > Feel free to commit this, but take care to add the apache license
>>> > > > headers.
>>> > > > > Also I wanted to add a few testcases over the next few weekends.
>>> > > > >
>>> > > > > 2012/7/10 Tommaso Teofili <[email protected]>
>>> > > > >
>>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD is a
>>> > good
>>> > > > fit
>>> > > > > > for BSP.
>>> > > > > > Also I was trying to implement some easy meta learning algorithm
>>> > like
>>> > > > the
>>> > > > > > weighed majority algorithm where each peer as a proper learning
>>> > > > algorithm
>>> > > > > > and gest penalized for each mistaken prediction.
>>> > > > > > Regarding your math library do you plan to commit it yourself?
>>> > > > Otherwise
>>> > > > > I
>>> > > > > > can do it.
>>> > > > > > Regards,
>>> > > > > > Tommaso
>>> > > > > >
>>> > > > > >
>>> > > > > > 2012/7/10 Thomas Jungblut <[email protected]>
>>> > > > > >
>>> > > > > > > Maybe a first good step towards algorithms would be to try to
>>> > > > evaluate
>>> > > > > > how
>>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS or
>>> > > > conjugate
>>> > > > > > > gradient method)
>>> > > > > > >
>>> > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>>> > > > > > >
>>> > > > > > > > 2012/7/9 Thomas Jungblut <[email protected]>
>>> > > > > > > >
>>> > > > > > > > > For the matrix/vector I would propose my library
>>> interface:
>>> > > > (quite
>>> > > > > > like
>>> > > > > > > > > mahouts math, but without boundary checks)
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
>>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix:
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
>>> > > > > > > > >
>>> > > > > > > > > It is an enough to make all machine learning algorithms
>>> I've
>>> > > seen
>>> > > > > > until
>>> > > > > > > > now
>>> > > > > > > > > and the builder pattern allows really nice chaining of
>>> > commands
>>> > > > to
>>> > > > > > > easily
>>> > > > > > > > > code equations or translate code from matlab/octave.
>>> > > > > > > > > See for example logistic regression cost function
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > very nice, +1!
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > For the interfaces of the algorithms:
>>> > > > > > > > > I guess we need to get some more experience, I can not
>>> tell
>>> > how
>>> > > > the
>>> > > > > > > > > interfaces for them should look like, mainly because I
>>> don't
>>> > > know
>>> > > > > how
>>> > > > > > > the
>>> > > > > > > > > BSP version of them will call the algorithm logic.
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > > > you're right, it's more reasonable to just proceed bottom -
>>> up
>>> > > with
>>> > > > > > this
>>> > > > > > > as
>>> > > > > > > > we're going to have a clearer idea while developing the
>>> > different
>>> > > > > > > > algorithms.
>>> > > > > > > > So for now I'd introduce your library Writables and then
>>> > proceed
>>> > > 1
>>> > > > > step
>>> > > > > > > at
>>> > > > > > > > a time with the more common API.
>>> > > > > > > > Thanks,
>>> > > > > > > > Tommaso
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > But having stable math interfaces is the key point.
>>> > > > > > > > >
>>> > > > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>>> > > > > > > > >
>>> > > > > > > > > > Ok, so let's sketch up here what these interfaces should
>>> > look
>>> > > > > like.
>>> > > > > > > > > > Any proposal is more than welcome.
>>> > > > > > > > > > Regards,
>>> > > > > > > > > > Tommaso
>>> > > > > > > > > >
>>> > > > > > > > > > 2012/7/7 Thomas Jungblut <[email protected]>
>>> > > > > > > > > >
>>> > > > > > > > > > > Looks fine to me.
>>> > > > > > > > > > > The key are the interfaces for learning and
>>> predicting so
>>> > > we
>>> > > > > > should
>>> > > > > > > > > > define
>>> > > > > > > > > > > some vectors and matrices.
>>> > > > > > > > > > > It would be enough to define the algorithms via the
>>> > > > interfaces
>>> > > > > > and
>>> > > > > > > a
>>> > > > > > > > > > > generic BSP should just run them based on the given
>>> > input.
>>> > > > > > > > > > >
>>> > > > > > > > > > > 2012/7/7 Tommaso Teofili <[email protected]>
>>> > > > > > > > > > >
>>> > > > > > > > > > > > Hi all,
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > in my spare time I started writing some basic BSP
>>> based
>>> > > > > machine
>>> > > > > > > > > > learning
>>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering,
>>> from a
>>> > > > > design
>>> > > > > > > > point
>>> > > > > > > > > of
>>> > > > > > > > > > > > view, where it'd make sense to put the training
>>> data /
>>> > > > model.
>>> > > > > > I'd
>>> > > > > > > > > > assume
>>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me
>>> think
>>> > > we
>>> > > > > > should
>>> > > > > > > > > come
>>> > > > > > > > > > > with
>>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one for
>>> > > > learning
>>> > > > > > and
>>> > > > > > > > one
>>> > > > > > > > > > for
>>> > > > > > > > > > > > "predicting" each to be run separately.
>>> > > > > > > > > > > > This would allow to read the training data from
>>> HDFS,
>>> > and
>>> > > > > > > > > consequently
>>> > > > > > > > > > > > create a model (also on HDFS) and then the created
>>> > model
>>> > > > > could
>>> > > > > > be
>>> > > > > > > > > read
>>> > > > > > > > > > > > (again from HDFS) in order to predict an output for
>>> a
>>> > new
>>> > > > > > input.
>>> > > > > > > > > > > > Does that make sense?
>>> > > > > > > > > > > > I'm just wondering what a general purpose design for
>>> > Hama
>>> > > > > based
>>> > > > > > > ML
>>> > > > > > > > > > stuff
>>> > > > > > > > > > > > would look like so this is just to start the
>>> > discussion,
>>> > > > any
>>> > > > > > > > opinion
>>> > > > > > > > > is
>>> > > > > > > > > > > > welcome.
>>> > > > > > > > > > > >
>>> > > > > > > > > > > > Cheers,
>>> > > > > > > > > > > > Tommaso
>>> > > > > > > > > > > >
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>




-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ML] - data storage and basic design approach

Reply via email to