Re: [ML] - data storage and basic design approach

Edward J. Yoon Tue, 10 Jul 2012 05:45:38 -0700

My concern is that this looks like duplicated efforts with Miklai.

I think it needs to be organized.


On Tue, Jul 10, 2012 at 8:26 PM, Thomas Jungblut
<[email protected]> wrote:
> Splitting out a math module would be smarter, but let's just keep that in
> the ML package.
>
> Anyone volunteer to code a simple (mini-) batch gradient descent in BSP?
> http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html
>
>
> 2012/7/10 Edward J. Yoon <[email protected]>
>
>> would like to move core module so that other can reuse it.
>>
>> On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili
>> <[email protected]> wrote:
>> > I've done the first import, we can start from that now, thanks Thomas.
>> > Tommaso
>> >
>> > 2012/7/10 Tommaso Teofili <[email protected]>
>> >
>> >> ok, I'll try that, thanks :)
>> >> Tommaso
>> >>
>> >> 2012/7/10 Thomas Jungblut <[email protected]>
>> >>
>> >>> I don't know if we need sparse/named vectors for the first scratch.
>> >>> You can just use the interface and the dense implementations and remove
>> >>> all
>> >>> the uncompilable code in the writables.
>> >>>
>> >>> 2012/7/10 Tommaso Teofili <[email protected]>
>> >>>
>> >>> > Thomas, while inspecting the code I realize I may need to import
>> >>> most/all
>> >>> > of the classes inside your math library for the writables to compile,
>> >>> is it
>> >>> > ok for you or you don't want that?
>> >>> > Regards,
>> >>> > Tommaso
>> >>> >
>> >>> > 2012/7/10 Thomas Jungblut <[email protected]>
>> >>> >
>> >>> > > great, thank you for taking care of it ;)
>> >>> > >
>> >>> > > 2012/7/10 Tommaso Teofili <[email protected]>
>> >>> > >
>> >>> > > > Ok, sure, I'll just add the writables along with
>> DoubleMatrix/Vector
>> >>> > with
>> >>> > > > the AL2 headers on top.
>> >>> > > > Thanks Thomas for the contribution and feedback.
>> >>> > > > Tommaso
>> >>> > > >
>> >>> > > > 2012/7/10 Thomas Jungblut <[email protected]>
>> >>> > > >
>> >>> > > > > Feel free to commit this, but take care to add the apache
>> license
>> >>> > > > headers.
>> >>> > > > > Also I wanted to add a few testcases over the next few
>> weekends.
>> >>> > > > >
>> >>> > > > > 2012/7/10 Tommaso Teofili <[email protected]>
>> >>> > > > >
>> >>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD
>> is a
>> >>> > good
>> >>> > > > fit
>> >>> > > > > > for BSP.
>> >>> > > > > > Also I was trying to implement some easy meta learning
>> algorithm
>> >>> > like
>> >>> > > > the
>> >>> > > > > > weighed majority algorithm where each peer as a proper
>> learning
>> >>> > > > algorithm
>> >>> > > > > > and gest penalized for each mistaken prediction.
>> >>> > > > > > Regarding your math library do you plan to commit it
>> yourself?
>> >>> > > > Otherwise
>> >>> > > > > I
>> >>> > > > > > can do it.
>> >>> > > > > > Regards,
>> >>> > > > > > Tommaso
>> >>> > > > > >
>> >>> > > > > >
>> >>> > > > > > 2012/7/10 Thomas Jungblut <[email protected]>
>> >>> > > > > >
>> >>> > > > > > > Maybe a first good step towards algorithms would be to try
>> to
>> >>> > > > evaluate
>> >>> > > > > > how
>> >>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS
>> or
>> >>> > > > conjugate
>> >>> > > > > > > gradient method)
>> >>> > > > > > >
>> >>> > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>> >>> > > > > > >
>> >>> > > > > > > > 2012/7/9 Thomas Jungblut <[email protected]>
>> >>> > > > > > > >
>> >>> > > > > > > > > For the matrix/vector I would propose my library
>> >>> interface:
>> >>> > > > (quite
>> >>> > > > > > like
>> >>> > > > > > > > > mahouts math, but without boundary checks)
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > >
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > >
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java
>> >>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix:
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > >
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable
>> >>> > > > > > > > >
>> >>> > > > > > > > > It is an enough to make all machine learning algorithms
>> >>> I've
>> >>> > > seen
>> >>> > > > > > until
>> >>> > > > > > > > now
>> >>> > > > > > > > > and the builder pattern allows really nice chaining of
>> >>> > commands
>> >>> > > > to
>> >>> > > > > > > easily
>> >>> > > > > > > > > code equations or translate code from matlab/octave.
>> >>> > > > > > > > > See for example logistic regression cost function
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > >
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java
>> >>> > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > > very nice, +1!
>> >>> > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > > > For the interfaces of the algorithms:
>> >>> > > > > > > > > I guess we need to get some more experience, I can not
>> >>> tell
>> >>> > how
>> >>> > > > the
>> >>> > > > > > > > > interfaces for them should look like, mainly because I
>> >>> don't
>> >>> > > know
>> >>> > > > > how
>> >>> > > > > > > the
>> >>> > > > > > > > > BSP version of them will call the algorithm logic.
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > > you're right, it's more reasonable to just proceed
>> bottom -
>> >>> up
>> >>> > > with
>> >>> > > > > > this
>> >>> > > > > > > as
>> >>> > > > > > > > we're going to have a clearer idea while developing the
>> >>> > different
>> >>> > > > > > > > algorithms.
>> >>> > > > > > > > So for now I'd introduce your library Writables and then
>> >>> > proceed
>> >>> > > 1
>> >>> > > > > step
>> >>> > > > > > > at
>> >>> > > > > > > > a time with the more common API.
>> >>> > > > > > > > Thanks,
>> >>> > > > > > > > Tommaso
>> >>> > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > > > But having stable math interfaces is the key point.
>> >>> > > > > > > > >
>> >>> > > > > > > > > 2012/7/9 Tommaso Teofili <[email protected]>
>> >>> > > > > > > > >
>> >>> > > > > > > > > > Ok, so let's sketch up here what these interfaces
>> should
>> >>> > look
>> >>> > > > > like.
>> >>> > > > > > > > > > Any proposal is more than welcome.
>> >>> > > > > > > > > > Regards,
>> >>> > > > > > > > > > Tommaso
>> >>> > > > > > > > > >
>> >>> > > > > > > > > > 2012/7/7 Thomas Jungblut <[email protected]>
>> >>> > > > > > > > > >
>> >>> > > > > > > > > > > Looks fine to me.
>> >>> > > > > > > > > > > The key are the interfaces for learning and
>> >>> predicting so
>> >>> > > we
>> >>> > > > > > should
>> >>> > > > > > > > > > define
>> >>> > > > > > > > > > > some vectors and matrices.
>> >>> > > > > > > > > > > It would be enough to define the algorithms via the
>> >>> > > > interfaces
>> >>> > > > > > and
>> >>> > > > > > > a
>> >>> > > > > > > > > > > generic BSP should just run them based on the given
>> >>> > input.
>> >>> > > > > > > > > > >
>> >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili <
>> [email protected]>
>> >>> > > > > > > > > > >
>> >>> > > > > > > > > > > > Hi all,
>> >>> > > > > > > > > > > >
>> >>> > > > > > > > > > > > in my spare time I started writing some basic BSP
>> >>> based
>> >>> > > > > machine
>> >>> > > > > > > > > > learning
>> >>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering,
>> >>> from a
>> >>> > > > > design
>> >>> > > > > > > > point
>> >>> > > > > > > > > of
>> >>> > > > > > > > > > > > view, where it'd make sense to put the training
>> >>> data /
>> >>> > > > model.
>> >>> > > > > > I'd
>> >>> > > > > > > > > > assume
>> >>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me
>> >>> think
>> >>> > > we
>> >>> > > > > > should
>> >>> > > > > > > > > come
>> >>> > > > > > > > > > > with
>> >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one
>> for
>> >>> > > > learning
>> >>> > > > > > and
>> >>> > > > > > > > one
>> >>> > > > > > > > > > for
>> >>> > > > > > > > > > > > "predicting" each to be run separately.
>> >>> > > > > > > > > > > > This would allow to read the training data from
>> >>> HDFS,
>> >>> > and
>> >>> > > > > > > > > consequently
>> >>> > > > > > > > > > > > create a model (also on HDFS) and then the
>> created
>> >>> > model
>> >>> > > > > could
>> >>> > > > > > be
>> >>> > > > > > > > > read
>> >>> > > > > > > > > > > > (again from HDFS) in order to predict an output
>> for
>> >>> a
>> >>> > new
>> >>> > > > > > input.
>> >>> > > > > > > > > > > > Does that make sense?
>> >>> > > > > > > > > > > > I'm just wondering what a general purpose design
>> for
>> >>> > Hama
>> >>> > > > > based
>> >>> > > > > > > ML
>> >>> > > > > > > > > > stuff
>> >>> > > > > > > > > > > > would look like so this is just to start the
>> >>> > discussion,
>> >>> > > > any
>> >>> > > > > > > > opinion
>> >>> > > > > > > > > is
>> >>> > > > > > > > > > > > welcome.
>> >>> > > > > > > > > > > >
>> >>> > > > > > > > > > > > Cheers,
>> >>> > > > > > > > > > > > Tommaso
>> >>> > > > > > > > > > > >
>> >>> > > > > > > > > > >
>> >>> > > > > > > > > >
>> >>> > > > > > > > >
>> >>> > > > > > > >
>> >>> > > > > > >
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: [ML] - data storage and basic design approach

Reply via email to