Splitting out a math module would be smarter, but let's just keep that in the ML package.
Anyone volunteer to code a simple (mini-) batch gradient descent in BSP? http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html 2012/7/10 Edward J. Yoon <edwardy...@apache.org> > would like to move core module so that other can reuse it. > > On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili > <tommaso.teof...@gmail.com> wrote: > > I've done the first import, we can start from that now, thanks Thomas. > > Tommaso > > > > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com> > > > >> ok, I'll try that, thanks :) > >> Tommaso > >> > >> 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com> > >> > >>> I don't know if we need sparse/named vectors for the first scratch. > >>> You can just use the interface and the dense implementations and remove > >>> all > >>> the uncompilable code in the writables. > >>> > >>> 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com> > >>> > >>> > Thomas, while inspecting the code I realize I may need to import > >>> most/all > >>> > of the classes inside your math library for the writables to compile, > >>> is it > >>> > ok for you or you don't want that? > >>> > Regards, > >>> > Tommaso > >>> > > >>> > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com> > >>> > > >>> > > great, thank you for taking care of it ;) > >>> > > > >>> > > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com> > >>> > > > >>> > > > Ok, sure, I'll just add the writables along with > DoubleMatrix/Vector > >>> > with > >>> > > > the AL2 headers on top. > >>> > > > Thanks Thomas for the contribution and feedback. > >>> > > > Tommaso > >>> > > > > >>> > > > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com> > >>> > > > > >>> > > > > Feel free to commit this, but take care to add the apache > license > >>> > > > headers. > >>> > > > > Also I wanted to add a few testcases over the next few > weekends. > >>> > > > > > >>> > > > > 2012/7/10 Tommaso Teofili <tommaso.teof...@gmail.com> > >>> > > > > > >>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD > is a > >>> > good > >>> > > > fit > >>> > > > > > for BSP. > >>> > > > > > Also I was trying to implement some easy meta learning > algorithm > >>> > like > >>> > > > the > >>> > > > > > weighed majority algorithm where each peer as a proper > learning > >>> > > > algorithm > >>> > > > > > and gest penalized for each mistaken prediction. > >>> > > > > > Regarding your math library do you plan to commit it > yourself? > >>> > > > Otherwise > >>> > > > > I > >>> > > > > > can do it. > >>> > > > > > Regards, > >>> > > > > > Tommaso > >>> > > > > > > >>> > > > > > > >>> > > > > > 2012/7/10 Thomas Jungblut <thomas.jungb...@gmail.com> > >>> > > > > > > >>> > > > > > > Maybe a first good step towards algorithms would be to try > to > >>> > > > evaluate > >>> > > > > > how > >>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS > or > >>> > > > conjugate > >>> > > > > > > gradient method) > >>> > > > > > > > >>> > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teof...@gmail.com> > >>> > > > > > > > >>> > > > > > > > 2012/7/9 Thomas Jungblut <thomas.jungb...@gmail.com> > >>> > > > > > > > > >>> > > > > > > > > For the matrix/vector I would propose my library > >>> interface: > >>> > > > (quite > >>> > > > > > like > >>> > > > > > > > > mahouts math, but without boundary checks) > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java > >>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix: > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable > >>> > > > > > > > > > >>> > > > > > > > > It is an enough to make all machine learning algorithms > >>> I've > >>> > > seen > >>> > > > > > until > >>> > > > > > > > now > >>> > > > > > > > > and the builder pattern allows really nice chaining of > >>> > commands > >>> > > > to > >>> > > > > > > easily > >>> > > > > > > > > code equations or translate code from matlab/octave. > >>> > > > > > > > > See for example logistic regression cost function > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > very nice, +1! > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > For the interfaces of the algorithms: > >>> > > > > > > > > I guess we need to get some more experience, I can not > >>> tell > >>> > how > >>> > > > the > >>> > > > > > > > > interfaces for them should look like, mainly because I > >>> don't > >>> > > know > >>> > > > > how > >>> > > > > > > the > >>> > > > > > > > > BSP version of them will call the algorithm logic. > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > you're right, it's more reasonable to just proceed > bottom - > >>> up > >>> > > with > >>> > > > > > this > >>> > > > > > > as > >>> > > > > > > > we're going to have a clearer idea while developing the > >>> > different > >>> > > > > > > > algorithms. > >>> > > > > > > > So for now I'd introduce your library Writables and then > >>> > proceed > >>> > > 1 > >>> > > > > step > >>> > > > > > > at > >>> > > > > > > > a time with the more common API. > >>> > > > > > > > Thanks, > >>> > > > > > > > Tommaso > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > But having stable math interfaces is the key point. > >>> > > > > > > > > > >>> > > > > > > > > 2012/7/9 Tommaso Teofili <tommaso.teof...@gmail.com> > >>> > > > > > > > > > >>> > > > > > > > > > Ok, so let's sketch up here what these interfaces > should > >>> > look > >>> > > > > like. > >>> > > > > > > > > > Any proposal is more than welcome. > >>> > > > > > > > > > Regards, > >>> > > > > > > > > > Tommaso > >>> > > > > > > > > > > >>> > > > > > > > > > 2012/7/7 Thomas Jungblut <thomas.jungb...@gmail.com> > >>> > > > > > > > > > > >>> > > > > > > > > > > Looks fine to me. > >>> > > > > > > > > > > The key are the interfaces for learning and > >>> predicting so > >>> > > we > >>> > > > > > should > >>> > > > > > > > > > define > >>> > > > > > > > > > > some vectors and matrices. > >>> > > > > > > > > > > It would be enough to define the algorithms via the > >>> > > > interfaces > >>> > > > > > and > >>> > > > > > > a > >>> > > > > > > > > > > generic BSP should just run them based on the given > >>> > input. > >>> > > > > > > > > > > > >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili < > tommaso.teof...@gmail.com> > >>> > > > > > > > > > > > >>> > > > > > > > > > > > Hi all, > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > in my spare time I started writing some basic BSP > >>> based > >>> > > > > machine > >>> > > > > > > > > > learning > >>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering, > >>> from a > >>> > > > > design > >>> > > > > > > > point > >>> > > > > > > > > of > >>> > > > > > > > > > > > view, where it'd make sense to put the training > >>> data / > >>> > > > model. > >>> > > > > > I'd > >>> > > > > > > > > > assume > >>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me > >>> think > >>> > > we > >>> > > > > > should > >>> > > > > > > > > come > >>> > > > > > > > > > > with > >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one > for > >>> > > > learning > >>> > > > > > and > >>> > > > > > > > one > >>> > > > > > > > > > for > >>> > > > > > > > > > > > "predicting" each to be run separately. > >>> > > > > > > > > > > > This would allow to read the training data from > >>> HDFS, > >>> > and > >>> > > > > > > > > consequently > >>> > > > > > > > > > > > create a model (also on HDFS) and then the > created > >>> > model > >>> > > > > could > >>> > > > > > be > >>> > > > > > > > > read > >>> > > > > > > > > > > > (again from HDFS) in order to predict an output > for > >>> a > >>> > new > >>> > > > > > input. > >>> > > > > > > > > > > > Does that make sense? > >>> > > > > > > > > > > > I'm just wondering what a general purpose design > for > >>> > Hama > >>> > > > > based > >>> > > > > > > ML > >>> > > > > > > > > > stuff > >>> > > > > > > > > > > > would look like so this is just to start the > >>> > discussion, > >>> > > > any > >>> > > > > > > > opinion > >>> > > > > > > > > is > >>> > > > > > > > > > > > welcome. > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > Cheers, > >>> > > > > > > > > > > > Tommaso > >>> > > > > > > > > > > > > >>> > > > > > > > > > > > >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >> > >> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon >