My concern is that this looks like duplicated efforts with Miklai. I think it needs to be organized.
On Tue, Jul 10, 2012 at 8:26 PM, Thomas Jungblut <[email protected]> wrote: > Splitting out a math module would be smarter, but let's just keep that in > the ML package. > > Anyone volunteer to code a simple (mini-) batch gradient descent in BSP? > http://holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html > > > 2012/7/10 Edward J. Yoon <[email protected]> > >> would like to move core module so that other can reuse it. >> >> On Tue, Jul 10, 2012 at 7:13 PM, Tommaso Teofili >> <[email protected]> wrote: >> > I've done the first import, we can start from that now, thanks Thomas. >> > Tommaso >> > >> > 2012/7/10 Tommaso Teofili <[email protected]> >> > >> >> ok, I'll try that, thanks :) >> >> Tommaso >> >> >> >> 2012/7/10 Thomas Jungblut <[email protected]> >> >> >> >>> I don't know if we need sparse/named vectors for the first scratch. >> >>> You can just use the interface and the dense implementations and remove >> >>> all >> >>> the uncompilable code in the writables. >> >>> >> >>> 2012/7/10 Tommaso Teofili <[email protected]> >> >>> >> >>> > Thomas, while inspecting the code I realize I may need to import >> >>> most/all >> >>> > of the classes inside your math library for the writables to compile, >> >>> is it >> >>> > ok for you or you don't want that? >> >>> > Regards, >> >>> > Tommaso >> >>> > >> >>> > 2012/7/10 Thomas Jungblut <[email protected]> >> >>> > >> >>> > > great, thank you for taking care of it ;) >> >>> > > >> >>> > > 2012/7/10 Tommaso Teofili <[email protected]> >> >>> > > >> >>> > > > Ok, sure, I'll just add the writables along with >> DoubleMatrix/Vector >> >>> > with >> >>> > > > the AL2 headers on top. >> >>> > > > Thanks Thomas for the contribution and feedback. >> >>> > > > Tommaso >> >>> > > > >> >>> > > > 2012/7/10 Thomas Jungblut <[email protected]> >> >>> > > > >> >>> > > > > Feel free to commit this, but take care to add the apache >> license >> >>> > > > headers. >> >>> > > > > Also I wanted to add a few testcases over the next few >> weekends. >> >>> > > > > >> >>> > > > > 2012/7/10 Tommaso Teofili <[email protected]> >> >>> > > > > >> >>> > > > > > nice idea, quickly thinking to it it looks to me that (C)GD >> is a >> >>> > good >> >>> > > > fit >> >>> > > > > > for BSP. >> >>> > > > > > Also I was trying to implement some easy meta learning >> algorithm >> >>> > like >> >>> > > > the >> >>> > > > > > weighed majority algorithm where each peer as a proper >> learning >> >>> > > > algorithm >> >>> > > > > > and gest penalized for each mistaken prediction. >> >>> > > > > > Regarding your math library do you plan to commit it >> yourself? >> >>> > > > Otherwise >> >>> > > > > I >> >>> > > > > > can do it. >> >>> > > > > > Regards, >> >>> > > > > > Tommaso >> >>> > > > > > >> >>> > > > > > >> >>> > > > > > 2012/7/10 Thomas Jungblut <[email protected]> >> >>> > > > > > >> >>> > > > > > > Maybe a first good step towards algorithms would be to try >> to >> >>> > > > evaluate >> >>> > > > > > how >> >>> > > > > > > we can implement some non-linear optimizers in BSP. (BFGS >> or >> >>> > > > conjugate >> >>> > > > > > > gradient method) >> >>> > > > > > > >> >>> > > > > > > 2012/7/9 Tommaso Teofili <[email protected]> >> >>> > > > > > > >> >>> > > > > > > > 2012/7/9 Thomas Jungblut <[email protected]> >> >>> > > > > > > > >> >>> > > > > > > > > For the matrix/vector I would propose my library >> >>> interface: >> >>> > > > (quite >> >>> > > > > > like >> >>> > > > > > > > > mahouts math, but without boundary checks) >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleVector.java >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> https://github.com/thomasjungblut/tjungblut-math/blob/master/src/de/jungblut/math/DoubleMatrix.java >> >>> > > > > > > > > Full Writable for Vector and basic Writable for Matrix: >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> https://github.com/thomasjungblut/thomasjungblut-common/tree/master/src/de/jungblut/writable >> >>> > > > > > > > > >> >>> > > > > > > > > It is an enough to make all machine learning algorithms >> >>> I've >> >>> > > seen >> >>> > > > > > until >> >>> > > > > > > > now >> >>> > > > > > > > > and the builder pattern allows really nice chaining of >> >>> > commands >> >>> > > > to >> >>> > > > > > > easily >> >>> > > > > > > > > code equations or translate code from matlab/octave. >> >>> > > > > > > > > See for example logistic regression cost function >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> https://github.com/thomasjungblut/thomasjungblut-common/blob/master/src/de/jungblut/regression/LogisticRegressionCostFunction.java >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > very nice, +1! >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > > For the interfaces of the algorithms: >> >>> > > > > > > > > I guess we need to get some more experience, I can not >> >>> tell >> >>> > how >> >>> > > > the >> >>> > > > > > > > > interfaces for them should look like, mainly because I >> >>> don't >> >>> > > know >> >>> > > > > how >> >>> > > > > > > the >> >>> > > > > > > > > BSP version of them will call the algorithm logic. >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > you're right, it's more reasonable to just proceed >> bottom - >> >>> up >> >>> > > with >> >>> > > > > > this >> >>> > > > > > > as >> >>> > > > > > > > we're going to have a clearer idea while developing the >> >>> > different >> >>> > > > > > > > algorithms. >> >>> > > > > > > > So for now I'd introduce your library Writables and then >> >>> > proceed >> >>> > > 1 >> >>> > > > > step >> >>> > > > > > > at >> >>> > > > > > > > a time with the more common API. >> >>> > > > > > > > Thanks, >> >>> > > > > > > > Tommaso >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > > But having stable math interfaces is the key point. >> >>> > > > > > > > > >> >>> > > > > > > > > 2012/7/9 Tommaso Teofili <[email protected]> >> >>> > > > > > > > > >> >>> > > > > > > > > > Ok, so let's sketch up here what these interfaces >> should >> >>> > look >> >>> > > > > like. >> >>> > > > > > > > > > Any proposal is more than welcome. >> >>> > > > > > > > > > Regards, >> >>> > > > > > > > > > Tommaso >> >>> > > > > > > > > > >> >>> > > > > > > > > > 2012/7/7 Thomas Jungblut <[email protected]> >> >>> > > > > > > > > > >> >>> > > > > > > > > > > Looks fine to me. >> >>> > > > > > > > > > > The key are the interfaces for learning and >> >>> predicting so >> >>> > > we >> >>> > > > > > should >> >>> > > > > > > > > > define >> >>> > > > > > > > > > > some vectors and matrices. >> >>> > > > > > > > > > > It would be enough to define the algorithms via the >> >>> > > > interfaces >> >>> > > > > > and >> >>> > > > > > > a >> >>> > > > > > > > > > > generic BSP should just run them based on the given >> >>> > input. >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > 2012/7/7 Tommaso Teofili < >> [email protected]> >> >>> > > > > > > > > > > >> >>> > > > > > > > > > > > Hi all, >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > > > in my spare time I started writing some basic BSP >> >>> based >> >>> > > > > machine >> >>> > > > > > > > > > learning >> >>> > > > > > > > > > > > algorithms for our ml module, now I'm wondering, >> >>> from a >> >>> > > > > design >> >>> > > > > > > > point >> >>> > > > > > > > > of >> >>> > > > > > > > > > > > view, where it'd make sense to put the training >> >>> data / >> >>> > > > model. >> >>> > > > > > I'd >> >>> > > > > > > > > > assume >> >>> > > > > > > > > > > > the obvious answer would be HDFS so this makes me >> >>> think >> >>> > > we >> >>> > > > > > should >> >>> > > > > > > > > come >> >>> > > > > > > > > > > with >> >>> > > > > > > > > > > > (at least) two BSP jobs for each algorithm: one >> for >> >>> > > > learning >> >>> > > > > > and >> >>> > > > > > > > one >> >>> > > > > > > > > > for >> >>> > > > > > > > > > > > "predicting" each to be run separately. >> >>> > > > > > > > > > > > This would allow to read the training data from >> >>> HDFS, >> >>> > and >> >>> > > > > > > > > consequently >> >>> > > > > > > > > > > > create a model (also on HDFS) and then the >> created >> >>> > model >> >>> > > > > could >> >>> > > > > > be >> >>> > > > > > > > > read >> >>> > > > > > > > > > > > (again from HDFS) in order to predict an output >> for >> >>> a >> >>> > new >> >>> > > > > > input. >> >>> > > > > > > > > > > > Does that make sense? >> >>> > > > > > > > > > > > I'm just wondering what a general purpose design >> for >> >>> > Hama >> >>> > > > > based >> >>> > > > > > > ML >> >>> > > > > > > > > > stuff >> >>> > > > > > > > > > > > would look like so this is just to start the >> >>> > discussion, >> >>> > > > any >> >>> > > > > > > > opinion >> >>> > > > > > > > > is >> >>> > > > > > > > > > > > welcome. >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > > > Cheers, >> >>> > > > > > > > > > > > Tommaso >> >>> > > > > > > > > > > > >> >>> > > > > > > > > > > >> >>> > > > > > > > > > >> >>> > > > > > > > > >> >>> > > > > > > > >> >>> > > > > > > >> >>> > > > > > >> >>> > > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> >> >> >> >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon >> -- Best Regards, Edward J. Yoon @eddieyoon
