I was thinking so too. Most ML frameworks are at least loosly based on the Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net model1.fit(trainingData) // then depending on the goal of the algorithm you have either (or both) preds = model1.predict( testData) // which returns a vector of predictions for each obs point in testing data // or sometimes newVals = model1.transform( testData) // which returns a new dataset like object, as this makes more sense for things like neural nets, or when you're not just predicting a single value per observation In addition to the above, pre-processing operations then also have a transform method such as preprocess1 = new Normalizer preprocess1.fit( trainingData ) // in this phase calculates the mean and variance of the training data set preprocessedTrainingData = preprocess1.transform( trainingData) preprocessTestingData = preprocess1.transform( testingData) I think this is a reasonalbe approach bc A) it makes sense and B) is a standard of sorts across ML libraries (bc of A) We have two high level bucket types, based on what the output is: Predictors and Transformers Predictors: anything that return a single value per observation, this is classifiers and regressors Transformers: anything that returns a vector per observation - Pre-processing operations - Classifiers, in that usually there is a probability vector for each observation as to which class it belongs too, the 'predict' method then just picks the most likely class - Neural nets ( though with one small tweak can be extended to regression or classification ) - Any unsupervised learning application (e.g. clustering) - etc. And so really we have something like: class LearningFunction def fit() class Transformer extends LearningFunction: def transform class Predictor extends Transformer: def predict This paradigm also lends its self nicely to pipelines... pipeline1 = new Pipeline .add( transformer1 ) .add( transformer2 ) .add( model1 ) pipeline1.fit( trainingData ) pipelin1.predict( testingData ) I have to read up on reccomenders a bit more to figure how those play in, or if we need another class. In addition to that I think we would have an optimizers section that allows for the various flavors of SGD, but also allows other types of optimizers all together. Again, just moving the conversation forward a bit here. Excited to get to work on this Best, tg Trevor Grant Data Scientist https://github.com/rawkintrevo http://stackexchange.com/users/3002022/rawkintrevo http://trevorgrant.org *"Fortunate is he, who is able to know the causes of things." -Virgil* On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <s...@apache.org> wrote: > Hi Andrew, > > I think this topic is broader than just defining a few traits. A popular > way of integrating ML algorithms is via the combination of dataframes and > pipelines, similar to what scipy and SparkML are offering at the moment. > Maybe it could make sense to integrate with what they have instead of > starting our own efforts? > > Best, > Sebastian > > > > On 21.07.2016 04:35, Andrew Palumbo wrote: > >> Hi All, >> >> >> I'd like to draw your attention to MAHOUT-1856: >> https://issues.apache.org/jira/browse/MAHOUT-1856 >> >> >> This is a discussion that has popped up several times over the last >> couple of years. as we move towards building out our algorithm library, It >> would be great to nail this down now. >> >> >> Most Importantly to not be able to be criticized as "a loose bag of >> algorithms" as we've sometimes been in the past. >> >> >> The main point being It would be good to lay out common traits for >> Classification, Clustering, and Optimization algorithms. >> >> >> This is just a start. I created this issue a few months back, and >> intentionally left off Recommender, because I was unsure if there were >> common traits across them. By traits, I am referring to both both the >> literal meaning and more specifically, actual Scala traits. >> >> >> @pat, @tdunning, @ssc, could you give your thoughts on this? >> >> >> As well, it would be good to add online flavors of different algorithm >> classes into the mix. >> >> >> @tdunning could you share some thoughts here? >> >> >> Trevor Grant will be heading up this effort, and It would be great if we >> all as a team could come up with abstract design plans for each class of >> algorithm (as well as to determine the current "classes of algorithms", as >> each of us has our own unique blend of specializations. And could give our >> thoughts on this. >> >> >> Currently this is really the opening of the conversation. >> >> >> It would be best to post thoughts on: >> https://issues.apache.org/jira/browse/MAHOUT-1856 >> >> >> Any feedback is welcomed. >> >> >> Thanks, >> >> >> Andy >> >> >> >>