I was thinking so too.  Most ML frameworks are at least loosly based on the
Sklearn paradigm.  For those not familiar, at a very abstract level-

model1 = new Algo // e.g. K-Means, Random Forest, Neural Net

model1.fit(trainingData)

// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData)  // which returns a vector of predictions
for each obs point in testing data

// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation


In addition to the above, pre-processing operations then also have a
transform method such as

preprocess1 = new Normalizer

preprocess1.fit( trainingData )  // in this phase calculates the mean and
variance of the training data set

preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)

I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)

We have two high level bucket types, based on what the output is:

Predictors and Transformers

Predictors: anything that return a single value per observation, this is
classifiers and regressors

Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.

And so really we have something like:

class LearningFunction
  def fit()

class Transformer extends LearningFunction:
  def transform

class Predictor extends Transformer:
  def predict


This paradigm also lends its self nicely to pipelines...

pipeline1 = new Pipeline
                   .add( transformer1 )
                   .add(  transformer2 )
                   .add( model1 )

pipeline1.fit( trainingData )
pipelin1.predict( testingData )

I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.

In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.

Again, just moving the conversation forward a bit here.

Excited to get to work on this

Best,

tg






Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <s...@apache.org> wrote:

> Hi Andrew,
>
> I think this topic is broader than just defining a few traits. A popular
> way of integrating ML algorithms is via the combination of dataframes and
> pipelines, similar to what scipy and SparkML are offering at the moment.
> Maybe it could make sense to integrate with what they have instead of
> starting our own efforts?
>
> Best,
> Sebastian
>
>
>
> On 21.07.2016 04:35, Andrew Palumbo wrote:
>
>> Hi All,
>>
>>
>> I'd like to draw your attention to MAHOUT-1856:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> This is a discussion that has popped up several times over the last
>> couple of years. as we move towards building out our algorithm library, It
>> would be great  to nail this down now.
>>
>>
>> Most Importantly to not be able to be criticized as "a loose bag of
>> algorithms" as we've sometimes been in the past.
>>
>>
>> The main point being It would be good to lay out  common traits for
>> Classification, Clustering, and Optimization algorithms.
>>
>>
>> This is just a start. I created this issue a few months back, and
>> intentionally left off Recommender, because I was unsure if there were
>> common traits across them.  By traits, I am referring to both both the
>> literal meaning and more specifically, actual Scala traits.
>>
>>
>> @pat, @tdunning, @ssc, could you give your thoughts on this?
>>
>>
>> As well, it would be good to add online flavors of different algorithm
>> classes into the mix.
>>
>>
>> @tdunning could you share some thoughts here?
>>
>>
>> Trevor Grant will be heading up this effort, and It would be great if we
>> all as a team could come up with abstract design plans for each class of
>> algorithm (as well as to determine the current "classes of algorithms", as
>> each of us has our own unique blend of specializations.  And could give our
>> thoughts on this.
>>
>>
>> Currently this is really the opening of the conversation.
>>
>>
>> It would be best to post thoughts on:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> Any feedback is welcomed.
>>
>>
>> Thanks,
>>
>>
>> Andy
>>
>>
>>
>>

Reply via email to