I was thinking so too. Most ML frameworks are at least loosly based on the
Sklearn paradigm. For those not familiar, at a very abstract level-
model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
model1.fit(trainingData)
// then depending on the goal of the algorithm you have either (or both)
preds = model1.predict( testData) // which returns a vector of predictions
for each obs point in testing data
// or sometimes
newVals = model1.transform( testData) // which returns a new dataset like
object, as this makes more sense for things like neural nets, or when
you're not just predicting a single value per observation
In addition to the above, pre-processing operations then also have a
transform method such as
preprocess1 = new Normalizer
preprocess1.fit( trainingData ) // in this phase calculates the mean and
variance of the training data set
preprocessedTrainingData = preprocess1.transform( trainingData)
preprocessTestingData = preprocess1.transform( testingData)
I think this is a reasonalbe approach bc A) it makes sense and B) is a
standard of sorts across ML libraries (bc of A)
We have two high level bucket types, based on what the output is:
Predictors and Transformers
Predictors: anything that return a single value per observation, this is
classifiers and regressors
Transformers: anything that returns a vector per observation
- Pre-processing operations
- Classifiers, in that usually there is a probability vector for each
observation as to which class it belongs too, the 'predict' method then
just picks the most likely class
- Neural nets ( though with one small tweak can be extended to regression
or classification )
- Any unsupervised learning application (e.g. clustering)
- etc.
And so really we have something like:
class LearningFunction
def fit()
class Transformer extends LearningFunction:
def transform
class Predictor extends Transformer:
def predict
This paradigm also lends its self nicely to pipelines...
pipeline1 = new Pipeline
.add( transformer1 )
.add( transformer2 )
.add( model1 )
pipeline1.fit( trainingData )
pipelin1.predict( testingData )
I have to read up on reccomenders a bit more to figure how those play in,
or if we need another class.
In addition to that I think we would have an optimizers section that allows
for the various flavors of SGD, but also allows other types of optimizers
all together.
Again, just moving the conversation forward a bit here.
Excited to get to work on this
Best,
tg
Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org
*"Fortunate is he, who is able to know the causes of things." -Virgil*
On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <[email protected]> wrote:
> Hi Andrew,
>
> I think this topic is broader than just defining a few traits. A popular
> way of integrating ML algorithms is via the combination of dataframes and
> pipelines, similar to what scipy and SparkML are offering at the moment.
> Maybe it could make sense to integrate with what they have instead of
> starting our own efforts?
>
> Best,
> Sebastian
>
>
>
> On 21.07.2016 04:35, Andrew Palumbo wrote:
>
>> Hi All,
>>
>>
>> I'd like to draw your attention to MAHOUT-1856:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> This is a discussion that has popped up several times over the last
>> couple of years. as we move towards building out our algorithm library, It
>> would be great to nail this down now.
>>
>>
>> Most Importantly to not be able to be criticized as "a loose bag of
>> algorithms" as we've sometimes been in the past.
>>
>>
>> The main point being It would be good to lay out common traits for
>> Classification, Clustering, and Optimization algorithms.
>>
>>
>> This is just a start. I created this issue a few months back, and
>> intentionally left off Recommender, because I was unsure if there were
>> common traits across them. By traits, I am referring to both both the
>> literal meaning and more specifically, actual Scala traits.
>>
>>
>> @pat, @tdunning, @ssc, could you give your thoughts on this?
>>
>>
>> As well, it would be good to add online flavors of different algorithm
>> classes into the mix.
>>
>>
>> @tdunning could you share some thoughts here?
>>
>>
>> Trevor Grant will be heading up this effort, and It would be great if we
>> all as a team could come up with abstract design plans for each class of
>> algorithm (as well as to determine the current "classes of algorithms", as
>> each of us has our own unique blend of specializations. And could give our
>> thoughts on this.
>>
>>
>> Currently this is really the opening of the conversation.
>>
>>
>> It would be best to post thoughts on:
>> https://issues.apache.org/jira/browse/MAHOUT-1856
>>
>>
>> Any feedback is welcomed.
>>
>>
>> Thanks,
>>
>>
>> Andy
>>
>>
>>
>>