Awesome- Sounds like we've got a good plan going.  A pipeline style sounds good 
to me.


So as I understand from reading today's discussions - been kind of following 
all day- we really need figure an optimal way of keeping mahout both  
(primarily) a "Roll your own math/algos" platform and a library (both perceived 
and in reality).  And a library that one could plug in their own previously 
"rolled" math/algos.


Re: DataFrames, we should be able to set this these pipelines up in a way that 
is abstract enough, ie at the math-scala level to so that any engine agnostic 
pipelines can be run without DataFrame or (or other Spark dependencies), and 
then drop down into the Spark module and add DataFrame, etc capabilities, 
correct?


So we'd have eg. o.a.m.library in both with Spark specific-algos in the spark 
module, and engine- agnostic in math-scala - same as we do with everything else.


I guess thats basically what Sebastian was suggesting earlier in the thread.


We can also make use of certain MLlib algos in the Spark module with 
conversions to/from Drm format, and further push the fact the we are a 
complement to MLlib rather than competition.


Sorry if I'm just repeating what you guys have hashed out today..


+1 to Hyperparamater search that may include feature extraction.









________________________________
From: Dmitriy Lyubimov <[email protected]>
Sent: Thursday, July 21, 2016 4:47:26 PM
To: [email protected]
Cc: Sebastian Schelter
Subject: Re: Traits for a mahout algorithm Library.

On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant <[email protected]>
wrote:

>
>
> Finally, re data-frames.  Why not leave it as vectors and matrices?
>

Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.)  than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.

Reply via email to