Re: Traits for a mahout algorithm Library.

Dmitriy Lyubimov Thu, 21 Jul 2016 13:48:04 -0700

On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant <[email protected]>
wrote:


>
>
> Finally, re data-frames.  Why not leave it as vectors and matrices?
>

Short answer: because (imo) data frames are not vectors and matrices.

Longer argumentation:

Some capabilities expected of data frames are as follows.

DFs are columnar tables where columns are either named vectors or named
factors (in R sense).

Also, operationally DFs are usually more leaning on providing relational
algebra capabilities (joins etc.)  than on numerical algebra (blas3).

A factor (or, perhaps a better term, a categorical feature) is
fundamentally a non-numerical data. It's representation of a categorical
data which could be bounded or unbounded in number of categories.

Further more, there is more than one way to vectorize a factor or a group
of factors, which is what formula and other things are called for doing.

Now you might view all these formulas, factors and hash tricks as feature
preparation activity and say that learning process is not bothered by that.
In the end, every fitting is essentially working on a numerical input.

That's unfortunately may not be quite true.

Model search (step-wise GLM, for esxample) is not necessarily a
numerical-only thing since it essentially manages factor vectorization.

That said, i think we can safely say that individual learner could be a
numerical-only thing. But as soon as we go up the chain to transformations,
vectorizations and searching for parameters of vectorizations, dataframes
are usually input sources for all those.

excellent example of those (which was failed to get properly architected by
concerns in that another OSS project) is implicit feedback recommender.

In fact, there are two problems here -- one is parameterized feature
extraction and another is fitting the decomposition.

each of the problems have its own parameters. In vanilla paper
implementation there were two suggested ways of feature extraction that
offered one parameter each, and then were suggested to be searched for via
CV along with the fitter hyperparameters (learning rate, regularization).

What it means is that hyperparameter search may overarch feature extraction
_and_ fitting and essentially may require a data frame as an input in most
general case (and i ran into such practical case before).

Finally, some goodness of fit metrics work on pre-vectorized factors.

This is all standard but it is all pretty expensive to do unfortunately. I
have big problem discarding notion of dataframe support as part of the
fitting/search process for some areas of computational statistics.

Re: Traits for a mahout algorithm Library.

Reply via email to