On Thu, Jul 21, 2016 at 12:35 PM, Trevor Grant <[email protected]> wrote:
> > > Finally, re data-frames. Why not leave it as vectors and matrices? > Short answer: because (imo) data frames are not vectors and matrices. Longer argumentation: Some capabilities expected of data frames are as follows. DFs are columnar tables where columns are either named vectors or named factors (in R sense). Also, operationally DFs are usually more leaning on providing relational algebra capabilities (joins etc.) than on numerical algebra (blas3). A factor (or, perhaps a better term, a categorical feature) is fundamentally a non-numerical data. It's representation of a categorical data which could be bounded or unbounded in number of categories. Further more, there is more than one way to vectorize a factor or a group of factors, which is what formula and other things are called for doing. Now you might view all these formulas, factors and hash tricks as feature preparation activity and say that learning process is not bothered by that. In the end, every fitting is essentially working on a numerical input. That's unfortunately may not be quite true. Model search (step-wise GLM, for esxample) is not necessarily a numerical-only thing since it essentially manages factor vectorization. That said, i think we can safely say that individual learner could be a numerical-only thing. But as soon as we go up the chain to transformations, vectorizations and searching for parameters of vectorizations, dataframes are usually input sources for all those. excellent example of those (which was failed to get properly architected by concerns in that another OSS project) is implicit feedback recommender. In fact, there are two problems here -- one is parameterized feature extraction and another is fitting the decomposition. each of the problems have its own parameters. In vanilla paper implementation there were two suggested ways of feature extraction that offered one parameter each, and then were suggested to be searched for via CV along with the fitter hyperparameters (learning rate, regularization). What it means is that hyperparameter search may overarch feature extraction _and_ fitting and essentially may require a data frame as an input in most general case (and i ran into such practical case before). Finally, some goodness of fit metrics work on pre-vectorized factors. This is all standard but it is all pretty expensive to do unfortunately. I have big problem discarding notion of dataframe support as part of the fitting/search process for some areas of computational statistics.
