one more word on row labels. it seems like historical DRM interpretation of row keys (as indexes vs. labels) has been a bit unfortunate.
But in the end it turned out it often has some strange synergy. e.g., if you compute a big svd, val (U, V, s) = dssvd(A, ...) then it doesn't matter if rows of A are labeled by strings or their ordinal Int indices. it is all transparent for underlying pipeline. all it means that matrix U will have the same type of keys and the same semantics as the keys of A (either e.g., document labels of a string type, or a matrix row index of Int type). More over, not only dssvd's user-facing API is oblivious of key type of A, but it turns out its implementation is oblivious of true semantics of key rows of A as well. This mostly goes down to a simple notion that self-square A'A is logically oblivious of row index type as well and that any matrix A inside optimization plan can actually be formed as A' if needed, as long as it doesn't meet the optimization barrier (i.e., collected or saved) On Wed, Mar 29, 2017 at 9:37 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > > On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > >> >> The other missing bit is dataframes. R and Spark have them in different >> forms but Mahout largely ignores the issue of real world object ids. > > > Mahout only supports matrices and vectors, not data frames. > > Data frames imply mix of various types of data which yet to be converted > to numerical data to be consumed by algebraic algorithm (in R, usually done > via formula). Unfortunately Mahout has no extension for formula. As for > data frames, usually native data frames (e.g., spark data frames > specifically) work reasonably well for vectorization of non-numerical data. > > distributed matrices are indeed do not support column labels, and row > labels are quasi-supported, meaning they share label nature with unordered > row index for transposition purposes, i.e., one can either have row labels > and limited transposition semantics, or one can have integer labels > interpreted as column index for transposition purposes, but not both. > > another way is to use mahout NamedVectors for the purposes of row > labeling, but this is not supported consistently in any given elementary > solver. > > >> >>