Hi, I'm researching a massive dataset (say, ~100M rows) in which URLs are mapped to word counts. These counts refer to words from a rather big dictionary, say in the 5M range (obviously, the dataset is extremely sparse).
My research aims to predict some variable related to these URLs, based on latent word counts information. I wish to start with basic models, namely linear regression of some sort (probably with some kind of regularization). Since fitting such models over 5M predictor variables is rather, hmm, unpleasent - I wish to reduce the dataset's dimensionality. Selecting features based on PCA on the dataset seems like a reasonable approach. I'm wondering if Mahout, at its current stage, is suitable for this kind of analysis. Specifically, I considered using Jake Mannix's latest porting<http://issues.apache.org/jira/browse/MAHOUT-180>of the excellent decomposer library to Mahout, but encountered two problems: 1) There is no org.apache.mahout.Matrix implementation that "lives" in HDFS (my covariance matrix is way too big to fit in a single machine). 2) The org.apache.mahout.math.decomposer classes currently do not use MapReduce, rendering them unfit for such large-scale computations. Am I missing something here? If not, is there a plan to address these issues in Mahout (can anyone provide an estimated time frame)? I'm also wondering if there are plans to implement in Mahout other algorithms for matrix decomposition, e.g. stochastic approximation algorithms as reviewed here <http://arxiv.org/abs/0909.4061>. Best, Danny
