Hi,

I'm researching a massive dataset (say, ~100M rows) in which URLs are mapped
to word counts. These counts refer to words from a rather big dictionary,
say in the 5M range (obviously, the dataset is extremely sparse).

My research aims to predict some variable related to these URLs, based on
latent word counts information. I wish to start with basic models, namely
linear regression of some sort (probably with some kind of regularization).
Since fitting such models over 5M predictor variables is rather, hmm,
unpleasent - I wish to reduce the dataset's dimensionality. Selecting
features based on PCA on the dataset seems like a reasonable approach.

I'm wondering if Mahout, at its current stage, is suitable for this kind of
analysis.

Specifically, I considered using Jake Mannix's latest
porting<http://issues.apache.org/jira/browse/MAHOUT-180>of the
excellent decomposer library to Mahout, but encountered two problems:
1) There is no org.apache.mahout.Matrix implementation that "lives" in HDFS
(my covariance matrix is way too big to fit in a single machine).
2) The org.apache.mahout.math.decomposer classes currently do not use
MapReduce, rendering them unfit for such large-scale computations.

Am I missing something here?
If not, is there a plan to address these issues in Mahout (can anyone
provide an estimated time frame)?

I'm also wondering if there are plans to implement in Mahout other
algorithms for matrix decomposition, e.g. stochastic approximation
algorithms as reviewed here <http://arxiv.org/abs/0909.4061>.

Best,
Danny

Reply via email to