Implement Stochastic Decomposition
----------------------------------

                 Key: MAHOUT-309
                 URL: https://issues.apache.org/jira/browse/MAHOUT-309
             Project: Mahout
          Issue Type: New Feature
          Components: Math
    Affects Versions: 0.4
            Reporter: Jake Mannix
            Assignee: Jake Mannix
             Fix For: 0.4


Techniques reviewed in <a href="http://arxiv.org/abs/0909.4061";>Halko, 
Martinsson, and Tropp</a>.

The basic idea of the implementation is as follows: if the input matrix is 
represented as a DistributedSparseRowMatrix (backed by a sequence-file of 
<Writable,VectorWritable> - the values of which should be 
SequentialAccessSparseVector instances for best performance), and you 
optionally have a kernel function f(v) which maps sparse numColumns-dimensional 
(here numColumns is unconstrained in size) vectors to sparse 
numKernelizedFeatures-dimensional (also unconstrained in size) vectors (in the 
case where you want to do kernel-PCA, for example, for a kernel k(u,v) = 
f(u).dot( f(v) )), then take the MurmurHash (from MAHOUT-228) and maps the 
numKernelizedFeatures-dimensional vectors and projects down to some 
numHashedFeatures-dimensional space (reasonably-sized - no more than a 10^2 to 
10^4).  

This is all done in the Mapper, and there are two outputs: the 
numHashedFeatures-dimensional vector itself (if the left-singular vectors are 
ever desired), which does not need to be Reduced, and the outer-product of this 
vector with itself, where the Reducer/Combiner just does the matrix sum on the 
partial outputs, eventually producing the kernel / gram matrix of your hashed 
features, which can then be run through a simple eigen-decomposition, the 
((1/eigenvalue)-scaled) eigenvectors of which can be applied to project the 
(optional) numHashedFeatures-dimensional outputs mentioned earlier in this 
paragraph to get the left-singular vectors / reduced projections (which can be 
then run through clustering, etc...).

Good fun will be had by all.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to