On Thu, Feb 25, 2010 at 7:48 PM, Steven Buss <[email protected]> wrote:
>
> I still don't see this. Hopefully I can explain my confusion. It seems
> to me that the complexity you wrote uses a kernel that does not
> require pair-wise computations. It doesn't look like this is actually
> operating on the Gram matrix, H.
>
This is correct, it's not using the full Gram matrix at all, which won't
yield
the same results as what you're doing for a general kernel K(u,v).
But it *does* work for kernels of the form K(u,v) = <k(u), k(v)>, where
the inner product is the euclidean inner product, and k() is a (possibly
complex, nonlinear) function from R^(X.numCols) to R^(dim_Kernel).
Let's go over in detail a little how this works and see if I'm smoking
crack:
> (call the data matrix X)
> I think a more accurate complexity would be O(k * numUsers *
> (kernelCost * numUsers)), which is the same as precomputing H. For
> every row, i, of X that the Lanczos algorithm comes to, we need to
> find k(X_i, X_{1...N}).
>
No, this is not what Lanczos does. Let's drop the kernel for a
bit, and stick with the euclidean one. Using Lanczos to get the SVD
of X can be done in several ways, as the SVD is looking for really
either the left singular vectors, or the right ones, because once you
have one set, you can reconstruct the other in a single pass over
your original data. So we can either look for eigenvectors of
X*X^{t} (which are N-dimensional), or eigenvectors of X^{t}*X
(which are M-dimensional). The most straightforward way to look
for eigenvectors of X*X^{t} is to first compute the full X*X^{t}
matrix, and then run (symmetric) Lanczos on the result to pull
out N-dimensional eigenvectors.
If instead we look for the M-dimensional eigenvectors of X^{t}*X,
by noting the following:
for each i = 1->M
[ (X^{t}*X)*v ]_i = sum_j,k=1->N (X_ij X_jk v_k)
Ostensibly, this looks like an O(N^2 * M) computation, and
would be, if X was dense.
But since it's sparse, reordering this sum can make clear how few
of these dot products are really necessary:
sum_k (X_jk v_k) is just <X_j, v> and if X_j only has a few
nonzero entries, this computation is O(numEntries(X_j)), not
O(dim(X_j)) = M.
So to compute the vector
v' = [(X^{t}*X)*v] vectorSum_i (X_i * <X_i,v>)
you are just taking repeated sparse dot products of X_i with
your input (possibly dense: eigen!) vector v, and then using
that number to scale that (sparse!) row up by that factor, and
then accumulating that vector onto your final (probably dense)
output vector v'. If the average density of each row of X_i
is "d", then this is O(d * N) FLOPs (4*d*N to be exact).
Lanczos then does this process k times:
O(d*N*k) operations
To compute the full gram matrix, you can also play some
tricks, to find only the pairs of rows of X which intersect
(and thus have nontrivial dot product), but unless this process
is very clever (maybe your MapReduce passes you describe
are this clever, I haven't dug into them in any detail yet),
you maybe have a bunch of pairs of rows, X_i and X_j,
which are known to overlap, and to compute their inner
product, you still need to iterate over all d entries of the
shorter of X_i and X_j, to find out where they actually
overlap.
So let's say we know how which pairs of X_i and X_j
we need to compute dot products on. If the distribution
of sparsity was fairly uniform (not the case, but it's only
going to be worse if they're non-uniform), then the
probability that X_i and X_j don't overlap at all is
(1-(d/M))^d =approx (1-d^2/M) (sparse: d/M is small).
Then there are N^2 * d^2/M nonzero entries in the gram
matrix (it's still sparse, but in comparison to N*d
nonzero entries in X, it's denser by a multiplicative
factor of d*(N/M)), and to compute each one it can
take O(d) operations (not flops, but still cpu cycles
of walking a sparse vector's iterator), , and Lanczos
takes O(k * nonzeroEntries(A)) operations to get
k-eigenvectors of a matrix A, so all in all, in the
sparse case, it looks like compute-Gram-then-decompose
will take something like
O(k*N^2 * d^2 * d / M) = O(k * N * (d^3 (N/M) )]
operations, while Lanczos in one pass is
O(k * N * d).
Now if we get clever enough to turn that d^3 into
more like d^2, and we observe that at least in your case,
M is a bit more than N, and d is really small, then
this is probably a wash. But in the usual case, where
N is big, and d is not tiny (maybe 10's to 100's), then
the factor of (N/M) * d is a multiplicative factor in cpu
cycles (as well as space, which is often neglected
unless we really are talking about a factor of 1000
bigger, in which case your Hadoop admin may want
to have a word with you).
The intuitive reason why folding the kernel into
X itself, and computing the SVD, instead of computing
the Gram matrix and doing eigendecomposition that,
is that essentially, eigendecomposition of a matrix A
is taking A, A^2, A^3, ... and using these in some
algebraic fashion to pull out eigenvectors. The SVD
is getting the same vectors, but it's basically doing it
instead of over A = X*X^{t}, over X itself, the "square
root" of A. Eigendecomposition is iterating over a
denser matrix than SVD is.
All of this was using the euclidean kernel, and does not
generalize to *all kernels*. As long as you can represent
your kernel as just mapping the rows of X to some space
where the inner product can be computed as a euclidean
dot product, the entire argument goes over as is, where
M gets replaced with dim(KernelSpace), and d gets
replaced with numNonZeroEntries( k(X_i) ).
Let's say that I am completely misunderstanding what you're saying and
> that your suggestion works. We would then still have the problem of
> centering H. There may be tricks to do this row-wise, but I am pretty
> sure that the entire matrix has to be computed first, meaning that
> your suggestion wouldn't work out after all.
>
Centering may be problematic, depending on how you're meaning it.
You mean you're centering the rows of the Gram matrix? If I knew how
and why you were centering, then maybe there'd be a way to salvage
the above technique, but I don't know.
Ok, I'm not sure if any of this made any sense, but either way, code
will soon be available in Mahout to allow for comparisons between
all these various techniques, and it may turn out that I/O costs
are so dominant that none of this matters except how much you can
minimize the number of passes over your data, in which case going
with the stochastic decomposition work will certainly be the way
to go, as it allows you to do it in *one* monster pass over the data.
-jake