[Senseclusters-users] changes coming in svdpackout.pl, sqrt and negative values

Ted Pedersen Mon, 31 Mar 2008 17:24:54 -0700

Greetings all,

As I've been working on our migration to CPAN, I've also been working
on svdpackout.pl, in order to
do some comparisons with SVDLIBC, and also just to understand the
inner workings of svdpackout.pl
again before making any big changes.


There are a couple of issues that I've focused attention on one more
time, and this is very belated
follow-up prompted by some notes from Richard Wicentowski to the
developers lists in November 2006
that raised a couple of issues, which had also been raised by other
users and developers from time
to time in the past.

Now, before we get to those, a bit of review. svdpackout.pl takes the
output from SVDPACKC and
essentially "recombines" the decomposed input matrix in order build a
new matrix that represents the
k most significant dimensions in that original data. SVD decomposes
the input matrix into three
matrices, U, S, and V is a common notation for those.

svdpackout.pl does this recombination two different ways - first,
using the --rowonly option. This
just takes the M x k matrix U and combines that with k x k matrix S. M
is the number of rows in the
original matrix, so we get an M x k matrix that represents the
original M x N data. Now, as a part
of this operation, when we were doing this recombination we would take
the square root of the
values in S. Despite my best efforts, I really don't know why we did
that, and I don't find much
evidence to support the use of this technique in the literature, so I
believe we'll stop doing that, and
will simply provide a --sqrt option to allow for backwards
compatibility. Now, was it a bad thing to
take this square root? I don't know if it was bad, although I think
the effort of it would be to
minimize the differences between the values of the k singular values
that we find in S. So if those
values were originally (25, 16, 9, 4) (in a 4x4 diagonal matrix) then
of course the resulting values
after square root would be (5, 4, 3, 2) which is essentially causing
these k values to come together,
and in the end make our resulting M x k recombination possibly harder
to cluster.

Now, it's important to point out that we do use --rowonly as the
default in discriminate, which means
that it is also the default in the web interface. However,
svdpackout.pl defaulted to a full M x N matrix
reconstruction, which did not have the square root operation.

So, I think that in our next release we'll require that a user specify
--sqrt if they want this particular
feature turned on (and it would only have an effect on --rowonly).
Otherwise, we won't take the square root
of S (k x k) but will instead use the original values.

If anyone knows what we were thinking, speak now. :)

Next item, and this is some rather powerful and perhaps unwise
smoothing that we applied to the
recombined matrix in both the --rowonly and the full recombination (in
other words, this would happen for
all runs of svdpackout.pl). If the value of the cell in the recombined
M x k or M x N matrix was less than
0, we would smooth it to 0, thereby eliminating any negative values. I
don't have a good explanation for
why we chose to do that, and I'm inclined to think it was a bug.
Negative values are a natural byproduct
of SVD, so simply removing them does not have a good justification (or
not one that I can think of at least.)
The problem with removing them of course is that it simply changes the
nature of the result, and causes
a fairly significant lose of information in that "direction".

Richard provided some code to the developers list quite a while ago
that includes a --negatives option
to turn off that smoothing, and I think we will make that the default
behavior, and only have the smoothing
of negative values done by request.

Thus, you can anticipate some fairly fundamental changes coming to
svdpackout.pl - first, no more square
root operation being performed on the S matrix, and second, no more
smoothing of negative values to
zero in the recombined matrix. In the interests of backwards
compatibility we'll maintain this functionality
via options that a user can request, but I think in general the
default behavior should be that we don't take
square roots and we don't smooth negative values to 0.

So, stay tuned, and the good news is that I think this might actually
cause our results from SVD to be
a bit more dramatic than they have been thus far. Our experience over
the years has been that SVD
does not seem to have too much of an effect on overall results, but I
think that might have been because
we were diluting the resulting information somewhat via these operations.

Now, this still does not address the issue of SVDLIBC versus SVDPACKC,
but no matter which
direction we go there, svdpackout.pl will remain a part of the
picture, and in fact my goal is that if we
make a change in how we are computing SVD that it would be largely an
invisible change to the
user.

Comments and questions are of course welcome on this.

Thanks,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] changes coming in svdpackout.pl, sqrt and negative values

Reply via email to