Greetings all, As I've been working on our migration to CPAN, I've also been working on svdpackout.pl, in order to do some comparisons with SVDLIBC, and also just to understand the inner workings of svdpackout.pl again before making any big changes.
There are a couple of issues that I've focused attention on one more time, and this is very belated follow-up prompted by some notes from Richard Wicentowski to the developers lists in November 2006 that raised a couple of issues, which had also been raised by other users and developers from time to time in the past. Now, before we get to those, a bit of review. svdpackout.pl takes the output from SVDPACKC and essentially "recombines" the decomposed input matrix in order build a new matrix that represents the k most significant dimensions in that original data. SVD decomposes the input matrix into three matrices, U, S, and V is a common notation for those. svdpackout.pl does this recombination two different ways - first, using the --rowonly option. This just takes the M x k matrix U and combines that with k x k matrix S. M is the number of rows in the original matrix, so we get an M x k matrix that represents the original M x N data. Now, as a part of this operation, when we were doing this recombination we would take the square root of the values in S. Despite my best efforts, I really don't know why we did that, and I don't find much evidence to support the use of this technique in the literature, so I believe we'll stop doing that, and will simply provide a --sqrt option to allow for backwards compatibility. Now, was it a bad thing to take this square root? I don't know if it was bad, although I think the effort of it would be to minimize the differences between the values of the k singular values that we find in S. So if those values were originally (25, 16, 9, 4) (in a 4x4 diagonal matrix) then of course the resulting values after square root would be (5, 4, 3, 2) which is essentially causing these k values to come together, and in the end make our resulting M x k recombination possibly harder to cluster. Now, it's important to point out that we do use --rowonly as the default in discriminate, which means that it is also the default in the web interface. However, svdpackout.pl defaulted to a full M x N matrix reconstruction, which did not have the square root operation. So, I think that in our next release we'll require that a user specify --sqrt if they want this particular feature turned on (and it would only have an effect on --rowonly). Otherwise, we won't take the square root of S (k x k) but will instead use the original values. If anyone knows what we were thinking, speak now. :) Next item, and this is some rather powerful and perhaps unwise smoothing that we applied to the recombined matrix in both the --rowonly and the full recombination (in other words, this would happen for all runs of svdpackout.pl). If the value of the cell in the recombined M x k or M x N matrix was less than 0, we would smooth it to 0, thereby eliminating any negative values. I don't have a good explanation for why we chose to do that, and I'm inclined to think it was a bug. Negative values are a natural byproduct of SVD, so simply removing them does not have a good justification (or not one that I can think of at least.) The problem with removing them of course is that it simply changes the nature of the result, and causes a fairly significant lose of information in that "direction". Richard provided some code to the developers list quite a while ago that includes a --negatives option to turn off that smoothing, and I think we will make that the default behavior, and only have the smoothing of negative values done by request. Thus, you can anticipate some fairly fundamental changes coming to svdpackout.pl - first, no more square root operation being performed on the S matrix, and second, no more smoothing of negative values to zero in the recombined matrix. In the interests of backwards compatibility we'll maintain this functionality via options that a user can request, but I think in general the default behavior should be that we don't take square roots and we don't smooth negative values to 0. So, stay tuned, and the good news is that I think this might actually cause our results from SVD to be a bit more dramatic than they have been thus far. Our experience over the years has been that SVD does not seem to have too much of an effect on overall results, but I think that might have been because we were diluting the resulting information somewhat via these operations. Now, this still does not address the issue of SVDLIBC versus SVDPACKC, but no matter which direction we go there, svdpackout.pl will remain a part of the picture, and in fact my goal is that if we make a change in how we are computing SVD that it would be largely an invisible change to the user. Comments and questions are of course welcome on this. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
