I just did an exercise of implementing a faster sparse vector. In the process, I uncovered a bunch of warts. I will be filing some Jiras and patches as soon as I can get to them, but here is a preview:
a) most serialization code for vectors would write vectors with null names out as if they had names of "". This causes grief in tests and seems wrong. b) the Vector/SparseVector hierarchy was oddly split out. I added a HashVector and moved the current SparseVector into IntDoubleMappingVector with SparseVector remaining as an abstract class. This unfortunately caused lots of upheaval even up into the Vector class. I have yet to sort this out cleanly. c) the squared distance functions were defined in multiple places. I centralized these into SquaredEuclideanDistance as a static function. These definitions also were wrong and would ignore any components that had opposite sign and equal magnitude. In fixing this, I went ahead and wrote an implementation that makes use of any sparsity present. To do that, I added a method to Vector so that I could tell if a vector is sparse. This subsumes all of the distance optimizations that we have discussed. It also makes it very easy for new code to use these optimizations without knowing about them. d) the nonZeroIterator had to be substantially refactored to work in an abstract class without knowing to much about the internals of everything. e) in order ot make sure that all metrics worked reasonably with all types, I have heavily refactored the testing structure for metrics. I will be doing more of this as well. The goal is to test all vector types against all distance metrics to make sure that they give the same results as DenseVector. Then, I will build tests for DenseVector to verify that it produces the correct result. This is important because so many vectors have special code to help metric computation. f) I have uncovered some strangeness in the ARFF I/O code that I introduced with the SparseVector abstraction. The old code will work as it did, but it won't understand the new HashVector. Soo.... I will be trying to break this down into as small a pieces as I can, but the total will be a bunch of interdependent patches. If anybody can help me apply these as quickly as possible, we should have minimal problems with it all. If it drags out, it will get hard to keep rebasing the patch sequence all the time. -- Ted Dunning, CTO DeepDyve