I just did an exercise of implementing a faster sparse vector.  In the
process, I uncovered a bunch of warts.  I will be filing some Jiras and
patches as soon as I can get to them, but here is a preview:

a) most serialization code for vectors would write vectors with null names
out as if they had names of "".  This causes grief in tests and seems wrong.

b) the Vector/SparseVector hierarchy was oddly split out.  I added a
HashVector and moved the current SparseVector into IntDoubleMappingVector
with SparseVector remaining as an abstract class.  This unfortunately caused
lots of upheaval even up into the Vector class.  I have yet to sort this out
cleanly.

c) the squared distance functions were defined in multiple places.  I
centralized these into SquaredEuclideanDistance as a static function.  These
definitions also were wrong and would ignore any components that had
opposite sign and equal magnitude.  In fixing this, I went ahead and wrote
an implementation that makes use of any sparsity present.  To do that, I
added a method to Vector so that I could tell if a vector is sparse.  This
subsumes all of the distance optimizations that we have discussed.  It also
makes it very easy for new code to use these optimizations without knowing
about them.

d) the nonZeroIterator had to be substantially refactored to work in an
abstract class without knowing to much about the internals of everything.

e) in order ot make sure that all metrics worked reasonably with all types,
I have heavily refactored the testing structure for metrics.  I will be
doing more of this as well.  The goal is to test all vector types against
all distance metrics to make sure that they give the same results as
DenseVector.  Then, I will build tests for DenseVector to verify that it
produces the correct result.  This is important because so many vectors have
special code to help metric computation.

f) I have uncovered some strangeness in the ARFF I/O code that I introduced
with the SparseVector abstraction.  The old code will work as it did, but it
won't understand the new HashVector.


Soo....

I will be trying to break this down into as small a pieces as I can, but the
total will be a bunch of interdependent patches.  If anybody can help me
apply these as quickly as possible, we should have minimal problems with it
all.  If it drags out, it will get hard to keep rebasing the patch sequence
all the time.



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to