On Thu, Oct 1, 2009 at 10:10 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Btw... the other think that the HashVector does better is inserts. The > sorted vector could do much better on average if it deferred sorting until > an access or iteration was done. Even iteration doesn't necessarily need > sorting, but it could by seen as part of the contract. > Iteration doesn't *need* sorting, but I agree that it should be part of the contract because doing fast dot products of two OrderedIntDoublePair instances really needs ordering, so you can just walk them both in parallel. In decomposer, I ended up having an ImmutableSparseVector subclass which did the sorting in the constructor (if it wasn't already sorted) and then didn't allow further inserts/deletes/modifications. Speaking of other little niceties: caching the norm is something I found pretty useful for a vector, and keeping track of a boolean "isDirty" which lets you know whether you've invalidated it and will need to recalculate on the next call to norm(). Of course, to do this right in Mahout, we'd probably need to have some way of telling the Vector instance what to use for its norm (so it knows whether to cache it's L^2 norm, L^p norm, or some other inner product applied to itself). Which gets me thinking: maybe we should have an InnerProduct interface, similar to DistanceMeasure, which defined how to compute dot(). As it is, we basically assume in our API that while you may want norm(int p) for various p, and you may want different DistanceMeasure impls of various types, we say that dot() is always the Euclidean dot(). Should that be pluggable as well?