I am sure that our Sparse Vector stuff is at least twice as slow as it should be.
My preliminary experiments indicate that we should be able to get within about a factor of 2 of the speed of a DenseVector (ish). I am pretty sure our current SparseVector is a long ways from there. Shashi's experience indicates that replacing SV can give us a near doubling of overall performance in some cases. Remember how much difference the distance optimization made? By using sparsity well, we can probably get 100x speedup in k-means for some datasets because the distance computation uses cached values for the squared magnitude of the centroids and only iterates over the non-zero data values. I also expect that there *has* to be code that we have that conses way too much. And I am sure you remember how much difference it makes to have better data storage in Taste. For many applications, we can go further and have binary sparse matrices. These don't even need to store the data because if we know a value is non-zero, we also know its value. With integer compression and no data storage cost, we might be able to store 10x more data in memory and get an astronomical speed up for large data sets. And then there are on-line algorithms like vowpal. Using these often results in many orders of magnitude speedup (5 orders is not impossible for some probablems). Even if the number of operations for an on-line algorithm is twice as large, the effect of passing through the data completely sequentially can make four orders of magnitude difference in speed. So, my answer to this is that I can't imagine that we don't have 2-10x speedups in just the low-hanging fruit. On Fri, Sep 4, 2009 at 3:54 PM, Sean Owen <sro...@gmail.com> wrote: > > (Do you really think some implementations are twice as slow as they > need be? can't tell if that's good or bad news. ) -- Ted Dunning, CTO DeepDyve