On Sat, Jul 20, 2013 at 5:52 AM, John R. Frank <[email protected]> wrote:

> Hi scikit-learn experts,
>
> I am using the sparse matrices generated by
> sklearn.feature_extraction.FeatureHasher
>
> and want to compute cosine distances between feature vectors.  What is the
> best way to do this?
>
> When I was hacking on this a month ago, I found that the logic in the CSR
> sparse matrix was causing slow dot products, so I wrote the version below
> that uses element-wise multiplication.
>
> Is this a good way to compute cosine distances between feature hashed
> counters?
>
>
> class FeatureHashingCounter(object):
>      _default_num_features = 2**31 - 1
>      _hasher = FeatureHasher(_default_num_features, input_type='dict',
> non_negative=False)
>
>      def __init__(self, data):
>          self._matrix = scipy.sparse.csr_matrix((1,num_features))
>
>
> def smart_dot( fhc1, fhc2 ):
>      ## use element-wise multiplication
>      return fhc1._matrix.multiply( fhc2._matrix ).sum()
>
>
> def cosine( fhc1, fhc2 ):
>      dot = smart_dot(fhc1, fhc2)
>
>      norm1 = math.sqrt(smart_dot(fhc1, fhc1))
>      norm2 = math.sqrt(smart_dot(fhc2, fhc2))
>
>      result = float(dot)/norm1/norm2
>
>      return max(result, 0)
>


It looks like you've implemented cosine for two single vectors. Often one
wants the cosines of multiple vectors with multiple vectors. This is the
case handled by sklearn.metrics.pairwise.cosine_similarity.


> Also, I'm curious truncation:  is there a clean way to delete features
> that have low counts?  My current implementation is involves sorting on
> count, truncating, and then sorting again on the sparse matrix indices to
> make a valid sparse matrix again.  Is there a better way?
>

I'm not certain I understand what your approach involves, and assume you
mean to keep the same column indices so feature hashing works. There are
many ways to do it, it comes down to what you mean by clean (clean code?
low copying?) and what the input format is.

- Joel
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to