Dear Ian,

I guess this comes from the assumption that these types of pairwise
similarity matrices have been dense in many usecases. E.g. a gaussian
kernel matrix is compeletely dense. Many kernel methods also expect dense
input. But it is true that this latter fact shouldn't necessarily be
imposed all the similarity measures if there is a possibility of sparse
output...

Michael

On Thu, Nov 27, 2014 at 5:26 PM, Ian Ozsvald <[email protected]> wrote:

> Hey all. I'm working on distance measurement in a reasonably
> high-dimensional but very sparse space (1.3mil * 35k matrix). At this size
> my 16GB laptop runs out of space, I've walked back through the code and
> noticed something I don't understand.
>
> sklearn.metrics.pairwise_distances('cosine') calls
> pairwise.cosine_similarity which takes sparse inputs and preserves their
> sparsity until the final call:
> def cosine_similarity(X, Y)  # both inputs are csr sparse from a
> DictVectorizer(...,sparse=True)
>  X_normalized = normalize(...)  # sparse result
>  Y_normalized = X_normalized  # as both inputs are the same, still sparse
>  K = linear_kernel(X_normalized, Y_normalized)
> ->linear_kernel(X_normalized, Y_normalized)
>  calls safe_sparse_dot(X, Y.T, dense_output=True)
> and then the result is forced to be dense.
>
> If safe_sparse_dot is called with dense_output=False then I get a sparse
> result and everything looks sensible with low RAM usage.
>
> I'm using 0.15, the current github shows the line:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L692
>
> Was there a design decision to force dense matrices at this point? Maybe
> some call paths assume a dense result?
>
> Ian.
>
> --
> Ian Ozsvald (A.I. researcher)
> [email protected]
>
> http://IanOzsvald.com
> http://ModelInsight.io
> http://MorConsulting.com
> http://Annotate.IO
> http://SocialTiesApp.com
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com
> http://twitter.com/IanOzsvald
> http://ShowMeDo.com
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to