Hi Lars,

so does it mean you agree on the inconvenience of sort the indices in 
FeatureHasher.Transform and suggested 
use dot product to sum the duplicates?

--Terry

On Apr 10, 2013, at 5:47 PM, Lars Buitinck <[email protected]> wrote:

> 2013/4/10 Terry Peng <[email protected]>:
>> Hi Lars Buitinck,
> 
> Replying to the ML, please send this kind of message there next time.
> 
>> I thought the order of words are same as the indices order after
>> FeatureHasher.Transform. but it turn out it's not. the reason is
>> sum_duplicates in FeatureHasher:
>> 
>>        X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
>>                          shape=(n_samples, self.n_features))
>>        X.sum_duplicates()  # also sorts the indices
>> 
>> which added by your change f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c (ENH
>> speed up hashing and reduce memory usage by 1/3)
>> sum_duplicates not only sum the values of duplicated indices, but it also
>> sort the indice in natural order (from small to large). i think it's more
>> convenient to not sort the indices. so we can easily get the feature back
>> from the indices.
> 
> I'm not sure what effect that would have an dot products performed
> with FeatureHasher output. In the best case, they'd be much slower. In
> the worst case, they'd break. Before we implement anything, I'd like
> to see how slow/broken the resulting CSR matrices become. Feel free to
> try it out and send us a report.
> 
> -- 
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to