Hi Lars, so does it mean you agree on the inconvenience of sort the indices in FeatureHasher.Transform and suggested use dot product to sum the duplicates?
--Terry On Apr 10, 2013, at 5:47 PM, Lars Buitinck <[email protected]> wrote: > 2013/4/10 Terry Peng <[email protected]>: >> Hi Lars Buitinck, > > Replying to the ML, please send this kind of message there next time. > >> I thought the order of words are same as the indices order after >> FeatureHasher.Transform. but it turn out it's not. the reason is >> sum_duplicates in FeatureHasher: >> >> X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype, >> shape=(n_samples, self.n_features)) >> X.sum_duplicates() # also sorts the indices >> >> which added by your change f5a4ad2bfc3c7e487c3855abfc8f83b670d89d0c (ENH >> speed up hashing and reduce memory usage by 1/3) >> sum_duplicates not only sum the values of duplicated indices, but it also >> sort the indice in natural order (from small to large). i think it's more >> convenient to not sort the indices. so we can easily get the feature back >> from the indices. > > I'm not sure what effect that would have an dot products performed > with FeatureHasher output. In the best case, they'd be much slower. In > the worst case, they'd break. Before we implement anything, I'd like > to see how slow/broken the resulting CSR matrices become. Feel free to > try it out and send us a report. > > -- > Lars Buitinck > Scientific programmer, ILPS > University of Amsterdam ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
