Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Joel Nothman
A very good point! (Although augmented and log-average tf both do some kind of normalisation of the tf distribution before IDF weighting.) ___ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] DBSCAN Border Points

2018-01-30 Thread Joel Nothman
It includes non-core points, but not points that are out of eps from any core point. You can modify eps and min_samples. But perhaps you should just choose a different clustering algorithm if this is behaviour you absolutely do not want. On 30 January 2018 at 23:24, AMIR SHANEHSAZZADEH < amir.p.sh

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Roman Yurchak
Hi Yacine, On 29/01/18 16:39, Yacine MAZARI wrote: >> I wouldn't hate if length normalisation was added to if it was shown that normalising before IDF multiplication was more effective than (or complementary >> to) norming afterwards. I think this is one of the most important points here. T

Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Yacine MAZARI
Okay, thanks for the replies. @Joel: Should I go ahead and send a PR with the change to TfidfTransformer? On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman wrote: > I don't think you will do this without an O(N) cost. The fact that it's > done with a second pass is moot. > > My position stands: if

[scikit-learn] DBSCAN Border Points

2018-01-30 Thread AMIR SHANEHSAZZADEH
Hello, I am working with the latest implementation of DBSCAN. I believe that scikit-learn's implementation does not include non-core points in clusters. This results in border points not being included in clusters. Is there any way to remedy this issue so that border points are included in their r