Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-06 Thread Lars Buitinck
2013/6/6 Joel Nothman : > Or perhaps the docs should consider including a glossary that translates > some of these meanings and specifies what is preferred for sklearn > development/documentation. Something like this? https://github.com/scikit-learn/scikit-learn/wiki/Glossary (On the wiki for now

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-05 Thread Joel Nothman
Or perhaps the docs should consider including a glossary that translates some of these meanings and specifies what is preferred for sklearn development/documentation. On Thu, Jun 6, 2013 at 2:17 AM, Andreas Mueller wrote: > On 06/04/2013 08:27 PM, Tom Fawcett wrote: > > On Jun 4, 2013, at 2:38 A

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-05 Thread Andreas Mueller
On 06/04/2013 08:27 PM, Tom Fawcett wrote: > On Jun 4, 2013, at 2:38 AM, Lars Buitinck wrote: > >> 2013/6/4 Joel Nothman : >>> NLP folks pass the blame to IR folks :P >> ... and IR folks always mean absolute frequency, unless stated otherwise. > Coming from ML, I’ve seen it used as both absolute a

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-04 Thread Tom Fawcett
On Jun 4, 2013, at 2:38 AM, Lars Buitinck wrote: > 2013/6/4 Joel Nothman : >> NLP folks pass the blame to IR folks :P > > ... and IR folks always mean absolute frequency, unless stated otherwise. Coming from ML, I’ve seen it used as both absolute and relative. ML (and sklearn) is at the junct

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-04 Thread Lars Buitinck
2013/6/4 Joel Nothman : > NLP folks pass the blame to IR folks :P ... and IR folks always mean absolute frequency, unless stated otherwise. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam -- How Ser

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Joel Nothman
On Tue, Jun 4, 2013 at 12:14 AM, Andreas Mueller wrote: > On 06/03/2013 04:09 PM, Lars Buitinck wrote: > > 2013/6/3 Andreas Mueller : > >> I named the variable, I think, and it is a bad name :-( > >> Should we rename it? > >> > >> I think giving a count makes more sense than giving a frequency: yo

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Andreas Mueller
On 06/03/2013 04:09 PM, Lars Buitinck wrote: > 2013/6/3 Andreas Mueller : >> I named the variable, I think, and it is a bad name :-( >> Should we rename it? >> >> I think giving a count makes more sense than giving a frequency: you want to >> exclude outliers that appear only once or twice for exam

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/3 Andreas Mueller : > I named the variable, I think, and it is a bad name :-( > Should we rename it? > > I think giving a count makes more sense than giving a frequency: you want to > exclude outliers that appear only once or twice for example. I actually hadn't seen this reply. It's not a

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-03 Thread Lars Buitinck
2013/6/2 Harold Nguyen : > http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html > Does TfidfVectorizer take a sequence of filenames, where each file is just a > plain text file ? Depends on the parameter input (the first in the list). In the example, I

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Andreas Mueller
On 06/02/2013 08:48 PM, Harold Nguyen wrote: Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does Tfidf

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Harold Nguyen
Hi Lars, Thank you very much for this response. Please excuse my questions since I'm new. >From here the document on TfidfVectorizer here: http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html Does TfidfVectorizer take a sequence of filenames, where

Re: [Scikit-learn-general] Clustering of Text Documents

2013-06-02 Thread Lars Buitinck
2013/6/1 Harold Nguyen : > I was wondering if anyone can point me to a tutorial on clustering text > documents, but then also displaying the results in a graph ? I see some > examples on clustering text documents, but I'd like to be able to visualize > the clusters. You'll need dimensionality redu

[Scikit-learn-general] Clustering of Text Documents

2013-06-01 Thread Harold Nguyen
Hi all, I was wondering if anyone can point me to a tutorial on clustering text documents, but then also displaying the results in a graph ? I see some examples on clustering text documents, but I'd like to be able to visualize the clusters. Any help would be appreciated! Thank you, Harold