Ken Williams <[EMAIL PROTECTED]> wrote:
> I had a chance last week to read Yiming Yang's paper on feature set
> reduction:
> 
>   http://www.cs.cmu.edu/~yiming/papers.yy/icml97.ps.gz
> 
> It contains the startling conclusion that the single biggest factor in
> getting good results by reducing feature sets is to keep frequently-used
> features (after getting rid of a stopword set) and throw away rare
> features.  This algorithm was called "Document Frequency", because each
> term's "frequency" is defined as the number of corpus documents in which
> the term appears. 
[etc.]

Just a casual comment on this.  There has been a fair amount of work on text
classification in the past few years, comparing different representations and
algorithms.  I wouldn't take any individual study's conclusions as definitive,
since various papers have conflicting conclusions.  As one example, most
people think stopword elimination and stemming are effective, but Riloff makes
a case against doing them:

http://citeseer.nj.nec.com/riloff97little.html

I have no reason to question Yang's results; I'm just pointing out that text
classification is a big ball of wax.

Regards,
-Tom

Reply via email to