Mark Miller wrote:
The only commercial options that I have seen do not have a web presence (that I know of or can find) and I don't recall the company names (only peripherally involved).

Are we talking about Yahoo's buzz index and
Amazon's SIPs or CAPs?

I actually think the most interesting application
of this is in the Scirus.com search engine, built by
Fast Search (Lucene competitor) and Elsevier (publisher).
They extract phrases relative to a query (I have guesses
as to how they do this quickly) and show those for
query refinement.  For instance, the query "text data
mining" finds the following "keywords" (and then some):

        association rules       
        case-based reasoning    
        computational biology   
        data integration        
        data visualization      
        genomics        
        information access      
        information filtering   
        information integration

The standard way to tackle this problem (see, e.g.,
Manning and Schuetze's 1999 NLP textbook) is to
look for collocations -- terms that don't look to
be random according to standard independence tests
(e.g. a t-test or chi-squared test).  That is,
do "data" and "visualization" show up more than
you would expect them to in the results of the
query "text data mining"?)

Although Manning and Schuetze
don't really discuss it, you can also compare one
corpus to another (e.g. today's news to the last
month's to see what's newly hot today, or the top
1000 hits for a query relative to a whole collection).

You can find pretty much every version ever
put forward implemented in Ted Pedersen's
n-grams package:

     http://www.d.umn.edu/~tpederse/nsp.html

which is in Perl with lots of doc and manuals
and papers with all the (very easy) math.

These techniques are also very very easy to implement,
as in first exercise in an undergrad computer sci
class easy.  The only real issues are (a) scaling
and (b) heuristic pruning.  Popular pruning options
include using only nouns (as determined by a part
of speech tagger), only capitalized phrases,
or even phrases appearing after "the".  With enough
pruning, scaling's easy.

We provide a tutorial in LingPipe:

    
http://www.alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html

And here's a blog entry comparing our hypothesis
testing approach to a standard mutual-info based
method (discussed by Matthew Hurst, when he was
at Nielsen BuzzMetrics):

    http://www.alias-i.com/blog/?p=14

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to