Mark Miller wrote:
The only commercial options that I have seen do not have a web presence (that I know of or can find) and I don't recall the company names (only peripherally involved).
Are we talking about Yahoo's buzz index and Amazon's SIPs or CAPs? I actually think the most interesting application of this is in the Scirus.com search engine, built by Fast Search (Lucene competitor) and Elsevier (publisher). They extract phrases relative to a query (I have guesses as to how they do this quickly) and show those for query refinement. For instance, the query "text data mining" finds the following "keywords" (and then some): association rules case-based reasoning computational biology data integration data visualization genomics information access information filtering information integration The standard way to tackle this problem (see, e.g., Manning and Schuetze's 1999 NLP textbook) is to look for collocations -- terms that don't look to be random according to standard independence tests (e.g. a t-test or chi-squared test). That is, do "data" and "visualization" show up more than you would expect them to in the results of the query "text data mining"?) Although Manning and Schuetze don't really discuss it, you can also compare one corpus to another (e.g. today's news to the last month's to see what's newly hot today, or the top 1000 hits for a query relative to a whole collection). You can find pretty much every version ever put forward implemented in Ted Pedersen's n-grams package: http://www.d.umn.edu/~tpederse/nsp.html which is in Perl with lots of doc and manuals and papers with all the (very easy) math. These techniques are also very very easy to implement, as in first exercise in an undergrad computer sci class easy. The only real issues are (a) scaling and (b) heuristic pruning. Popular pruning options include using only nouns (as determined by a part of speech tagger), only capitalized phrases, or even phrases appearing after "the". With enough pruning, scaling's easy. We provide a tutorial in LingPipe: http://www.alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html And here's a blog entry comparing our hypothesis testing approach to a standard mutual-info based method (discussed by Matthew Hurst, when he was at Nielsen BuzzMetrics): http://www.alias-i.com/blog/?p=14 - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]