Re: How do i get a text summary

2008-02-29 Thread Bob Carpenter
.pdf Both LingPipe and Kea are able to find significant phrases, which is useful for query refinement or summarizing sets of search results, but not so useful for individual documents. It can be a huge help to add part-of-speech information to these kinds of approaches. -

Re: Lucene for Sentiment Analysis

2008-03-07 Thread Bob Carpenter
Gathering more data like that from Amazon, C-net, etc. should be easy. That's what everyone's doing for evaluations. But these are all at the review level, not at the sentence level. We've actually had customers annotating at the sen

Book: Building Search Applications: Lucene, LingPipe and Gate

2008-06-12 Thread Bob Carpenter
ssary, but the book never loses sight of its goal of providing a practical introduction. In that way, it’s like the Manning "in Action" series. About the author: Manu Konchady has a home page/blog on Amazon: http://www.amazon.com/gp/blog/A2TWRNMTU6T9TW/ref=cm_blog_dp_artist_blog - Bob C

Re: a "fair" similarity

2006-11-21 Thread Bob Carpenter
because the doc vectors remain stable as new docs are added. Then, in general: score(doc,doc) < score(doc,doc') if IDF(doc) = doc'. That is, the inversely IDF-scaled query matches a document better than the document itself. - Bob Carpenter Alias-i -

Re: Analysis/tokenization of compound words (German, Chinese, etc.)

2006-11-21 Thread Bob Carpenter
recall is much easier than fine-grained linguistic morphology. Often the best solution is a combination of best-guess based on linguistic rules/statistical models/heuristics combined with weaker substring measures. For beter solutions that would cover fuzzy errors, contact Bob Carpenter from Al

Re: Analyzers and multiple languages (language detection)

2006-11-21 Thread Bob Carpenter
out our tutorial at: http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html Accuracy depends on the pair of languages (some are more confusible than others), as well as length of input (it's very hard with only one or two words, especially if it's a a nam

Re: does anyone know of a 'smart' categorizing text pattern finder?

2006-11-21 Thread Bob Carpenter
oblems, you might want to check out Weka. - Bob Carpenter Alias-i - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: is there any n-gram analyzer available??

2006-11-21 Thread Bob Carpenter
and maximum n-gram length. You might want to put them in different fields if you want weighting between them to be easy. - Bob Carpenter Alias-i - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Language detection library

2007-05-07 Thread Bob Carpenter
22.59% 2 34.82% 4 58.55% 8 81.17% 16 92.45% 32 97.33% 64 98.99% 128 99.67% The end of the tutorial has references to other popular language ID packages online (e.g. TextCat, which is Gertjan van Noord's Perl package). And it also has

Re: Keyphrase Extraction

2007-05-08 Thread Bob Carpenter
here's a blog entry comparing our hypothesis testing approach to a standard mutual-info based method (discussed by Matthew Hurst, when he was at Nielsen BuzzMetrics): http://www.alias-i.com/blog/?p=14 - Bob Carpenter Alias-i - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Content Summarization

2007-06-19 Thread Bob Carpenter
arch in this area is coming out of Kathy McKeown's group at Columbia, not to mention the horde of students she's graduated over the last ten years, such as Drago Radev, the author of the second tutorial and software above. - Bob Carpenter Alias-i ---

Re: question with spellchecker

2006-06-13 Thread Bob Carpenter
et al. a lot of problem with false positives (correcting things that were right) and false negatives (missing corrections). This is especially obvious once you drop into a specialized domain that's not computer science (which is over-represented proportionally on the web), or a language that

Re: JVM Crash

2006-06-13 Thread Bob Carpenter
in the JVM until I replaced my memory with ECC memory a couple of years ago, and haven't seg-faulted since. - Bob Carpenter Ross Rankin wrote: We keep getting JVM crashes on 1.4.3. I found in the archive that setting a JVM parameter solved the problem for a few users. We've tried that a

Re: Lucene and SIPs

2006-06-22 Thread Bob Carpenter
"t1")*probFG("t2") to both find things that are new and that are phrase-like. I'm going to be writing this all up in a bit longer form in a case study for the revised Lucene in Action, with explanations of how to find the significant terms relative to a query, like Scirus.com doe

Re: Phrase Frequency For Analysis

2006-06-22 Thread Bob Carpenter
nt("t") / collectionSize collectionCount("t") = count of term "t" in the collection collectionSize = number of term instances (not types) in the collection - Bob Carpenter Alias-i Andrzej Bialecki wrote: Nader Akhnoukh wrote: Yes, Chris is correct, the goal is to det

Re: lucene in combination with pattern recognition...

2006-06-22 Thread Bob Carpenter
iness is a testament to how hard this problem is in general. - Bob Carpenter Alias-i i'm looking at a problem and i can't figure out how to "easily" solve it... basically, i'm trying to figure out if there's a way to use lucene/nutch with some form of pattern match