Re: Tags generation?

2012-08-07 Thread SAMIK CHAKRABORTY
Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in

Re: Tags generation?

2012-08-07 Thread Ted Dunning
Nice stuff. And glad that Mahout was able to help! On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY sam...@gmail.com wrote: Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an

Re: Tags generation?

2012-08-06 Thread Lance Norskog
I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', 'verb', etc. I removed all words that were not nouns or verbs. In my use case, this is a total win. In other cases, maybe not: Twitter has a quite varied non-grammer. On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel

Re: Tags generation?

2012-08-03 Thread Ted Dunning
tf-idf is a good approximation of the LLR score for many applications and often gives useful signatures although not always super pretty. It helps to have an overall minimum document frequency for terms of the be considered for being tags. This is the same as an IDF maximum. On Fri, Aug 3, 2012

Re: Tags generation?

2012-08-03 Thread Pat Ferrel
We do what Ted describes by tossing frequently used terms with the IDF max, tossing stop words and stemming with a lucene analyzer. The stemming makes the tags less readable for sure but without it the near duplicate terms make for a strange looking tag list. With or without stemming the top

Re: Tags generation?

2012-08-03 Thread Ted Dunning
Unstemming is pretty simple. Just build an unstemming dictionary based on seeing what word forms have lead to a stemmed form. Include frequencies. When unstemming in the context of a document, pick the most popular (corpus-wide) version that actually appears in the document. On Fri, Aug 3,

Re: Tags generation?

2012-08-03 Thread Dawid Weiss
Unstemming is pretty simple. Just build an unstemming dictionary based on seeing what word forms have lead to a stemmed form. Include frequencies. This can lead to very funny (or not, depends how you look at it) mistakes when different lemmas stem to the same token. How frequent and important

Re: Tags generation?

2012-08-03 Thread Ted Dunning
This is definitely just the first step. Similar goofs happen with inappropriate stemming. For instance, AIDS should not stem to aid. A reasonable way to find and classify exceptional cases is to look at cooccurrence statistics. The contexts of original forms can be examined to find cases where

Re: Tags generation?

2012-08-03 Thread Dawid Weiss
I know, I know. :) Just wanted to mention that it could lead to funny results, that's all. There are lots of way of doing proper form disambiguation, including shallow tagging which then allows to retrieve correct base forms for lemmas, not stems. Stemming is typically good enough (and fast) so

Re: Tags generation?

2012-08-03 Thread Lance Norskog
Thanks everyone- I hadn't considered the stem/synonym problem. I have code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may