Hi All,
We have developed an auto tagging system for our micro-blogging platform.
Here is what we have done:
The purpose of the system was to look for tags in an articles automatically
when someone posts a link in our micro-blogging site. The goal was to allow
us to follow a tag instead (in
Nice stuff. And glad that Mahout was able to help!
On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY sam...@gmail.com wrote:
Hi All,
We have developed an auto tagging system for our micro-blogging platform.
Here is what we have done:
The purpose of the system was to look for tags in an
I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
'verb', etc. I removed all words that were not nouns or verbs. In my
use case, this is a total win. In other cases, maybe not: Twitter has
a quite varied non-grammer.
On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel
tf-idf is a good approximation of the LLR score for many applications and
often gives useful signatures although not always super pretty.
It helps to have an overall minimum document frequency for terms of the be
considered for being tags. This is the same as an IDF maximum.
On Fri, Aug 3, 2012
We do what Ted describes by tossing frequently used terms with the IDF max,
tossing stop words and stemming with a lucene analyzer. The stemming makes the
tags less readable for sure but without it the near duplicate terms make for a
strange looking tag list. With or without stemming the top
Unstemming is pretty simple. Just build an unstemming dictionary based on
seeing what word forms have lead to a stemmed form. Include frequencies.
When unstemming in the context of a document, pick the most popular
(corpus-wide) version that actually appears in the document.
On Fri, Aug 3,
Unstemming is pretty simple. Just build an unstemming dictionary based on
seeing what word forms have lead to a stemmed form. Include frequencies.
This can lead to very funny (or not, depends how you look at it)
mistakes when different lemmas stem to the same token. How frequent
and important
This is definitely just the first step. Similar goofs happen with
inappropriate stemming. For instance, AIDS should not stem to aid.
A reasonable way to find and classify exceptional cases is to look at
cooccurrence statistics. The contexts of original forms can be examined to
find cases where
I know, I know. :) Just wanted to mention that it could lead to funny
results, that's all. There are lots of way of doing proper form
disambiguation, including shallow tagging which then allows to
retrieve correct base forms for lemmas, not stems. Stemming is
typically good enough (and fast) so
Thanks everyone- I hadn't considered the stem/synonym problem. I have
code for regularizing a doc/term matrix with tf, binary, log and
augmented norm for the cells and idf, gfidf, entropy, normal (term
vector) and probabilistic inverse. Running any of these, and then SVD,
on a Reuters article may
10 matches
Mail list logo