Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done:
The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in addition) of (to) a person. So we used some custom code on top of Mahout, UIMA, Open-NLP etc. If you are interested to see how it works take a look at: http://www.scoopspot.com/ One more thing, we also created a robot which goes to some of the well known web sites like: Read Write Web, Hackers News, Tech Crunch etc which gets the article from the web and publishes that to our micro-blog. As we already have the tag following, we get the information without any problem. That's very cool (to us at least). You can see the output of the robot at this location: http://news.scoopspot.com/ I thought, this might be an example of what Mahout can do and related to this thread, so felt like sharing with you guys. Sorry if it looks like off-topic. Regards, Samik On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <goks...@gmail.com> wrote: > I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', > 'verb', etc. I removed all words that were not nouns or verbs. In my > use case, this is a total win. In other cases, maybe not: Twitter has > a quite varied non-grammer. > > On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <p...@farfetchers.com> wrote: > > The way back from stem to tag is interesting from the standpoint of > making tags human readable. I had assumed a lookup but this seems much more > satisfying and flexible. In order to keep frequencies it will take > something like a dictionary creation step in the analyzer. This in turn > seems to imply a join so a whole new map reduce job--maybe not completely > trivial? > > > > It seems that NLP can be used in two very different ways here. First as > a filter (keep only nouns and verbs?) second to differentiate semantics > (can:verb, can:noun). One method is a dimensional reduction technique the > other increases dimensions but can lead to orthogonal dimensions from the > same term. I suppose both could be used together as the above example > indicates. > > > > It sounds like you are using it to filter (only?) Can you explain what > you mean by: > > "One thing came through- parts-of-speech selection for nouns and verbs > > helped 5-10% in every combination of regularizers.' > > > > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <goks...@gmail.com> wrote: > > > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > > code for regularizing a doc/term matrix with tf, binary, log and > > augmented norm for the cells and idf, gfidf, entropy, normal (term > > vector) and probabilistic inverse. Running any of these, and then SVD, > > on a Reuters article may take 10-20 ms. This uses a sentence/term > > matrix for document summarization. After doing all of this, I realized > > that maybe just the regularized matrix was good enough. > > > > One thing came through- parts-of-speech selection for nouns and verbs > > helped 5-10% in every combination of regularizers. All across the > > board. If you want good tags, select your parts of speech! > > > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > > <dawid.we...@cs.put.poznan.pl> wrote: > >> I know, I know. :) Just wanted to mention that it could lead to funny > >> results, that's all. There are lots of way of doing proper form > >> disambiguation, including shallow tagging which then allows to > >> retrieve correct base forms for lemmas, not stems. Stemming is > >> typically good enough (and fast) so your advise was 100% fine. > >> > >> Dawid > >> > >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > >>> This is definitely just the first step. Similar goofs happen with > >>> inappropriate stemming. For instance, AIDS should not stem to aid. > >>> > >>> A reasonable way to find and classify exceptional cases is to look at > >>> cooccurrence statistics. The contexts of original forms can be > examined to > >>> find cases where there is a clear semantic mismatch between the > original > >>> and the set of all forms that stem to the same form. > >>> > >>> But just picking the most common that is present in the document is a > >>> pretty good step for all that it produces some oddities. The results > are > >>> much better than showing a user the stemmed forms. > >>> > >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss < > dawid.we...@cs.put.poznan.pl>wrote: > >>> > >>>>> Unstemming is pretty simple. Just build an unstemming dictionary > based > >>>> on > >>>>> seeing what word forms have lead to a stemmed form. Include > frequencies. > >>>> > >>>> This can lead to very funny (or not, depends how you look at it) > >>>> mistakes when different lemmas stem to the same token. How frequent > >>>> and important this phenomenon is varies from language to language (and > >>>> can be calculated apriori). > >>>> > >>>> Dawid > >>>> > > > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > > > > > -- > Lance Norskog > goks...@gmail.com >