Re: LDA in Mahout

2011-01-06 Thread Neal Richter
ooccur. > Alternatively, you can look at dot product between test documents and the > tags that are on the test document. Then you can define AUC as the > probability that tags that are actually present have higher dot product > than > randomly selected tags. Higher AUC is good.

Re: LDA in Mahout

2011-01-06 Thread Neal Richter
> > > My point is exactly that this evaluation will lead to nonsense. The size > of > the extracted topics vector isn't even necessarily the same as the size of > the labels vector. There is also no guarantee that it would be in the same > order. > > If order is not important in the comparison.

Re: LDA in Mahout

2011-01-06 Thread Neal Richter
rence step is run on the document vector prior to LDA input. So it's not really supervised as there is no training just the 2nd-stage testing part of supervised learning. - Neal > > On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter wrote: > > > What about gauging it's ab

Re: LDA in Mahout

2011-01-05 Thread Neal Richter
What about gauging it's ability to predict the topics of labeled data? 1) Grab RSS feeds of blog posts and use the tags as labels 2) Delicious bookmarks & their content versus user tags 3) other examples abound... On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix wrote: > Saying we have hashing is d

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Neal Richter
On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh wrote: > Does anybody have examples/reference how to use TF-IDF weights in mahout > cbayes for particular words and phrases while doing text classification ? http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf - Neal

Re: Text Classification using Mahout

2010-09-28 Thread Neal Richter
use Mahout and Hadoop for  scalability anyway.Does the > pearl implementation works with mahout and Hadoop ? > > Thanks > Neil > > On Tue, Sep 28, 2010 at 1:21 AM, Neal Richter wrote: >> >> Neil, >> >>  Is your classification task online or offline?  Ie w

Re: Text Classification using Mahout

2010-09-27 Thread Neal Richter
Neil, Is your classification task online or offline? Ie will you need a classification for a piece of text live within some web-service? IF OFFLINE: I've put up a very easy to use implementation of NaiveBayes here: http://github.com/nealrichter/ddj_naivebayes It's an extension of a per

Re: PFP Growth

2010-09-18 Thread Neal Richter
anks. Ill give this a try and see how it performs >> >> >> On 9/18/10 12:01 PM, Neal Richter wrote: >> >>> I suggest you take a sample of your data and run it on these >>> non-hadoop implementations of itemset miners, FPGrowth is one of the >>> avai

Re: PFP Growth

2010-09-18 Thread Neal Richter
I suggest you take a sample of your data and run it on these non-hadoop implementations of itemset miners, FPGrowth is one of the available algorithms. http://www.borgelt.net/fpm.html If you have success on a small sample then start upscaling the sample as well as investigate the distributions of