ooccur.
> Alternatively, you can look at dot product between test documents and the
> tags that are on the test document. Then you can define AUC as the
> probability that tags that are actually present have higher dot product
> than
> randomly selected tags. Higher AUC is good.
>
>
> My point is exactly that this evaluation will lead to nonsense. The size
> of
> the extracted topics vector isn't even necessarily the same as the size of
> the labels vector. There is also no guarantee that it would be in the same
> order.
>
>
If order is not important in the comparison.
rence step is run on the document vector prior to LDA
input.
So it's not really supervised as there is no training just the 2nd-stage
testing part of supervised learning.
- Neal
>
> On Wed, Jan 5, 2011 at 11:57 PM, Neal Richter wrote:
>
> > What about gauging it's ab
What about gauging it's ability to predict the topics of labeled data?
1) Grab RSS feeds of blog posts and use the tags as labels
2) Delicious bookmarks & their content versus user tags
3) other examples abound...
On Tue, Jan 4, 2011 at 10:33 AM, Jake Mannix wrote:
> Saying we have hashing is d
On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh wrote:
> Does anybody have examples/reference how to use TF-IDF weights in mahout
> cbayes for particular words and phrases while doing text classification ?
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
- Neal
use Mahout and Hadoop for scalability anyway.Does the
> pearl implementation works with mahout and Hadoop ?
>
> Thanks
> Neil
>
> On Tue, Sep 28, 2010 at 1:21 AM, Neal Richter wrote:
>>
>> Neil,
>>
>> Is your classification task online or offline? Ie w
Neil,
Is your classification task online or offline? Ie will you need a
classification for a piece of text live within some web-service?
IF OFFLINE:
I've put up a very easy to use implementation of NaiveBayes here:
http://github.com/nealrichter/ddj_naivebayes
It's an extension of a per
anks. Ill give this a try and see how it performs
>>
>>
>> On 9/18/10 12:01 PM, Neal Richter wrote:
>>
>>> I suggest you take a sample of your data and run it on these
>>> non-hadoop implementations of itemset miners, FPGrowth is one of the
>>> avai
I suggest you take a sample of your data and run it on these
non-hadoop implementations of itemset miners, FPGrowth is one of the
available algorithms.
http://www.borgelt.net/fpm.html
If you have success on a small sample then start upscaling the sample
as well as investigate the distributions of