Hi,

I am currently working on the classification of pages according to DMOZ :-)
I have been planning to give Mahout a serious try but never managed to do it
so that could be a good opportunity to do that.

We have downloaded and parsed the latest DMOZ snapshot. Everything is
currently stored in a DB, we have the following fields for each document:
- URL
- category (level 1 from DMOZ)
- content
- title
- description (taken from the HTML meta tags)
- keywords (taken from the HTML meta tags)
- status (unavailable|fetched)

We are using our own API to convert the information for each document into a
vector with a choice of which weighting scheme to use (tf-idf, frequency,
etc...). The weighting takes the fields into account i.e. if using tf.idf
the weight of a given term takes into account its frequency in this specific
field (say title).

I could describe the whole process on a Wiki page but that would be quite
long (especially if we need to go through all the details of Nutch), maybe I
could simply generate a textual representation of the matrix and put it in a
place where people could download it? That could be the starting point of
the use case. There would also be a lexicon file containing the mapping
between the attribute labels and their index.

There could be all sorts of possible experiments from there e.g. trying to
see which attributes are the most discriminant etc...

Does that make sense?

Julien


2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>

> Amazon has generously donated some credits, so I plan on putting Mahout up
> and doing some testing.  Was wondering if people had suggestions on things
> they would like to see from Mahout.  For starters, I'm going to put up a
> public image containing 0.1 when it's ready, but I'd also like to wiki up
> some examples.  I.e. go here, get this data, put it in this format and then
> do X.  We have some simple examples, but I think it would be cool to show
> how to do something a bit more complex, like maybe classify web pages
> according to DMOZ or to cluster on stuff, or maybe put in a large traveling
> salesman problem using the GA stuff Deneche did.
>
> Thoughts?  Anyone else interested in setting up some use cases?
>
> -Grant
>

Reply via email to