if you have: --a set of snippets --a set of articles
and for each snippet, you want to find the `matching` set of articles, then you could: --treat this as an IR task (a snippet becomes a query) --treat this as co-clustering (eg http://citeseer.ist.psu.edu/447871.html) nutch could do the first for you; right now there is no support in mahout that i know about for co-clustering Miles 2008/6/4 Marcus Persson Lindqvist <[EMAIL PROTECTED]>: > Hi list! > > I've been looking at mahout since the start and am very excited. However, > I'm a ML-noob and need some introductory pointers before I can start play. > > What I want to do fairly simple: I have small set of text snippets which I > now match a smaller set of articles, so that an article consists of one or > more of the text snippets. So I need to group those snippets into articles. > Preferably would I like to be able to detect "noise" as well (snippet has > too little or dirty information and is not classified as an article.) > > I have access to large training sets of "complete" articles. > > Now, anyone got any tip on how to achieve this? Which of the algos > discussed > here would be sufficient? > > Any help much appreciated. > > /Marcus > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
