Karl Wettin wrote:
Grant Ingersoll skrev:
Now that we have some code in place for clustering, I think it would be cool to put together some examples/demos of real world problems. Things like clustering text (perhaps we can use the wikipedia download or the reuters download that Lucene contrib/benchmark uses) or clustering other pieces of data.

We could setup a demo area of code and use Lucene's analysis code to create document vectors.

Ideas and/or thoughts or volunteers?

Should a demo make sense enough so people who never heard about machine learning before understand what's going on? Or should it mainly show how to use the API? Or is it something that is just built to show off working or large data set?


Wikinews is generally speaking less good than the Reuters data, but some articles exists in mulitiple languages and they often reference parts of texts to Wikipedia articles.

I can't think of any clustering use case with the mentioned data sets that makes that sense. Something grouping articles or stories that are the same but from different sources makes sense, but we only have this one source that often tries to merge things that are the same.

There are these tags describing categories and what not, but testing this feels more of a classifier- than a cluster problem.

There are many other corpora, which are free and good enough for a demo. For example, the "20 newsgroups" for clustering, the EuroParl for multi-lingual IR (language detection, machine translation etc), WebKB for web page clustering, the Acquis corpus (http://wt.jrc.it/lt/Acquis/), etc, etc ...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to