Re: Demos/Tutorials

Andrzej Bialecki Thu, 20 Mar 2008 03:06:11 -0700

Karl Wettin wrote:

Grant Ingersoll skrev:
Now that we have some code in place for clustering, I think it wouldbe cool to put together some examples/demos of real world problems.Things like clustering text (perhaps we can use the wikipedia downloador the reuters download that Lucene contrib/benchmark uses) orclustering other pieces of data.
We could setup a demo area of code and use Lucene's analysis code tocreate document vectors.
Ideas and/or thoughts or volunteers?
Should a demo make sense enough so people who never heard about machinelearning before understand what's going on? Or should it mainly show howto use the API? Or is it something that is just built to show offworking or large data set?
Wikinews is generally speaking less good than the Reuters data, but somearticles exists in mulitiple languages and they often reference parts oftexts to Wikipedia articles.
I can't think of any clustering use case with the mentioned data setsthat makes that sense. Something grouping articles or stories that arethe same but from different sources makes sense, but we only have thisone source that often tries to merge things that are the same.
There are these tags describing categories and what not, but testingthis feels more of a classifier- than a cluster problem.

There are many other corpora, which are free and good enough for a demo.For example, the "20 newsgroups" for clustering, the EuroParl formulti-lingual IR (language detection, machine translation etc), WebKBfor web page clustering, the Acquis corpus(http://wt.jrc.it/lt/Acquis/), etc, etc ...



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Demos/Tutorials

Reply via email to