Karl Wettin wrote:
Grant Ingersoll skrev:
Now that we have some code in place for clustering, I think it would
be cool to put together some examples/demos of real world problems.
Things like clustering text (perhaps we can use the wikipedia download
or the reuters download that Lucene contrib/benchmark uses) or
clustering other pieces of data.
We could setup a demo area of code and use Lucene's analysis code to
create document vectors.
Ideas and/or thoughts or volunteers?
Should a demo make sense enough so people who never heard about machine
learning before understand what's going on? Or should it mainly show how
to use the API? Or is it something that is just built to show off
working or large data set?
Wikinews is generally speaking less good than the Reuters data, but some
articles exists in mulitiple languages and they often reference parts of
texts to Wikipedia articles.
I can't think of any clustering use case with the mentioned data sets
that makes that sense. Something grouping articles or stories that are
the same but from different sources makes sense, but we only have this
one source that often tries to merge things that are the same.
There are these tags describing categories and what not, but testing
this feels more of a classifier- than a cluster problem.
There are many other corpora, which are free and good enough for a demo.
For example, the "20 newsgroups" for clustering, the EuroParl for
multi-lingual IR (language detection, machine translation etc), WebKB
for web page clustering, the Acquis corpus
(http://wt.jrc.it/lt/Acquis/), etc, etc ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com