Grant Ingersoll skrev:
Now that we have some code in place for clustering, I think it would be cool to put together some examples/demos of real world problems. Things like clustering text (perhaps we can use the wikipedia download or the reuters download that Lucene contrib/benchmark uses) or clustering other pieces of data.

We could setup a demo area of code and use Lucene's analysis code to create document vectors.

Ideas and/or thoughts or volunteers?

Should a demo make sense enough so people who never heard about machine learning before understand what's going on? Or should it mainly show how to use the API? Or is it something that is just built to show off working or large data set?


Wikinews is generally speaking less good than the Reuters data, but some articles exists in mulitiple languages and they often reference parts of texts to Wikipedia articles.

I can't think of any clustering use case with the mentioned data sets that makes that sense. Something grouping articles or stories that are the same but from different sources makes sense, but we only have this one source that often tries to merge things that are the same.

There are these tags describing categories and what not, but testing this feels more of a classifier- than a cluster problem.

I suppose text mining means Lucene tokenization, so clustering search results is not too far fetched. But it is still clustering this one source we have.


Wikibooks:cookbook could be a great source for fun examples (cluster applicable recepies, feature select shopping list, product ethicity classifier, market basket analysis, collaborate filtering, etc) but I fear it would take a bit of work to parse the recepies.


    karl

Reply via email to