Re: Demos/Tutorials

Karl Wettin Wed, 19 Mar 2008 18:56:57 -0700

Grant Ingersoll skrev:

Now that we have some code in place for clustering, I think it would becool to put together some examples/demos of real world problems. Thingslike clustering text (perhaps we can use the wikipedia download or thereuters download that Lucene contrib/benchmark uses) or clustering otherpieces of data.
We could setup a demo area of code and use Lucene's analysis code tocreate document vectors.
Ideas and/or thoughts or volunteers?

Should a demo make sense enough so people who never heard about machinelearning before understand what's going on? Or should it mainly show howto use the API? Or is it something that is just built to show offworking or large data set?

Wikinews is generally speaking less good than the Reuters data, but somearticles exists in mulitiple languages and they often reference parts oftexts to Wikipedia articles.

I can't think of any clustering use case with the mentioned data setsthat makes that sense. Something grouping articles or stories that arethe same but from different sources makes sense, but we only have thisone source that often tries to merge things that are the same.

There are these tags describing categories and what not, but testingthis feels more of a classifier- than a cluster problem.

I suppose text mining means Lucene tokenization, so clustering searchresults is not too far fetched. But it is still clustering this onesource we have.

Wikibooks:cookbook could be a great source for fun examples (clusterapplicable recepies, feature select shopping list, product ethicityclassifier, market basket analysis, collaborate filtering, etc) but Ifear it would take a bit of work to parse the recepies.



    karl

Re: Demos/Tutorials

Reply via email to