Some newbie questions- Mahout clustering

Gökhan Çapan Wed, 13 Jan 2010 01:33:52 -0800

Hi,
We have a local news aggregation(and news search engine) web site, which
show news stories within a cluster (a cluster of news articles from
different news sites that are about the same(sometimes just very similar)
story).
For clustering the news of the last crawl(not results of search, news
themselves), we use Carrot2, and it works pretty good.


However, we sometimes need to publish summary of the week/month/year.

I am not experienced about clustering, and from what I read about clustering
in this mailing list, I guess applying kmeans to data after intelligently
selecting initial clusters with canopy will fulfill our needs.
I have some questions about topic:
-Could anyone who is experienced about clustering stuff suggest me the
rightest way to detect news stories? Does the method I mentioned above seem
reasonable ?
-Do I need some initial work before clustering? Should I partition the data
into daily groups before clustering, for example?
(Again, in our case; a news story is an aggregated view of the
similar(nearly same) stories from different sources.)

Finally, our search engine is built on Lucene/Solr. I've read our index may
be easily converted to Mahout vector format by lucene driver on Wiki pages.

-Are the documents about clustering jobs in Wiki pages  applicable with
"trunk"? If they are out of date, is there anywhere that I can reach the
documents about trunk?

Thanks.
-- 
Gökhan Çapan

Some newbie questions- Mahout clustering

Reply via email to