Hi, We have a local news aggregation(and news search engine) web site, which show news stories within a cluster (a cluster of news articles from different news sites that are about the same(sometimes just very similar) story). For clustering the news of the last crawl(not results of search, news themselves), we use Carrot2, and it works pretty good.
However, we sometimes need to publish summary of the week/month/year. I am not experienced about clustering, and from what I read about clustering in this mailing list, I guess applying kmeans to data after intelligently selecting initial clusters with canopy will fulfill our needs. I have some questions about topic: -Could anyone who is experienced about clustering stuff suggest me the rightest way to detect news stories? Does the method I mentioned above seem reasonable ? -Do I need some initial work before clustering? Should I partition the data into daily groups before clustering, for example? (Again, in our case; a news story is an aggregated view of the similar(nearly same) stories from different sources.) Finally, our search engine is built on Lucene/Solr. I've read our index may be easily converted to Mahout vector format by lucene driver on Wiki pages. -Are the documents about clustering jobs in Wiki pages applicable with "trunk"? If they are out of date, is there anywhere that I can reach the documents about trunk? Thanks. -- Gökhan Çapan
