Hello, (I have posted this in solr as well) I am indexing newspaper articles as an excercise in solr. When dealing with newspaper articles in previous experiences I always tried to get the div or the table that contains the actual news, using nekohtml traversing tru the dom tree and getting the text from the div or table that contains the article. When dealing with many newspapers, it is a hassle to custom code to extract relevant information. There is usually a lot of garbage in the html. >From categories to ads, and further more they change, so a static coding is problematic.
I have been thinking if I could measure the frequency or uniqueness for each node, and find the news automatically - but I have not come up with an implementation. Has anyone did/contemplated/used something similar? Maybe there is already a way - using lucene, or even hadoop. Otis from solr mailing list suggested a NovelAnalyzer from the lucene development code. I think hadoop people might have an idea about this... Best,