I have never tried this method. The concept came from a research paper I ran into. The goal was to detect the language of piece of text by looking at several factors. Average length of word, average length of sentence, average number of vowels in a word, etc. He used these to score and article, and it worked well in determining the language of the text. It worked well.
This is a fairly basic program that you might see in Artificial Intelligence, you can create a score and try to determine what the block of text you are looking for is. The answer is not going to be perfect, and I can not imagine many out-of-the box solutions will do exactly what you need. (Just a guess) The one plus about this is that you can take html right out of the equation. I believe the java HTML tag parsers has some quick 'toText' method that will dump the text of a web page. Also your would think most online newspapers carry a NewsML XML version or RSS version of their paper.