All, I finally got around to adding tika-eval[1] to Apache Tika. If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try. You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work.
Tilman, I generalized your common word count methodology. The code now runs language id on the text and then counts the common words for that language. Lots more work remains. Thank you, all, for contributing to the methodologies! Cheers, Tim [1] https://wiki.apache.org/tika/TikaEval