Am 21.05.2017 um 18:20 schrieb Andreas Lehmkuehler:
Am 17.02.2017 um 17:58 schrieb Allison, Timothy B.:
All,
I finally got around to adding tika-eval[1] to Apache Tika. If you
have any interest in comparing the output of different
tools/versions/parameters on text extraction, give it a try. You
don't need to use Tika or format the output in a specific format;
plain UTF-8 text will work.
Tilman, I generalized your common word count methodology. The code
now runs language id on the text and then counts the common words for
that language.
Lots more work remains. Thank you, all, for contributing to the
methodologies!
And here is the talk about it Tim gave at ApacheCon
https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp
I've enjoyed it (the video).
So did I!
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org