Use dictionary lookups to increase text extraction accuracy
-----------------------------------------------------------
Key: PDFBOX-1153
URL: https://issues.apache.org/jira/browse/PDFBOX-1153
Project: PDFBox
Issue Type: New Feature
Components: Text extraction
Reporter: Jukka Zitting
There are still some cases where the text extraction code incorrectly inserts
spaces inside words extracted from a PDF document. We could increase extraction
accuracy with an optional dictionary lookup mechanism that checks each
extracted word or token against a dictionary of common words. If the lookup
fails (and the amount of empty space after the token is small), the token is
concatenated with the next one. If that concatenated token matches a word in
the dictionary, the intervening space can very likely be dropped.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira