[ https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413701#comment-15413701 ]
Tim Allison commented on TIKA-2052: ----------------------------------- Sorry. I suspect this is a PDF issue, rather than PDFBox, but if they can fix it, great! Thank you for opening this. > Words are separated where there the letters are spaced together in the PDF > document > ----------------------------------------------------------------------------------- > > Key: TIKA-2052 > URL: https://issues.apache.org/jira/browse/TIKA-2052 > Project: Tika > Issue Type: Bug > Reporter: Sebastian Landwehr > > For example in the following document: > https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf > Searching for "onsimpulse des Herzschrittmachers" finds the location where > "Herzschrittmacher" is separated into "Herzschrittma chers". This is > especially problematic when using the PDF for full text search because often > such end syllables are found which are not really part of the content. The > whitespace config parameter did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)