[ 
https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413701#comment-15413701
 ] 

Tim Allison commented on TIKA-2052:
-----------------------------------

Sorry.  I suspect this is a PDF issue, rather than PDFBox, but if they can fix 
it, great!  Thank you for opening this.

> Words are separated where there the letters are spaced together in the PDF 
> document
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-2052
>                 URL: https://issues.apache.org/jira/browse/TIKA-2052
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where 
> "Herzschrittmacher" is separated into "Herzschrittma chers". This is 
> especially problematic when using the PDF for full text search because often 
> such end syllables are found which are not really part of the content. The 
> whitespace config parameter did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to