[ https://issues.apache.org/jira/browse/TIKA-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-584. -------------------------------- Resolution: Duplicate Assignee: Jukka Zitting Like TIKA-583, this is a duplicate of TIKA-548, fixed in trunk. > Tika parse of some PDF files removes all spaces between words > ------------------------------------------------------------- > > Key: TIKA-584 > URL: https://issues.apache.org/jira/browse/TIKA-584 > Project: Tika > Issue Type: Bug > Affects Versions: 0.8 > Environment: Windows XP 3, OpenSuse 11.2 > Reporter: Ajay Vohra > Assignee: Jukka Zitting > Attachments: JavaEE6Tutorial.pdf > > > In the case of some pdf files (not all), when Tika.parse(InputStream) method > is used, the content extracted from the returned reader has all spaces > removed. This only happens for some PDF files: An example where this happens > is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files > where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug > remains. > When PDFTextStripper is directly used, the extracted content is correct, with > the spaces between words retained. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.