[ https://issues.apache.org/jira/browse/TIKA-559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935788#action_12935788 ]
Staffan Olsson commented on TIKA-559: ------------------------------------- Isnt this a duplicate of TIKA-548? Try trunk. > [PDF Parser] New paragraph not taken into account sometime > ---------------------------------------------------------- > > Key: TIKA-559 > URL: https://issues.apache.org/jira/browse/TIKA-559 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.8 > Environment: Windows 7 x86 Pro > Reporter: Antoine L. > Priority: Minor > Fix For: 0.9 > > Attachments: partition.pdf > > > The document attached to this issue have some parsing problem. > The extracted text is the following (using Tika.parseToString(...)): > ---- > MiserereAntonio Lotti (1666 - 1740) Mi-se-re- > Mi-se-re--re [...] > ---- > If you open the file, you will see that "Miserere" and Antonio" are not > placed in the same line. > I was expecting to have a white space at least between "Miserere" and > "Antonio". > I don't have the tools to analyze the PDF but could it be that the text in > the file are using absolute alignment ? (or this is completely irrelevant). > Thank you. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.