[ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson updated PDFBOX-800: ------------------------------- Affects Version/s: 1.7.0 > Wrong text extract from vertical textboxes in pdf files > ------------------------------------------------------- > > Key: PDFBOX-800 > URL: https://issues.apache.org/jira/browse/PDFBOX-800 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.7.0 > Environment: Windows 7, VS 2010 C#, Tika Library > Reporter: Sandor Dj > Fix For: 2.0.0 > > Attachments: problemdoc.doc, problemdoc.pdf > > > Vertical textboxes in pdf files are not extracted correctly (using the tika > library in C#). > For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! > line breaks): > H > E > L > L > O > the parser returns 5 strings, each with a single letter, even there is NO > line break after every letter. > Is there a option to avoid this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)