[ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903435#action_12903435 ]
Jukka Zitting commented on PDFBOX-800: -------------------------------------- One possible approach would be to divide the characters on a page to different "layers" depending on the orientation in which they are drawn. One layer would contain only horizontal characters, while others would contain vertical and diagonal ones. With appropriate rotation we could then apply the normal horizontal text extraction algorithm also for the vertically and diagonally drawn characters. > Wrong text extract from vertical textboxes in pdf files > ------------------------------------------------------- > > Key: PDFBOX-800 > URL: https://issues.apache.org/jira/browse/PDFBOX-800 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Environment: Win 7, VS 2010 C# > Reporter: Sandor Dj > Attachments: problemdoc.doc, problemdoc.pdf > > > I was told to move this issue to the pdfbox parser, so I hope this is the > right section. > Vertical textboxes in pdf files are not extracted correctly (using the tika > library in C#). > For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! > line breaks): > H > E > L > L > O > the parser returns 5 strings, each with a single letter, even there is NO > line break after every letter. > Is there a option to avoid this problem? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.