Wrong text extract from vertical textboxes in pdf files
-------------------------------------------------------

                 Key: PDFBOX-800
                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
             Project: PDFBox
          Issue Type: Bug
         Environment: Win 7, VS 2010 C#
            Reporter: Sandor Dj
            Priority: Critical


I was told to move this issue to the pdfbox parser, so I hope this is the right 
section.
Vertical textboxes in pdf files are not extracted correctly (using the tika 
library in C#).
For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! 
line breaks):
H
E
L
L
O
the parser returns 5 strings, each with a single letter, even there is NO line 
break after every letter.
Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to