[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Sandor Dj (JIRA) Tue, 31 Aug 2010 01:03:21 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904550#action_12904550
 ]


Sandor Dj commented on PDFBOX-800:
----------------------------------

Okay, i see the problem.
But what is about the textbox "Hallo das ist ein vertikales TEXTFELD" (the 
first vertical one on the left side)? Why is this one not extracted correctly? 
The font is turned 90° around... 
We have some other PDF files with similar textboxes and they are also extracted 
in a wrong way.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the 
> right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika 
> library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! 
> line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO 
> line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Reply via email to