[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Jukka Zitting (JIRA) Fri, 27 Aug 2010 07:56:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903439#action_12903439
 ]


Jukka Zitting commented on PDFBOX-800:
--------------------------------------

Hmm, actually PDFBox already does properly extract the "Hallo das ist ein 
anderes vertikales TEXTFELD" and "Hallo das ist ein horizontales TEXTFELD" 
sentences from the example document.

Handling the vertical "Hallo" text boxes where the characters are horizontally 
oriented is probably impossible unless there's some external hint that the text 
should be treated like vertical writing in Chinese or Japanese.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the 
> right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika 
> library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! 
> line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO 
> line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Reply via email to