[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Sandor Dj (JIRA) Tue, 24 Aug 2010 01:18:47 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sandor Dj updated PDFBOX-800:
-----------------------------

    Attachment: problemdoc.pdf
                problemdoc.doc

As you can see there are some vertical textboxes in the middle of the page (pdf 
file).
Referring to the office document out of with the pdf file was created, there 
are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the 
> right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika 
> library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! 
> line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO 
> line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Reply via email to