[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Mel Martinez (JIRA) Fri, 27 Aug 2010 07:36:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903430#action_12903430
 ]


Mel Martinez commented on PDFBOX-800:
-------------------------------------

This is a tricky problem to try to resolve.

The reason this occurs is because fundamentally PDF is not a structured data 
format.  It is a page rendering format.   That means that the letters:

H
E
L
L
O

May not be stored within the PDF as an integral character sequence (and 
probably are not).  Instead they exist as commands to 'render' each character 
on the page in the desired location and with the specified attributes (size, 
color, font, etc.).

The fact that they are not separated by a carriage return when entered into the 
creation of the document doesn't really mean anything as PDF doesn't really 
have the concept of carriage returns.   In the PDF, text starts on the next 
line down by the fact that the next text object is to be rendered at the 
coordinates that _look_ like a carriage return is there.

When PDFBox 'extracts' text, all it is really doing is _rendering_ the PDF to a 
text file.  So it tries to guess based on the character coordinates (and their 
proximity to other characters being rendered) on when to insert white space 
control characters such as spaces and carriage returns.

Basically, it is 'drawing' each page using the limitation that the only drawing 
tool is plain characters!

So, a piece of vertical text like you have here is tricky because there is no 
inherent way for PDF Box to know for certain that the characters are meant to 
be contiguous within a single word.   I.E. one could also have a page with the 
characters:

1
2
3
4
...

where that is meant to be a template for a list - PDFBox can't really know the 
difference.  You wouldn't want that text to be extracted as "1234..."

Someone else might have an idea for a solution here, but I don't see an obvious 
one.


> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the 
> right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika 
> library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! 
> line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO 
> line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Reply via email to