[jira] [Comment Edited] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Paul Slootweg (Jira) Thu, 22 Aug 2019 07:29:40 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913364#comment-16913364
 ]


Paul Slootweg edited comment on PDFBOX-4313 at 8/22/19 2:28 PM:
----------------------------------------------------------------

I am currently seeing a similar problem - in this case a line of bold text has 
a line of standard text below it and places the second line as part of the 
first.

This looks to be because it is using the bold font height to compare the 
overlap for the standard line.

See the attached file `details.pdf` - 

{{protected void writeString(String text, List<TextPosition> textPositions)}} 
passes `text` as "Quote / Invoice Number: AT-82081073PO Number: CS-20167 " 
despite being on separate lines.

The overlap() method should also look at the x position to determine what, if 
any, the overlap is.

*PDFBox 2.0.16* using {{setSortByPosition(true)}}


was (Author: pslootweg):
I am currently seeing a similar problem - in this case a line of bold text has 
a line of standard text below it and places the second line as part of the 
first.

This looks to be because it is using the bold font height to compare the 
overlap for the standard line.

See the attached file `details.pdf` - 

{{protected void writeString(String text, List<TextPosition> textPositions)}} 
passes `text` as "Quote / Invoice Number: AT-82081073PO Number: CS-20167 " 
despite being on separate lines.

The overlap() method should also look at the x position to determine what, if 
any, the overlap is.

*PDFBox 2.0.16*

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> details.pdf, pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to