[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Alfonso Maya updated PDFBOX-5529:

Attachment: image-2022-10-19-16-48-36-198.png

> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
> 
>
> Key: PDFBOX-5529
> URL: https://issues.apache.org/jira/browse/PDFBOX-5529
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 
> 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17, 
> 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22, 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27
>Reporter: Carlos Alfonso Maya
>Priority: Major
> Attachments: image-2022-10-18-15-53-06-512.png, 
> image-2022-10-18-16-23-00-123.png, image-2022-10-18-16-26-15-001.png, 
> image-2022-10-19-16-48-36-198.png
>
>
> *Overview:* 
> We are using PDFBOX as a third party API to extract text from financial PDF 
> documents.
> We have been using PDFBox since a long time back, and we have detected a 
> problem related to a bad text extraction on PDFs from a Customer. 
> Since we worked with Customer Data we cannot shared the PDF besides that are 
> signed and we cannot even edit them.
> *Description of the problem:*
> By opening the PDF in Adobe Reader we can see several cases like the 
> following screenshot:
> !image-2022-10-18-15-53-06-512.png|width=221,height=211!
> Visually it appears to have spaces between words, but if we copy the text 
> from Adobe Reader and paste it into a text editor there is no extra spaces. 
> The following is the output that PDFBOX generates at the moment of doing text 
> extraction:
> {code:java}
> Da te
> In v oice number
> Ou r r eference
> You r reference
> Con tact person{code}
> (!) *Important note: this behavior is present in all the versions of PDFBox.*
> *Analysis:*
> By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
> 2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
> _*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
> {code:java}
> List line = new ArrayList();{code}
> Which subsequently the code add elements into the list:
> {code:java}
> line.add(LineItem.getWordSeparator()); 
> .
> .
> .
> line.add(new LineItem(position));{code}
>  
> And at some point it passes the list as a parameter into the following 
> statement:
> {code:java}
> writeLine(normalize(line));{code}
> (!) *The important about this list called "line" is that somehow the 
> "LineItem" objects are having NULL values inserted into it, and this values 
> are at some point interpreted as "blank spaces" causing the behavior 
> described above.*
> Here is an screenshot of how it is showed in the debugger:
> !image-2022-10-18-16-23-00-123.png|width=621,height=195!
> !image-2022-10-18-16-26-15-001.png|width=620,height=431!
>  
> We tried to look for a method that manipulates this list and that we can 
> override, but all of these methods that modified or access the list are 
> protected.
>  
> (!) *This is an example of how it displayed in the PDF Debugger:*
> {code:java}
>     q
>       94.525 545.32 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 547.72 Tm
>         0 g
>         0 G
>         [ (D) 22 (a) -131 (t) -109 (e) ] TJ
>       ET
>     Q 
>     q
>       94.525 530.9 141 11.225 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 533.3 Tm
>         0 G
>         [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) 
> 30 (m) -27 (b) -75 (e) 28 (r) ] TJ
>       ET
>     Q
>     q
>       94.525 516.5 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 519.7 Tm
>         0 G
>         [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) 
> -44 (e) 28 (n) -44 (ce) ] TJ
>       ET
>     Q{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-18 Thread Carlos Alfonso Maya (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Alfonso Maya updated PDFBOX-5529:

Description: 
*Overview:* 
We are using PDFBOX as a third party API to extract text from financial PDF 
documents.

We have been using PDFBox since a long time back, and we have detected a 
problem related to a bad text extraction on PDFs from a Customer. 

Since we worked with Customer Data we cannot shared the PDF besides that are 
signed and we cannot even edit them.

*Description of the problem:*

By opening the PDF in Adobe Reader we can see several cases like the following 
screenshot:
!image-2022-10-18-15-53-06-512.png|width=221,height=211!

Visually it appears to have spaces between words, but if we copy the text from 
Adobe Reader and paste it into a text editor there is no extra spaces. 

The following is the output that PDFBOX generates at the moment of doing text 
extraction:
{code:java}
Da te
In v oice number
Ou r r eference
You r reference
Con tact person{code}
(!) *Important note: this behavior is present in all the versions of PDFBox.*

*Analysis:*

By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
_*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
{code:java}
List line = new ArrayList();{code}
Which subsequently the code add elements into the list:
{code:java}
line.add(LineItem.getWordSeparator()); 
.
.
.
line.add(new LineItem(position));{code}
 

And at some point it passes the list as a parameter into the following 
statement:
{code:java}
writeLine(normalize(line));{code}
(!) *The important about this list called "line" is that somehow the "LineItem" 
objects are having NULL values inserted into it, and this values are at some 
point interpreted as "blank spaces" causing the behavior described above.*

Here is an screenshot of how it is showed in the debugger:

!image-2022-10-18-16-23-00-123.png|width=621,height=195!

!image-2022-10-18-16-26-15-001.png|width=620,height=431!

 

We tried to look for a method that manipulates this list and that we can 
override, but all of these methods that modified or access the list are 
protected.

 

(!) *This is how it displayed in the PDF Debugger:*
{code:java}
    q
      94.525 545.32 141 11.2 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 547.72 Tm
        0 g
        0 G
        [ (D) 22 (a) -131 (t) -109 (e) ] TJ
      ET
    Q 

    q
      94.525 530.9 141 11.225 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 533.3 Tm
        0 G
        [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) 30 
(m) -27 (b) -75 (e) 28 (r) ] TJ
      ET
    Q

    q
      94.525 516.5 141 11.2 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 519.7 Tm
        0 G
        [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) 
-44 (e) 28 (n) -44 (ce) ] TJ
      ET
    Q{code}
 

 

  was:
*Overview:* 
We are using PDFBOX as a third party API to extract text from financial PDF 
documents.

We have been using PDFBox since a long time back, and we have detected a 
problem related to a bad text extraction on PDFs from a Customer. 

Since we worked with Customer Data we cannot shared the PDF besides that are 
signed and we cannot even edit them.

*Description of the problem:*

By opening the PDF in Adobe Reader we can see several cases like the following 
screenshot:
!image-2022-10-18-15-53-06-512.png|width=221,height=211!

Visually it appears to have spaces between words, but if we copy the text from 
Adobe Reader and paste it into a text editor there is no extra spaces. 

The following is the output that PDFBOX generates at the moment of doing text 
extraction:
{code:java}
Da te
In v oice number
Ou r r eference
You r reference
Con tact person{code}

(!) *Important note: this behavior is present in all the versions of PDFBox.*



*Analysis:*

By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
_*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
{code:java}
List line = new ArrayList();{code}
Which subsequently the code add elements into the list:
{code:java}
line.add(LineItem.getWordSeparator()); 
.
.
.
line.add(new LineItem(position));{code}
 

And at some point it passes the list as a parameter into the following 
statement:
{code:java}
writeLine(normalize(line));{code}

(!) *The important about this list called "line" is that somehow the "LineItem" 
objects are having NULL values inserted into it, and this values are at some 
point interpreted as "blank spaces" causing the behavior described above.*


Here is an screenshot of how it is showed in the debugger:

!image-2022-10-18-16-23-00-123.png|width=621,height=

[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-18 Thread Carlos Alfonso Maya (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Alfonso Maya updated PDFBOX-5529:

Description: 
*Overview:* 
We are using PDFBOX as a third party API to extract text from financial PDF 
documents.

We have been using PDFBox since a long time back, and we have detected a 
problem related to a bad text extraction on PDFs from a Customer. 

Since we worked with Customer Data we cannot shared the PDF besides that are 
signed and we cannot even edit them.

*Description of the problem:*

By opening the PDF in Adobe Reader we can see several cases like the following 
screenshot:
!image-2022-10-18-15-53-06-512.png|width=221,height=211!

Visually it appears to have spaces between words, but if we copy the text from 
Adobe Reader and paste it into a text editor there is no extra spaces. 

The following is the output that PDFBOX generates at the moment of doing text 
extraction:
{code:java}
Da te
In v oice number
Ou r r eference
You r reference
Con tact person{code}
(!) *Important note: this behavior is present in all the versions of PDFBox.*

*Analysis:*

By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
_*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
{code:java}
List line = new ArrayList();{code}
Which subsequently the code add elements into the list:
{code:java}
line.add(LineItem.getWordSeparator()); 
.
.
.
line.add(new LineItem(position));{code}
 

And at some point it passes the list as a parameter into the following 
statement:
{code:java}
writeLine(normalize(line));{code}
(!) *The important about this list called "line" is that somehow the "LineItem" 
objects are having NULL values inserted into it, and this values are at some 
point interpreted as "blank spaces" causing the behavior described above.*

Here is an screenshot of how it is showed in the debugger:

!image-2022-10-18-16-23-00-123.png|width=621,height=195!

!image-2022-10-18-16-26-15-001.png|width=620,height=431!

 

We tried to look for a method that manipulates this list and that we can 
override, but all of these methods that modified or access the list are 
protected.

 

(!) *This is an example of how it displayed in the PDF Debugger:*
{code:java}
    q
      94.525 545.32 141 11.2 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 547.72 Tm
        0 g
        0 G
        [ (D) 22 (a) -131 (t) -109 (e) ] TJ
      ET
    Q 

    q
      94.525 530.9 141 11.225 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 533.3 Tm
        0 G
        [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) 30 
(m) -27 (b) -75 (e) 28 (r) ] TJ
      ET
    Q

    q
      94.525 516.5 141 11.2 re
      W*
      n
      BT
        /F3 8.8 Tf
        1 0 0 1 99.325 519.7 Tm
        0 G
        [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) 
-44 (e) 28 (n) -44 (ce) ] TJ
      ET
    Q{code}
 

 

  was:
*Overview:* 
We are using PDFBOX as a third party API to extract text from financial PDF 
documents.

We have been using PDFBox since a long time back, and we have detected a 
problem related to a bad text extraction on PDFs from a Customer. 

Since we worked with Customer Data we cannot shared the PDF besides that are 
signed and we cannot even edit them.

*Description of the problem:*

By opening the PDF in Adobe Reader we can see several cases like the following 
screenshot:
!image-2022-10-18-15-53-06-512.png|width=221,height=211!

Visually it appears to have spaces between words, but if we copy the text from 
Adobe Reader and paste it into a text editor there is no extra spaces. 

The following is the output that PDFBOX generates at the moment of doing text 
extraction:
{code:java}
Da te
In v oice number
Ou r r eference
You r reference
Con tact person{code}
(!) *Important note: this behavior is present in all the versions of PDFBox.*

*Analysis:*

By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
_*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
{code:java}
List line = new ArrayList();{code}
Which subsequently the code add elements into the list:
{code:java}
line.add(LineItem.getWordSeparator()); 
.
.
.
line.add(new LineItem(position));{code}
 

And at some point it passes the list as a parameter into the following 
statement:
{code:java}
writeLine(normalize(line));{code}
(!) *The important about this list called "line" is that somehow the "LineItem" 
objects are having NULL values inserted into it, and this values are at some 
point interpreted as "blank spaces" causing the behavior described above.*

Here is an screenshot of how it is showed in the debugger:

!image-2022-10-18-16-23-00-123.png|width=62