[ https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carlos Alfonso Maya updated PDFBOX-5529: ---------------------------------------- Attachment: image-2022-10-19-16-48-36-198.png > Wrong Text Extraction - Unwanted Extra Spaces in the middle of words > -------------------------------------------------------------------- > > Key: PDFBOX-5529 > URL: https://issues.apache.org/jira/browse/PDFBOX-5529 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, > 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17, > 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22, 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27 > Reporter: Carlos Alfonso Maya > Priority: Major > Attachments: image-2022-10-18-15-53-06-512.png, > image-2022-10-18-16-23-00-123.png, image-2022-10-18-16-26-15-001.png, > image-2022-10-19-16-48-36-198.png > > > *Overview:* > We are using PDFBOX as a third party API to extract text from financial PDF > documents. > We have been using PDFBox since a long time back, and we have detected a > problem related to a bad text extraction on PDFs from a Customer. > Since we worked with Customer Data we cannot shared the PDF besides that are > signed and we cannot even edit them. > *Description of the problem:* > By opening the PDF in Adobe Reader we can see several cases like the > following screenshot: > !image-2022-10-18-15-53-06-512.png|width=221,height=211! > Visually it appears to have spaces between words, but if we copy the text > from Adobe Reader and paste it into a text editor there is no extra spaces. > The following is the output that PDFBOX generates at the moment of doing text > extraction: > {code:java} > Da te > In v oice number > Ou r r eference > You r reference > Con tact person{code} > (!) *Important note: this behavior is present in all the versions of PDFBox.* > *Analysis:* > By downloading the PDFBOX source code 2.0.27 (this was checked as well in > 2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method > _*writePage()* inside *PDFTextStripper.java*_ declared a list of objects: > {code:java} > List<LineItem> line = new ArrayList<LineItem>();{code} > Which subsequently the code add elements into the list: > {code:java} > line.add(LineItem.getWordSeparator()); > . > . > . > line.add(new LineItem(position));{code} > > And at some point it passes the list as a parameter into the following > statement: > {code:java} > writeLine(normalize(line));{code} > (!) *The important about this list called "line" is that somehow the > "LineItem" objects are having NULL values inserted into it, and this values > are at some point interpreted as "blank spaces" causing the behavior > described above.* > Here is an screenshot of how it is showed in the debugger: > !image-2022-10-18-16-23-00-123.png|width=621,height=195! > !image-2022-10-18-16-26-15-001.png|width=620,height=431! > > We tried to look for a method that manipulates this list and that we can > override, but all of these methods that modified or access the list are > protected. > > (!) *This is an example of how it displayed in the PDF Debugger:* > {code:java} > q > 94.525 545.32 141 11.2 re > W* > n > BT > /F3 8.8 Tf > 1 0 0 1 99.325 547.72 Tm > 0 g > 0 G > [ (D) 22 (a) -131 (t) -109 (e) ] TJ > ET > Q > q > 94.525 530.9 141 11.225 re > W* > n > BT > /F3 8.8 Tf > 1 0 0 1 99.325 533.3 Tm > 0 G > [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) > 30 (m) -27 (b) -75 (e) 28 (r) ] TJ > ET > Q > q > 94.525 516.5 141 11.2 re > W* > n > BT > /F3 8.8 Tf > 1 0 0 1 99.325 519.7 Tm > 0 G > [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) > -44 (e) 28 (n) -44 (ce) ] TJ > ET > Q{code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org