[ https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897067#comment-15897067 ]
Roman edited comment on PDFBOX-3710 at 3/6/17 10:38 AM: -------------------------------------------------------- OK, I found an ugly solution - I've overided whole method *showGlyph()* from *LegacyPDFStreamEngine* class. (I had to override 4 more class-member-properties and one function for calculating them, and also a constructor). So, this solution has performance overhead and very lot of copy-pasting. At the same time, it is intended to do very little, just to avoid returning in this piece of code: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { if (font instanceof PDSimpleFont) { char c = (char) code; unicode = new String(new char[] { c }); } else { // Acrobat doesn't seem to coerce composite font's character codes, instead it // skips them. See the "allah2.pdf" TestTextStripper file. return; } } {code} now changed to: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { // if (font instanceof PDSimpleFont) // { char c = (char) code; unicode = new String(new char[] { c }); // } // else // { // // Acrobat doesn't seem to coerce composite font's character codes, instead it // // skips them. See the "allah2.pdf" TestTextStripper file. // // return; // } } {code} My only left question: can you tweak LegacyPDFStreamEngine class to be more flexible. For example, we may add new public overloadable boolean method *deepLegacy* as here: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { if (deepLegacy() || font instanceof PDSimpleFont) { char c = (char) code; unicode = new String(new char[] { c }); } else { // Acrobat doesn't seem to coerce composite font's character codes, instead it // skips them. See the "allah2.pdf" TestTextStripper file. return; } } {code} was (Author: rmakarov): OK, I found an ugly solution - I've overided whole method *showGlyph()* from *LegacyPDFStreamEngine* class. (I had to override 4 more class-member-properties and one function for calculating them, and also a constructor). So, this solution has performance overhead and very lot of copy-pasting. At the same time, it is intended to do very little, just to avoid returning in this piece of code: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { if (font instanceof PDSimpleFont) { char c = (char) code; unicode = new String(new char[] { c }); } else { // Acrobat doesn't seem to coerce composite font's character codes, instead it // skips them. See the "allah2.pdf" TestTextStripper file. return; } } {code} now changed to: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { // if (font instanceof PDSimpleFont) // { char c = (char) code; unicode = new String(new char[] { c }); // } // else // { // // Acrobat doesn't seem to coerce composite font's character codes, instead it // // skips them. See the "allah2.pdf" TestTextStripper file. // // return; // } } {code} My only left question: can you tweak LegacyPDFStreamEngine class to be more flexible. For example, we may add new boolean method *deepLegacy* as here: {code} // when there is no Unicode mapping available, Acrobat simply coerces the character code // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want // this, which is why we leave it until this point in PDFTextStreamEngine. if (unicode == null) { if (deepLegacy() || font instanceof PDSimpleFont) { char c = (char) code; unicode = new String(new char[] { c }); } else { // Acrobat doesn't seem to coerce composite font's character codes, instead it // skips them. See the "allah2.pdf" TestTextStripper file. return; } } {code} > Text Stripper in 2.0 lost some texts - regression > ------------------------------------------------- > > Key: PDFBOX-3710 > URL: https://issues.apache.org/jira/browse/PDFBOX-3710 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Roman > Attachments: highlight19.pdf_page1-marked-1.png, > highlight19.pdf_page1.pdf, regression_in_blue.png > > > After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 > lines of texts are disappeared. Those are the texts followed by black bullet > (3 lines) and also "OVERALL" word which is placed above in table. > Problematic PDF attached - [^highlight19.pdf_page1.pdf] > Also, attached the result of > [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java] > example - > [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png] > Notice, that unicodes, red and blue boxes missing for problematic text. The > main problem that these glyphs are absent in *textPositions* parameter which > is passed to *writeString* function, line #275. In the 1.8 version these > characters ARE present, so their positions along with their char codes could > be extracted fine in our App. > Also, attached picture of regression in our App - [^regression_in_blue.png]. > Here, blue boxes drawn where text WAS present and disappeared afterwards. > (The purple boxes are OK and should be ignored.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org