[ https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897725#comment-15897725 ]
Tilman Hausherr commented on PDFBOX-3710: ----------------------------------------- It's not a fault, it's a feature: in 2.0 only entries with unicode are used. {quote} But this worked in 1.8. {quote} That is an illusion: it didn't. That some of it worked in 1.8 is an excellent example why that extraction can't be trusted: look at the G1 font in PDFDebugger, code 33 displays as "A" but the unicode 33 is "!". Code 65 of the same font displays as "a" but the unicode 65 is an "A", so you can't just use the code. Now your problem is that you want the dimensions and don't get them if there are no text extractions. There are two things you can do: - use only what you named "a separate cycle", that is a bit slower but brings the most accurate results on size; - similar to what you did, clone PDFTextStripper and LegacyPDFStreamEngine, and change the part which skips where the unicode is missing. I wonder if it wouldn't be better to remove almost all from PDFTextStripper when only the sizes are needed. Re your suggestion - yes this would make sense. I need a better name for the method. "deepLegacy" feels weird to me, and this would be for 2.0.6 so that this isn't done in the last minute. Alternatively just a mention in the FAQ. > Text Stripper in 2.0 lost some texts - regression > ------------------------------------------------- > > Key: PDFBOX-3710 > URL: https://issues.apache.org/jira/browse/PDFBOX-3710 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Reporter: Roman > Attachments: highlight19.pdf_page1-marked-1.png, > highlight19.pdf_page1.pdf, regression_in_blue.png > > > After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 > lines of texts are disappeared. Those are the texts followed by black bullet > (3 lines) and also "OVERALL" word which is placed above in table. > Problematic PDF attached - [^highlight19.pdf_page1.pdf] > Also, attached the result of > [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java] > example - > [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png] > Notice, that unicodes, red and blue boxes missing for problematic text. The > main problem that these glyphs are absent in *textPositions* parameter which > is passed to *writeString* function, line #275. In the 1.8 version these > characters ARE present, so their positions along with their char codes could > be extracted fine in our App. > Also, attached picture of regression in our App - [^regression_in_blue.png]. > Here, blue boxes drawn where text WAS present and disappeared afterwards. > (The purple boxes are OK and should be ignored.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org