[jira] [Commented] (PDFBOX-3710) Text Stripper in 2.0 lost some texts - regression

Tilman Hausherr (JIRA) Mon, 06 Mar 2017 09:46:02 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897725#comment-15897725
 ]


Tilman Hausherr commented on PDFBOX-3710:
-----------------------------------------

It's not a fault, it's a feature: in 2.0 only entries with unicode are used.

{quote}
But this worked in 1.8.
{quote}

That is an illusion: it didn't. That some of it worked in 1.8 is an excellent 
example why that extraction can't be trusted: look at the G1 font in 
PDFDebugger, code 33 displays as "A" but the unicode 33 is "!". Code 65 of the 
same font displays as "a" but the unicode 65 is an "A", so you can't just use 
the code.

Now your problem is that you want the dimensions and don't get them if there 
are no text extractions. There are two things you can do:
- use only what you named "a separate cycle", that is a bit slower but brings 
the most accurate results on size;
- similar to what you did, clone PDFTextStripper and LegacyPDFStreamEngine, and 
change the part which skips where the unicode is missing. I wonder if it 
wouldn't be better to remove almost all from PDFTextStripper when only the 
sizes are needed. 

Re your suggestion - yes this would make sense. I need a better name for the 
method. "deepLegacy" feels weird to me, and this would be for 2.0.6 so that 
this isn't done in the last minute. Alternatively just a mention in the FAQ.

> Text Stripper in 2.0 lost some texts - regression
> -------------------------------------------------
>
>                 Key: PDFBOX-3710
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3710
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Roman
>         Attachments: highlight19.pdf_page1-marked-1.png, 
> highlight19.pdf_page1.pdf, regression_in_blue.png
>
>
> After migration of our App from pdfbox 1.8 to 2.0, we noticed a regression: 4 
> lines of texts are disappeared. Those are the texts followed by black bullet 
> (3 lines) and also "OVERALL" word which is placed above in table.
> Problematic PDF attached - [^highlight19.pdf_page1.pdf]
> Also, attached the result of 
> [DrawPrintTextLocations|https://apache.googlesource.com/pdfbox/+/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java]
>  example - 
> [highlight19.pdf_page1-marked-1.png|https://issues.apache.org/jira/secure/attachment/12856229/highlight19.pdf_page1-marked-1.png]
> Notice, that unicodes, red and blue boxes missing for problematic text. The 
> main problem that these glyphs are absent in *textPositions* parameter which 
> is passed to *writeString* function, line #275. In the 1.8 version these 
> characters ARE present, so their positions along with their char codes could 
> be extracted fine in our App.
> Also, attached picture of regression in our App - [^regression_in_blue.png]. 
> Here, blue boxes drawn where text WAS present and disappeared afterwards. 
> (The purple boxes are OK and should be ignored.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3710) Text Stripper in 2.0 lost some texts - regression

Reply via email to