[jira] [Comment Edited] (PDFBOX-4355) PDFTextStripperByArea dies on Chinese/Japanese files

Tilman Hausherr (JIRA) Thu, 18 Oct 2018 23:15:20 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656318#comment-16656318
 ]


Tilman Hausherr edited comment on PDFBOX-4355 at 10/19/18 6:14 AM:
-------------------------------------------------------------------

Type 3 fonts are PDF content streams, one per glyph. Some glyphs have a 
completely empty content stream which is unexpected. It happens when the width 
of a space is gathered, which is needed for text extraction (to decide whether 
two glyphs are in different words if no space is used). The width is 0 from the 
width table so it tried to get it from the font content stream itself (where 
the first operator usually tells the width), which was empty.

It is possible that the sequence in gathering the width was changed between 
versions.

The real question is - does your code do what you expected it do do, or not? If 
it did, then don't panic. I just changed the code so that the exception + error 
no longer appear.




was (Author: tilman):
Type 3 fonts are PDF content streams, one per glyph. Some glyphs have a 
completely empty content stream which is unexpected. It happens when the width 
of a space is gathered, which is needed for text extraction (to decide whether 
two glyphs are in different words if no space is used). The width is 0 from the 
width table so it tried to get it from the font content stream itself (where 
the first operator usually tells the width), which was empty.

It is possible that the sequence in gathering the width was changed between 
versions.

The real question is - does your code to what is expected or not? If it does, 
then don't panic. I just changed that the exception + error no longer appears.



> PDFTextStripperByArea dies on Chinese/Japanese files
> ----------------------------------------------------
>
>                 Key: PDFBOX-4355
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4355
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.12
>            Reporter: Ilya Kantor
>            Priority: Critical
>         Attachments: output.pdf, output.txt
>
>
> {{I'm using PDFTextStripperByArea, this code makes it die:}}
> stripper.extractRegions(page);
> Assuming, the language is Chinese or Japanese, that's the error:
> ===========
> Oct 19, 2018 1:24:50 AM org.apache.pdfbox.pdmodel.font.PDFont getSpaceWidth
>  SEVERE: Can't determine the width of the space character, assuming 250
>  java.io.IOException: Unexpected end of stream
>  at 
> org.apache.pdfbox.pdmodel.font.PDType3CharProc.getWidth(PDType3CharProc.java:170)
>  at 
> org.apache.pdfbox.pdmodel.font.PDType3Font.getWidthFromFont(PDType3Font.java:165)
>  at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:547)
>  at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.showGlyph(LegacyPDFStreamEngine.java:265)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:734)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:595)
> =========
> Should I use something else instead of PDFTextStripperByArea?
> Let me know if more information needed. I attached the PDF file.[^output.pdf]
> ^====^
> *To reproduce use the attached  above output.pdf and the standard PDFBox app:*
>  java -jar pdfbox-app-2.0.12.jar ExtractText output.pdf 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4355) PDFTextStripperByArea dies on Chinese/Japanese files

Reply via email to