[
https://issues.apache.org/jira/browse/PDFBOX-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656318#comment-16656318
]
Tilman Hausherr edited comment on PDFBOX-4355 at 10/19/18 6:14 AM:
-------------------------------------------------------------------
Type 3 fonts are PDF content streams, one per glyph. Some glyphs have a
completely empty content stream which is unexpected. It happens when the width
of a space is gathered, which is needed for text extraction (to decide whether
two glyphs are in different words if no space is used). The width is 0 from the
width table so it tried to get it from the font content stream itself (where
the first operator usually tells the width), which was empty.
It is possible that the sequence in gathering the width was changed between
versions.
The real question is - does your code do what you expected it do do, or not? If
it did, then don't panic. I just changed the code so that the exception + error
no longer appear.
was (Author: tilman):
Type 3 fonts are PDF content streams, one per glyph. Some glyphs have a
completely empty content stream which is unexpected. It happens when the width
of a space is gathered, which is needed for text extraction (to decide whether
two glyphs are in different words if no space is used). The width is 0 from the
width table so it tried to get it from the font content stream itself (where
the first operator usually tells the width), which was empty.
It is possible that the sequence in gathering the width was changed between
versions.
The real question is - does your code to what is expected or not? If it does,
then don't panic. I just changed that the exception + error no longer appears.
> PDFTextStripperByArea dies on Chinese/Japanese files
> ----------------------------------------------------
>
> Key: PDFBOX-4355
> URL: https://issues.apache.org/jira/browse/PDFBOX-4355
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.12
> Reporter: Ilya Kantor
> Priority: Critical
> Attachments: output.pdf, output.txt
>
>
> {{I'm using PDFTextStripperByArea, this code makes it die:}}
> stripper.extractRegions(page);
> Assuming, the language is Chinese or Japanese, that's the error:
> ===========
> Oct 19, 2018 1:24:50 AM org.apache.pdfbox.pdmodel.font.PDFont getSpaceWidth
> SEVERE: Can't determine the width of the space character, assuming 250
> java.io.IOException: Unexpected end of stream
> at
> org.apache.pdfbox.pdmodel.font.PDType3CharProc.getWidth(PDType3CharProc.java:170)
> at
> org.apache.pdfbox.pdmodel.font.PDType3Font.getWidthFromFont(PDType3Font.java:165)
> at org.apache.pdfbox.pdmodel.font.PDFont.getSpaceWidth(PDFont.java:547)
> at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.showGlyph(LegacyPDFStreamEngine.java:265)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:734)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:595)
> =========
> Should I use something else instead of PDFTextStripperByArea?
> Let me know if more information needed. I attached the PDF file.[^output.pdf]
> ^====^
> *To reproduce use the attached above output.pdf and the standard PDFBox app:*
> java -jar pdfbox-app-2.0.12.jar ExtractText output.pdf
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]