Igor Santos created PDFBOX-3742:
-----------------------------------
Summary: Unknown dir object c='>' cInt=62 peek='>' peekInt=62
Key: PDFBOX-3742
URL: https://issues.apache.org/jira/browse/PDFBOX-3742
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.5
Environment: Based on Tika Docker image: logicalspark/docker-tikaserver
Reporter: Igor Santos
Attachments: buggy.pdf, screenshot_002.png
This was originally stumbled upon when running a 69-page long PDF through Tika.
I could isolate the issue to in-between those two pages. Tika ends up
responding with a faulty XML, as the attached screenshot shows - together with
a stacktrace on the logs that includes the PDFBox exception, shown below as
reproduced from the standalone CLI tool.
I'm using Tika 1.1.4, although I'm not exactly sure what version of PDFBox it
uses. Here's the base
[Dockerfile|https://github.com/LogicalSpark/docker-tikaserver/blob/master/Dockerfile].
{code}
$ java -jar pdfbox-app-2.0.5.jar ExtractText buggy.pdf
Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSans' for 'ArialMT'
Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference at offset 150196
Exception in thread "main" java.io.IOException: Unknown dir object c='>'
cInt=62 peek='>' peekInt=62 at offset 150196
at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:954)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:654)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
at
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
{code}
Seems related to PDFBOX-1327.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]