[
https://issues.apache.org/jira/browse/PDFBOX-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859644#comment-15859644
]
Manuel Gübeli commented on PDFBOX-3677:
---------------------------------------
In another file (that failed as well before the Type1Font fix), there are the
following messages:
{quote}
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 200, length: 3818,
expected end position: 4018
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 9338, length: 34466,
expected end position: 43804
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 43674, length: 6275,
expected end position: 49949
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 49711, length: 34466,
expected end position: 84177
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 84029, length: 5188,
expected end position: 89217
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 89158, length: 34466,
expected end position: 123624
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 123487, length: 12032,
expected end position: 135519
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 138048, length: 34466,
expected end position: 172514
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 172379, length: 12235,
expected end position: 184614
Feb 09, 2017 4:10:41 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 184447, length: 34466,
expected end position: 218913
{quote}
Is that something that you you want to lookin as well or is this an other
issue?
Note: TextExtration is working for this file as well now
> NullPointerException in Type1Parser.read
> ----------------------------------------
>
> Key: PDFBOX-3677
> URL: https://issues.apache.org/jira/browse/PDFBOX-3677
> Project: PDFBox
> Issue Type: Bug
> Components: FontBox
> Affects Versions: 2.0.3, 2.0.4
> Environment: Windows 10, java version "1.8.0_25"
> Reporter: Manuel Gübeli
> Assignee: Tilman Hausherr
> Labels: type1, type1font
> Fix For: 2.0.5, 2.1.0
>
> Attachments: F1.PFB, F1.txt, F2.PFB, F2.txt,
> Resources_ScreenShot.GIF, StackTrace.txt
>
>
> Text extraction from certain PDFs is not possible and PDF Box responses with
> NullPointerException. Text extraction from same PDF with version 1.8.13 is
> working.
> Originally the issue was discovered while using the newest Apache Tika 1.14
> library. I can not down-grade to PDF Box 1.8.13 with Apache Tika 1.14.
> Unfortunately I can not provide the PDFs that fail to you. However, I did
> some testing and found out that “Token token = lexer.nextToken();” return
> Null.
> Feb 07, 2017 12:17:40 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
> SEVERE: Can't read the embedded Type1 font AAAAAB+Arial-BoldMT
> java.io.IOException: Found token=null but expected NAME
> Caused by: java.io.EOFException
> at
> org.apache.pdfbox.io.ScratchFileBuffer.seek(ScratchFileBuffer.java:302)
> at
> org.apache.pdfbox.pdfparser.COSParser.checkXRefOffset(COSParser.java:1177)
> at org.apache.pdfbox.pdfparser.COSParser.parseXref(COSParser.java:202)
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]