RE: Getting IOException: expected: 'endstream' actual: '' at offset X

Poisson, David (DGRI) Mon, 25 Aug 2025 13:45:56 -0700

[Intranet logo]

Hi Tilman,
        You may disregard my last comment. I copied the wrong file and passed 
it through our modified legacy system, wrongfully thinking it had solved our 
problem. But resetting the inputStream had no impact on anything.


David Poisson

-----Message d'origine-----
De : Poisson, David (DGRI) <[email protected]>
Envoyé : 25 août 2025 10:37
À : [email protected]
Objet : RE: Getting IOException: expected: 'endstream' actual: '' at offset X

[Intranet logo]

Hi Tilman,
        Sorry for the late reply.

I did more digging around. I've added a reset() call to the inputStream before 
parsing the document and preliminary tests shows all problematic PDF could be 
opened now without any problems.

I have a feeling that there are some IF-ELSE branches in the code that 
conditionally reset the input stream on most, but not all, cases.

David Poisson


-----Message d'origine-----
De : Tilman Hausherr <[email protected]> Envoyé : 15 août 2025 10:13 À : 
[email protected] Objet : Re: Getting IOException: expected: 'endstream' 
actual: '' at offset X

Am 15.08.2025 um 15:57 schrieb Poisson, David (DGRI):
> [Intranet logo]
>
> Hi Tilman, appreciate your reply!
>
> I'm not that familiar with the internal structure of PDF files, so I 
> appreciate the fact that you confirm that they are valid.
> When I use "file" command in Linux, it does report all files are being 
> application/pdf.
>
> I was also able to write a sample program which loads the PDF as per the 
> PDFBox documentation. It works well.
>
> I pulled out all the statements that work with PDFBox objects from our legacy 
> system's code and put it in a self-contained project and I think I can 
> reproduce the problem.
> The input stream is used for 2 passes through the document.
> The first pass goes through all pages and determines the text location on 
> each page.
> The second pass extracts all the text, which is then cleaned up (removing 
> what are in the margins and top/bottom of pages).
>
> When we re-used the input stream in the second pass to create the Parser, 
> that's when we get the error.
>
> I've added an inputStream.reset() in between the two passes in my 
> self-contained project and the error goes away.
> I'm in the process of making the modification in our legacy system and will 
> test it to see if that helps us with the PDF files that cannot be opened.
> What I can't explain though, is why some PDF files are going through without 
> this reset()?

I don't know, and I'm not curious enough to find out why incorrect code 
sometimes works.

You should be able to work with the same PDDocument object for both passes. 
That second pass is a bit weird. You don't need to call "new PDFParser(", this 
is very "old style". If you have a file in your production code, use that one. 
If you a byte array, use that one directly (in PDFBox 2.0 and 3.0). 3.0 should 
be faster with a file because it does parse on demand.

Tilman


>
> Not sure what's the best way to share this project? I've put it up on my 
> google drive (where the PDF's were).
> It's a java maven project called PDFLoader.zip (for convenience, I
> have the 3 PDF files at the root of the project)
> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriv
> %2F&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C8555ce14966b4285d
> b3c08dde3e4f346%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C638917294
> 717007981%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAu
> MDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7C%7C
> %7C&sdata=DVblZqWrK%2BUysvEm%2FMyturxnI7gMHwD2J6%2BFoIzJlKY%3D&reserve
> d=0
> e.google.com%2Fdrive%2Ffolders%2F1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z%3Fu
> sp%3Dsharing&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7Ced3da4f5
> 22354429e8a308dddc05f35c%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C
> 638908640312612144%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlY
> iOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
> 7C%7C%7C&sdata=6D3bUAZsPm2k66xJujRTxbtHhagRFtgJLRKA7YIwnqQ%3D&reserved
> =0 When you run the main(), you'll get an exception with v4.PDF.
> Simply uncomment line 62 and it should work.
>
> David Poisson
>
>
> -----Message d'origine-----
> De : Tilman Hausherr <[email protected]> Envoyé : 14 août 2025
> 11:59 À : [email protected] Objet : Re: Getting IOException:
> expected: 'endstream' actual: '' at offset X
>
> Am 14.08.2025 um 16:24 schrieb Poisson, David (DGRI):
>> Here are the PDF's in question (didn't want to add 3 PDF's to the email, so 
>> here's a link to my google drive's folder that has all 3 PDF's):
>> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri
>> %2F&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C8555ce14966b4285
>> db3c08dde3e4f346%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C6389172
>> 94717039104%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>> jAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C60000%7
>> C%7C%7C&sdata=T0SI9ArMQ0qV8RFKsrbAf4dWBB%2Fvz1gyPd1iZK0DX9g%3D&reserv
>> ed=0
>> v%2F&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7Ced3da4f52235442
>> 9e8a308dddc05f35c%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7C638908
>> 640312639920%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIw
>> LjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7
>> C%7C&sdata=tXfkvUpytuDe7B7db93cj%2FIYyQxXDiXKlbuB0Fuwn%2B8%3D&reserve
>> d=0
>> e.google.com%2Fdrive%2Ffolders%2F1Tb136kzA5mMy5R2ti0Cy7UXWT2PQVS5z%3F
>> u
>> sp%3Dsharing&data=05%7C02%7CDavid.Poisson%40mrnf.gouv.qc.ca%7C5f7e76c
>> b
>> 23414628c8f808dddb4b7433%7C8705e97737814f4790e1c84c8b884da1%7C0%7C0%7
>> C
>> 638907839441012283%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIl
>> Y
>> iOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
>> %
>> 7C%7C%7C&sdata=%2Fjl5Z3feolFod9uH7AhacRPhhmPTJIYpbeMw55POnB8%3D&reser
>> v
>> ed=0
>> v3.PDF: conversion result using version 3 of our conversion library,
>> works well in PDFBox 1.8.12
>> v4.PDF: conversion result using version 4 of our conversion library,
>> gives errors in PDFBox
>> v4-fixedByAcrobat.pdf: v4.PDF opened and exported by Acrobat: works
>> well in PDFBox 1.8.12
> I had no trouble doing a text extraction with 1.8.12, 1.8.16 and
> 1.8.17 on v3 and v4 using pdfbox-app. Makes me wonder if there's
> either a problem with PDFBox when using an input stream, or if
> something goes wrong when you read the file (maybe wrong mime type so
> it's passed as text)
>
> Re the PDF/A problems:
>
> Your file is a (correct) PDF/A-2a, and you checked it to be PDF/A-1b, which 
> it isn't.
>
>      Checking against conformance level PDF/A-2a
>      True
>
>      Checking against conformance level PDF/A-2b
>      True
>
>      Checking against conformance level PDF/A-2u
>      True
>
> That's all you need!
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Getting IOException: expected: 'endstream' actual: '' at offset X

Reply via email to