Re: [Podofo-users] Patch for pdfParser - findToken function

Francesco Pretto Wed, 27 Apr 2022 10:57:46 -0700

My report on pdfmm:

512.pdf -> OK
513.pdf -> OK
514.pdf -> OK
rev.pdf -> FAIL
big.pdf -> OK
false.pdf -> OK


I also created a big2.pdf (attached) that also fails on pdfmm but
opens on Adobe, where the garbage is put just in between the numeric
offset and the %%EOF. As you say, a better backward function should be
created to handle such edge cases.

I think for PoDoFo 0.9.8 we could focus on just handling the specific
issue reported by Dennis, if possible with few lines of code and not
breaking other pdfs.

I'm sorry but I'm not available to work on PoDoFo 0.9.x codebase, but
I will create test cases using the pdfs you created and fix it in
pdfmm (which is candidate for merging to PoDoFo).

Regards,
Francesco


On Wed, 27 Apr 2022 at 18:27, Michal Sudolsky <sudols...@gmail.com> wrote:
>
> Attached are 6 PDF files and all of them open well in 3 pdf viewers I tested.
>
>>
>> so the backward search is correct, but it's better to limit it to 
>> "startxref".
>>
>> > Seems you are searching for a trailer right after xref (if I read that 
>> > part well).
>> >
>>
>> Yes, correct, that was a cleaner solution: in my case it was useful to
>> fix some spurious warnings as the commit message says. It also
>> improved parsing performance.
>
>
> Btw I noticed some typo here "Ooffset read position to the EOF marker if it 
> is not the last thing in the file".
>
>>
>>
>> > So is there actually some reason that for "i == 0" it is internal logic? 
>> > What if startxref is precisely PDF_XREF_BUF bytes before the last EOF 
>> > offset (m_LastEOFOffset)?
>> >
>>
>> I didn't modify that code but I believe this was kind of a intended
>> safeguard since the backward search is slow. Assuming one put a big
>> amount of garbage also between "startxref" and "%%EOF" yes, what you
>> say is true.
>
>
> Yes, searching backward may be slow unless the whole file is loaded into 
> memory (which is not really good) but this can be also done by parts see at 
> bottom. And also it can search for both the trailer and startxref at once.
>
> 512.pdf gives error:
>
> PoDoFo encountered an error. Error: 8 ePdfError_InternalLogic
> Error Description: An internal error occurred.
>
> That will be that "if( !i )" and it will probably throw such an error also in 
> pdfmm. I still do not believe this is really intentional (rather it is just a 
> bug).
>
> 513.pdf surprisingly works in podofo (trailer is not found by FindToken but i 
> is -1 so it seeks 513 bytes backwards where is subsequently found trailer by 
> IsNextToken after call to FindToken in ReadTrailer).
>
> 514.pdf same error as big.pdf.
>
>> We should test if Adobe handles arbitrary amount of
>> garbage.
>>
>
> big.pdf gives error (it has 1 MB of garbage so it is zipped):
>
> PoDoFo encountered an error. Error: 15 ePdfError_NoNumber
> Error Description: A number was expected but not found.
>
> At the bottom of the call stack there is "Information: Unable to find trailer 
> in file."
>
> I also tested 1 GB of garbage (comments) and also this worked fine in the 
> mentioned 3 viewers.
>
>> Going back to the reporter issue: I don't know how to fix it in PoDoFo
>> with a few lines patch, but if you don't think anything safe enough a
>> better fix is doing like a did in pdfmm not reading "trailer"
>> backward. Of course such change won't need being merged to pdfmm.
>
>
> rev.pdf is working fine in podofo but when is applied patch from this email 
> thread then it gives error:
>
> PoDoFo encountered an error. Error: 15 ePdfError_NoNumber
> Error Description: A number was expected but not found.
>
> It cannot find the trailer.
>
> Also I suppose rev.pdf cannot be opened in pdfmm. It has reordered xref and 
> trailer. Note that there is nothing in the pdf specification which says that 
> trailer and xref must be in particular order just that trailer is before 
> startxref. It also does not say how far from the end can be trailer or 
> startxref (only that %%EOF must be within 1024 bytes).
>
> Maybe the best approach would be to load chunks of file into memory from 
> backwards. Lets say first it loads the last 16 kB and searches for a token, 
> if not found it will discard this chunk and loads next 16 kB and so on so 
> even when there are GBs of garbage it will not drain the whole memory (of 
> course these chunks should somehow overlap because there can be "trai" at end 
> of one chunk and "ler" in previous chunk).
>
> But there is another case:
> false.pdf gives error on podofo:
>
> PoDoFo encountered an error. Error: 20 ePdfError_InvalidDataType
>
> There is a "false" trailer in a comment. This means that it is not enough to 
> just search for a specific string but it needs to be aware of context whether 
> that string is in comment or not (this is the case for both trailer and 
> startxref).
>
>
>>
>> Cheers,
>> Francesco


_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] Patch for pdfParser - findToken function

Reply via email to