End of line heuristic

Petr Slabý Fri, 15 Sep 2017 02:28:50 -0700

Hi,
I have a PDF which cannot be read by PDFBox 1.8.x, but works fine in PDFBox 
2.x. Before I create an issue I need to internally clarify whether I can share 
the PDF. And I would like to clarify with you whether it makes sense to create 
an issue at all.


The problem in the PDF is the following.
The PDF consistently uses the line feed (0x0A) character for end of line. Its 
streams are encrypted. There is a stream having 

<</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>

so the length attribute is stored in a separate object and the sequential 
parser cannot use it. Hence it searches for the “endstream” keyword. The 
misfortune is that the stream itself ends with the byte 0x0D, which is 
interpreted by the EndstreamOutputStream as carriage return and stripped off 
from the stream. The stream content is then one byte short and the decryption 
fails...

Do you see a chance to improve this? Could the EndstreamOutputStream learn the 
line ending to search for from the PDF content? I mean my PDF starts with 
%PDF-1.7<0x0D>, could the EndstreamOutputStream search just for this character 
in such case? Or are the PDFs which use a mixture of both line endings?

The only other solution I found is to add the missing byte in the method 
encryptData() of SecurityHandler. There I know that the data length has to be 
divisible by 16, so I add 0x0D if one byte is missing. But it is rather a hack 
and I am not sure whether the missing byte might not be 0x0A in some cases. And 
this only helps for AES encrypted streams anyway.

Please do not suggest to move to the non-sequential parser of PDFBox 2.x (I 
guess that is the reason why it works there). I would love to move on to the 
new version, but we are not this far in our software yet. And our customers 
will move to it one or two years after we are ready...

Best regards,
Petr.

End of line heuristic

Reply via email to