Re: How to flatedecode and find all acroform fields in a compressed PDF

Balaji Venkatamohan Thu, 21 May 2015 17:01:30 -0700

Hi,

How was the decompressing of the PDF from your customer done - did your
customer also use PDFBox? Or something else?


I could not get the answer to this question from the customer. I am waiting
for a response from them. Its not pdfbox API or itextpdf API and its not
any online tools available for compression as well. I verified this by
trying to compress the uncompressed file sent by customer using pdfbox API,
itextpdf  and the online tools. The size of  resultant compressed file for
all the three methods mentioned above is close to 1.60 MB but the
compressed file sent by customer is only 21 KB!
I will have to use Adobe Acrobat Pro to compress the uncompressed PDF to
see if this results in a resultant PDF which is close to 21 KB but I do not
have the full Adobe software.

And I read in the first post that the decompressed customer file was OK,
but not the compressed file...  so the problem is to find if something is
missing in the compressed file, or if PDFBox has a bug causing to miss it.

Both the decompressed (orignial) and the compressed file are okay when
opened with a PDF reader software, that is, I am able to edit the acroform
fields manually and save them too. The problem is that when using pdfbox
API, only the decompressed (original) file's acroform fields are read. When
I use the compressed file, pdfbox is not able to retrieve any of the
acroform fields and the API call PDDocumentCatalog.getAcroForm() returns
null.

Today, I used iTextPDF API to read acroform fields from the compressed PDF
sent by the customer and even their API could not locate the acroform
fields in the compressed PDF.
I will now try with pdfbox 2.0.0 API. I will share with you the compression
technique used by the customer as soon as they get back to me.

Thank you,
Balaji


On Tue, May 19, 2015 at 10:52 PM, Tilman Hausherr <[email protected]>
wrote:

> Hi,
>
> "How to properly decompress the PDF using pdfbox APIs" - see the source
> code of WriteDecodedDoc:
>
> https://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/main/java/org/apache/pdfbox/WriteDecodedDoc.java?view=markup&sortby=date
>
> How was the decompressing of the PDF from your customer done - did your
> customer also use PDFBox? Or something else?
>
> And I read in the first post that the decompressed customer file was OK,
> but not the compressed file...  so the problem is to find if something is
> missing in the compressed file, or if PDFBox has a bug causing to miss it.
>
> Tilman
>
> PS: image didn't go through. Maybe upload it to imageshack.us.
>
>
> Am 20.05.2015 um 03:24 schrieb Balaji Venkatamohan:
>
>> Thank you for your pointers and sorry about the image. I am attaching it
>> with this email.
>>
>> The point I am trying to make is that the PDF, which was decompressed
>> using WriteDecodedDoc, is smaller in size than the original PDF given to us
>> by our customers.
>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did
>> not have any PDAcroform fields whereas the decompressed PDF given to us by
>> the customers does contain Acroform fields. Hence I wanted to know how to
>> properly decompress the PDF using pdfbox APIs. The reason why I was
>> analyzing COSStream was to check if the decompression of the compressed PDF
>> was happening correctly while using PDFBox APIs.
>> I know it would have been difficult for you to help me without the actual
>> PDFs. For that, I would like to thank you for your time and pointers.
>>
>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi,
>>
>>     The image doesn't appear in the mailing list.
>>
>>     This is all very confusing... /acroform is in the document
>>     catalog. I don't see how the page content stream is related to it.
>>     The best is that you either go through the source code, or read
>>     the spec and then look at the pdf.
>>
>>     To find out what's going on, you'd have to start from that
>>     /acroform entry and then compare the two files.
>>
>>     It is really difficult to help you without the files. The cause
>>     could be a bug in pdfbox, or a malformed pdf...
>>
>>     Some more ideas:
>>     - use loadNonSeq(file, null) instead of load(file)
>>     - try the unreleased 2.0 version, that one has some improvements
>>     in the acroform stuff. Note that the API is different.
>>     https://pdfbox.apache.org/download.cgi#scm
>>     https://pdfbox.apache.org/2.0/getting-started.html
>>
>>     If you still need help, one possibility would be 1) post the
>>     smallest possible code that fails, and 2) post a small part of the
>>     raw PDF, i.e. the objects relevant to the field in your code.
>>
>>
>>     Tilman
>>
>>
>>     Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
>>
>>         Moreover, for every page of the compressed PDF (there are 3
>>         pages), I tried getting the COSStream for each of the page :
>>
>>         PDPage firstPage=(PDPage)
>>         document.getDocumentCatalog().getAllPages().get(0);
>>                     pdStream=firstPage.getContents();
>>                     COSStream stream=pdStream.getStream();
>>
>>         In the above code snippet, the object stream, when analyzed in
>>         debug mode, has the following:
>>
>>
>>         The line from the compressed PDF as opened with Notepad++ is :
>>
>>         <</Filter/FlateDecode/Length 5675>>stream
>>
>>         From this point on, using the COSStream object for every page,
>>         how can I decompress and find out the acroform fields given
>>         that the unFilteredStream object is null for COSStream?
>>         
>>
>>         On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan
>>         <[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>             Thank you for your response Tilman.
>>
>>             I had previously tried using the WriteDecodedDoc for my
>>         compressed
>>             PDF and I tried to get the number of acro form fields
>>         present in     the output file generated by WriteDecodedDoc.
>>         The API still could
>>             not find the acro form fields in the generated
>>         decompressed file.
>>              Also the decompressed file generated is 75 KB which is
>>         far less
>>             than the original decompressed file which I have (1.6 MB)
>>         though I
>>             could edit the acro form fields using acrobat reader.
>>
>>             Thanks,
>>             Balaji
>>
>>
>>
>>             On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
>>             <[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>>
>>         wrote:
>>
>>                 Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
>>
>>                     My question is: how do I flatedecode a PDF so that
>>         I can
>>                     find all the
>>                     acroform fields within it. ANy help or pointers
>>         would be
>>                     highly appreciated.
>>
>>
>>                 You could try the WriteDecodedDoc option of the
>>         command line app
>>         https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
>>
>>                 Maybe you can have further ideas by comparing the two
>>         files
>>                 with NOTEPAD++.... however the two files might have their
>>                 objects in different order.
>>
>>                 Tilman
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>                 To unsubscribe, e-mail:
>>         [email protected]
>>         <mailto:[email protected]>
>>                 <mailto:[email protected]
>>         <mailto:[email protected]>>
>>                 For additional commands, e-mail:
>>         [email protected] <mailto:[email protected]
>> >
>>                 <mailto:[email protected]
>>         <mailto:[email protected]>>
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to