Re: How to flatedecode and find all acroform fields in a compressed PDF

Balaji Venkatamohan Sat, 23 May 2015 07:38:48 -0700

Hi,

So AcroForms/Fields is an empty Array?


Yes, in the filled interview_compressed.pdf, the acroforms are not null but
empty. Size of array is zero.

Also, I tried qpdf command line tool to compress the file interview.pdf and
the resultant compressed file size of 1.6MB was no way near the file size
of interview_compressed.pdf (21 KB).

Thanks,
Balaji

On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <[email protected]>
wrote:

> Hi,
>
> > Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <[email protected]>:
> >
> > I opened the interview_compressed in notepad++ and did not see any
> > 'Acroform' text anywhere.
> > However, as Maruan suggested, I entered some data into what looks like
> form
> > fields of interview_compressed.pdf and saved it. When I opened this file
> in
> > notepad++, I did see 'Acroform' text in it. I also noticed an increase in
> > file size from 21 KB to ~530 KB.
> >
> > I then ran this filled saved compressed PDF in pdfdebugger.java and saw
> > that the field values were getting stored but not under Acroform fields
> but
> > under Annotations.
>
>
>
> So AcroForms/Fields is an empty Array?
>
> > Please refer to this image:
> >
> > http://imageshack.com/a/img540/9951/QGLDtS.jpg
> >
> > So, whatever the compression technique was, it simply made all the
> Acroform
> > fields disappear from the original PDF but retained all annotations which
> > also contain the interactive forms and this helped reduce the file size
> so
> > much? If this is the case, can pdfbox API also use similar compression
> > technique to compress such a a huge file into a smaller one?
> >
> >
> >
> >
> > On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <[email protected]
> >:
> >>>
> >>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan:
> >>>> Hello,
> >>>>
> >>>> I used PdfDebugger to make the internal PDF structure of the two files
> >> (1)
> >>>> interview.pdf and (2) interview_compressed.pdf  visually available
> and I
> >>>> have uploaded my images to imageshack. Here are the four links:
> >>>>
> >>>> http://imageshack.com/a/img538/8277/JghCpG.jpg
> >>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg
> >>>> http://imageshack.com/a/img903/8644/mk15As.jpg
> >>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg
> >>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg
> >>>>
> >>>> The first two links are from the internal structure of interview.pdf
> >>>> (original uncompressed file)
> >>>> The third and fourth links are from the internal structure of
> >>>> interview_compressed.pdf (compressed file)
> >>>> The fifth link compares the file sizes of the two files and as you can
> >> also
> >>>> see, the difference is huge.
> >>>>
> >>>> As you might notice, the file interview_compressed.pdf has no acroform
> >>>
> >>> Indeed... but this is needed - from the spec:
> >>>
> >>> "The contents and properties of a document’s interactive form shall be
> >> defined by an interactive form dictionary that shall be referenced from
> the
> >> AcroForm entry in the document catalogue (see 7.7.2, “Document
> Catalog”).
> >> Table 218 shows the contents of this dictionary."
> >>>
> >>
> >> correct
> >>
> >>>> fields listed even though opening the PDF in pdf reader allows me to
> >> enter
> >>>> values in places which look like AcroForm fields and also save them.
> Are
> >>>> there any other PDF 'types' similar to Acroform fields which would
> >> enable
> >>>> users to fill data and which can be accessed in PdfBox APIs without
> >> having
> >>>> to go through PDAcrofield?
> >>>
> >>> Yes, annotations... there are some common parts, but this is just a
> >> vague observation from me, I'm not the acroform specialist.
> >>
> >> from a first glance it looks like there are all entries necessary to
> (re-)
> >> generate the form fields. That's what's likely happening for this
> document
> >> in Adobe Reader. Would be interesting to see what's being save after the
> >> forms has been filled out and saved using Acrobat. We'd need a test
> form to
> >> come up with an enhancement like this.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>>
> >>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm"
> in
> >> the "compressed" file.
> >>> - if it is missing, tell the client (or your boss) just that
> >>> - if it isn't missing, then there's some problem in PDFBox (try also
> the
> >> loadNonSeq I mentioned earlier)
> >>>
> >>> Tilman
> >>>
> >>>>
> >>>> You can use qpdf , then use these options:
> >>>>
> >>>> I will now try using this link to compress the original file.
> >>>>
> >>>> Another strategy to think about - can your client generate a
> >>>> non-confidential file, so that you can share it, and the "compressed"
> >> file?
> >>>>
> >>>> I wish I had direct communication with the clients but due to
> >> bureaucracy,
> >>>> I am having to go through multiple layers to get my message across to
> >> them.
> >>>> I will share more information as soon as I have them.
> >>>>
> >>>> PS: i sent these image links to my personal email first to make sure
> >> that I
> >>>> can open them. I could and so I am hoping you all could too. If you
> are
> >>>> unable to open them, please let me know.
> >>>>
> >>>> Thanks,
> >>>> Balaji
> >>>>
> >>>>
> >>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <
> [email protected]
> >>>
> >>>> wrote:
> >>>>
> >>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um
> >> 03:24
> >>>>>>> geschrieben:
> >>>>>>>
> >>>>>>>
> >>>>>>> Thank you for your pointers and sorry about the image. I am
> >> attaching it
> >>>>>>> with this email.
> >>>>>>>
> >>>>>>> The point I am trying to make is that the PDF, which was
> decompressed
> >>>>>>> using
> >>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to
> >> us by
> >>>>>>> our customers.
> >>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox
> >> did
> >>>>>>> not
> >>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us
> >> by
> >>>>>>> the
> >>>>>>> customers does contain Acroform fields. Hence I wanted to know how
> to
> >>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was
> >>>>>>> analyzing COSStream was to check if the decompression of the
> >> compressed
> >>>>>>> PDF
> >>>>>>> was happening correctly while using PDFBox APIs.
> >>>>>>> I know it would have been difficult for you to help me without the
> >> actual
> >>>>>>> PDFs. For that, I would like to thank you for your time and
> pointers.
> >>>>>>>
> >>>>>> Maybe it's worth to try to share the file "visually" with us. Open
> >> both
> >>>>>> files
> >>>>>> (compressed and decompressed) with PDFDebugger [1] and post a
> >> screenshot
> >>>>>> of both
> >>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that
> could
> >>>>>> shed some
> >>>>>> light on your issue.
> >>>>>>
> >>>>> @Balaji: here's an example on how such a screenshot would look like:
> >>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png
> >>>>>
> >>>>> Tilman
> >>>>>
> >>>>>
> >>>>>
> >>>>>> BR
> >>>>>> Andreas Lehmkühler
> >>>>>>
> >>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger
> >>>>>>
> >>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr <
> >> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>> The image doesn't appear in the mailing list.
> >>>>>>>>
> >>>>>>>> This is all very confusing... /acroform is in the document
> catalog.
> >> I
> >>>>>>>> don't see how the page content stream is related to it. The best
> is
> >> that
> >>>>>>>> you either go through the source code, or read the spec and then
> >> look at
> >>>>>>>> the pdf.
> >>>>>>>>
> >>>>>>>> To find out what's going on, you'd have to start from that
> /acroform
> >>>>>>>> entry
> >>>>>>>> and then compare the two files.
> >>>>>>>>
> >>>>>>>> It is really difficult to help you without the files. The cause
> >> could
> >>>>>>>> be a
> >>>>>>>> bug in pdfbox, or a malformed pdf...
> >>>>>>>>
> >>>>>>>> Some more ideas:
> >>>>>>>> - use loadNonSeq(file, null) instead of load(file)
> >>>>>>>> - try the unreleased 2.0 version, that one has some improvements
> in
> >> the
> >>>>>>>> acroform stuff. Note that the API is different.
> >>>>>>>> https://pdfbox.apache.org/download.cgi#scm
> >>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html
> >>>>>>>>
> >>>>>>>> If you still need help, one possibility would be 1) post the
> >> smallest
> >>>>>>>> possible code that fails, and 2) post a small part of the raw PDF,
> >> i.e.
> >>>>>>>> the
> >>>>>>>> objects relevant to the field in your code.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Tilman
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
> >>>>>>>>
> >>>>>>>> Moreover, for every page of the compressed PDF (there are 3
> >> pages), I
> >>>>>>>>> tried getting the COSStream for each of the page :
> >>>>>>>>>
> >>>>>>>>> PDPage firstPage=(PDPage)
> >>>>>>>>> document.getDocumentCatalog().getAllPages().get(0);
> >>>>>>>>>             pdStream=firstPage.getContents();
> >>>>>>>>>             COSStream stream=pdStream.getStream();
> >>>>>>>>>
> >>>>>>>>> In the above code snippet, the object stream, when analyzed in
> >> debug
> >>>>>>>>> mode, has the following:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The line from the compressed PDF as opened with Notepad++ is :
> >>>>>>>>>
> >>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream
> >>>>>>>>>
> >>>>>>>>> From this point on, using the COSStream object for every page,
> how
> >>>>>>>>> can I
> >>>>>>>>> decompress and find out the acroform fields given that the
> >>>>>>>>> unFilteredStream
> >>>>>>>>> object is null for COSStream?
> >>>>>>>>> 
> >>>>>>>>>
> >>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan <
> >>>>>>>>> [email protected]
> >>>>>>>>> <mailto:[email protected]>> wrote:
> >>>>>>>>>
> >>>>>>>>>     Thank you for your response Tilman.
> >>>>>>>>>
> >>>>>>>>>     I had previously tried using the WriteDecodedDoc for my
> >> compressed
> >>>>>>>>>     PDF and I tried to get the number of acro form fields present
> >> in
> >>>>>>>>>  the output file generated by WriteDecodedDoc. The API still
> could
> >>>>>>>>>     not find the acro form fields in the generated decompressed
> >> file.
> >>>>>>>>>      Also the decompressed file generated is 75 KB which is far
> >> less
> >>>>>>>>>     than the original decompressed file which I have (1.6 MB)
> >> though I
> >>>>>>>>>     could edit the acro form fields using acrobat reader.
> >>>>>>>>>
> >>>>>>>>>     Thanks,
> >>>>>>>>>     Balaji
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>     On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
> >>>>>>>>>     <[email protected] <mailto:[email protected]>>
> wrote:
> >>>>>>>>>
> >>>>>>>>>         Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
> >>>>>>>>>
> >>>>>>>>>             My question is: how do I flatedecode a PDF so that I
> >> can
> >>>>>>>>>             find all the
> >>>>>>>>>             acroform fields within it. ANy help or pointers would
> >> be
> >>>>>>>>>             highly appreciated.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>         You could try the WriteDecodedDoc option of the command
> >> line
> >>>>>>>>> app
> >>>>>>>>>
> >> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
> >>>>>>>>>
> >>>>>>>>>         Maybe you can have further ideas by comparing the two
> >> files
> >>>>>>>>>         with NOTEPAD++.... however the two files might have their
> >>>>>>>>>         objects in different order.
> >>>>>>>>>
> >>>>>>>>>         Tilman
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>         To unsubscribe, e-mail:
> >> [email protected]
> >>>>>>>>>         <mailto:[email protected]>
> >>>>>>>>>         For additional commands, e-mail:
> >> [email protected]
> >>>>>>>>>         <mailto:[email protected]>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>>> For additional commands, e-mail: [email protected]
> >>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected] <mailto:
> >> [email protected]>
> >>> For additional commands, e-mail: [email protected] <mailto:
> >> [email protected]>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: How to flatedecode and find all acroform fields in a compressed PDF

Reply via email to