Hi, So AcroForms/Fields is an empty Array?
Yes, in the filled interview_compressed.pdf, the acroforms are not null but empty. Size of array is zero. Also, I tried qpdf command line tool to compress the file interview.pdf and the resultant compressed file size of 1.6MB was no way near the file size of interview_compressed.pdf (21 KB). Thanks, Balaji On Fri, May 22, 2015 at 11:58 PM, Maruan Sahyoun <[email protected]> wrote: > Hi, > > > Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <[email protected]>: > > > > I opened the interview_compressed in notepad++ and did not see any > > 'Acroform' text anywhere. > > However, as Maruan suggested, I entered some data into what looks like > form > > fields of interview_compressed.pdf and saved it. When I opened this file > in > > notepad++, I did see 'Acroform' text in it. I also noticed an increase in > > file size from 21 KB to ~530 KB. > > > > I then ran this filled saved compressed PDF in pdfdebugger.java and saw > > that the field values were getting stored but not under Acroform fields > but > > under Annotations. > > > > So AcroForms/Fields is an empty Array? > > > Please refer to this image: > > > > http://imageshack.com/a/img540/9951/QGLDtS.jpg > > > > So, whatever the compression technique was, it simply made all the > Acroform > > fields disappear from the original PDF but retained all annotations which > > also contain the interactive forms and this helped reduce the file size > so > > much? If this is the case, can pdfbox API also use similar compression > > technique to compress such a a huge file into a smaller one? > > > > > > > > > > On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <[email protected]> > > wrote: > > > >> Hi, > >> > >>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <[email protected] > >: > >>> > >>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan: > >>>> Hello, > >>>> > >>>> I used PdfDebugger to make the internal PDF structure of the two files > >> (1) > >>>> interview.pdf and (2) interview_compressed.pdf visually available > and I > >>>> have uploaded my images to imageshack. Here are the four links: > >>>> > >>>> http://imageshack.com/a/img538/8277/JghCpG.jpg > >>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg > >>>> http://imageshack.com/a/img903/8644/mk15As.jpg > >>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg > >>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg > >>>> > >>>> The first two links are from the internal structure of interview.pdf > >>>> (original uncompressed file) > >>>> The third and fourth links are from the internal structure of > >>>> interview_compressed.pdf (compressed file) > >>>> The fifth link compares the file sizes of the two files and as you can > >> also > >>>> see, the difference is huge. > >>>> > >>>> As you might notice, the file interview_compressed.pdf has no acroform > >>> > >>> Indeed... but this is needed - from the spec: > >>> > >>> "The contents and properties of a document’s interactive form shall be > >> defined by an interactive form dictionary that shall be referenced from > the > >> AcroForm entry in the document catalogue (see 7.7.2, “Document > Catalog”). > >> Table 218 shows the contents of this dictionary." > >>> > >> > >> correct > >> > >>>> fields listed even though opening the PDF in pdf reader allows me to > >> enter > >>>> values in places which look like AcroForm fields and also save them. > Are > >>>> there any other PDF 'types' similar to Acroform fields which would > >> enable > >>>> users to fill data and which can be accessed in PdfBox APIs without > >> having > >>>> to go through PDAcrofield? > >>> > >>> Yes, annotations... there are some common parts, but this is just a > >> vague observation from me, I'm not the acroform specialist. > >> > >> from a first glance it looks like there are all entries necessary to > (re-) > >> generate the form fields. That's what's likely happening for this > document > >> in Adobe Reader. Would be interesting to see what's being save after the > >> forms has been filled out and saved using Acrobat. We'd need a test > form to > >> come up with an enhancement like this. > >> > >> BR > >> Maruan > >> > >> > >>> > >>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" > in > >> the "compressed" file. > >>> - if it is missing, tell the client (or your boss) just that > >>> - if it isn't missing, then there's some problem in PDFBox (try also > the > >> loadNonSeq I mentioned earlier) > >>> > >>> Tilman > >>> > >>>> > >>>> You can use qpdf , then use these options: > >>>> > >>>> I will now try using this link to compress the original file. > >>>> > >>>> Another strategy to think about - can your client generate a > >>>> non-confidential file, so that you can share it, and the "compressed" > >> file? > >>>> > >>>> I wish I had direct communication with the clients but due to > >> bureaucracy, > >>>> I am having to go through multiple layers to get my message across to > >> them. > >>>> I will share more information as soon as I have them. > >>>> > >>>> PS: i sent these image links to my personal email first to make sure > >> that I > >>>> can open them. I could and so I am hoping you all could too. If you > are > >>>> unable to open them, please let me know. > >>>> > >>>> Thanks, > >>>> Balaji > >>>> > >>>> > >>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr < > [email protected] > >>> > >>>> wrote: > >>>> > >>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um > >> 03:24 > >>>>>>> geschrieben: > >>>>>>> > >>>>>>> > >>>>>>> Thank you for your pointers and sorry about the image. I am > >> attaching it > >>>>>>> with this email. > >>>>>>> > >>>>>>> The point I am trying to make is that the PDF, which was > decompressed > >>>>>>> using > >>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to > >> us by > >>>>>>> our customers. > >>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox > >> did > >>>>>>> not > >>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us > >> by > >>>>>>> the > >>>>>>> customers does contain Acroform fields. Hence I wanted to know how > to > >>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was > >>>>>>> analyzing COSStream was to check if the decompression of the > >> compressed > >>>>>>> PDF > >>>>>>> was happening correctly while using PDFBox APIs. > >>>>>>> I know it would have been difficult for you to help me without the > >> actual > >>>>>>> PDFs. For that, I would like to thank you for your time and > pointers. > >>>>>>> > >>>>>> Maybe it's worth to try to share the file "visually" with us. Open > >> both > >>>>>> files > >>>>>> (compressed and decompressed) with PDFDebugger [1] and post a > >> screenshot > >>>>>> of both > >>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that > could > >>>>>> shed some > >>>>>> light on your issue. > >>>>>> > >>>>> @Balaji: here's an example on how such a screenshot would look like: > >>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png > >>>>> > >>>>> Tilman > >>>>> > >>>>> > >>>>> > >>>>>> BR > >>>>>> Andreas Lehmkühler > >>>>>> > >>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger > >>>>>> > >>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr < > >> [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>>> The image doesn't appear in the mailing list. > >>>>>>>> > >>>>>>>> This is all very confusing... /acroform is in the document > catalog. > >> I > >>>>>>>> don't see how the page content stream is related to it. The best > is > >> that > >>>>>>>> you either go through the source code, or read the spec and then > >> look at > >>>>>>>> the pdf. > >>>>>>>> > >>>>>>>> To find out what's going on, you'd have to start from that > /acroform > >>>>>>>> entry > >>>>>>>> and then compare the two files. > >>>>>>>> > >>>>>>>> It is really difficult to help you without the files. The cause > >> could > >>>>>>>> be a > >>>>>>>> bug in pdfbox, or a malformed pdf... > >>>>>>>> > >>>>>>>> Some more ideas: > >>>>>>>> - use loadNonSeq(file, null) instead of load(file) > >>>>>>>> - try the unreleased 2.0 version, that one has some improvements > in > >> the > >>>>>>>> acroform stuff. Note that the API is different. > >>>>>>>> https://pdfbox.apache.org/download.cgi#scm > >>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html > >>>>>>>> > >>>>>>>> If you still need help, one possibility would be 1) post the > >> smallest > >>>>>>>> possible code that fails, and 2) post a small part of the raw PDF, > >> i.e. > >>>>>>>> the > >>>>>>>> objects relevant to the field in your code. > >>>>>>>> > >>>>>>>> > >>>>>>>> Tilman > >>>>>>>> > >>>>>>>> > >>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan: > >>>>>>>> > >>>>>>>> Moreover, for every page of the compressed PDF (there are 3 > >> pages), I > >>>>>>>>> tried getting the COSStream for each of the page : > >>>>>>>>> > >>>>>>>>> PDPage firstPage=(PDPage) > >>>>>>>>> document.getDocumentCatalog().getAllPages().get(0); > >>>>>>>>> pdStream=firstPage.getContents(); > >>>>>>>>> COSStream stream=pdStream.getStream(); > >>>>>>>>> > >>>>>>>>> In the above code snippet, the object stream, when analyzed in > >> debug > >>>>>>>>> mode, has the following: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> The line from the compressed PDF as opened with Notepad++ is : > >>>>>>>>> > >>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream > >>>>>>>>> > >>>>>>>>> From this point on, using the COSStream object for every page, > how > >>>>>>>>> can I > >>>>>>>>> decompress and find out the acroform fields given that the > >>>>>>>>> unFilteredStream > >>>>>>>>> object is null for COSStream? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan < > >>>>>>>>> [email protected] > >>>>>>>>> <mailto:[email protected]>> wrote: > >>>>>>>>> > >>>>>>>>> Thank you for your response Tilman. > >>>>>>>>> > >>>>>>>>> I had previously tried using the WriteDecodedDoc for my > >> compressed > >>>>>>>>> PDF and I tried to get the number of acro form fields present > >> in > >>>>>>>>> the output file generated by WriteDecodedDoc. The API still > could > >>>>>>>>> not find the acro form fields in the generated decompressed > >> file. > >>>>>>>>> Also the decompressed file generated is 75 KB which is far > >> less > >>>>>>>>> than the original decompressed file which I have (1.6 MB) > >> though I > >>>>>>>>> could edit the acro form fields using acrobat reader. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Balaji > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr > >>>>>>>>> <[email protected] <mailto:[email protected]>> > wrote: > >>>>>>>>> > >>>>>>>>> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan: > >>>>>>>>> > >>>>>>>>> My question is: how do I flatedecode a PDF so that I > >> can > >>>>>>>>> find all the > >>>>>>>>> acroform fields within it. ANy help or pointers would > >> be > >>>>>>>>> highly appreciated. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> You could try the WriteDecodedDoc option of the command > >> line > >>>>>>>>> app > >>>>>>>>> > >> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc > >>>>>>>>> > >>>>>>>>> Maybe you can have further ideas by comparing the two > >> files > >>>>>>>>> with NOTEPAD++.... however the two files might have their > >>>>>>>>> objects in different order. > >>>>>>>>> > >>>>>>>>> Tilman > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: > >> [email protected] > >>>>>>>>> <mailto:[email protected]> > >>>>>>>>> For additional commands, e-mail: > >> [email protected] > >>>>>>>>> <mailto:[email protected]> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >> --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: [email protected] > >>>>>>> For additional commands, e-mail: [email protected] > >>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [email protected] > >>>>>> For additional commands, e-mail: [email protected] > >>>>>> > >>>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [email protected] > >>>>> For additional commands, e-mail: [email protected] > >>>>> > >>>>> > >>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] <mailto: > >> [email protected]> > >>> For additional commands, e-mail: [email protected] <mailto: > >> [email protected]> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

