Hi, > Am 22.05.2015 um 23:00 schrieb Balaji Venkatamohan <[email protected]>: > > I opened the interview_compressed in notepad++ and did not see any > 'Acroform' text anywhere. > However, as Maruan suggested, I entered some data into what looks like form > fields of interview_compressed.pdf and saved it. When I opened this file in > notepad++, I did see 'Acroform' text in it. I also noticed an increase in > file size from 21 KB to ~530 KB. > > I then ran this filled saved compressed PDF in pdfdebugger.java and saw > that the field values were getting stored but not under Acroform fields but > under Annotations.
So AcroForms/Fields is an empty Array? > Please refer to this image: > > http://imageshack.com/a/img540/9951/QGLDtS.jpg > > So, whatever the compression technique was, it simply made all the Acroform > fields disappear from the original PDF but retained all annotations which > also contain the interactive forms and this helped reduce the file size so > much? If this is the case, can pdfbox API also use similar compression > technique to compress such a a huge file into a smaller one? > > > > > On Fri, May 22, 2015 at 1:25 PM, Maruan Sahyoun <[email protected]> > wrote: > >> Hi, >> >>> Am 22.05.2015 um 21:54 schrieb Tilman Hausherr <[email protected]>: >>> >>> Am 22.05.2015 um 17:53 schrieb Balaji Venkatamohan: >>>> Hello, >>>> >>>> I used PdfDebugger to make the internal PDF structure of the two files >> (1) >>>> interview.pdf and (2) interview_compressed.pdf visually available and I >>>> have uploaded my images to imageshack. Here are the four links: >>>> >>>> http://imageshack.com/a/img538/8277/JghCpG.jpg >>>> http://imageshack.com/a/img909/6140/KsYNGR.jpg >>>> http://imageshack.com/a/img903/8644/mk15As.jpg >>>> http://imageshack.com/a/img901/8610/NXe3mJ.jpg >>>> http://imageshack.com/a/img673/8633/0GMdjQ.jpg >>>> >>>> The first two links are from the internal structure of interview.pdf >>>> (original uncompressed file) >>>> The third and fourth links are from the internal structure of >>>> interview_compressed.pdf (compressed file) >>>> The fifth link compares the file sizes of the two files and as you can >> also >>>> see, the difference is huge. >>>> >>>> As you might notice, the file interview_compressed.pdf has no acroform >>> >>> Indeed... but this is needed - from the spec: >>> >>> "The contents and properties of a document’s interactive form shall be >> defined by an interactive form dictionary that shall be referenced from the >> AcroForm entry in the document catalogue (see 7.7.2, “Document Catalog”). >> Table 218 shows the contents of this dictionary." >>> >> >> correct >> >>>> fields listed even though opening the PDF in pdf reader allows me to >> enter >>>> values in places which look like AcroForm fields and also save them. Are >>>> there any other PDF 'types' similar to Acroform fields which would >> enable >>>> users to fill data and which can be accessed in PdfBox APIs without >> having >>>> to go through PDAcrofield? >>> >>> Yes, annotations... there are some common parts, but this is just a >> vague observation from me, I'm not the acroform specialist. >> >> from a first glance it looks like there are all entries necessary to (re-) >> generate the form fields. That's what's likely happening for this document >> in Adobe Reader. Would be interesting to see what's being save after the >> forms has been filled out and saved using Acrobat. We'd need a test form to >> come up with an enhancement like this. >> >> BR >> Maruan >> >> >>> >>> What you should do: use NOTEPAD++ to look whether there's "/AcroForm" in >> the "compressed" file. >>> - if it is missing, tell the client (or your boss) just that >>> - if it isn't missing, then there's some problem in PDFBox (try also the >> loadNonSeq I mentioned earlier) >>> >>> Tilman >>> >>>> >>>> You can use qpdf , then use these options: >>>> >>>> I will now try using this link to compress the original file. >>>> >>>> Another strategy to think about - can your client generate a >>>> non-confidential file, so that you can share it, and the "compressed" >> file? >>>> >>>> I wish I had direct communication with the clients but due to >> bureaucracy, >>>> I am having to go through multiple layers to get my message across to >> them. >>>> I will share more information as soon as I have them. >>>> >>>> PS: i sent these image links to my personal email first to make sure >> that I >>>> can open them. I could and so I am hoping you all could too. If you are >>>> unable to open them, please let me know. >>>> >>>> Thanks, >>>> Balaji >>>> >>>> >>>> On Fri, May 22, 2015 at 6:45 AM, Tilman Hausherr <[email protected] >>> >>>> wrote: >>>> >>>>> Am 22.05.2015 um 08:28 schrieb Andreas Lehmkühler: >>>>> >>>>>> Hi, >>>>>> >>>>>> Balaji Venkatamohan <[email protected]> hat am 20. Mai 2015 um >> 03:24 >>>>>>> geschrieben: >>>>>>> >>>>>>> >>>>>>> Thank you for your pointers and sorry about the image. I am >> attaching it >>>>>>> with this email. >>>>>>> >>>>>>> The point I am trying to make is that the PDF, which was decompressed >>>>>>> using >>>>>>> WriteDecodedDoc, is smaller in size than the original PDF given to >> us by >>>>>>> our customers. >>>>>>> Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox >> did >>>>>>> not >>>>>>> have any PDAcroform fields whereas the decompressed PDF given to us >> by >>>>>>> the >>>>>>> customers does contain Acroform fields. Hence I wanted to know how to >>>>>>> properly decompress the PDF using pdfbox APIs. The reason why I was >>>>>>> analyzing COSStream was to check if the decompression of the >> compressed >>>>>>> PDF >>>>>>> was happening correctly while using PDFBox APIs. >>>>>>> I know it would have been difficult for you to help me without the >> actual >>>>>>> PDFs. For that, I would like to thank you for your time and pointers. >>>>>>> >>>>>> Maybe it's worth to try to share the file "visually" with us. Open >> both >>>>>> files >>>>>> (compressed and decompressed) with PDFDebugger [1] and post a >> screenshot >>>>>> of both >>>>>> somehwere (dropbox etc.) and share the link with us. Maybe that could >>>>>> shed some >>>>>> light on your issue. >>>>>> >>>>> @Balaji: here's an example on how such a screenshot would look like: >>>>> http://home.snafu.de/tilman/tmp/pdfdebugger-screenshot.png >>>>> >>>>> Tilman >>>>> >>>>> >>>>> >>>>>> BR >>>>>> Andreas Lehmkühler >>>>>> >>>>>> [1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger >>>>>> >>>>>> On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr < >> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>>> The image doesn't appear in the mailing list. >>>>>>>> >>>>>>>> This is all very confusing... /acroform is in the document catalog. >> I >>>>>>>> don't see how the page content stream is related to it. The best is >> that >>>>>>>> you either go through the source code, or read the spec and then >> look at >>>>>>>> the pdf. >>>>>>>> >>>>>>>> To find out what's going on, you'd have to start from that /acroform >>>>>>>> entry >>>>>>>> and then compare the two files. >>>>>>>> >>>>>>>> It is really difficult to help you without the files. The cause >> could >>>>>>>> be a >>>>>>>> bug in pdfbox, or a malformed pdf... >>>>>>>> >>>>>>>> Some more ideas: >>>>>>>> - use loadNonSeq(file, null) instead of load(file) >>>>>>>> - try the unreleased 2.0 version, that one has some improvements in >> the >>>>>>>> acroform stuff. Note that the API is different. >>>>>>>> https://pdfbox.apache.org/download.cgi#scm >>>>>>>> https://pdfbox.apache.org/2.0/getting-started.html >>>>>>>> >>>>>>>> If you still need help, one possibility would be 1) post the >> smallest >>>>>>>> possible code that fails, and 2) post a small part of the raw PDF, >> i.e. >>>>>>>> the >>>>>>>> objects relevant to the field in your code. >>>>>>>> >>>>>>>> >>>>>>>> Tilman >>>>>>>> >>>>>>>> >>>>>>>> Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan: >>>>>>>> >>>>>>>> Moreover, for every page of the compressed PDF (there are 3 >> pages), I >>>>>>>>> tried getting the COSStream for each of the page : >>>>>>>>> >>>>>>>>> PDPage firstPage=(PDPage) >>>>>>>>> document.getDocumentCatalog().getAllPages().get(0); >>>>>>>>> pdStream=firstPage.getContents(); >>>>>>>>> COSStream stream=pdStream.getStream(); >>>>>>>>> >>>>>>>>> In the above code snippet, the object stream, when analyzed in >> debug >>>>>>>>> mode, has the following: >>>>>>>>> >>>>>>>>> >>>>>>>>> The line from the compressed PDF as opened with Notepad++ is : >>>>>>>>> >>>>>>>>> <</Filter/FlateDecode/Length 5675>>stream >>>>>>>>> >>>>>>>>> From this point on, using the COSStream object for every page, how >>>>>>>>> can I >>>>>>>>> decompress and find out the acroform fields given that the >>>>>>>>> unFilteredStream >>>>>>>>> object is null for COSStream? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan < >>>>>>>>> [email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> >>>>>>>>> Thank you for your response Tilman. >>>>>>>>> >>>>>>>>> I had previously tried using the WriteDecodedDoc for my >> compressed >>>>>>>>> PDF and I tried to get the number of acro form fields present >> in >>>>>>>>> the output file generated by WriteDecodedDoc. The API still could >>>>>>>>> not find the acro form fields in the generated decompressed >> file. >>>>>>>>> Also the decompressed file generated is 75 KB which is far >> less >>>>>>>>> than the original decompressed file which I have (1.6 MB) >> though I >>>>>>>>> could edit the acro form fields using acrobat reader. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Balaji >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr >>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> >>>>>>>>> Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan: >>>>>>>>> >>>>>>>>> My question is: how do I flatedecode a PDF so that I >> can >>>>>>>>> find all the >>>>>>>>> acroform fields within it. ANy help or pointers would >> be >>>>>>>>> highly appreciated. >>>>>>>>> >>>>>>>>> >>>>>>>>> You could try the WriteDecodedDoc option of the command >> line >>>>>>>>> app >>>>>>>>> >> https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc >>>>>>>>> >>>>>>>>> Maybe you can have further ideas by comparing the two >> files >>>>>>>>> with NOTEPAD++.... however the two files might have their >>>>>>>>> objects in different order. >>>>>>>>> >>>>>>>>> Tilman >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: >> [email protected] >>>>>>>>> <mailto:[email protected]> >>>>>>>>> For additional commands, e-mail: >> [email protected] >>>>>>>>> <mailto:[email protected]> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] <mailto: >> [email protected]> >>> For additional commands, e-mail: [email protected] <mailto: >> [email protected]> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

