[Podofo-users] Splicing PDFs with AcroForms, NeedsAppearances, mysterious file size shrinkage, Adobe Reader behavior

Dennis Jenkins Tue, 06 May 2014 23:01:28 -0700

Hello all (but mostly directed to Leonard),

   A few days ago I described [1] some odd behavior that I am having with
Adobe Reader consuming PDFs generated by my project.  To avoid hijacking
Christophe's original thread, I am starting a new one.


   At a high-level, my goal is to use PoDoFo to splice together pages from
various PDFs which are US tax forms, fill in the data, save the resulting
PDF and have the filled-in form fields "just work" in Adobe Reader (eg, be
visible and still editable) and have Adobe Reader NOT prompt the user to
save the file when the user attempts to exit.  Secondly, I noticed that if
I allow Adobe Reader to save the PDF, it shrinks in half (sometimes).  I
want to know why, so that I can optimize the size of my PDFs without
needing Adobe Reader (my code runs on Linux as part of a web service).

   Leonard suggested that my PDF is malformed and that Adobe Reader is
offering to repair/save it in this case.  After much experimentation and
staring at "podofobrowser" and "podofopdfinfo diffs" of the pre- and post-
PDFs, I am not 100% convinced that this is the case.

  In my code, I must set the "NeedsApperances" dictionary element of the
"/AcroForm" to "true", or my fields will not be visible in Adobe Reader.  I
then need to populate the appearance stream, per section 12.7.3.3 of ISO
32000:2008 (herein referred to as "the spec").  When Adobe Reader saves my
PDF, this dictionary key disappears, and every field element gains a key
called "AP", with a child key of "N".  This is discussed in 12.7.3.3 of the
spec on page #435, first complete paragraph.

  If I omit adding the key for "NeedsApperances" to the AcroForm, Adobe
Reader will no longer offer to save my PDF, but my field values are no
longer visible.  Therefore, I suspect that Adobe wants to save the PDF in
order to apply/generate the per-field appearance stream.

QUESTION 1: Is the above hypothesis valid?

  I generate my PDFs by creating an empty PDF in memory, and "inserting"
pages from other PDFs.  This results in a PDF with no "Fields" in the
"/AcroForm/Fields" array.  Adobe Reader populates the "Fields" array when
it saves the PDF.  However, the count of elements in the "Fields" array
does not match the actual count of fields.  For example, Adobe Reader
places 176 elements into this array, but when I enumerate all fields on all
pages using the PoDoFo API (with my patch to handle inherited fields), I
count 212.  I have not completed an exhaustive comparison of the "Fields"
arrays yet to determine if the discrepancy is due to the inherited form
fields (typically check boxes) or not.  I wrote a routine to populate the
"Fields" array myself (with all 212 items), but Adobe Reader rebuilds it
with on 176 items.  If I do not set the "NeedsApperances" flag, Adobe
Reader never offers to save the PDF on exit, so this array is not rebuilt
in this case.

QUESTION 2: How does Adobe Reader determine which fields need to be in the
"/AcroForm/Fields" array?

    Adobe Reader seems to not care that the "/AcroForm" is missing (its
presence or absence does not affect when Adobe Reader offers to save the
form).  Yet section 12.7.2 of the spec states that the "/AcroForm" is
required.

QUESTION 3: How do we reconcile section 12.7.2 with Adobe Reader's
behavior?  Which is "correct" (or did I misunderstand the ISO)?

    The content of the "Fields -> element -> AP -> N" key is an
"/XObject".  The data stream created by Adobe Reader for it looks
complicated.

QUESTION 4: Assuming the answer to Question #1 is "yes", Do you have any
suggestions on how I can compute the required XObject in code?  I just want
to check a checkbox or place simple text into a text field.

    When Adobe Reader does save the PDF, and depending on which source
form(s) are in it, the resulting PDF might shrink in size considerably.  A
cursory look with podofobrowser shows that Adobe Reader has heavily
modified "Pages -> Kids[page] -> Contents[]".  In my current testing PDF,
the original has one element in page #0 Contents, with a compressed length
of 20443.  Adobe Reader's version has 8 array elements, each with
approximately 2K of compressed XObject data.

QUESTION 5:  Why does Adobe Reader tinker with this part of a PDF when
saving it?  Ok, that was rhetorical - I assume that it does so so the the
file will be smaller, and it also sets the "linearized" flag.  The question
should be stated: What rules does Adobe Reader follow when deciding if/how
to refactor the actual page layout.

QUESTION 6: Why does refactoring the XObject components make the file so
much smaller (200K vs 450K for example).

   In some cases, the file size savings are significant.  If I knew what
rules Adobe Reader followed, I might attempt to write a routine to apply
the same changes using PoDoFo (and share it with the community).

   Thank you for your time.

[1] http://sourceforge.net/p/podofo/mailman/message/32302847/

------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
&#149; 3 signs your SCM is hindering your productivity
&#149; Requirements for releasing software faster
&#149; Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

[Podofo-users] Splicing PDFs with AcroForms, NeedsAppearances, mysterious file size shrinkage, Adobe Reader behavior

Reply via email to