Fwd: Migrate form field entries from one pdf to another

Roberto Nibali Sun, 28 Jun 2015 10:15:51 -0700

Hi

I'm working on a project that involves the migration of existing PDFs (with
filled forms) to the new template PDFs. The new templates should contain
the same fully qualified field names, so basically my naive approach was to:


1. Scan the original PDF (input PDF) and put all found PDfield entries into
a map
2. Open the empty template PDF and for each PDfield entry (the key is the
field's fqName) fill up the field accordingly
3. Save the modified template to an output PDF.

In theory this sounds great, but it's not working as I expect it to. All
PDTextbox type fields are correctly migrated in the output PDF, however the
PDCheckbox, PDRadioCollection, and PDPushButton type fields are not
migrated. Do I have to write specific code for that?

I have tried three different attempts at migrating fields and none work. I
have not found anything that helps on the usual suspects, such as google,
stackoverflow, the source code or the examples of PDFBox. I'm using PDFBox
1.8.9 and the tool is designed to be a CLI, so the other Maven dependency I
have (not that this should matter), is argparse4j. The PDFs do not contain
XFA data.

This is the code that fills up the map:

private void extractFields() throws IOException {
    // Printing and scaffolding form map
    PDDocument oldPDF = null;
    try {
        logerr("DEBUG: Opening " + inputPDF);
        oldPDF = pdfFormMigrator.loadPDFhandlingEncryption(inputPDF,
PDFPassword, true);
        @SuppressWarnings("unchecked")
        List<PDField> fields =
oldPDF.getDocumentCatalog().getAcroForm().getFields();
        for (PDField pdField : fields) {
            traverseFields(pdField);
        }
        oldPDF.close();
    } catch (Exception e) {
        logerr(e.getMessage());
    } finally {
        if (oldPDF != null) {
            oldPDF.close();
        }
    }
}

private void traverseFields(PDField field) throws IOException {
    List<COSObjectable> kids = field.getKids();
    if (kids != null) {
        for (COSObjectable pdfObj : kids) {
            if (pdfObj instanceof PDField) {
                traverseFields((PDField) pdfObj);
            }
        }
    } else {
        analyseAndPrintFields(field);
        if (!(field instanceof PDSignatureField) && field.getValue() != null) {
            //TODO: maybe field.getActions();
            PDFFormElement pdfFE = new
PDFFormElement(field.getValue(), field.getClass(),
field.getFieldFlags());
            pdfFormElement.put(field.getFullyQualifiedName(), pdfFE);
        }
    }
}

private PDDocument loadPDFhandlingEncryption(String filename, String
password, boolean removeSecurity)
        throws IOException, CryptographyException, BadSecurityHandlerException,
        ParserConfigurationException, SAXException {
    PDDocument pdDocument;
    pdDocument = PDDocument.load(filename);
    if (pdDocument.isEncrypted()) {
        StandardDecryptionMaterial sdm = new
StandardDecryptionMaterial(password);
        pdDocument.openProtection(sdm);
    }
    pdDocument.setAllSecurityToBeRemoved(removeSecurity);
    PDXFA xr = pdDocument.getDocumentCatalog().getAcroForm().getXFA();
    if (xr != null) {
        logmsg("Found XFA data:");
        logmsg(xr.getDocument().toString());
    } else {
        logmsg("No XFA data in stream");
    }
    return pdDocument;
}

Option CLONE (with two variants):

private void executeCmdClone() throws IOException {
    // Mapping the forms from the old to the new pdf
    PDDocument templatePDF = null;
    PDDocument oldPDF = null;

    try {
        logerr("DEBUG: Opening template: " + formTemplatePDF);
        templatePDF =
pdfFormMigrator.loadPDFhandlingEncryption(formTemplatePDF,
PDFPassword, true);

        logerr("DEBUG: Opening old: " + inputPDF);
        oldPDF = pdfFormMigrator.loadPDFhandlingEncryption(inputPDF,
PDFPassword, true);

        /* TODO: Does not work!!!!
        HashMap<String, PDAcroForm> pdAcroFormMap = new HashMap<>();
        @SuppressWarnings("unchecked")
        List<PDField> oldFields =
oldPDF.getDocumentCatalog().getAcroForm().getFields();
        for (PDField pdField : oldFields) {
            pdAcroFormMap.put(pdField.getFullyQualifiedName(),
pdField.getAcroForm());
            pdField.getAcroForm().exportFDF();
        }

        @SuppressWarnings("unchecked")
        List<PDField> templateFields =
oldPDF.getDocumentCatalog().getAcroForm().getFields();
        for (PDField pdField : templateFields) {
            
pdField.setAcroForm(pdAcroFormMap.get(pdField.getFullyQualifiedName()));
        }*/

        FDFDocument exportFDF =
oldPDF.getDocumentCatalog().getAcroForm().exportFDF();
        templatePDF.getDocumentCatalog().getAcroForm().importFDF(exportFDF);
        exportFDF.close();

        templatePDF.encrypt(PDFPassword, "");
        templatePDF.save(outputPDF);
        templatePDF.close();
        oldPDF.close();
    } catch (Exception e) {
        logerr(e.getMessage());
    } finally {
        if (templatePDF != null) {
            templatePDF.close();
        }
        if (oldPDF != null) {
            oldPDF.close();
        }
    }
}

Option MIGRATE:

private void executeCmdMigrate() throws IOException {
    // Mapping the forms from the old to the new pdf
    PDDocument templatePDF = null;
    try {
        logerr("DEBUG: Opening " + formTemplatePDF);
        templatePDF =
pdfFormMigrator.loadPDFhandlingEncryption(formTemplatePDF,
PDFPassword, true);
        for (String keyEntry : pdfFormElement.keySet()) {
            PDFFormElement pdfFormElement =
PDFFormMigrator.pdfFormElement.get(keyEntry);
            pdfFormMigrator.setField(templatePDF, keyEntry,
                    (String) pdfFormElement.getValue(),
                    pdfFormElement.getFieldFlags());
        }
        //TODO: Figure out how to map access permissions for the user
that currently opened the PDF
        //templatePDF.protect(pdfFormMigrator.applyDefaultProtection());
        
//templatePDF.protect(pdfFormMigrator.applyProtection(accessPermissions));
        templatePDF.encrypt(PDFPassword, "");
        templatePDF.save(outputPDF);
        templatePDF.close();
    } catch (Exception e) {
        logerr(e.getMessage());
    } finally {
        if (templatePDF != null) {
            templatePDF.close();
        }
    }
}

public void setField(PDDocument pdfDocument, String name, String
value, int flags) throws IOException {
    PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
    PDAcroForm acroForm = docCatalog.getAcroForm();
    PDField field = acroForm.getField(name);
    if (field != null) {
        logmsg("Setting field: " + name + " to value: " + value + "
with flags: " + flags);
        field.setValue(value);
        if (setFieldFlags) {
            field.setFieldFlags(flags);
        }
    } else {
        logerr("No field found with name: " + name);
    }
}

Option REPLACE:

This option really is different to the others, since instead of migrating
fields, I try to apply the text changes that had been performed from the
old template to the new template. This would probably be the most elegant
way, however the output PDF looks completely unledigble after the search
and replace. The fonts are out of control and the text is all over the page
after the transformation. Here is the code:

private void executeCmdReplace() throws IOException {
    PDDocument oldPDF = null;

    try {
        logerr("DEBUG: Opening old: " + inputPDF);
        oldPDF = pdfFormMigrator.loadPDFhandlingEncryption(inputPDF,
PDFPassword, true);

        List pages = oldPDF.getDocumentCatalog().getAllPages();
        for (int i = 0; i < pages.size(); i++) {
            PDPage page = (PDPage) pages.get(i);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();
            List tokens = parser.getTokens();
            for (int j = 0; j < tokens.size(); j++) {
                Object next = tokens.get(j);
                if (next instanceof PDFOperator) {
                    PDFOperator op = (PDFOperator) next;
                    if (op.getOperation().equals("Tj")) {
                        COSString previous = (COSString) tokens.get(j - 1);
                        String string = previous.getString();
                        string = string.replaceFirst("VISA", "Visa");
                        previous.reset();
                        previous.append(string.getBytes("ISO-8859-1"));
                    } else if (op.getOperation().equals("TJ")) {
                        COSArray previous = (COSArray) tokens.get(j - 1);
                        for (int k = 0; k < previous.size(); k++) {
                            Object arrElement = previous.getObject(k);
                            if (arrElement instanceof COSString) {
                                COSString cosString = (COSString) arrElement;
                                String string = cosString.getString();
                                string = string.replaceFirst("VISA", "Visa");
                                cosString.append(string.getBytes("ISO-8859-1"));
                            }
                        }
                    }
                }
            }
            // now that the tokens are updated we will replace the
page content stream.
            PDStream updatedStream = new PDStream(oldPDF);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens(tokens);
            page.setContents(updatedStream);
        }
        oldPDF.save(outputPDF);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        if (oldPDF != null) {
            try {
                oldPDF.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}


This is also the least desired path of solution, since the client cannot
guarantee that the changes in templates were only text-based.

Next steps on my side: I'll try to extend the migration code to deal with
those special type fields now, just to see if I can get better results.

Due to contractual NDAs, I am unable to share any PDFs, however I'll try to
come up with a possibility, if needed. I'm also open to have a private
Skype or Teamviewer session with a person knowledgeable about this.
Unfortunately, the project has to be delivered by the end of the month, and
I just took over from my colleague on Friday, since he's had a family
emergency.

Any pointers would be greatly appreciated.

Thanks and best regards

Roberto

Fwd: Migrate form field entries from one pdf to another

Reply via email to