[
https://issues.apache.org/jira/browse/PDFBOX-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185135#comment-14185135
]
Laurent Richard commented on PDFBOX-2419:
-----------------------------------------
By the way, here is our code to workaround until the problem has been fixed in
PdfBox
{code}
public String extractXFDF() {
try {
@Cleanup
PDDocument pdf = PDDocument.load(pdfFileName);
pdf.setAllSecurityToBeRemoved(true);
PDAcroForm form = pdf.getDocumentCatalog().getAcroForm();
if (form == null) {
throw new Pdf2OpxException("PDF file contains no Acroform");
}
@Cleanup
FDFDocument fdf = form.exportFDF();
@SuppressWarnings("unchecked")
List<FDFField> fields = fdf.getCatalog().getFDF().getFields();
sanitize(fields); // cf
https://issues.apache.org/jira/browse/PDFBOX-2419
@Cleanup
StringWriter writer = new StringWriter();
fdf.saveXFDF(writer);
return writer.toString();
} catch (COSVisitorException e) {
throw new Pdf2OpxException("exception while extracting XFDF", e);
} catch (IOException e) {
throw new Pdf2OpxException("exception while reading PDF", e);
}
}
private void sanitize(List<FDFField> fields) throws IOException {
if (fields != null) {
for (FDFField field : fields) {
field.setValue(XmlEscapers.xmlContentEscaper().escape(field.getValue().toString()));
sanitize(field.getKids());
}
}
}
{code}
The interesting part is in the sanitize method. We use Guava Escapers but it is
simply a matter of replacing the three mentioned characters ('<', '>' and '&')
by their XML escaped equivalent. It would be better to write correctly each
field value directly rather than modifying them recursively afterwards.
It means that org.apache.pdfbox.pdmodel.fdf.FDFField.writeXML could be adjusted
(even if a better approach would be to avoid using Strings directly in order to
write XML).
I would have liked to suggest a patch but I was unable to compile PdfBox (maven
couldn't resolve dependencies such as
com.levigo.jbig2:levigo-jbig2-imageio:jar:1.6.3)
> XFDF export is not XML compliant
> --------------------------------
>
> Key: PDFBOX-2419
> URL: https://issues.apache.org/jira/browse/PDFBOX-2419
> Project: PDFBox
> Issue Type: Bug
> Components: AcroForm
> Affects Versions: 1.8.7
> Reporter: Laurent Richard
> Labels: FDF
> Fix For: 1.8.8
>
> Attachments: SampleForm.pdf
>
>
> The XFDF content is written as a simple string instead of XML nodes.
> As a result, field values containing special characters (&, <, >, ...) are
> not escaped and the resulting XML is invalid.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)