[jira] [Created] (TIKA-973) PDF form data isn't included in extracted content.

Michael Graessle (JIRA) Thu, 09 Aug 2012 12:54:22 -0700

Michael Graessle created TIKA-973:
-------------------------------------

             Summary: PDF form data isn't included in extracted content.
                 Key: TIKA-973
                 URL: https://issues.apache.org/jira/browse/TIKA-973
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.2
            Reporter: Michael Graessle
            Priority: Minor



When extracting content from PDFs, PDF form data isn't extracted. 

The following code extracts this data via PDF box, but it seems like something 
Tika should be doing.

PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
  PDAcroForm acroForm = docCatalog.getAcroForm();
  if (acroForm != null) {
        @SuppressWarnings("unchecked")
        List<PDField> fields = acroForm.getFields();
        if (fields != null && fields.size() > 0) {
          documentContent.append(" ");
          for (PDField field : fields) {
                if (field.getValue()!=null) {
                  documentContent.append(field.getValue());
                  documentContent.append(" ");
                }
          }
        }
  }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-973) PDF form data isn't included in extracted content.

Reply via email to