Michael Graessle created TIKA-973:
-------------------------------------

             Summary: PDF form data isn't included in extracted content.
                 Key: TIKA-973
                 URL: https://issues.apache.org/jira/browse/TIKA-973
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 1.2
            Reporter: Michael Graessle
            Priority: Minor


When extracting content from PDFs, PDF form data isn't extracted. 

The following code extracts this data via PDF box, but it seems like something 
Tika should be doing.

PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
  PDAcroForm acroForm = docCatalog.getAcroForm();
  if (acroForm != null) {
        @SuppressWarnings("unchecked")
        List<PDField> fields = acroForm.getFields();
        if (fields != null && fields.size() > 0) {
          documentContent.append(" ");
          for (PDField field : fields) {
                if (field.getValue()!=null) {
                  documentContent.append(field.getValue());
                  documentContent.append(" ");
                }
          }
        }
  }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to