Michael Graessle created TIKA-973:
-------------------------------------
Summary: PDF form data isn't included in extracted content.
Key: TIKA-973
URL: https://issues.apache.org/jira/browse/TIKA-973
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 1.2
Reporter: Michael Graessle
Priority: Minor
When extracting content from PDFs, PDF form data isn't extracted.
The following code extracts this data via PDF box, but it seems like something
Tika should be doing.
PDDocumentCatalog docCatalog = load.getDocumentCatalog();
if (docCatalog != null) {
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
@SuppressWarnings("unchecked")
List<PDField> fields = acroForm.getFields();
if (fields != null && fields.size() > 0) {
documentContent.append(" ");
for (PDField field : fields) {
if (field.getValue()!=null) {
documentContent.append(field.getValue());
documentContent.append(" ");
}
}
}
}
}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira