[
https://issues.apache.org/jira/browse/TIKA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694774#comment-13694774
]
Tim Allison commented on TIKA-973:
----------------------------------
Agree on both. Also would appreciate feedback on what the output should be.
The current code extracts this unseemly xhtml:
<div class="acroform">
<ol> <li partialName="form1[0]" fullName="form1[0]"/>
<ol> <li partialName="#subform[6]" fullName="form1[0].#subform[6]"/>
<li partialName="MiddleInitial[0]"
fullName="form1[0].#subform[6].MiddleInitial[0]" altName="Enter Middle Initial
(MI)">X</li>
<li partialName="FamilyName[0]"
fullName="form1[0].#subform[6].FamilyName[0]" altName="Section 1. Employee
Information and Attestation. Family Name (Last Name)">Doe</li>
<li partialName="GivenName[0]"
fullName="form1[0].#subform[6].GivenName[0]" altName="Given Name (First
Name)">John</li>
<li partialName="OtherNamesUsed[0]"
fullName="form1[0].#subform[6].OtherNamesUsed[0]" altName="Maiden Name">Mr.
Doe</li>
<li partialName="StreetNumberName[0]"
fullName="form1[0].#subform[6].StreetNumberName[0]" altName=" Street Number and
Name">123 Main St.</li>
>
...
Another idea I had was to include the partialName in the contents and not fill
out the attrs:
<li>StreetNumberName[0]: 123 Main St</li>
More unit tests on way...
> PDF form data isn't included in extracted content.
> --------------------------------------------------
>
> Key: TIKA-973
> URL: https://issues.apache.org/jira/browse/TIKA-973
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.2
> Reporter: Michael Graessle
> Priority: Minor
> Attachments: TIKA-973-patch.tar.gz
>
>
> When extracting content from PDFs, PDF form data isn't extracted.
> The following code extracts this data via PDF box, but it seems like
> something Tika should be doing.
> PDDocumentCatalog docCatalog = load.getDocumentCatalog();
> if (docCatalog != null) {
> PDAcroForm acroForm = docCatalog.getAcroForm();
> if (acroForm != null) {
> @SuppressWarnings("unchecked")
> List<PDField> fields = acroForm.getFields();
> if (fields != null && fields.size() > 0) {
> documentContent.append(" ");
> for (PDField field : fields) {
> if (field.getValue()!=null) {
> documentContent.append(field.getValue());
> documentContent.append(" ");
> }
> }
> }
> }
> }
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira