[ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128603#comment-16128603 ]
Tim Allison commented on TIKA-2442: ----------------------------------- Ok. And, right, I cited that exact table in the summary of PDFBOX-3898. :) [~msahyoun] has fixed PDFBOX-3898. When we next upgrade PDFBox, the fix will come into Tika. Thank you, again, for raising this issue, and thank you [~msahyoun] for fixing it in PDFBox! > Non-terminal interactive form fields not handled recursively > ------------------------------------------------------------ > > Key: TIKA-2442 > URL: https://issues.apache.org/jira/browse/TIKA-2442 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.14 > Reporter: Christopher Creutzig > Attachments: simple-form.pdf > > > (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form > extractor in PDFBox, but the app api does not have one. PDFDebugger does show > me the expected tree structure.) > The attached PDF has a non-terminal field named “parent” and two children, > “child1” and “child2.” According to the PDF spec in section 8.6, the fully > qualified field names should be parent.child1 and parent.child2. That is the > output given by pdftk: > > pdftk simple-form.pdf dump_data_fields > --- > FieldType: Text > FieldName: parent.child1 > FieldFlags: 0 > FieldValue: child1 value > FieldJustification: Left > --- > FieldType: Text > FieldName: parent.child2 > FieldFlags: 0 > FieldValue: child2 value > FieldJustification: Left > Tika with the ToXMLContentHandler seems to silently ignore the children, > however, returning only a parent with no value. > Calling code: > import java.io.FileInputStream; > import org.apache.tika.detect.DefaultDetector; > import org.apache.tika.detect.Detector; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.PasswordProvider; > import org.apache.tika.sax.ToXMLContentHandler; > class readAsXHTML { > public static String readAsXHTML(String filename) throws Exception { > ToXMLContentHandler handler = new ToXMLContentHandler(); > Detector detector = new DefaultDetector(); > Parser parser = new AutoDetectParser(detector); > ParseContext context = new ParseContext(); > Metadata metadata = new Metadata(); > FileInputStream fh = null; > final String pass = password; > try { > fh = new FileInputStream(filename); > parser.parse(fh, handler, metadata, context); > > return(handler.toString()); > } > finally { > if (fh != null) { > fh.close(); > } > } > } > } > Abbreviated output: > <body><div class="page"><p /> > </div> > <div class="acroform"><ol> <li>parent: </li> > </ol> > </div> > </body> > Expected: > <body><div class="page"><p /> > </div> > <div class="acroform"><ol> > <li>parent.child1: child1 value</li> > <li>parent.child2: child2 value</li> > </ol> > </div> > </body> -- This message was sent by Atlassian JIRA (v6.4.14#64029)