Hi Tim, > Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <talli...@mitre.org>: > > All, > I can't tell if the triggering file is corrupt or how we want to handle it > on the PDFBox side. The problem is that the parent node is a PDTextField -- > a PDTerminalField -- so we don't/can't look for children, even though it > actually does have pointers in Kids.
I had a quick look with the debugger and the file looks fine. There is nothing wrong with a non terminal field having a field type /FT and the kids (terminal fields) having not. In such case the field type should be taken for the kids. Which vesion of PDFBox is Tika 1.14 on? BR Maruan > > The output from PrintFields is: > > 1 top-level fields were found on the form > |--parent.parent = , > type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField > > -----Original Message----- > From: Tim Allison (JIRA) [mailto:j...@apache.org] > Sent: Monday, August 14, 2017 10:36 AM > To: d...@tika.apache.org > Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields > not handled recursively > > > [ > https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 > ] > >> Non-terminal interactive form fields not handled recursively >> ------------------------------------------------------------ >> >> Key: TIKA-2442 >> URL: https://issues.apache.org/jira/browse/TIKA-2442 >> Project: Tika >> Issue Type: Bug >> Components: parser >> Affects Versions: 1.14 >> Reporter: Christopher Creutzig >> Attachments: simple-form.pdf >> >> >> (I am not sure if this is a Tika or a PDFBox problem; I tried finding >> a form extractor in PDFBox, but the app api does not have one. PDFDebugger >> does show me the expected tree structure.) The attached PDF has a >> non-terminal field named “parent” and two children, “child1” and “child2.” >> According to the PDF spec in section 8.6, the fully qualified field names >> should be parent.child1 and parent.child2. That is the output given by pdftk: >>> pdftk simple-form.pdf dump_data_fields >> --- >> FieldType: Text >> FieldName: parent.child1 >> FieldFlags: 0 >> FieldValue: child1 value >> FieldJustification: Left >> --- >> FieldType: Text >> FieldName: parent.child2 >> FieldFlags: 0 >> FieldValue: child2 value >> FieldJustification: Left >> Tika with the ToXMLContentHandler seems to silently ignore the children, >> however, returning only a parent with no value. >> Calling code: >> import java.io.FileInputStream; >> import org.apache.tika.detect.DefaultDetector; >> import org.apache.tika.detect.Detector; import >> org.apache.tika.metadata.Metadata; >> import org.apache.tika.parser.AutoDetectParser; >> import org.apache.tika.parser.ParseContext; >> import org.apache.tika.parser.Parser; >> import org.apache.tika.parser.PasswordProvider; >> import org.apache.tika.sax.ToXMLContentHandler; >> class readAsXHTML { >> public static String readAsXHTML(String filename) throws Exception { >> ToXMLContentHandler handler = new ToXMLContentHandler(); >> Detector detector = new DefaultDetector(); >> Parser parser = new AutoDetectParser(detector); >> ParseContext context = new ParseContext(); >> Metadata metadata = new Metadata(); >> FileInputStream fh = null; >> final String pass = password; >> try { >> fh = new FileInputStream(filename); >> parser.parse(fh, handler, metadata, context); >> >> return(handler.toString()); >> } >> finally { >> if (fh != null) { >> fh.close(); >> } >> } >> } >> } >> Abbreviated output: >> <body><div class="page"><p /> >> </div> >> <div class="acroform"><ol> <li>parent: </li> >> </ol> >> </div> >> </body> >> Expected: >> <body><div class="page"><p /> >> </div> >> <div class="acroform"><ol> >> <li>parent.child1: child1 value</li> >> <li>parent.child2: child2 value</li> </ol> </div> </body> > > > > -- > This message was sent by Atlassian JIRA > (v6.4.14#64029) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org