Hi Tim,

> Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <talli...@mitre.org>:
> 
> All,
>  I can't tell if the triggering file is corrupt or how we want to handle it 
> on the PDFBox side.  The problem is that the parent node is a PDTextField -- 
> a PDTerminalField -- so we don't/can't look for children, even though it 
> actually does have pointers in Kids.

I had a quick look with the debugger and the file looks fine. There is nothing 
wrong with a non terminal field having a field type /FT and the kids (terminal 
fields) having not. In such case the field type should be taken for the kids.

Which vesion of PDFBox is Tika 1.14 on?

BR
Maruan


> 
> The output from PrintFields is:
> 
> 1 top-level fields were found on the form
> |--parent.parent = ,  
> type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
> 
> -----Original Message-----
> From: Tim Allison (JIRA) [mailto:j...@apache.org] 
> Sent: Monday, August 14, 2017 10:36 AM
> To: d...@tika.apache.org
> Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields 
> not handled recursively
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756
>  ] 
> 
>> Non-terminal interactive form fields not handled recursively
>> ------------------------------------------------------------
>> 
>>                Key: TIKA-2442
>>                URL: https://issues.apache.org/jira/browse/TIKA-2442
>>            Project: Tika
>>         Issue Type: Bug
>>         Components: parser
>>   Affects Versions: 1.14
>>           Reporter: Christopher Creutzig
>>        Attachments: simple-form.pdf
>> 
>> 
>> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
>> a form extractor in PDFBox, but the app api does not have one. PDFDebugger 
>> does show me the expected tree structure.) The attached PDF has a 
>> non-terminal field named “parent” and two children, “child1” and “child2.” 
>> According to the PDF spec in section 8.6, the fully qualified field names 
>> should be parent.child1 and parent.child2. That is the output given by pdftk:
>>> pdftk simple-form.pdf dump_data_fields
>> ---
>> FieldType: Text
>> FieldName: parent.child1
>> FieldFlags: 0
>> FieldValue: child1 value
>> FieldJustification: Left
>> ---
>> FieldType: Text
>> FieldName: parent.child2
>> FieldFlags: 0
>> FieldValue: child2 value
>> FieldJustification: Left
>> Tika with the ToXMLContentHandler seems to silently ignore the children, 
>> however, returning only a parent with no value.
>> Calling code:
>> import java.io.FileInputStream;
>> import org.apache.tika.detect.DefaultDetector;
>> import org.apache.tika.detect.Detector; import 
>> org.apache.tika.metadata.Metadata;
>> import org.apache.tika.parser.AutoDetectParser;
>> import org.apache.tika.parser.ParseContext;
>> import org.apache.tika.parser.Parser;
>> import org.apache.tika.parser.PasswordProvider;
>> import org.apache.tika.sax.ToXMLContentHandler;
>> class readAsXHTML {
>>  public static String readAsXHTML(String filename) throws Exception {
>>    ToXMLContentHandler handler = new ToXMLContentHandler();
>>    Detector detector = new DefaultDetector();
>>    Parser parser = new AutoDetectParser(detector);
>>    ParseContext context = new ParseContext();
>>    Metadata metadata = new Metadata();
>>    FileInputStream fh = null;
>>    final String pass = password;
>>    try {
>>      fh = new FileInputStream(filename);
>>      parser.parse(fh, handler, metadata, context);
>> 
>>      return(handler.toString());
>>    }
>>    finally {
>>      if (fh != null) {
>>        fh.close();
>>      }
>>    }
>>  }
>> }
>> Abbreviated output:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>   <li>parent: </li>
>> </ol>
>> </div>
>> </body>
>> Expected:
>> <body><div class="page"><p />
>> </div>
>> <div class="acroform"><ol>
>>  <li>parent.child1: child1 value</li>
>>  <li>parent.child2: child2 value</li> </ol> </div> </body>
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to