[jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Christopher Creutzig (JIRA) Wed, 16 Aug 2017 06:24:32 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128787#comment-16128787
 ]


Christopher Creutzig commented on TIKA-2442:
--------------------------------------------

[~msahyoun]: Sure.

> Non-terminal interactive form fields not handled recursively
> ------------------------------------------------------------
>
>                 Key: TIKA-2442
>                 URL: https://issues.apache.org/jira/browse/TIKA-2442
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding a form 
> extractor in PDFBox, but the app api does not have one. PDFDebugger does show 
> me the expected tree structure.)
> The attached PDF has a non-terminal field named “parent” and two children, 
> “child1” and “child2.” According to the PDF spec in section 8.6, the fully 
> qualified field names should be parent.child1 and parent.child2. That is the 
> output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children, 
> however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
>   public static String readAsXHTML(String filename) throws Exception {
>     ToXMLContentHandler handler = new ToXMLContentHandler();
>     Detector detector = new DefaultDetector();
>     Parser parser = new AutoDetectParser(detector);
>     ParseContext context = new ParseContext();
>     Metadata metadata = new Metadata();
>     FileInputStream fh = null;
>     final String pass = password;
>     try {
>       fh = new FileInputStream(filename);
>       parser.parse(fh, handler, metadata, context);
>       
>       return(handler.toString());
>     }
>     finally {
>       if (fh != null) {
>         fh.close();
>       }
>     }
>   }
> }
> Abbreviated output:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>    <li>parent: </li>
> </ol>
> </div>
> </body>
> Expected:
> <body><div class="page"><p />
> </div>
> <div class="acroform"><ol>
>   <li>parent.child1: child1 value</li>
>   <li>parent.child2: child2 value</li>
> </ol>
> </div>
> </body>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

Reply via email to