[ https://issues.apache.org/jira/browse/TIKA-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15774207#comment-15774207 ]
ASF GitHub Bot commented on TIKA-2222: -------------------------------------- GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/143 New XFDL parser for TIKA-2222 contributed by pascal.essiembre You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-2222 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #143 ---- commit f6acb7c9b509e98c76c520123e79941071a08ea6 Author: Pascal Essiembre <pascal.essiem...@norconex.com> Date: 2016-12-24T03:58:48Z New XFDL parser for TIKA-2222 contributed by pascal.essiembre ---- > Contributing a XFDL Parser > -------------------------- > > Key: TIKA-2222 > URL: https://issues.apache.org/jira/browse/TIKA-2222 > Project: Tika > Issue Type: Improvement > Components: parser > Environment: Any. > Reporter: Pascal Essiembre > Priority: Minor > > I am considering contributing a XFDL parser but I first have a few questions. > Feel free to close and let me know if this is not the proper channel for > asking such questions. > XFDL files are XML-based forms that can be regular text or base64 encoded. > They contain form field labels, field values, formulas, screen coordinates, > etc. Not everything is relevant so the default XML parser will extract too > much text. > My question is about what to store as metadata vs content. > Some people may want to capture the form field values only while others may > feel capturing the field labels are as important. Because people may be > interested in different things, I am thinking of storing each as separate > metadata entries. In doing so, it may make it so that no or very little > "content" is extracted, which can also be a nuisance to some. So... would > it be acceptable to store specific values as both metadata entries > (structured) and content (unstructured)? That approach would be the most > flexible for users, but is there a concern with having information stored in > two locations? > As for the metadata entries themselves, I am thinking this XFDL XML > representation... > {code:xml} > <field sid="FieldID"> > … > <value>This is a value.</value> > <label>This is a label</label> > … > </field> > {code} > …could be metadata stored/flatten like this : > {code} > field.FieldID.value= This is a value. > field.FieldID.label= This is a label. > {code} > Any comments on this approach? -- This message was sent by Atlassian JIRA (v6.3.4#6332)