[ 
https://issues.apache.org/jira/browse/TIKA-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15774207#comment-15774207
 ] 

ASF GitHub Bot commented on TIKA-2222:
--------------------------------------

GitHub user essiembre opened a pull request:

    https://github.com/apache/tika/pull/143

    New XFDL parser for TIKA-2222 contributed by pascal.essiembre

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/essiembre/tika TIKA-2222

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tika/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #143
    
----
commit f6acb7c9b509e98c76c520123e79941071a08ea6
Author: Pascal Essiembre <pascal.essiem...@norconex.com>
Date:   2016-12-24T03:58:48Z

    New XFDL parser for TIKA-2222 contributed by pascal.essiembre

----


> Contributing a XFDL Parser
> --------------------------
>
>                 Key: TIKA-2222
>                 URL: https://issues.apache.org/jira/browse/TIKA-2222
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>
> I am considering contributing a XFDL parser but I first have a few questions. 
> Feel free to close and let me know if this is not the proper channel for 
> asking such questions.
> XFDL files are XML-based forms that can be regular text or base64 encoded.  
> They contain form field labels, field values, formulas, screen coordinates, 
> etc.   Not everything is relevant so the default XML parser will extract too 
> much text.
> My question is about what to store as metadata vs content.    
> Some people may want to capture the form field values only while others may 
> feel capturing the field labels are as important.  Because people may be 
> interested in different things, I am thinking of storing each as separate 
> metadata entries.  In doing so, it may make it so that no or very little 
> "content" is extracted, which can also be a nuisance to some.    So... would 
> it be acceptable to store specific values as both metadata entries 
> (structured) and content (unstructured)?  That approach would be the most 
> flexible for users, but is there a concern with having information stored in 
> two locations?
> As for the metadata entries themselves, I am thinking this XFDL XML 
> representation...
> {code:xml}
>       <field sid="FieldID">
>           …
>          <value>This is a value.</value>
>          <label>This is a label</label>
>          …
>       </field>
> {code}
> …could be metadata stored/flatten like this :
> {code}
> field.FieldID.value= This is a value.
> field.FieldID.label= This is a label.
> {code}
> Any comments on this approach?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to