[jira] [Comment Edited] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

Tim Allison (JIRA) Tue, 15 Mar 2016 12:38:12 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196045#comment-15196045
 ]


Tim Allison edited comment on TIKA-1903 at 3/15/16 7:37 PM:
------------------------------------------------------------

Copied from [~rgauss] on TIKA-1607:

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes 
a Metadata object which is only associated with the embedded resource (not the 
same metadata object associated with the 'container' file) and is populated 
with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {
    
    /**
     * Gets the map of known embedded resources or null if no resources
     * were stored during parsing
     * 
     * @return the embedded resources
     */
    Map<Metadata, byte[]> getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them 
use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some EmbeddedResources object to be optionally populated along with the 
Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?



was (Author: [email protected]):
Copied from [~rgauss] on TIKA-1607:

bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The EmbeddedDocumentExtractor interface's parseEmbedded method currently takes 
a Metadata object which is only associated with the embedded resource (not the 
same metadata object associated with the 'container' file) and is populated 
with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:


/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {
    
    /**
     * Gets the map of known embedded resources or null if no resources
     * were stored during parsing
     * 
     * @return the embedded resources
     */
    Map<Metadata, byte[]> getEmbeddedResources();

}


then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull FileEmbeddedDocumentExtractor out of TikaCLI and make them 
use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some EmbeddedResources object to be optionally populated along with the 
Metadata in the Parser.parse method?

Other options? Maybe they don't need the RAW XMP?


> Allow for more flexibility in handling embedded metadata objects (e.g. XMP)
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1903
>                 URL: https://issues.apache.org/jira/browse/TIKA-1903
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> On TIKA-1607, we veered a bit from allowing flexible metadata structures to 
> how to handle embedded metadata documents, such as XMP.  Let's use this issue 
> to discuss and design.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1903) Allow for more flexibility in handling embedded metadata objects (e.g. XMP)

Reply via email to