[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196030#comment-15196030
 ] 

Ray Gauss II commented on TIKA-1607:
------------------------------------

bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method 
currently takes a {{Metadata}} object which is only associated with the 
embedded resource (not the same metadata object associated with the 'container' 
file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:
{code}
/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {
    
    /**
     * Gets the map of known embedded resources or null if no resources
     * were stored during parsing
     * 
     * @return the embedded resources
     */
    Map<Metadata, byte[]> getEmbeddedResources();

}
{code}

then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and 
make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some {{EmbeddedResources}} object to be optionally populated along with 
the {{Metadata}} in the {{Parser.parse}} method?

Other options?  Maybe they don't need the RAW XMP?

I'm also aware that we've strayed a bit from the original issue here of 
structured metadata.  Should we create a separate issue?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.13
>
>         Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to