[ 
https://issues.apache.org/jira/browse/TIKA-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192200#comment-13192200
 ] 

Nick Burch commented on TIKA-675:
---------------------------------

We could probably do this with a wrapper parser, which tracks the name, outputs 
the nested name to the metadata, then delegates a different parser for the 
actual processing

If we added this, we'd need to decide on what metadata key to put this in (a 
new one, or change the resource name?), and how to separate parts (maybe an ! 
like in VFS?)

It should be very quick to do though, once those are decided
                
> PackageExtractor should track names of recursively nested resources
> -------------------------------------------------------------------
>
>                 Key: TIKA-675
>                 URL: https://issues.apache.org/jira/browse/TIKA-675
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Andrzej Bialecki 
>
> When parsing archive formats the hierarchy of names is not tracked, only the 
> current embedded component's name is preserved under 
> Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be 
> nice to build pseudo-urls for nested resources. In case of Tika API that uses 
> streams this could look like 
> {code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or 
> otherwise track the parent-child relationship - e.g. some applications need 
> this information to indicate what composite documents to delete from the 
> index after a container archive has been deleted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to