PackageExtractor should track names of recursively nested resources
-------------------------------------------------------------------

                 Key: TIKA-675
                 URL: https://issues.apache.org/jira/browse/TIKA-675
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Andrzej Bialecki 


When parsing archive formats the hierarchy of names is not tracked, only the 
current embedded component's name is preserved under 
Metadata.RESOURCE_NAME_KEY. In a way similar to the VFS model it would be nice 
to build pseudo-urls for nested resources. In case of Tika API that uses 
streams this could look like 
{code}tar:gz:stream://example.tar.gz!/example.tar!/example.html{code} ...or 
otherwise track the parent-child relationship - e.g. some applications need 
this information to indicate what composite documents to delete from the index 
after a container archive has been deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to