[
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693097#comment-13693097
]
Nick Burch commented on TIKA-1109:
----------------------------------
Some parsers fetch the metadata first, some do it after the text, some populate
the metadata as they make their way through the file, and some do a mixture!
The current parser contract is on that the metadata will be populated by the
end of the call to parse, not that it will be available during the parsing.
It's up to the person writing the parser to do what makes most sense for their
format.
If you need all the metadata before you process the text, you'll need to buffer
the sax events yourself, sorry.
> Metadata not extracted before the context in OOXML (pptx)
> ---------------------------------------------------------
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
> Priority: Critical
> Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read
> after the text. This means it's impossible to use the medata while processing
> the text. I think it would be more useful to have the metadata populated
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type"
> content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment
> Test</dc:title>).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira