[
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977887#comment-13977887
]
Rupert Westenthaler commented on TIKA-1109:
-------------------------------------------
To workaround the reported issues I created an own bundle for Tika 1.5 [1].
This bundle does not embed commons-compress, xz, commons-codec, commons-io as
those are anyway required by other Apache Stanbol modules and therefore
guaranteed to be around in the OSGI environment. Not sure if Tika would like to
embed those to avoid dependencies to other bundles.
If you like me to create a patch for 1.5 or 1.6-SNAPSHOT just leave a short
comment.
[1]
http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/commons/tikabundle/pom.xml
> Metadata not extracted before the content in OOXML (pptx)
> ---------------------------------------------------------
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
> Priority: Critical
> Labels: patch
> Fix For: 1.5
>
> Attachments: TIKA-1109.patch
>
>
> It seems that when processing OOXML documents, the metadata is only read
> after the text. This means it's impossible to use the medata while processing
> the text. I think it would be more useful to have the metadata populated
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type"
> content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment
> Test</dc:title>).
--
This message was sent by Atlassian JIRA
(v6.2#6252)