[
https://issues.apache.org/jira/browse/TIKA-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13694659#comment-13694659
]
Daniel Bonniot de Ruisselet commented on TIKA-1109:
---------------------------------------------------
I tried it. It broke two tests (same cause): as you mentioned, in excel the
metadata for TikaMetadataKeys.PROTECTED is populated during parsing. I made a
change in how that is implemented, and:
{{[INFO]
------------------------------------------------------------------------}}
{{[INFO] Building Apache Tika 1.5-SNAPSHOT}}
{{[INFO]
------------------------------------------------------------------------}}
{{[INFO]}}
{{[INFO] --- maven-remote-resources-plugin:1.2.1:process (default) @ tika ---}}
{{[INFO]
------------------------------------------------------------------------}}
{{[INFO] Reactor Summary:}}
{{[INFO]}}
{{[INFO] Apache Tika parent ................................ SUCCESS [0.806s]}}
{{[INFO] Apache Tika core .................................. SUCCESS [8.418s]}}
{{[INFO] Apache Tika parsers ............................... SUCCESS [26.857s]}}
{{[INFO] Apache Tika XMP ................................... SUCCESS [0.789s]}}
{{[INFO] Apache Tika application ........................... SUCCESS [3.336s]}}
{{[INFO] Apache Tika OSGi bundle ........................... SUCCESS [1.204s]}}
{{[INFO] Apache Tika server ................................ SUCCESS [5.312s]}}
{{[INFO] Apache Tika ....................................... SUCCESS [0.014s]}}
{{[INFO]
------------------------------------------------------------------------}}
{{[INFO] BUILD SUCCESS}}
{{[INFO]
------------------------------------------------------------------------}}
{{[INFO] Total time: 47.498s}}
{{[INFO] Finished at: Thu Jun 27 14:10:50 CEST 2013}}
{{[INFO] Final Memory: 27M/1930M}}
{{[INFO]
------------------------------------------------------------------------}}
{{dbonniot@naming:~/world/tika$ svn diff | diffstat}}
{{ main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
| 11 -}}
{{
main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
| 36 ++----}}
{{ test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
| 56 ++++++++++}}
{{ 3 files changed, 74 insertions(+), 29 deletions(-)}}
{{dbonniot@naming:~/world/tika$ svn diff > /tmp/TIKA-1109.patch}}
The logic is OOXMLExtractorFactory is now simpler, since I could remove the
extra shielding, which I suppose was made necessary by the previous ordering.
And the metadata for OOXML formats is now available at parse time, as tested by
the added test to OOXMLParserTest :)
> Metadata not extracted before the content in OOXML (pptx)
> ---------------------------------------------------------
>
> Key: TIKA-1109
> URL: https://issues.apache.org/jira/browse/TIKA-1109
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Daniel Bonniot de Ruisselet
> Priority: Critical
> Fix For: 1.5
>
>
> It seems that when processing OOXML documents, the metadata is only read
> after the text. This means it's impossible to use the medata while processing
> the text. I think it would be more useful to have the metadata populated
> first.
> As a symptom:
> java -jar tika-app-1.3.jar test-classes/test-documents/testPPT.pptx
> outputs only as metadata:
> <meta name="Content-Length" content="36518"/>
> <meta name="Content-Type"
> content="application/vnd.openxmlformats-officedocument.presentationml.presentation"/>
> <meta name="resourceName" content="testPPT.pptx"/>
> while there is more medata in the file (e.g. <dc:title>Attachment
> Test</dc:title>).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira