[ 
https://issues.apache.org/jira/browse/TIKA-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097319#comment-13097319
 ] 

Nick Burch commented on TIKA-705:
---------------------------------

I'll need to read the spec to be sure, but I have a feeling it could be our 
issue with not removing anchors before fetching parts.

Either way we probably want to make it easier for people to get related parts 
anyway, as the current method is a bit more fiddly that we really want.

This will probably largely all be done on the POI side though, with the only 
Tika bit being moving to the new, simpler code once available

> Valid OOXML PPT file hits InvalidFormatException thrown in POI
> --------------------------------------------------------------
>
>                 Key: TIKA-705
>                 URL: https://issues.apache.org/jira/browse/TIKA-705
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Michael McCandless
>             Fix For: 1.0
>
>         Attachments: testPPT_various.pptx
>
>
> I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as 
> various other doc types, to generate more test cases.
> But when I did this for PPTX, the resulting file hits this exception:
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: Broken 
> OOXML file
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A 
> segment shall not hold any characters other than pchar characters. [M1.6]
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
>       at 
> org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
>       at 
> org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
>       ... 9 more
> {noformat}
> All I did was open Office 2007, copy/paste over the text from the Word doc, 
> and save it.  Ie, it should be a valid OOXML file, unless Office 2007 is 
> buggy?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to