[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

Tim Allison (JIRA) Fri, 30 Jun 2017 08:27:19 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070271#comment-16070271
 ]


Tim Allison commented on TIKA-2408:
-----------------------------------

[~Giorgy], thank you for opening this and sharing a triggering document!

MSWord also fails to open the document because of numbering.xml, and Winzip 
also has a problem with numbering.xml  There really does appear to be a problem 
with the zipped numbering.xml.

But wait, there's good news!  Our new, experimental SAX-based docx parser 
ignores problems with numbering and extracts text from this document...the 
extracted numbers that rely on numbering.xml are nearly guaranteed to be bad, 
but you at least get something.

To tell Tika to use that parser instead of our legacy DOM-based parser, do 
something like this:

{noformat}
        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);
{noformat}

Longer, term, I'm not sure if we want to move the SAX parser into POI or leave 
it in Tika.  IMHO, POI is right to throw an exception and stop because POI's 
xwpfdocument is read/write, and I'm not sure there's a correct behavior for 
writing to a corrupt document.  However, if your goal is to extract as much as 
you can even if there are problems, then our new SAX parser is for you!

Let me know if you need help turning on that parser via tika-config.xml.  I 
should update our wiki...probably...if I haven't.

> ZipException in text extraction from DOCX file
> ----------------------------------------------
>
>                 Key: TIKA-2408
>                 URL: https://issues.apache.org/jira/browse/TIKA-2408
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Jorge Spinsanti
>         Attachments: ZipException.docx
>
>
> I got a ZipException when try to extract text from DOCX file (attached):
> {code}
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 16 more
> Caused by: java.util.zip.ZipException: invalid literal/lengths set
>       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>       at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:122)
>       at 
> org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.read(ZipSecureFile.java:210)
>       at 
> org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown 
> Source)
>       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>       at 
> org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>       at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>       at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>       at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>       at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:112)
>       at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
>       at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 23 more
> {code}
> OpenOffice extracts text successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2408) ZipException in text extraction from DOCX file

Reply via email to