[
https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662467#comment-16662467
]
Nick Burch commented on TIKA-2765:
----------------------------------
Oracle hid all the useful Zip security stuff in recent Java releases, without
apparent replacement, so we had to switch from `java.util.zip` to Commons
Compress to be able to guard against (deliberate or corruption induced) zip
bombs and the like
I'm guessing that the old JDK stuff used to skip over any incomplete zip
entries towards the end of the file, and Commons Compress is now more
explicitly flagging up the issue
Either way, this probably needs handling at the Apache POI level, so the bug
probably needs shuffling over there!
> Regression extracting text from corrupted docx files
> ----------------------------------------------------
>
> Key: TIKA-2765
> URL: https://issues.apache.org/jira/browse/TIKA-2765
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.19.1
> Reporter: Luis Filipe Nassif
> Priority: Minor
> Attachments: DX IMPORTADORA E EXPORTADORA LTDA.docx
>
>
> Tika-1.19.1 throws the following exception with some corrupt docx files (MS
> Word complains but fixes them) previously handled without problems by
> tika-1.18. Stacktrace bellow:
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@79efa1ad
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
> at
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
> at
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> at javax.swing.TransferHandler.importData(Unknown Source)
> at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
> at java.awt.dnd.DropTarget.drop(Unknown Source)
> at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
> at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source)
> at
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
> Source)
> at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown
> Source)
> at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
> at java.awt.Component.dispatchEventImpl(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
> at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
> at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Window.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
> at java.awt.EventQueue.access$500(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
> Source)
> at
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
> Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
> Source)
> at java.awt.EventQueue.dispatchEvent(Unknown Source)
> at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.run(Unknown Source)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException:
> Could not open the specified zip entry source stream
> at
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
> at
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
> at
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
> at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
> at
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 43 more
> Caused by: java.io.EOFException
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
> at
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
> at
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
> at
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
> at
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
> ... 51 more{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)