[ 
https://issues.apache.org/jira/browse/TIKA-2765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724333#comment-16724333
 ] 

Tim Allison commented on TIKA-2765:
-----------------------------------

Another improvement would be to whitelist known default part names and dump 
those all to one "file" (i.e. not treat them as attachments) so that we're not 
getting 18 xml attachments...  Worth it?

> Regression extracting text from corrupted docx files
> ----------------------------------------------------
>
>                 Key: TIKA-2765
>                 URL: https://issues.apache.org/jira/browse/TIKA-2765
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.19.1
>            Reporter: Luis Filipe Nassif
>            Priority: Minor
>         Attachments: DX IMPORTADORA  E  EXPORTADORA  LTDA.docx, 
> TIKA-2765.patch
>
>
> Tika-1.19.1 throws the following exception with some corrupt docx files (MS 
> Word complains but fixes them) previously handled without problems by 
> tika-1.18. Stacktrace bellow:
> {code:java}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser@79efa1ad
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
> at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)
> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358)
> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309)
> at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
> at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> at javax.swing.TransferHandler.importData(Unknown Source)
> at javax.swing.TransferHandler$DropHandler.drop(Unknown Source)
> at java.awt.dnd.DropTarget.drop(Unknown Source)
> at javax.swing.TransferHandler$SwingDropTarget.drop(Unknown Source)
> at sun.awt.dnd.SunDropTargetContextPeer.processDropMessage(Unknown Source)
> at 
> sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchDropEvent(Unknown
>  Source)
> at sun.awt.dnd.SunDropTargetContextPeer$EventDispatcher.dispatchEvent(Unknown 
> Source)
> at sun.awt.dnd.SunDropTargetEvent.dispatch(Unknown Source)
> at java.awt.Component.dispatchEventImpl(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
> at java.awt.LightweightDispatcher.processDropTargetEvent(Unknown Source)
> at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
> at java.awt.Container.dispatchEventImpl(Unknown Source)
> at java.awt.Window.dispatchEventImpl(Unknown Source)
> at java.awt.Component.dispatchEvent(Unknown Source)
> at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
> at java.awt.EventQueue.access$500(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.awt.EventQueue$3.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.awt.EventQueue$4.run(Unknown Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at 
> java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(Unknown
>  Source)
> at java.awt.EventQueue.dispatchEvent(Unknown Source)
> at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
> at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
> at java.awt.EventDispatchThread.run(Unknown Source)
> Caused by: org.apache.poi.openxml4j.exceptions.InvalidOperationException: 
> Could not open the specified zip entry source stream
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
> at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
> at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 43 more
> Caused by: java.io.EOFException
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
> at 
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
> at 
> org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
> at 
> org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
> at 
> org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
> ... 51 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to