https://bz.apache.org/bugzilla/show_bug.cgi?id=62886
Bug ID: 62886
Summary: Regression extracting text from corrupted docx files
Product: POI
Version: 4.0.0-FINAL
Hardware: PC
Status: NEW
Severity: regression
Priority: P2
Component: OPC
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: ---
Created attachment 36245
--> https://bz.apache.org/bugzilla/attachment.cgi?id=36245&action=edit
Example file
While testing Tika-1.19.1, POI throws the following exception with some corrupt
docx files (MS Word complains but fixes them) previously handled without
problems by POI-3.17. See TIKA-2765 for more info. Stacktrace bellow:
org.apache.poi.openxml4j.exceptions.InvalidOperationException: Could not open
the specified zip entry source stream
at
org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:214)
at
org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:196)
at
org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:170)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:151)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:123)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:234)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:81)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:110)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 43 more
Caused by: java.io.EOFException
at
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:803)
at
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.readFully(ZipArchiveInputStream.java:795)
at
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.skipRemainderOfArchive(ZipArchiveInputStream.java:1014)
at
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.getNextZipEntry(ZipArchiveInputStream.java:257)
at
org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.getNextEntry(ZipArchiveThresholdInputStream.java:139)
at
org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:47)
at
org.apache.poi.openxml4j.opc.ZipPackage.openZipEntrySourceStream(ZipPackage.java:212)
... 51 more{code}
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]