https://bz.apache.org/bugzilla/show_bug.cgi?id=60374
Bug ID: 60374
Summary: Extracting text from some older Word documents fails
with ArrayIndexOutOfBoundsException due to
unicode/non-unicode mismatch
Product: POI
Version: 3.16-dev
Hardware: PC
Status: NEW
Severity: normal
Priority: P2
Component: HWPF
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: ---
Created attachment 34447
--> https://bz.apache.org/bugzilla/attachment.cgi?id=34447&action=edit
Sample file
The regression testing at
http://people.apache.org/~centic/poi_regression/reportsAll/ shows the following
for some files.
It seems the text-pieces in the files are stored as non-unicode, but the class
PieceDescriptor sets unicode = true. If I set unicode = false manually there
extracting text works for these documents as well.
public void testException() throws IOException, OpenXML4JException,
XmlException {
final POITextExtractor extractor =
ExtractorFactory.createExtractor(POIDataSamples.getDocumentInstance().openResourceAsStream("cn.orthodox.www_divenbog_APRIL_30-APRIL.DOC"));
// Check it gives text without error
System.out.println(extractor.getText());
extractor.close();
}
java.lang.IllegalArgumentException: Error creating Scratchpad Extractor
at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:197)
at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:119)
at
o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:276)
at
o.a.p.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:129)
at
o.a.p.stress.AbstractFileHandler.handleExtractingInternal(AbstractFileHandler.java:81)
at
o.a.p.stress.AbstractFileHandler.handleExtracting(AbstractFileHandler.java:60)
at
org.dstadler.commoncrawl.FileHandlingRunnable.run(FileHandlingRunnable.java:62)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor4560.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
o.a.p.extractor.OLE2ExtractorFactory.createExtractor(OLE2ExtractorFactory.java:192)
... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
at o.a.p.hwpf.model.TextPieceTable.(TextPieceTable.java:109)
at o.a.p.hwpf.model.ComplexFileTable.(ComplexFileTable.java:70)
at o.a.p.hwpf.HWPFOldDocument.(HWPFOldDocument.java:68)
at o.a.p.hwpf.extractor.Word6Extractor.(Word6Extractor.java:74)
at
o.a.p.extractor.OLE2ScratchpadExtractorFactory.createExtractor(OLE2ScratchpadExtractorFactory.java:62)
... 16 more
--
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]