Hello,
I'm digging into possibly corrupt MS Word (.doc) document, under
https://issues.apache.org/jira/browse/TIKA-1072
POI is throwing an exception inside OLE10Native.java:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 40
at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225)
at
org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139)
at
org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89)
at
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
I don't understand the \U0001Ole10Native entry format, so I wanted to
ask you all if 1) this looks corrupt (ie bad document), or 2) it's
possible POI is mis-parsing the bytes.
Here's a hex dump of the 40 bytes:
00000000 24 00 00 00 02 00 01 01 00 0a 01 12 83 46 02 86 |$............F..|
00000010 3d 12 83 49 12 83 6c 12 83 42 12 82 73 12 82 69 |=..I..l..B..s..i|
00000020 12 82 6e 02 84 71 00 00 |..n..q..|
00000028
Thanks,
Mike McCandless
http://blog.mikemccandless.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]