[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann updated TIKA-1072: ------------------------------------ Fix Version/s: (was: 1.7) 1.8 - push to 1.8 > AIOOBE when handling embedded document in .doc file > --------------------------------------------------- > > Key: TIKA-1072 > URL: https://issues.apache.org/jira/browse/TIKA-1072 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Michael McCandless > Fix For: 1.8 > > Attachments: 20-Force-on-a-current-S00.doc, Ole10NativeEntry.bin > > > I have a Word (.doc) document that hits an exception when I run: > {noformat} > java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar > /x/tmp/20-Force-on-a-current-S00.doc > {noformat} > Here's the exception: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 > at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) > at > org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139) > at > org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > {noformat} > It happens when we try to parse an OLE10 embedded object ... the code > that does this parsing captures and ignores Ole10NativeException and > skips the entry ... so I'm wondering if we should also catch AIOOBE > and skip the entry? Ie, maybe this entry really is not OLE10, and the > Ole10Native code is failing to throw Ole10NativeException for it? -- This message was sent by Atlassian JIRA (v6.3.4#6332)