[ https://issues.apache.org/jira/browse/TIKA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570305#comment-13570305 ]
Michael McCandless commented on TIKA-1072: ------------------------------------------ Thanks Nick, I'll try asking on dev@poi. I'll open a separate issue about continuing parsing even when an embedded doc hits an exception ... > AIOOBE when handling embedded document in .doc file > --------------------------------------------------- > > Key: TIKA-1072 > URL: https://issues.apache.org/jira/browse/TIKA-1072 > Project: Tika > Issue Type: Bug > Reporter: Michael McCandless > Fix For: 1.4 > > Attachments: 20-Force-on-a-current-S00.doc > > > I have a Word (.doc) document that hits an exception when I run: > {noformat} > java -jar tika-app/target/tika-app-1.4-SNAPSHOT.jar > /x/tmp/20-Force-on-a-current-S00.doc > {noformat} > Here's the exception: > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 40 > at org.apache.poi.util.LittleEndian.getShort(LittleEndian.java:225) > at > org.apache.poi.poifs.filesystem.Ole10Native.<init>(Ole10Native.java:139) > at > org.apache.poi.poifs.filesystem.Ole10Native.createFromEmbeddedOleObject(Ole10Native.java:89) > at > org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedOfficeDoc(AbstractPOIFSExtractor.java:149) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:135) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:186) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > {noformat} > It happens when we try to parse an OLE10 embedded object ... the code > that does this parsing captures and ignores Ole10NativeException and > skips the entry ... so I'm wondering if we should also catch AIOOBE > and skip the entry? Ie, maybe this entry really is not OLE10, and the > Ole10Native code is failing to throw Ole10NativeException for it? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira