Pascal Essiembre created TIKA-2530: -------------------------------------- Summary: OutlookExtractor "buffer underrun" when parsing .msg with embedded .msg Key: TIKA-2530 URL: https://issues.apache.org/jira/browse/TIKA-2530 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.17, 1.16 Environment: Reproduced with both Tika 1.16 and Tika 1.17 on Windows but the problem is likely on all platform. Reporter: Pascal Essiembre
When parsing certain .msg files containing certain attachments (e.g. other .msg files), I get this error: {noformat} ... Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer underrun at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662) at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73) at org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81) at org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42) at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270) ... {noformat} I think the issue is with {{MAPIRtfAttribute}} not liking it when receiving an empty byte array from {{OutlookExtractor}}. I was able to eliminate the error at around line 269 of {{OutlookExtractor}} with Tika 1.16 code (or around line 322 with Tika 1.17) with the following: {code:java} //--- START FIX --- ByteChunk chunk = (ByteChunk) rtfChunk; if (chunk != null && chunk.getValue() != null && chunk.getValue().length > 0 && !doneBody) { //ByteChunk chunk = (ByteChunk) rtfChunk; //--- END FIX --- {code} I am not sure if that is a real fix or more should be done than just getting rid of the error to make sure all is extracted properly from all files. I cannot share the sample file I have to test since it was given to me as sensitive content and I could not recreate a faulty msg file. Thanks -- This message was sent by Atlassian JIRA (v6.4.14#64029)