Pascal Essiembre created TIKA-2530:
--------------------------------------

             Summary: OutlookExtractor "buffer underrun" when parsing .msg with 
embedded .msg
                 Key: TIKA-2530
                 URL: https://issues.apache.org/jira/browse/TIKA-2530
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.17, 1.16
         Environment: Reproduced with both Tika 1.16 and Tika 1.17 on Windows 
but the problem is likely on all platform.
            Reporter: Pascal Essiembre


When parsing certain .msg files containing certain attachments (e.g. other .msg 
files), I get this error:

{noformat}
...
Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer 
underrun
        at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
        at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
        at 
org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
        at 
org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
        at 
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
...
{noformat}

I think the issue is with {{MAPIRtfAttribute}} not liking it when receiving an 
empty byte array from {{OutlookExtractor}}.  I was able to eliminate the error 
at around line 269 of {{OutlookExtractor}} with Tika 1.16 code (or around line 
322 with Tika 1.17) with the following:

{code:java}
            //--- START FIX ---
            ByteChunk chunk = (ByteChunk) rtfChunk;
            if (chunk != null && chunk.getValue() != null 
                    && chunk.getValue().length > 0 && !doneBody) {
                //ByteChunk chunk = (ByteChunk) rtfChunk;
            //--- END FIX ---
{code}

I am not sure if that is a real fix or more should be done than just getting 
rid of the error to make sure all is extracted properly from all files.

I cannot share the sample file I have to test since it was given to me as 
sensitive content and I could not recreate a faulty msg file.

Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to