Caleb Postlethwait created TIKA-3152:
----------------------------------------
Summary: Calling autoDetectParser.parse results in Unexpected
RuntimeException on .msg file with large attachment.
Key: TIKA-3152
URL: https://issues.apache.org/jira/browse/TIKA-3152
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.24
Environment: Running on ubuntu machines in AWS Cloud
Reporter: Caleb Postlethwait
When calling parse on an msg file stream I'm getting a RuntimeException from
Tika. The msg file contains a MOV file attachment approximately 22 MB.
Unfortunately, I'm unable to share the file as it is client data, my QA are
trying to re-produce with another file but aren't having much luck. I'm able to
open the msg file with outlook and the attached MOV file and they seem ok. I'm
including the stack trace, the code leading up to the parse, and the
tika-config we're using.
Code Snippet:
config = TikaConfigFactory.getTikaConfig();
Parser autoDetectParser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(TikaConfig.class, config);
autoDetectParser.parse(input, handler, metadata, context);
Stacktrace:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.officepar...@bdef9dborg.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@bdef9db at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
com.stormed.processing.TikaInfor.getInfor(TikaInfor.java:102) at
com.stormed.processing.AbstractFileProduct.addNatural_Metadata(AbstractFileProduct.java:114)
at com.stormed.processing.ProcessingMain.processing(ProcessingMain.java:280)
at com.stormed.processing.ProcessingMain.<init>(ProcessingMain.java:93) at
com.stormed.processing.common.ProcessingBuilder.run(ProcessingBuilder.java:45)
at com.stormed.proxy.AppRunner.run(AppRunner.java:21) at
com.stormed.proxy.ProxyMain.runApp(ProxyMain.java:228) at
com.stormed.proxy.ProxyMain.lambda$main$0(ProxyMain.java:120) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Caused by:
java.lang.IndexOutOfBoundsException: Block 45824 not found at
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:429)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.readCoreContents(POIFSFileSystem.java:362)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:316)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:123)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ...
14 moreCaused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes
from 23462400 in stream of length 23462400 at
org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:47)
at
org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:427)
... 18 more
Tika Config:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<service-loader dynamic="false" loadErrorHandler="IGNORE"
initializableProblemHandler="IGNORE"/>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector"/>
<encodingDetector
class="org.apache.tika.parser.txt.UniversalEncodingDetector"/>
<encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector"/>
</encodingDetectors>
<detectors>
<detector class="org.apache.tika.detect.OverrideDetector"/>
<detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
<detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
<detector class="org.gagravarr.tika.OggDetector"/>
<detector class="org.apache.tika.mime.MimeTypes"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.apple.AppleSingleFileParser"/>
<parser class="org.apache.tika.parser.asm.ClassParser"/>
<parser class="org.apache.tika.parser.audio.AudioParser"/>
<parser class="org.apache.tika.parser.audio.MidiParser"/>
<parser class="org.apache.tika.parser.chm.ChmParser"/>
<parser class="org.apache.tika.parser.code.SourceCodeParser"/>
<parser class="org.apache.tika.parser.crypto.Pkcs7Parser"/>
<parser class="org.apache.tika.parser.crypto.TSDParser"/>
<parser class="org.apache.tika.parser.csv.TextAndCSVParser"/>
<parser class="org.apache.tika.parser.dbf.DBFParser"/>
<parser class="org.apache.tika.parser.dif.DIFParser"/>
<parser class="org.apache.tika.parser.dwg.DWGParser"/>
<parser class="org.apache.tika.parser.epub.EpubParser"/>
<parser class="org.apache.tika.parser.executable.ExecutableParser"/>
<parser class="org.apache.tika.parser.feed.FeedParser"/>
<parser class="org.apache.tika.parser.font.AdobeFontMetricParser"/>
<parser class="org.apache.tika.parser.font.TrueTypeParser"/>
<parser class="org.apache.tika.parser.gdal.GDALParser"/>
<parser class="org.apache.tika.parser.geoinfo.GeographicInformationParser"/>
<parser class="org.apache.tika.parser.grib.GribParser"/>
<parser class="org.apache.tika.parser.hdf.HDFParser"/>
<parser class="org.apache.tika.parser.html.HtmlParser"/>
<parser class="org.apache.tika.parser.hwp.HwpV5Parser"/>
<parser class="org.apache.tika.parser.image.BPGParser"/>
<parser class="org.apache.tika.parser.image.ICNSParser"/>
<parser class="org.apache.tika.parser.image.ImageParser"/>
<parser class="org.apache.tika.parser.image.PSDParser"/>
<parser class="org.apache.tika.parser.image.TiffParser"/>
<parser class="org.apache.tika.parser.image.WebPParser"/>
<parser class="org.apache.tika.parser.iptc.IptcAnpaParser"/>
<parser class="org.apache.tika.parser.isatab.ISArchiveParser"/>
<parser class="org.apache.tika.parser.iwork.IWorkPackageParser"/>
<parser class="org.apache.tika.parser.jdbc.SQLite3Parser"/>
<parser class="org.apache.tika.parser.jpeg.JpegParser"/>
<parser class="org.apache.tika.parser.mail.RFC822Parser"/>
<parser class="org.apache.tika.parser.mat.MatParser"/>
<parser class="org.apache.tika.parser.mbox.MboxParser"/>
<parser class="org.apache.tika.parser.mbox.OutlookPSTParser"/>
<parser class="org.apache.tika.parser.microsoft.EMFParser"/>
<parser class="org.apache.tika.parser.microsoft.JackcessParser"/>
<parser class="org.apache.tika.parser.microsoft.MSOwnerFileParser"/>
<parser class="org.apache.tika.parser.microsoft.OfficeParser"/>
<parser class="org.apache.tika.parser.microsoft.OldExcelParser"/>
<parser class="org.apache.tika.parser.microsoft.TNEFParser"/>
<parser class="org.apache.tika.parser.microsoft.WMFParser"/>
<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<parser
class="org.apache.tika.parser.microsoft.ooxml.xwpf.ml2006.Word2006MLParser"/>
<parser class="org.apache.tika.parser.microsoft.xml.SpreadsheetMLParser"/>
<parser class="org.apache.tika.parser.microsoft.xml.WordMLParser"/>
<parser class="org.apache.tika.parser.mp3.Mp3Parser"/>
<parser class="org.apache.tika.parser.mp4.MP4Parser"/>
<parser class="org.apache.tika.parser.netcdf.NetCDFParser"/>
<parser class="org.apache.tika.parser.odf.OpenDocumentParser"/>
<parser class="org.apache.tika.parser.pdf.PDFParser"/>
<parser class="org.apache.tika.parser.pkg.CompressorParser"/>
<parser class="org.apache.tika.parser.pkg.PackageParser"/>
<parser class="org.apache.tika.parser.pkg.RarParser"/>
<parser class="org.apache.tika.parser.rtf.RTFParser"/>
<parser class="org.apache.tika.parser.sas.SAS7BDATParser"/>
<parser class="org.apache.tika.parser.video.FLVParser"/>
<parser class="org.apache.tika.parser.wordperfect.QuattroProParser"/>
<parser class="org.apache.tika.parser.wordperfect.WordPerfectParser"/>
<parser class="org.apache.tika.parser.xliff.XLIFF12Parser"/>
<parser class="org.apache.tika.parser.xliff.XLZParser"/>
<parser class="org.apache.tika.parser.xml.DcXMLParser"/>
<parser class="org.apache.tika.parser.xml.FictionBookParser"/>
<parser class="org.gagravarr.tika.FlacParser"/>
<parser class="org.gagravarr.tika.OggParser"/>
<parser class="org.gagravarr.tika.OpusParser"/>
<parser class="org.gagravarr.tika.SpeexParser"/>
<parser class="org.gagravarr.tika.TheoraParser"/>
<parser class="org.gagravarr.tika.VorbisParser"/>
</parsers>
</properties>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)