[ https://issues.apache.org/jira/browse/TIKA-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Adler updated TIKA-577: ------------------------------ Attachment: X'd Out Doc for Tika.doc Here's the Word document that causes the exception. After my hex-editing foray one thing I noticed is it contains some Sharepoint info (came from a SharePoint site originally). > IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no > pictures > ---------------------------------------------------------------------------------- > > Key: TIKA-577 > URL: https://issues.apache.org/jira/browse/TIKA-577 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.8 > Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 > Reporter: Dennis Adler > Attachments: X'd Out Doc for Tika.doc > > > When cracking a Word 03 document (which, unfortunately, I cannot upload -- it > has client-confidential data), an index out of bounds exception occurs in the > POI code used by the WordExtractor. To try to make up for the unavailable doc > file, I've included the results of a couple of hours stepping through the > code to find the failure point. The error occurs because point[0] = point[1] > = 301; upperbound of _paragraphs = 301. This is in the method > org.apache.poi.hwpf.usermodel.CharacterRun() . > The method + line numbers are: > public CharacterRun getCharacterRun(int index) > line 792: int[] point = findRange(_paragraphs, _parStart, > Math.max(chpx.getStart(), _start), chpx.getEnd()); > line 794: PAPX papx = _paragraphs.get(point[0]); // <<< This is the > source of the exception > STACK at time of exception: > Range.GetCharacterRun(int) line 794 > PicturesTable.getAllPictures() line 191 > WordExtractor$PicturesSource.<init>(HPWFDocument) line 429 > WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419 > WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75 > OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, > Metadata, ParseContext) line 187 > DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, > ParseContext) line 197 > AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, > Metadata, ParseContext) line 197 > AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) > line 137 > ... (my project) ... > As noted, this occurs in a Word 2003 doc which has no pictures (it is a > table); 147 character runs (0 - 146) found in first pass. Problem occurs on > first pass (not sure if there will be others) on this run. Last run in this > code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(), > lines 186-191: > public List<Picture> getAllPictures() { > ArrayList<Picture> pictures = new ArrayList<Picture>(); > Range range = _document.getOverallRange(); > for (int i = 0; i < range.numCharacterRuns(); i++) { > CharacterRun run = range.getCharacterRun(i); > Error occurs on getCharacterRun(i) when i = 146, which is the last run in the > range. If I change point[0] to 300 (in getCharacterRun), the call returns > nicely to > WordExtractor$PicturesSource<init>(HPWFDocument) line 429, setting the List > all to an empty List. Fails again later on subsequent call to > getAllPictures with same error. > POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun > for the paragraph in question. > Cannot send repro document - contains confidential client data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.