IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures ----------------------------------------------------------------------------------
Key: TIKA-577 URL: https://issues.apache.org/jira/browse/TIKA-577 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.8 Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 Reporter: Dennis Adler When cracking a Word 03 document (which, unfortunately, I cannot upload -- it has client-confidential data -- an index out of bounds exception occurs in the POI code used by the WordExtractor. To try to make up for the unavailable doc file, I've included the resutls of a couple of hours stepping through the code to find the failure point. The error occurs because point[0] = point[1] = 30; upperbound of _paragraphs = 301. This is in the method org.apache.poi.hwpf.usermodel.CharacterRun() . The method + line numbers are: public CharacterRun getCharacterRun(int index) line 792: int[] point = findRange(_paragraphs, _parStart, Math.max(chpx.getStart(), _start), chpx.getEnd()); line 794: PAPX papx = _paragraphs.get(point[0]); // <<< This is the source of the exception STACK at time of exception: Range.GetCharacterRun(nit) line 794 PicturesTable.getAllPictures() line 191 WordExtractor$PicturesSource.<init>(HPWFDocument) line 429 WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419 WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75 OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 187 DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 197 AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, ParseContext) line 197 AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) line 137 ... (my project) ... As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 147 character runs (0 - 146) found in first pass. Problem occurs on first pass (not sure if there will be others) on this run. Last run in this code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(), lines 186-191: public List<Picture> getAllPictures() { ArrayList<Picture> pictures = new ArrayList<Picture>(); Range range = _document.getOverallRange(); for (int i = 0; i < range.numCharacterRuns(); i++) { CharacterRun run = range.getCharacterRun(i); Error occurs on getCharacterRun(146) -- which is the last run in the range. If I change point[0] to 300, the call returns nicely to WordExtractor$PicturesSource.<init>(HPWFDocument) line 429, setting <all> to an empty list. Fails again later on subsequent call to getAllPictures with same error. POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for the paragraph in question. Cannot send repro document - contains confidential client data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.