[jira] Created: (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

Dennis Adler (JIRA) Tue, 21 Dec 2010 11:53:26 -0800

IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no 
pictures
----------------------------------------------------------------------------------


                 Key: TIKA-577
                 URL: https://issues.apache.org/jira/browse/TIKA-577
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
            Reporter: Dennis Adler


When cracking a Word 03 document (which, unfortunately, I cannot upload -- it 
has client-confidential data -- an index out of bounds exception occurs in the 
POI code used by the WordExtractor. To try to make up for the unavailable doc 
file, I've included the resutls of a couple of hours stepping through the code 
to find the failure point. The error occurs because point[0] = point[1] = 30; 
upperbound of _paragraphs = 301. This is in the method 
org.apache.poi.hwpf.usermodel.CharacterRun() .

The method + line numbers are:

public CharacterRun getCharacterRun(int index)

line 792:       int[] point = findRange(_paragraphs, _parStart, 
Math.max(chpx.getStart(), _start), chpx.getEnd());
line 794:       PAPX papx = _paragraphs.get(point[0]);  // <<< This is the 
source of the exception

STACK at time of exception:

Range.GetCharacterRun(nit) line 794
PicturesTable.getAllPictures() line 191
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, 
Metadata, ParseContext) line 187
DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, 
ParseContext) line 197
AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, 
ParseContext) line 197
AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) 
line 137
... (my project) ...


As noted, this occurs in a Word 2003 doc which has no pictures (it is a table); 
147 character runs (0 - 146) found in first pass. Problem occurs on
first pass (not sure if there will be others) on this run. Last run in this 
code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
lines 186-191:


  public List<Picture> getAllPictures() {
    ArrayList<Picture> pictures = new ArrayList<Picture>();

    Range range = _document.getOverallRange();
    for (int i = 0; i < range.numCharacterRuns(); i++) {
        CharacterRun run = range.getCharacterRun(i);

Error occurs on getCharacterRun(146) -- which is the last run in the range. If 
I change point[0] to 300, the call returns nicely to 
WordExtractor$PicturesSource.<init>(HPWFDocument) line 429, setting <all> to an 
empty list. Fails again later on subsequent call to
getAllPictures with same error.

POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun for 
the paragraph in question.
Cannot send repro document - contains confidential client data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

Reply via email to