[jira] Updated: (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

Dennis Adler (JIRA) Tue, 18 Jan 2011 18:23:10 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dennis Adler updated TIKA-577:
------------------------------

    Attachment: X'd Out Doc for Tika.doc

Here's the Word document that causes the exception. After my hex-editing foray 
one thing I noticed is it contains some Sharepoint info (came from a SharePoint 
site originally).

> IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no 
> pictures
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-577
>                 URL: https://issues.apache.org/jira/browse/TIKA-577
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
>            Reporter: Dennis Adler
>         Attachments: X'd Out Doc for Tika.doc
>
>
> When cracking a Word 03 document (which, unfortunately, I cannot upload -- it 
> has client-confidential data), an index out of bounds exception occurs in the 
> POI code used by the WordExtractor. To try to make up for the unavailable doc 
> file, I've included the results of a couple of hours stepping through the 
> code to find the failure point. The error occurs because point[0] = point[1] 
> = 301; upperbound of _paragraphs = 301. This is in the method 
> org.apache.poi.hwpf.usermodel.CharacterRun() .
> The method + line numbers are:
> public CharacterRun getCharacterRun(int index)
> line 792:     int[] point = findRange(_paragraphs, _parStart, 
> Math.max(chpx.getStart(), _start), chpx.getEnd());
> line 794:     PAPX papx = _paragraphs.get(point[0]);  // <<< This is the 
> source of the exception
> STACK at time of exception:
> Range.GetCharacterRun(int) line 794
> PicturesTable.getAllPictures() line 191
> WordExtractor$PicturesSource.<init>(HPWFDocument) line 429
> WordExtractor$PicturesSource.<init>(HPWFDocument, WordExtractor#1) line 419
> WordExtractor.parse(POIFSFileSystem, XHTMLContentHandler) line 75
> OfficeParser.parse(CompositeParser).parse(InputStream, ContentHandler, 
> Metadata, ParseContext) line 187
> DefaulttParser(CompositeParser).parse(InputStream, ContentHandler, Metadata, 
> ParseContext) line 197
> AutoDetectParser(CompositeParser).parse(InputStream, ContentHandler, 
> Metadata, ParseContext) line 197
> AutoDetectParser.parse(InputStream, ContentHandler, Metadata, ParseContext) 
> line 137
> ... (my project) ...
> As noted, this occurs in a Word 2003 doc which has no pictures (it is a 
> table); 147 character runs (0 - 146) found in first pass. Problem occurs on
> first pass (not sure if there will be others) on this run. Last run in this 
> code section from org.apache.poi.hwpf.model.PicturesTable.getAllPictures(),
> lines 186-191:
>   public List<Picture> getAllPictures() {
>     ArrayList<Picture> pictures = new ArrayList<Picture>();
>     Range range = _document.getOverallRange();
>     for (int i = 0; i < range.numCharacterRuns(); i++) {
>       CharacterRun run = range.getCharacterRun(i);
> Error occurs on getCharacterRun(i) when i = 146, which is the last run in the 
> range. If I change point[0] to 300 (in getCharacterRun), the call returns 
> nicely to 
> WordExtractor$PicturesSource<init>(HPWFDocument) line 429, setting the List 
> all to an empty List. Fails again later on subsequent call to
> getAllPictures with same error.
> POTENTIAL FIX: if point[0] > papx.Length then return an EMPTY CharacterRun 
> for the paragraph in question.
> Cannot send repro document - contains confidential client data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-577) IndexOutOfBounds Exception looking for Picture in Word 03 doc that has no pictures

Reply via email to