https://issues.apache.org/bugzilla/show_bug.cgi?id=52863

             Bug #: 52863
           Summary: java.lang.ArrayIndexOutOfBoundsException in
                    org.apache.poi.hwpf.sprm.SprmOperation.initSize
           Product: POI
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: HWPF
        AssignedTo: [email protected]
        ReportedBy: [email protected]
    Classification: Unclassified


1. When converting a bunch of Microsoft Word documents using the command,

    java -jar tika-app-1.1-SNAPSHOT.jar -v -t

, I'm getting the following exception. Ditto with Tika 1.1 release candidate.

org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@5d3ac0
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
    at org.apache.tika.cli.TikaCLI$TikaServer$1.run(TikaCLI.java:735)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 487
    at org.apache.poi.hwpf.sprm.SprmOperation.initSize(SprmOperation.java:174)
    at org.apache.poi.hwpf.sprm.SprmOperation.<init>(SprmOperation.java:80)
    at org.apache.poi.hwpf.sprm.SprmIterator.next(SprmIterator.java:48)
    at
org.apache.poi.hwpf.sprm.ParagraphSprmUncompressor.uncompressPAP(ParagraphSprmUncompressor.java:67)
    at org.apache.poi.hwpf.usermodel.Paragraph.newParagraph(Paragraph.java:103)
    at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:943)
    at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:146)
    at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:97)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:185)
    at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:160)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    ... 4 more

A user, Nick Burch, has advised me to raise this as a POI bug.

2. Here's the output of the BFF Validator tool:

<BFFValidation path="failing.doc" datetime="03/08/12 07:14:27" result="FAILED">
<ParseStack>
<Type builtinType="Docfile" docName="MS-DOC" sectionTitle="File Structure"
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737";>
<Info>Built-in type "Docfile": The root storage object of an OLE compound file.
For more information, see
http://msdn.microsoft.com/en-us/library/dd942138.aspx.</Info>
</Type>
<Type builtinType="Stream" docName="MS-DOC" sectionTitle="File Structure"
msdnLink="http://msdn.microsoft.com/en-us/library/4eaddc8f-4abd-43bb-8fd4-aef9c6121737";
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0">
<Info>Built-in type "Stream": Any stream object for OLE compound files. The
entire file contents for other files.</Info>
</Type>
<Type docName="MS-DOC" sectionTitle="Fib" sectionNumber="2.5.1"
msdnLink="http://msdn.microsoft.com/en-us/library/9AEAA2E7-4A45-468E-AB13-3F6193EB9394";
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type docName="MS-DOC" sectionTitle="FibBase" sectionNumber="2.5.2"
msdnLink="http://msdn.microsoft.com/en-us/library/26FB6C06-4E5C-4778-AB4E-EDBF26A545BB";
streamName="WordDocument" streamOffset="0" hexStreamOffset="0x0"/>
<Type builtinType="USHORT" streamName="WordDocument" bitfield="True"
bitOffsetWithinStruct="84" hexBitOffsetWithinStruct="0x54" bitCount="4"
streamOffsetOfStruct="0" hexStreamOffsetOfStruct="0x0" streamOffset="10"
hexStreamOffset="0xa" childId="10" hexChildId="0xa">
<Info>Built-in type "USHORT": Unsigned 2-byte integer.</Info>
</Type>
</ParseStack>
<LastData><![CDATA[
EC A5 01 01 4D 20 09 04  00 00 08 12 BF 00 00 00  ....M...........
00 00 00 30 00 00 00 00  00 08 00 00 66 EF 00 00  ...0........f...
]]></LastData>
</BFFValidation>
--------------------------------------------

Would greatly appreciate a timely fix, as I have 2000+ of documents that
POI/Tika are failing on. I cannot proceed any further.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to