Hi,

I'm using Nutch 0.9 to crawl part of my intranet, and am getting the
following when attempting to parse ppt files:

2009-03-11 16:30:47,000 ERROR mspowerpoint.ContentReaderListener -
extractClientTextBoxes
java.lang.ArrayIndexOutOfBoundsException: -55133188
        at
org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491)
        at
org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64)
        at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBox
es(ContentReaderListener.java:201)
        at
org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSRe
aderEvent(ContentReaderListener.java:111)
        at
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFS
Reader.java:260)
        at
org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:9
6)
        at
org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtracto
r.java:50)
        at
org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:78)
        at
org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:82)
        at
org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerP
ointParser.java:45)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:308)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:153)

Does anyone know what is happening here and how I can fix it ?

Thanks

Luke

Reply via email to