Hi, I'm using Nutch 0.9 to crawl part of my intranet, and am getting the following when attempting to parse ppt files:
2009-03-11 16:30:47,000 ERROR mspowerpoint.ContentReaderListener - extractClientTextBoxes java.lang.ArrayIndexOutOfBoundsException: -55133188 at org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java:491) at org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java:64) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBox es(ContentReaderListener.java:201) at org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSRe aderEvent(ContentReaderListener.java:111) at org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFS Reader.java:260) at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:9 6) at org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(PPTExtracto r.java:50) at org.apache.nutch.parse.ms.MSExtractor.extract(MSExtractor.java:78) at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:82) at org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(MSPowerP ointParser.java:45) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:308) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:153) Does anyone know what is happening here and how I can fix it ? Thanks Luke