Anyone ever get any closer to a solution to this?
We are encountering the same error parsing some PowerPoint documents too. Am I right in assuming this stems from the POI library rather than Nutch? Best, Trym WebDev Freak wrote: > > Hi when I'm crawling some Powerpoint documents some work and some give me > the following error: > > 2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener - > extractClientTextBoxes > > java.lang.ArrayIndexOutOfBoundsException: 1611976644 > > at > org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java > :491) > > at > org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java > :64) > > at > org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes( > ContentReaderListener.java:200) > > at > org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent > (ContentReaderListener.java:110) > > at > org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties( > POIFSReader.java:260) > > at org.apache.poi.poifs.eventfilesystem.POIFSReader.read( > POIFSReader.java:96) > > at > org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText( > PPTExtractor.java:49) > > at org.apache.nutch.parse.ms.MSExtractor.extract( > MSExtractor.java:77) > > at org.apache.nutch.parse.ms.MSBaseParser.getParse( > MSBaseParser.java:81) > > at > org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse( > MSPowerPointParser.java:44) > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.output( > Fetcher.java:283) > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run( > Fetcher.java:152) > > > Any help is appreciated. > > -- View this message in context: http://www.nabble.com/Need-Help....Problem-Crawling%2C-tf2354599.html#a7217888 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
