Anyone ever get any closer to a solution to this?

We are encountering the same error parsing some PowerPoint documents too.

Am I right in assuming this stems from the POI library rather than Nutch?


Best,
Trym



WebDev Freak wrote:
> 
> Hi when I'm crawling some Powerpoint documents some work and some give me
> the following error:
> 
> 2006-09-27 17:12:29,044 ERROR mspowerpoint.ContentReaderListener -
> extractClientTextBoxes
> 
> java.lang.ArrayIndexOutOfBoundsException: 1611976644
> 
>             at
> org.apache.poi.util.LittleEndian.getNumber(LittleEndian.java
> :491)
> 
>             at
> org.apache.poi.util.LittleEndian.getUShort(LittleEndian.java
> :64)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.extractTextBoxes(
> ContentReaderListener.java:200)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.ContentReaderListener.processPOIFSReaderEvent
> (ContentReaderListener.java:110)
> 
>             at
> org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(
> POIFSReader.java:260)
> 
>             at org.apache.poi.poifs.eventfilesystem.POIFSReader.read(
> POIFSReader.java:96)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.PPTExtractor.extractText(
> PPTExtractor.java:49)
> 
>             at org.apache.nutch.parse.ms.MSExtractor.extract(
> MSExtractor.java:77)
> 
>             at org.apache.nutch.parse.ms.MSBaseParser.getParse(
> MSBaseParser.java:81)
> 
>             at
> org.apache.nutch.parse.mspowerpoint.MSPowerPointParser.getParse(
> MSPowerPointParser.java:44)
> 
>             at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
> 
>             at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(
> Fetcher.java:283)
> 
>             at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(
> Fetcher.java:152)
> 
> 
> Any help is appreciated.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Need-Help....Problem-Crawling%2C-tf2354599.html#a7217888
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to