[ http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_64520
]
David Spencer commented on NUTCH-21:
------------------------------------
This may be of some use:
I needed a PPT parser in the context of Lucene, so I copied the code from here,
commented out a few nutch-specific things (e.g. the logging calls), and tested
it on some local PPT files. I'm using POI-2.5.1-final.
The code is not perfect, nor is the PPT I have :) but it's pretty good.
When it works it works well.
Went it fails it sometimes says there is no content, but in the doc I spot
checked there seemed to be textual content. I have only spot checked a few docs
but I did run it thru my disk:
In a test run:
[a] I had 195 PPT files
[b] In 36 files it said there was no body
[c] With one file it thru an exception
[d] With 158 files it found content
Wrt [b] this is not necessarily wrong e.g. if there are only images, however in
the 1 file I spot checked there was apparently textual content.
Wrt [d], I didn't spot check many files but the ones I did seemed fine.
Personally I would advocate using this esp if someone verifies this within
nutch - but I'm confident it will work as I didn't change much to use it in
Lucene.
This was the "bug" that happened in 1 file
Caused by: java.io.IOException: Cannot remove block[ 18805 ]; out of range
at
org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:103)
at
org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:92)
at
org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:83)
at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92)
-- Dave
> parser plugin for MS PowerPoint slides
> --------------------------------------
>
> Key: NUTCH-21
> URL: http://issues.apache.org/jira/browse/NUTCH-21
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Stefan Grroschupf
> Priority: Trivial
> Attachments: build.xml.patch.txt, parse-mspowerpoint.zip
>
> transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1109321&group_id=59548&atid=491356
> submitted by:
> Stephan Strittmatter
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers