* at 27/05 13:30 +0100 Pierre Denis said:
> I'd like to count the number of words in any type of documents.
> I have a processor that transform the initial document into plain text and
> then counting the words is a piece of cake.
> No problems so far to do it for plain text and html documents.
> 
> The problem is for MS Word documents and pdf. Is there a perl module I've
> missed that could do it? Maybe something that can transform MS word and pdf
> docs into rtf?
> It would be nice also to be able to extract the text from Excel
> spreadsheets.

Spreadsheet::ParseExcel will do the excel bit (as long as they're not
password protetected) and seemed ok for the quick hack i used it for.

I'm sure there's a stack of PDF modules. a quick search of cpan shows
several.

Word might be a bit trickier. there are a few libraries out there that
do the word -> text dance (the one antiword uses seems to be pretty
good) but none of them (as far as I've ever found) has a perl
interface so you'd either have to write one or so something ugly like
shell out to the relevant program.

I do seem to recall a talk at yapc::europe 2000 about doing some sort
of word -> text conversion but I seem to recall they used some sort of
windows machine with some sort of server that then use the perl OLE
stuff to take a word file and save it as text. This is obviously non
optimal :)

s

Reply via email to