* at 27/05 13:30 +0100 Pierre Denis said: > I'd like to count the number of words in any type of documents. > I have a processor that transform the initial document into plain text and > then counting the words is a piece of cake. > No problems so far to do it for plain text and html documents. > > The problem is for MS Word documents and pdf. Is there a perl module I've > missed that could do it? Maybe something that can transform MS word and pdf > docs into rtf? > It would be nice also to be able to extract the text from Excel > spreadsheets.
Spreadsheet::ParseExcel will do the excel bit (as long as they're not password protetected) and seemed ok for the quick hack i used it for. I'm sure there's a stack of PDF modules. a quick search of cpan shows several. Word might be a bit trickier. there are a few libraries out there that do the word -> text dance (the one antiword uses seems to be pretty good) but none of them (as far as I've ever found) has a perl interface so you'd either have to write one or so something ugly like shell out to the relevant program. I do seem to recall a talk at yapc::europe 2000 about doing some sort of word -> text conversion but I seem to recall they used some sort of windows machine with some sort of server that then use the perl OLE stuff to take a word file and save it as text. This is obviously non optimal :) s