Re: [Ferret-talk] indexing mostly-binary documents (.ppt)

Andreas Korth Sun, 01 Apr 2007 01:47:32 -0800

On Apr 1, 2007, at 3:09 AM, John Bachir wrote:

> Here's an interesting problem: In my app, we are indexing various
> types of documents, including microsoft powerpoint. Powerpoint
> documents are mostly binary, but have a bunch of text (all of the
> text in the document?) as well.


Are you serious? You're adding raw, unprocessed PPT files to your index?

Now this is just wrong. PPT files may contain all sorts of binary  
data, such as images and videos. I just had a look at the sample  
presentation that came with my Office installation. This file is  
3.5MB in size with a (plain text) payload of less than 1KB.

I'm sure there's some tool available which converts PPT to plain text  
and I strongly recommend you go out and find it.

Cheers,
Andy
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] indexing mostly-binary documents (.ppt)

Reply via email to