On 4/2/07, Andreas Korth <[EMAIL PROTECTED]> wrote: > On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote: > >> Now this is just wrong. PPT files may contain all sorts of binary > >> data, such as images and videos. I just had a look at the sample > >> presentation that came with my Office installation. This file is > >> 3.5MB in size with a (plain text) payload of less than 1KB. > > > > As I stated in my previous email, I am conjecturing that indexing > > these documents will not affect search performance. Do you disagree? > > I couldn't disagree more. Question is to what extent does it affect > performance.
Andy is right. Indexing binary data like this can really blow out the size of an index. Indexing natural language you get a lot of common terms so even in an index with millions of documents, you may have only tens of thousands of terms. This has a natural compression effect on the index so it will be a lot smaller than the collection of data that is being indexed. This doesn't work with binary data so the size of your index will be much larger and you'll have far more search terms in the index. So it will definitely have an effect on search performance but perhaps not as much as you'd expect. Nevertheless, you'd be much better off extracting the text as others have already said. Cheers, Dave -- Dave Balmain http://www.davebalmain.com/ _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

