On Apr 6, 2007, at 4:02 AM, David Balmain wrote: > On 4/2/07, Andreas Korth <[EMAIL PROTECTED]> wrote: >> On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote: >>>> Now this is just wrong. PPT files may contain all sorts of binary >>>> data, such as images and videos. I just had a look at the sample >>>> presentation that came with my Office installation. This file is >>>> 3.5MB in size with a (plain text) payload of less than 1KB. >>> >>> As I stated in my previous email, I am conjecturing that indexing >>> these documents will not affect search performance. Do you disagree? >> I couldn't disagree more. Question is to what extent does it affect >> performance. > Andy is right. Indexing binary data like this can really blow out the > size of an index. Indexing natural language you get a lot of common > terms so even in an index with millions of documents, you may have > only tens of thousands of terms. This has a natural compression effect > on the index so it will be a lot smaller than the collection of data > that is being indexed. This doesn't work with binary data so the size > of your index will be much larger and you'll have far more search > terms in the index. So it will definitely have an effect on search > performance but perhaps not as much as you'd expect.
For the record, by performance I meant the quality of the search (i.e., the results of a search query), and not the speed. I now realize that there is now way for anyone to have known that :) Thanks again for all the ideas, I'm happy as a clam with catdoc/catppt. John _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

