On Apr 6, 2007, at 4:02 AM, David Balmain wrote:

> On 4/2/07, Andreas Korth <[EMAIL PROTECTED]> wrote:
>> On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote:
>>>> Now this is just wrong. PPT files may contain all sorts of binary
>>>> data, such as images and videos. I just had a look at the sample
>>>> presentation that came with my Office installation. This file is
>>>> 3.5MB in size with a (plain text) payload of less than 1KB.
>>>
>>> As I stated in my previous email, I am conjecturing that indexing
>>> these documents will not affect search performance. Do you disagree?
>> I couldn't disagree more. Question is to what extent does it affect
>> performance.
> Andy is right. Indexing binary data like this can really blow out the
> size of an index. Indexing natural language you get a lot of common
> terms so even in an index with millions of documents, you may have
> only tens of thousands of terms. This has a natural compression effect
> on the index so it will be a lot smaller than the collection of data
> that is being indexed. This doesn't work with binary data so the size
> of your index will be much larger and you'll have far more search
> terms in the index. So it will definitely have an effect on search
> performance but perhaps not as much as you'd expect.


For the record, by performance I meant the quality of the search  
(i.e., the results of a search query), and not the speed. I now  
realize that there is now way for anyone to have known that :)

Thanks again for all the ideas, I'm happy as a clam with catdoc/catppt.

John

_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to