On 4/2/07, Andreas Korth <[EMAIL PROTECTED]> wrote:
> On Apr 1, 2007, at 6:11 PM, John Joseph Bachir wrote:
> >> Now this is just wrong. PPT files may contain all sorts of binary
> >> data, such as images and videos. I just had a look at the sample
> >> presentation that came with my Office installation. This file is
> >> 3.5MB in size with a (plain text) payload of less than 1KB.
> >
> > As I stated in my previous email, I am conjecturing that indexing
> > these documents will not affect search performance. Do you disagree?
>
> I couldn't disagree more. Question is to what extent does it affect
> performance.

Andy is right. Indexing binary data like this can really blow out the
size of an index. Indexing natural language you get a lot of common
terms so even in an index with millions of documents, you may have
only tens of thousands of terms. This has a natural compression effect
on the index so it will be a lot smaller than the collection of data
that is being indexed. This doesn't work with binary data so the size
of your index will be much larger and you'll have far more search
terms in the index. So it will definitely have an effect on search
performance but perhaps not as much as you'd expect. Nevertheless,
you'd be much better off extracting the text as others have already
said.

Cheers,
Dave
-- 
Dave Balmain
http://www.davebalmain.com/
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to