Hello Erik

Thanks for the feedback.  If you don't mind elaborating further, what kind
> of documents are you indexing (database rows?  file system files?  other?),
> how many documents do you have, and how are you indexing it?
>
> Thanks,
>
>        Erik
>


  Now, we are indexing  file system files varying from HTML pages (85%) to
IMAGES (10%) (We index Meta information here), PDF(2%) WORD (2%) and PURE
TEXT (1%), we have 100 000 000 documents to index (10%) is already done. And
for the last question, I didn't exactly understand what do you mean by "How
we are indexing", What I can say is that before we index non full text
documents (like PDF, WORD and HTML), we operate a content extraction
(usingpdftotext, antiword and 'hpricot' ruby library). We axtract also the
metadata related to each document we index.




>
>
>
>
> _______________________________________________
> Ferret-talk mailing list
> [email protected]
> http://rubyforge.org/mailman/listinfo/ferret-talk
>



-- 
===========
 |   Lyes Amazouz
 |   USTHB, Algiers
===========
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to