Hi,

       There is a simple split strategy for indexing. The document name
(URL in this case) is hashed using a MD5 function and the target machine
is the resulting number modulo 2 (if you have two machines) 4 (if you
have four etc...). Using this strategy prevents you from manually dividing
your document set (.org sites on one machine, .com sites on another etc.).
I've practical experience that shows that the prepartition you get for
a large number of documents (> 1 million) is even (i.e. four machines,
each of them have 25% of the documents).
To be really usefull it must be associated with a migration procedure
that is able to move part of an index from one machine to other machines.
Assuming you start with 2 machines and later want to add two more, this
is absolutely necessary.

       The problem then is for searching. The simple way is to 1) perform
the search on each machine 2) direct all the results on a machine whose
only task is to sort/merge/synchronize and return a ranked list of 
results.
       I know that Fulcrum uses this method to scale big indexes, and it
works fairly well. 

       I'm no expert on the subject, however. So much to do and so little
time :-)

       Cheers,

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to