I'm running into performance problems with the indexer.
Specifically, as all of my pages are dynamically generated, usually
with DB backends, the HEAD/date checks fail to be useful for the
indexer and it ends up re-indexing the entire site every time.  This
takes an inordinate length of time (I have a couple hundred thousand
pages online with the indexer running nightly via cron job).

Suggestion:

  In addition to doing the HEAD/date check, also compute a CRC for
the page and store that.  Then, allow a commandline or configfile
option to indexer to only reindex pages which return different CRCs.

  If you are concerned about the duplicate CRC window, then two CRCs
of different sizes or a CRC plus a hash should narrow that window
acceptably.

-- 
J C Lawrence                                 Home: [EMAIL PROTECTED]
----------(*)                              Other: [EMAIL PROTECTED]
--=| A man is as sane as he is dangerous to his environment |=--
______________
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]

Reply via email to