Hi,

I am using UDMSearch 3.0.23 with MySQL at the moment and
have a few suggestions regarding the software.  Please note
that I have only friefly looked at the developement version
so some of these may already be implemented.

I am using crc-multi for data storage and indexer seems to
slow down quite a lot when there are about 75000 URLs in
the database.  This site has about 125000 to index.  I
noticed it deletes from the various ndict tables even when
the URL had not previously been indexed.  Is that really
necessary or couldn't it just delete if status != 0?

Also, MySQL seems to really slow down when a table has more
than 2,000,000 rows.  Do you think there would be any
performance increase in splitting the various ndict table
into say four tables?  Say ndict4-1, ndict4-2, ndict4-3,
and ndict4-4 and just use the first two bits of the crc32
value to determine where it goes.  I'm sure there's a point
where you have too many tables and performance can suffer
the other way but that should cut the size of each ndict
table by 4.

My URLs primarily look like this:
http://www.cm.nu/~shane/lists/destin/2001-03/xxxxxxx.html
Do you thing there is a way to store the host/path part of
an URL seperately from the referenced file.  It seems
redundant to be storing protocol://host/path/filename over
and over when often, parts of these keep repeating.

Last thing, as far as I know, there is no way I can search
for an quoted string, for example, "Mail User Agent" won't
necessarily come up with pages containing that exact
string.  I thing this could be implemented fairly simple by
adding an int to the ndict and dict tables specifying the
word number of the referenced word.  It would mean storing
multiple words in a single document more than once but
would allow this to work.

Best regards,
Shane

-- 
Shane Wegner: [EMAIL PROTECTED]
              http://www.cm.nu/~shane/
PGP:          1024D/FFE3035D
              A0ED DAC4 77EC D674 5487
              5B5C 4F89 9A4E FFE3 035D
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]

Reply via email to