Hello,

I need to index aproximately 50-70gigs worth of text. All of the data is local, so 
there is no need to go through http or anything. The text is in a variety of formats 
(doc, txt, html, ps, pdf, etc...) and I need to be able to rather precise searches 
(strange non-word item codes and such).

Udmsearch takes care of the second two with no problem. External parsers are easy 
enough, and its hard to beat udmsearch for indexing everything.

The size of the data set it the real problem. I have tried glimpse, swish++, ht:dig, 
and several others...but none of them have really given me the exhaustive indexing 
(numbers, non-words, random stuff, etc) nor the performance that I want (they seem to 
choke on far less).

I am using udmsearch 3.0.22 with mysql 3.22.32 on a Dell 420 running debian. It has 
two PIII 677s, 256megs of ram, and 5 ultra ata 100 drives.

Currently I have 1 dedicated system drive, 1 dedicated data drive, and the 3 other 
drives are a raid0 for the mysql udmsearch db.

Mysql seems to be the bottleneck, in that as the database starts to get larger and 
larger it really starts to slow things down. I have only indexed 1 gig or so in my 
tests...but I am worried that it is going to get exponentially slower as I index more 
and more data. I have tuned mysql's variables, and it does virtually no key reads (the 
key cache is sufficient).

I am not looking for lightening fast indexes...I just want to be able to do it in a 
reasonable amount of time and not have it die on me. 3-4 days to index the data is not 
objectionable... I just want to know its going to work.

I have already shrunk the UDM_MAXTEXTSIZE down to 4..since I plan to do my own display 
parsing for searches.

I am relatively new to udmsearch...so I thought I would turn to the experts...

anyone have any suggestions? mysql tuning? is udmsearch even what I should be using to 
do this? Should I try a legit scsi raid? more ram?

thanks
-sean
______________
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]

Reply via email to