Hi all,

 

A couple of years ago we've implemented a website for one of our clients.
The website serves as an information portal for regular visitors seeking
information, but it has also workgroups for specialists which share
information about different subjects. When we created the website we used
htdig as a search engine to spider the public information as well as the
information in the workgroups. Workgroups contain news and agenda, forums,
webmail and documents. 

 

The problem is that since a few months the load on the server has increased.
During indexing the serverload gets up to and above 3.00 - 4.00 average. The
duration of the indexing: up to 12 hours! Now I know there is a lot of
information on the website plus 600+ workgroups to index, so I'd expect
htdig to consume some time. In the end we end up with the usual htdig files:
docdb = 206 MB, docs.index = 11 MB, wordlist = 222 MB and words db = 169 MB.
Comparing these filesizes with the filesizes of an other (and a lot smaller)
website index: not much different. Indexing the smaller website gives us a
docdb file of 195 MB, a wordlist file of 195 MB and a words db file of 140
MB. The indexing takes about 2 hours from start to finish.

 

I've tried to set up htdig to index incremental but that doesn't work well.
After a while htdig stops indexing with an error: the work files get past
the 2GB boundary. Cleaning up the workfiles before the indexing works, but
then it's no longer an incremental search, is it? :-) Besides the indexing
still take about 5-6 hours when doing an incremental search.

 

Can anyone help me with this problem? Or does anyone have an idea what the
problem might be? I've added serverinformation at the bottom of the e-mail
followed by the configuration file.

 

Thanks in advance!

 

Greetings,

 

Marco Houtman

Ecommany B.V.

 

 

 

PS: Serverinformation

 

SuSE Linux (I believe it's 8.2, but our service provider installed it for us
and I do not have enough privileges to login to the server and look up the
exact version)

Htdig 3.1.6 has been installed from source (I can see the sourcefile an
there's no RPM with the name htdig installed afaik).

Server hardware is about 3 years old now. I can't tell you exactly what the
components are but I know it was state of the art equipment back then :-)

 

PS2: configuration

 

# common

root_dir:                        /home/www/domain.nl/htdig

 

common_dir:                 ${root_dir}/common

database_dir:                 ${root_dir}/db

template_dir:                 ${root_dir}/templates

 

# htdig

bad_extensions:            .wav .gz .z .sit .au .zip .tar .hqx .exe .com
.gif \

.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm \

.mpg .mov .avi .css .js .inc

bad_word_list:               ${common_dir}/bad_words

create_url_list:               yes

exclude_urls:                 /bestel/ /zoek/ /download/ /pdf/ /uploaded/ \

selectie= regio= type= letter= auteur= rubriek= start= \

tumor= uitgever= sort= regtoev= meth= status= fase= \

aktie= thesaurus

external_parsers:           application/msword /usr/local/bin/parse_doc.pl \

application/pdf /usr/local/bin/parse_pdf.pl

limit_urls_to:                  http://www.domain.nl/

maintainer:                    [EMAIL PROTECTED]

max_doc_size:              5000000

max_head_length:          10000

max_hop_count:            10

start_url:                       http://www.domain.nl/index.php

user_agent:                   domain-digger

 

# htmerge

 

# htdump

 

# htload

 

# htfuzzy

endings_affix_file:           ${common_dir}/nederlands.aff

endings_dictionary:        ${common_dir}/nederlands.0

 

# htnotify

 

# htsearch

max_prefix_matches:     100

minimum_prefix_length:  2

no_excerpt_show_top:    true

nothing_found_file:         ${template_dir}/nomatch.html

prefix_match_character: *

search_algorithm:          exact:1 prefix:0.5 endings:0.1

search_results_footer:    ${template_dir}/footer.html

search_results_header:  ${template_dir}/header.html

syntax_error_file:           ${template_dir}/syntax.html

sort:                              score

template_map:               Long builtin-long builtin-long \

Short builtin-short builtin-short \

Website website ${template_dir}/website.html

template_name:             website

 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to