Hello
all,
I've
recently had some success tackling some problems that I had created for
myself, and my guess is that one or two of you may benefit from my suffering
<grin>. I had a fairly large development project that required a search
engine capable of handling approximately 150,000 internal pages and 10,000+
pages on thousands of external web servers. Despite being warned by the FAQ
that htdig wasn't built for this task, I couldn't resist giving it a go;
especially since the source code for the entire engine was in C and was readily
available.
My first
problem was that I was encapsulating the output of htsearch within my own cgi
engine, a group collaboration tool written entirely in C, that was driving this
particular online community. So I had to comment out the output of the
"content-type=text/html" string in the relevant locations within the htsearch
source (Display.cc?). Whew, that was easy.
My second
problem was related to the first. Since my cgi engine needed its own query
string to work its magic, and indirectly controlled htsearch, I had to modify
the htsearch source so that it didn't detect the presence of the query
string (I renamed REQUEST_METHOD in htlib/cgi.cc). In this manner, I could then
pass the goods directly as arguments to an external call to htsearch
during the execution of my cgi (i.e. htsearch -c
/my/custom.conf
"page=2&words=woah&cmd=command&searchtype=mysearch"). I then had to
modify the portion of Display.cc that built the hrefs to the next, previous and
page number links so that my own special query string name/value pairs were
piggy-backed. Whew, that was pretty easy too. After whipping up my own set of
html templates for htsearch (simple ones really, since the interface framework
was actually being provided by my group collaboration engine) I was ready
to start the real fun stuff.
My third
problem was figuring out how to hand 150,000 unique urls to htdig without having
it spider the web site to find them. I already knew the urls I wanted it to
index and, in fact, I didn't want it to go any further than the urls I
specified. Luckily it lets you hand it a file of proper urls to get it going, so
I wrote a program to create the list file for me. I then specified a maximum hop
count of zero in the htdig configuration file. Lest some of you disbelievers
think that htdig can't handle the big stuff, I can testify that it did just
fine when I force fed it a fourteen megabyte text file containing not less
than one hundred forty nine thousand, nine hundred seventy four unique urls! It
swallowed them up in well under four hours on a puny little 256meg PII-400
running Linux Redhat 5.0 and Apache using an insignificant amount of CPU
time.
My fourth
problem was the 10,000+ pages on 2,000+ external web servers. Again, I only
wanted the pages I specified so I built a program to create the list and then I
force fed it to htdig. This wasn't the problem. The problem was that htdig would
go to sleep during the indexing process and, seemingly, never wake up. I ran it
in debug mode and saw that it would eventually hit a web page on a new server
and stall. And stall. And stall. After about twenty or thirty minutes htdig
would finally timeout and continue until it eventually hit another web page on
a dead server and stall again. When you're dealing with 2,000 web servers
there is bound to be dozens of dead machines (likely in direct correlation with
the number of NT servers. Heh heh). I tried without luck to find an explanation
of why the "timeout: 20" (seconds) in my htdig configuration file was being
translated as 30 minutes. I spent an entire day researching on the Net to
uncover possible causes. The author of htdig indicated in an earlier mailing
digest post that he couldn't recreate the problem, and wasn't sure why the
timeout setting wasn't working for some people. This was NOT encouraging, but
I'm stubborn, so I kept hammering away. During a stall, I found that
netstat indicated that the htdig process owned a socket connection stuck in
SYN_SENT mode. I went searching for info on that (I'm not a IP guru) and found
some Linux kernel tweaking notes. I peaked at the value contained within my
/proc/sys/net/ipv4/tcp_syn_retries file and found "10". I peaked at the value in
my /proc/sys/net/ipv4/tcp_fin_timeout file and found "180" seconds. Using
my superior math skills (heh) I determined that 10 retries at 180 seconds each
is 30 minutes, which was pretty close to how long each htdig stall was. So
I crossed my fingers and changed the timeout to 30 seconds and the number of
retries to 2. Voila! The htdig index process still stalled, but each stall
didn't take but a minute or so and the entire index was built quite
quickly.
I'd be
interested in comments from any IP / Linux gurus regarding my
tcp_fin_timeout / tcp_syn_retries tweaking. Is the 30 seconds and 2 retries
too limiting or dangerous for a production machine?
All the
best,
Sean.
# Digital Spinner, Inc. # Web Design, Development and Consulting. # Phone: 802.948.2020 # Fax: 802.948.2749 # http://www.digitalspinner.com
