On Wed, 23 Feb 2000, Geoff Hutchison wrote:
> Date: Wed, 23 Feb 2000 08:31:22 -0600
> From: Geoff Hutchison <[EMAIL PROTECTED]>
> To: J Kinsley <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig] ht://Dig 3.2.0b1 and 3.2.0b2-022000 Extremely Slooooow
Ok, I shall attempt to provide some hard numbers here showing the
index speed difference between 3.2.0b2-022000 and 3.1.2. First
though I will clear up the rpm confusion. When I installed the beta
series, I used RPM to build and install it. However, last spring
when I first installed 3.1.2, I did not use RPM and the binaries went
into /opt/www/bin. When installing the beta rpm, I moved the binary
location to /opt/www/sbin and since 3.1.2 was manually installed, RPM
did not remove those binaries. The first time I built the index two
days ago, I called htdig from the command line and the 3.1.2 binaries
were used instead of the betas. I did not realize this until trying
to determine why htsearch (3.1.2 version was overwritten by beta
version) failed to recognize the database. Although I had previously
installed ht://Dig, I had never used it due to disk space
limitations.
Anyway, on with the numbers....
Server:
Intel PII 233MHz
64MB SDRAM
Kernel 2.2.14
Customized RedHat 6.0-6.2
Apache 1.3.6
Archive:
44101 Files - 1290 Directories
Smallest: 190 B
Largest: 9.40 MB
Average: 30.20 KB
Total: 1.35 GB
NOTE: ht://Dig is running on the same physical host as the web server
it indexing, so network bandwidth is not a factor here.
ht://Dig version: 3.1.2
htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3
Index time: 01:52:00
Index size: 634MB wordlist
325MB documents
URL's indexed according to /tmp/htdig.log: 52100
(number higher than total due to indexing ?[MNSD]=[AD] for
each directory
CPU time: 00:39:00
RSS: unknown
htmerge -c /etc/www/htdig/bti.conf
Merge time: 00:42:00
Index size: 504MB wordlist.db
CPU time: unknown
RSS: unknown
Note: the above numbers are from my memory and thus are
close approximations.
ht://Dig version: 3.2.0b1
htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3
Exited after 3 hours / ~2200 files to attempt to speed up
ht://Dig version: 3.2.0b2-022000
htdig -l -s -v -c /etc/www/htdig/bti.conf > /tmp/htdig.log 2>&3
Index time:
Start: Feb 23 05:07:18 EST 2000
Current: Feb 23 16:38:25 EST 2000
Est. End: Feb 24 10:00:00 EST 2000
URL's processed according to /tmp/htdig.log: 19111
CPU time: 00:52:42
RSS: 31MB
<snip>
> Now, as far as the speed of indexing in 3.2.0b1 (and current
> snapshots), I probably need to make this a FAQ. Right now, it's
> probably not going to be faster than 3.1.x versions and is quite
> likely to be slow. We rewrote the whole layout of databases and in
> the process made quite a few trade-offs against the indexer.
Using my estimated end time above, we're looking at a 27 hour
increase in index time on ~50,000 URL's. I do not think this is you
mean by 'a few trade-offs', so I am guessing it is a bug. Although I
do not fully understand how to detect memory leaks, I suspect that is
the problem. When I first start htdig, it indexes the first 1000
URL's in about 6 minutes and the RSS creeps up to around 18-19MB and
it starts to slow down.
<snip>
> But the important thing to remember is that these are *betas*--we're
> looking for feedback. We'd love to have accurate performance and
> requirement feedback. The new database layout is probably going to
> require more disk space (especially if compression is off), but you
> won't need as much memory for htmerge. So hard numbers would be
> wonderful. This will help us target what needs improvement. Further,
> if anyone wants to help improve indexing performance, I'm sure we can
> come up with a list.
Ht:/Dig is just one of many bleeding edge packages I currently
have installed, so I'll do what I can to help solve the problems.
J. Kinsley
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.