I suggest moving this discussion to the htdig3-dev list. Anyone who 
wants to follow it should feel free to do so.

At 5:20 PM -0500 2/23/00, J Kinsley wrote:
>NOTE: ht://Dig is running on the same physical host as the web server
>it indexing, so network bandwidth is not a factor here.

First off, I'd suggest using local_urls, which you don't mention. 
That would certainly give you a rather significant speed boost, but I 
digress.

>Using my estimated end time above, we're looking at a 27 hour
>increase in index time on ~50,000 URL's.  I do not think this is you
>mean by 'a few trade-offs', so I am guessing it is a bug.  Although I
>do not fully understand how to detect memory leaks, I suspect that is
>the problem.  When I first start htdig, it indexes the first 1000
>URL's in about 6 minutes and the RSS creeps up to around 18-19MB and
>it starts to slow down.

It isn't a bug. I also doubt it's an actual leak--we've run the 
source through Purify a few times. Now there may still be leaks--we 
clearly haven't hit all the code in testing, but I don't think it's 
that. (My favorite quick-and-dirty memory debugger is called 
LeakTracer, you can find it on Freshmeat.net.)

The explanation is going to be a long, so don't say I didn't warn 
you. I'd guess the biggest performance hit in indexing right now 
comes from a trade-off. And when I say a "trade-off," I know what I'm 
talking about--remember that you might have a compression algorithm 
that takes much longer to compress than to decompress--this is what I 
mean.

Previous versions stored the document DB keyed by URL. This was great 
for indexing, you'd just check to see if a given URL existed could 
retrieve it easily. The snag comes with the word database, which 
stored words by DocID. So when doing a search, htsearch had to go 
lookup the URLs in a DocID->URL index (db.docs.index). This is really 
silly--we'd much rather have htdig do the work and have htsearch fly, 
so 3.2 keys the document DB by DocID as well, so htsearch doesn't do 
any additional lookups.

But now htdig is stuck--as it stands, it's looking up the DocID in a 
URL->DocID database every time it wants to retrieve a document or 
check to see if the document is in the database. This is bad. 
(Remember, the lookup is going to slow down as we get more documents.)

There's also the matter discussed a while ago about the "Need2Get" 
list of URLs. It's big since it's a full hash table of all the URLs 
we've visited. Since it's a hash, it has a lot of free space. So if 
you don't have the memory for that, you're going to start paging.

Look, I can point to lots of places throughout the code where there 
are really bad speed hits. The main developers right now are all 
volunteers and we all have our hands full--if you'd like to help with 
optimizations, please do!

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to