I suggest moving this discussion to the htdig3-dev list. Anyone who wants to follow it should feel free to do so. At 5:20 PM -0500 2/23/00, J Kinsley wrote: >NOTE: ht://Dig is running on the same physical host as the web server >it indexing, so network bandwidth is not a factor here. First off, I'd suggest using local_urls, which you don't mention. That would certainly give you a rather significant speed boost, but I digress. >Using my estimated end time above, we're looking at a 27 hour >increase in index time on ~50,000 URL's. I do not think this is you >mean by 'a few trade-offs', so I am guessing it is a bug. Although I >do not fully understand how to detect memory leaks, I suspect that is >the problem. When I first start htdig, it indexes the first 1000 >URL's in about 6 minutes and the RSS creeps up to around 18-19MB and >it starts to slow down. It isn't a bug. I also doubt it's an actual leak--we've run the source through Purify a few times. Now there may still be leaks--we clearly haven't hit all the code in testing, but I don't think it's that. (My favorite quick-and-dirty memory debugger is called LeakTracer, you can find it on Freshmeat.net.) The explanation is going to be a long, so don't say I didn't warn you. I'd guess the biggest performance hit in indexing right now comes from a trade-off. And when I say a "trade-off," I know what I'm talking about--remember that you might have a compression algorithm that takes much longer to compress than to decompress--this is what I mean. Previous versions stored the document DB keyed by URL. This was great for indexing, you'd just check to see if a given URL existed could retrieve it easily. The snag comes with the word database, which stored words by DocID. So when doing a search, htsearch had to go lookup the URLs in a DocID->URL index (db.docs.index). This is really silly--we'd much rather have htdig do the work and have htsearch fly, so 3.2 keys the document DB by DocID as well, so htsearch doesn't do any additional lookups. But now htdig is stuck--as it stands, it's looking up the DocID in a URL->DocID database every time it wants to retrieve a document or check to see if the document is in the database. This is bad. (Remember, the lookup is going to slow down as we get more documents.) There's also the matter discussed a while ago about the "Need2Get" list of URLs. It's big since it's a full hash table of all the URLs we've visited. Since it's a hash, it has a lot of free space. So if you don't have the memory for that, you're going to start paging. Look, I can point to lots of places throughout the code where there are really bad speed hits. The main developers right now are all volunteers and we all have our hands full--if you'd like to help with optimizations, please do! -Geoff ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
