Because of the size limitation of this mailing list the message was
returned. I have placed the attachments on the patch site:
ftp://ftp.ccsf.org/htdig-patches/Bench/
On Fri, 28 Sep 2001, Gilles Detillieux wrote:
> Date: Fri, 28 Sep 2001 11:27:03 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
>
> OK, let's get any 3.1.6 problems nailed down first, then when it's out we
> can hopefully figure out all the strangeness with 3.2.0b4. Let me know
> what, if anything, comes of profiling in 3.1.5.
Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong
patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz
respectively. Those are huge files and have very little difference in
most blocks except in regex.d where Geoff's version numbers break the
scale;) To save your time I have attached regex.d blocks also:
bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz.
> OK, I've done some more testing on my site, and didn't notice any problems
> on my site, with over 300 HTML documents. The biggest difference I've
> noticed is the changes to the HTML parser made it about 15% slower, but
> it's much more robust so I think it's worth it.
Would you please elaborate on what you mean with "more robust." What are
the specific problems with Armstrong patch that warrants a performance hit
of, 15% in a limited sanitized environment on your system, and 400% in a
realistic environment on my system. I do not think testing in a sanitized
environment with a few hundred HTML documents is adequate to arrive at a
realistic conclusion.
> I can think of two changes that might account for getting less documents
> indexed in the post-Aug 29 snapshots:
>
> * htlib/URL.cc (URL): Fixed to call normalizePath() even if URL
> is relative but with absolute path. Should fix bug #408586.
>
> and
>
> * htdig/HTML.h, htdig/HTML.cc (HTML, parse, do_tag): Fixed buggy
> handling of nested tags that independently turn off indexing, so
> </script> doesn't cancel <meta name=robots ...> tag. Add handling
> of <noindex follow> tag.
>
> both on Aug. 31. The URL class change will get rid of some double slashes
> that were previously missed, which can reduce the number of duplicates.
> The HTML class change may prevent the parser from following links in
> documents that have meta robots tags, i.e. that it wasn't supposed
> to follow.
>
> If you get a chance to run old and new snapshots of htdig with -vvv and
> compare the outputs, you may be able to track down the source of the
> different URLs that are parsed in both cases. To do this in a meaningful
> way, though, you'll need to try a static site, or perhaps a snapshot of
> your site, so you don't get thrown off in your comparisons by updates
> to the site between digs.
Yes, I have kept that snapshot for a happy occasion like that;)
> If you don't have meta robots tags on your site, though, it's almost
> certainly going to be the URL class that accounts for the differences.
> A quick test would be to run htdig -t with an old snapshot, then grep for
> "http://.*//" in db.docs.
That grep would give me a great deal of hits, where multiple URL's are on
the same line; "[^:]//" gives more accurate results:
grep -c "http://.*//" db.docs = 537
grep -c "[^:]//" db.docs = 88
There still are around 100 documents unaccounted for;-/
> I just answered my own question here. The 082601 snapshot was the truncated
> one, so I take it that 082901 was a manual snapshot.
That was the very first 3.1.6 snapshot, right after you left for vacation.
I believe it was a manual snapshot.
> Yes, that last comparison is the one I wanted to see. An almost 3-fold
> increase in indexing time is dramatic. A comparison of profiling output
> for these two builds would really be informative.
Right you are;)
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev