Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots

Gilles Detillieux Mon, 01 Oct 2001 14:38:22 -0700

According to Joe R. Jah:
> Because of the size limitation of this mailing list the message was
> returned.  I have placed the attachments on the patch site:
> 
>       ftp://ftp.ccsf.org/htdig-patches/Bench/
...
> Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong
> patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz
> respectively.  Those are huge files and have very little difference in
> most blocks except in regex.d where Geoff's version numbers break the
> scale;)  To save your time I have attached regex.d blocks also:
> bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz.


The numbers don't really make sense in either case.  In
bb.out.regex-Geoff.gz it doesn't make sense that the counts would be
that high.  In bb.out.regex-Andy.gz the counts are lower, but why is
the regex code being called at all?  Andy's patch uses rx calls, not
regex calls.  Besides, I thought on BSD systems you weren't supposed
to use the htlib/regex.c code because of conflicts with your libraries.
Shouldn't you be using the C library's regex code?  Maybe the automatic
configure test isn't working correctly.  Try the manual solution as
for older htdig versions, and see if that clears up some of these wierd
regex-related problems, in both 3.1.6 and 3.2.0b4 snapshots.  If that
helps, we'll need to work out a better test.

> > OK, I've done some more testing on my site, and didn't notice any problems
> > on my site, with over 300 HTML documents.  The biggest difference I've
> > noticed is the changes to the HTML parser made it about 15% slower, but
> > it's much more robust so I think it's worth it.
> 
> Would you please elaborate on what you mean with "more robust."  What are
> the specific problems with Armstrong patch that warrants a performance hit
> of, 15% in a limited sanitized environment on your system, and 400% in a
> realistic environment on my system.  I do not think testing in a sanitized
> environment with a few hundred HTML documents is adequate to arrive at a
> realistic conclusion.

You misunderstand.  My tests above didn't involve Andy's or Geoff's code
for url_rewrite_rules at all.  The 15% difference was solely attributable
to the changes in htdig/HTML.cc, to use a different technique for
parsing tag attributes.  The old code used a StringMatch object to
search for certain attributes, like href, src, etc., but the search
could get thown off by the existance of these words within attribute
value strings in tags.  The new code instead creates a Configuration
object for each tag, and uses the code for this class to Add all the
attributes in the tag to this object.  This greatly simplifies the
HTML parser, makes it easier to extend it to handle new tag attributes,
and makes it more reliable.  It should NOT make it much more than 15%
slower on ANY system, including yours.

The problems with regex handling are a completely separate issue, and are
not tied to the HTML parser in any way.  I do want to resolve this issue
too, if we can ever get to the bottom of it.

> > I can think of two changes that might account for getting less documents
> > indexed in the post-Aug 29 snapshots:
> > 
> >     * htlib/URL.cc (URL): Fixed to call normalizePath() even if URL
> >     is relative but with absolute path. Should fix bug #408586.
> > 
> > and
> > 
> >     * htdig/HTML.h, htdig/HTML.cc (HTML, parse, do_tag): Fixed buggy
> >     handling of nested tags that independently turn off indexing, so
> >     </script> doesn't cancel <meta name=robots ...> tag. Add handling
> >     of <noindex follow> tag.
> > 
> > both on Aug. 31.  The URL class change will get rid of some double slashes
> > that were previously missed, which can reduce the number of duplicates.
> > The HTML class change may prevent the parser from following links in
> > documents that have meta robots tags, i.e. that it wasn't supposed
> > to follow.
> > 
> > If you get a chance to run old and new snapshots of htdig with -vvv and
> > compare the outputs, you may be able to track down the source of the
> > different URLs that are parsed in both cases.  To do this in a meaningful
> > way, though, you'll need to try a static site, or perhaps a snapshot of
> > your site, so you don't get thrown off in your comparisons by updates
> > to the site between digs.
> 
> Yes, I have kept that snapshot for a happy occasion like that;)

Keep me posted if you get a chance to run this test with both snapshots.
I can't think of any changes to 3.1.6 that would cause it to lose valid
URLs, but it would be good to confirm without a doubt that the lost URLs
on your system are all indeed URLs that should not have been indexed.

> > If you don't have meta robots tags on your site, though, it's almost
> > certainly going to be the URL class that accounts for the differences.
> > A quick test would be to run htdig -t with an old snapshot, then grep for
> > "http://.*//"; in db.docs.
> 
> That grep would give me a great deal of hits, where multiple URL's are on
> the same line; "[^:]//" gives more accurate results:
> 
>       grep -c "http://.*//";   db.docs  =  537
>       grep -c "[^:]//"        db.docs  =   88
> 
> There still are around 100 documents unaccounted for;-/

You're right, I was forgetting that URLs can appear in the body text of
a document, and therefore in the excerpt field of db.docs.  This does
suggest that the change to URL.cc on Aug. 31 would account for almost
half of the missing URLs.  Presumably a grep of "[^:]//" in a db.docs
from a recent 3.1.6 snapshot wouldn't find any matches, unless the
double slashes are in URLs within the body text of documents.

So, I guess the next question is do you have any documents that have
meta robots tags followed by script tags?

> > I just answered my own question here.  The 082601 snapshot was the truncated
> > one, so I take it that 082901 was a manual snapshot.
> 
> That was the very first 3.1.6 snapshot, right after you left for vacation.  
> I believe it was a manual snapshot.

No, I left for vacation in late July and got back in mid August.  Geoff
started making 3.1.6 snapshots in July, but the process failed Aug. 26
because the SourceForge project FTP server ran out of disk space.

> > Yes, that last comparison is the one I wanted to see.  An almost 3-fold
> > increase in indexing time is dramatic.  A comparison of profiling output
> > for these two builds would really be informative.
> 
> Right you are;)

I'd like to see the results after you take out the htlib regex code.
Could you run them through gprof the way you did a few months ago with
3.2.0b*?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots

Reply via email to