On Mon, 1 Oct 2001, Gilles Detillieux wrote:

> Date: Mon, 1 Oct 2001 17:05:58 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
> 
> According to Joe R. Jah:
> > Because of the size limitation of this mailing list the message was
> > returned.  I have placed the attachments on the patch site:
> > 
> >     ftp://ftp.ccsf.org/htdig-patches/Bench/
> ...
> > Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong
> > patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz
> > respectively.  Those are huge files and have very little difference in
> > most blocks except in regex.d where Geoff's version numbers break the
> > scale;)  To save your time I have attached regex.d blocks also:
> > bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz.
> 
> The numbers don't really make sense in either case.  In
> bb.out.regex-Geoff.gz it doesn't make sense that the counts would be
> that high.  In bb.out.regex-Andy.gz the counts are lower, but why is
> the regex code being called at all?  Andy's patch uses rx calls, not
> regex calls.  Besides, I thought on BSD systems you weren't supposed
> to use the htlib/regex.c code because of conflicts with your libraries.
> Shouldn't you be using the C library's regex code?  Maybe the automatic
> configure test isn't working correctly.  Try the manual solution as
> for older htdig versions, and see if that clears up some of these wierd
> regex-related problems, in both 3.1.6 and 3.2.0b4 snapshots.  If that
> helps, we'll need to work out a better test.

Yessss;)  That helped a lot indeed:
_________________3.1.6-093001 + ssl.4__________________
htdig:    Start digging:   Sun Sep 30 02:27:48 PDT 2001
htmerge:  Start merging:   Sun Sep 30 03:56:51 PDT 2001  89 minues;(
htmerge:  Total word count: 108939
htmerge:  Total documents: 7322
htmerge:  Total doc db size (in K): 117859
htnotify: Start notifying: Sun Sep 30 03:59:01 PDT 2001
htfuzzy:  Start fuzzying:  Sun Sep 30 03:59:09 PDT 2001
rundig:   end rundig: Sun  Sep 30 04:00:09 PDT 2001
____________3.1.6-093001 + ssl.4 & FAQ#5.14____________
htdig:    Start digging:   Mon Oct  1 16:15:46 PDT 2001
htmerge:  Start merging:   Mon Oct  1 16:59:05 PDT 2001  44 minutes;)
htmerge:  Total word count: 109171
htmerge:  Total documents: 7342
htmerge:  Total doc db size (in K): 118138
htnotify: Start notifying: Mon Oct  1 17:01:36 PDT 2001
htfuzzy:  Start fuzzying:  Mon Oct  1 17:01:49 PDT 2001
rundig:   end rundig:      Mon Oct  1 17:02:52 PDT 2001
_______________________________________________________

In pre 4.2 versions of BSDi htdig would segfault without using FAQ#5.14.
After upgrading to BSDi-4.2 that problem never occured until this patch.
Thank you very much Gilles; I have removed the files form
htdig-patches/Bench folder because they were irrelevant.

> You misunderstand.  My tests above didn't involve Andy's or Geoff's code
> for url_rewrite_rules at all.  The 15% difference was solely attributable
> to the changes in htdig/HTML.cc, to use a different technique for
> parsing tag attributes.  The old code used a StringMatch object to
> search for certain attributes, like href, src, etc., but the search
> could get thown off by the existance of these words within attribute
> value strings in tags.  The new code instead creates a Configuration
> object for each tag, and uses the code for this class to Add all the
> attributes in the tag to this object.  This greatly simplifies the
> HTML parser, makes it easier to extend it to handle new tag attributes,
> and makes it more reliable.  It should NOT make it much more than 15%
> slower on ANY system, including yours.

Sorry; I thought you were still on the same subject.

> The problems with regex handling are a completely separate issue, and are
> not tied to the HTML parser in any way.  I do want to resolve this issue
> too, if we can ever get to the bottom of it.

I believe it is resolved, thanks to you.  I will try FAQ#5.14 on 3.2
snapshot as soon as I get a chance to clarify this point once and for all.
Is it possible to set a test in the configure program to take care of
FAQ#5.14 automatically?

> > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > compare the outputs, you may be able to track down the source of the
> > > different URLs that are parsed in both cases.  To do this in a meaningful
> > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > your site, so you don't get thrown off in your comparisons by updates
> > > to the site between digs.
> > 
> > Yes, I have kept that snapshot for a happy occasion like that;)
> 
> Keep me posted if you get a chance to run this test with both snapshots.
> I can't think of any changes to 3.1.6 that would cause it to lose valid
> URLs, but it would be good to confirm without a doubt that the lost URLs
> on your system are all indeed URLs that should not have been indexed.

In the happy hour;)))

> You're right, I was forgetting that URLs can appear in the body text
> of a document, and therefore in the excerpt field of db.docs.  This
> does suggest that the change to URL.cc on Aug. 31 would account for
> almost half of the missing URLs.  Presumably a grep of "[^:]//" in a
> db.docs from a recent 3.1.6 snapshot wouldn't find any matches, unless
> the double slashes are in URLs within the body text of documents.
> 
> So, I guess the next question is do you have any documents that have
> meta robots tags followed by script tags?

Yes; most of the 88 documents in my previous post

> No, I left for vacation in late July and got back in mid August.  Geoff
> started making 3.1.6 snapshots in July, but the process failed Aug. 26
> because the SourceForge project FTP server ran out of disk space.

You are right.

> > > Yes, that last comparison is the one I wanted to see.  An almost 3-fold
> > > increase in indexing time is dramatic.  A comparison of profiling output
> > > for these two builds would really be informative.
> > 
> > Right you are;)
> 
> I'd like to see the results after you take out the htlib regex code.
> Could you run them through gprof the way you did a few months ago with
> 3.2.0b*?

I must have changed the options; I will try to remember what options I used
then;)

Regards,

Joe
-- 
     _/   _/_/_/       _/              ____________    __o
     _/   _/   _/      _/         ______________     _-\<,_
 _/  _/   _/_/_/   _/  _/                     ......(_)/ (_)
  _/_/ oe _/   _/.  _/_/ ah        [EMAIL PROTECTED]


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to