On Mon, 1 Oct 2001, Gilles Detillieux wrote:
> Date: Mon, 1 Oct 2001 17:05:58 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
>
> According to Joe R. Jah:
> > Because of the size limitation of this mailing list the message was
> > returned. I have placed the attachments on the patch site:
> >
> > ftp://ftp.ccsf.org/htdig-patches/Bench/
> ...
> > Attached are basic block profiles of htdig-3.1.6-092301 with Armstrong
> > patch and Geoff's version of the patch, bb.out-Andy.gz and bb.out-Geoff.gz
> > respectively. Those are huge files and have very little difference in
> > most blocks except in regex.d where Geoff's version numbers break the
> > scale;) To save your time I have attached regex.d blocks also:
> > bb.out.regex-Andy.gz and bb.out.regex-Geoff.gz.
>
> The numbers don't really make sense in either case. In
> bb.out.regex-Geoff.gz it doesn't make sense that the counts would be
> that high. In bb.out.regex-Andy.gz the counts are lower, but why is
> the regex code being called at all? Andy's patch uses rx calls, not
> regex calls. Besides, I thought on BSD systems you weren't supposed
> to use the htlib/regex.c code because of conflicts with your libraries.
> Shouldn't you be using the C library's regex code? Maybe the automatic
> configure test isn't working correctly. Try the manual solution as
> for older htdig versions, and see if that clears up some of these wierd
> regex-related problems, in both 3.1.6 and 3.2.0b4 snapshots. If that
> helps, we'll need to work out a better test.
Yessss;) That helped a lot indeed:
_________________3.1.6-093001 + ssl.4__________________
htdig: Start digging: Sun Sep 30 02:27:48 PDT 2001
htmerge: Start merging: Sun Sep 30 03:56:51 PDT 2001 89 minues;(
htmerge: Total word count: 108939
htmerge: Total documents: 7322
htmerge: Total doc db size (in K): 117859
htnotify: Start notifying: Sun Sep 30 03:59:01 PDT 2001
htfuzzy: Start fuzzying: Sun Sep 30 03:59:09 PDT 2001
rundig: end rundig: Sun Sep 30 04:00:09 PDT 2001
____________3.1.6-093001 + ssl.4 & FAQ#5.14____________
htdig: Start digging: Mon Oct 1 16:15:46 PDT 2001
htmerge: Start merging: Mon Oct 1 16:59:05 PDT 2001 44 minutes;)
htmerge: Total word count: 109171
htmerge: Total documents: 7342
htmerge: Total doc db size (in K): 118138
htnotify: Start notifying: Mon Oct 1 17:01:36 PDT 2001
htfuzzy: Start fuzzying: Mon Oct 1 17:01:49 PDT 2001
rundig: end rundig: Mon Oct 1 17:02:52 PDT 2001
_______________________________________________________
In pre 4.2 versions of BSDi htdig would segfault without using FAQ#5.14.
After upgrading to BSDi-4.2 that problem never occured until this patch.
Thank you very much Gilles; I have removed the files form
htdig-patches/Bench folder because they were irrelevant.
> You misunderstand. My tests above didn't involve Andy's or Geoff's code
> for url_rewrite_rules at all. The 15% difference was solely attributable
> to the changes in htdig/HTML.cc, to use a different technique for
> parsing tag attributes. The old code used a StringMatch object to
> search for certain attributes, like href, src, etc., but the search
> could get thown off by the existance of these words within attribute
> value strings in tags. The new code instead creates a Configuration
> object for each tag, and uses the code for this class to Add all the
> attributes in the tag to this object. This greatly simplifies the
> HTML parser, makes it easier to extend it to handle new tag attributes,
> and makes it more reliable. It should NOT make it much more than 15%
> slower on ANY system, including yours.
Sorry; I thought you were still on the same subject.
> The problems with regex handling are a completely separate issue, and are
> not tied to the HTML parser in any way. I do want to resolve this issue
> too, if we can ever get to the bottom of it.
I believe it is resolved, thanks to you. I will try FAQ#5.14 on 3.2
snapshot as soon as I get a chance to clarify this point once and for all.
Is it possible to set a test in the configure program to take care of
FAQ#5.14 automatically?
> > > If you get a chance to run old and new snapshots of htdig with -vvv and
> > > compare the outputs, you may be able to track down the source of the
> > > different URLs that are parsed in both cases. To do this in a meaningful
> > > way, though, you'll need to try a static site, or perhaps a snapshot of
> > > your site, so you don't get thrown off in your comparisons by updates
> > > to the site between digs.
> >
> > Yes, I have kept that snapshot for a happy occasion like that;)
>
> Keep me posted if you get a chance to run this test with both snapshots.
> I can't think of any changes to 3.1.6 that would cause it to lose valid
> URLs, but it would be good to confirm without a doubt that the lost URLs
> on your system are all indeed URLs that should not have been indexed.
In the happy hour;)))
> You're right, I was forgetting that URLs can appear in the body text
> of a document, and therefore in the excerpt field of db.docs. This
> does suggest that the change to URL.cc on Aug. 31 would account for
> almost half of the missing URLs. Presumably a grep of "[^:]//" in a
> db.docs from a recent 3.1.6 snapshot wouldn't find any matches, unless
> the double slashes are in URLs within the body text of documents.
>
> So, I guess the next question is do you have any documents that have
> meta robots tags followed by script tags?
Yes; most of the 88 documents in my previous post
> No, I left for vacation in late July and got back in mid August. Geoff
> started making 3.1.6 snapshots in July, but the process failed Aug. 26
> because the SourceForge project FTP server ran out of disk space.
You are right.
> > > Yes, that last comparison is the one I wanted to see. An almost 3-fold
> > > increase in indexing time is dramatic. A comparison of profiling output
> > > for these two builds would really be informative.
> >
> > Right you are;)
>
> I'd like to see the results after you take out the htlib regex code.
> Could you run them through gprof the way you did a few months ago with
> 3.2.0b*?
I must have changed the options; I will try to remember what options I used
then;)
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev