On Wed, 17 Oct 2001, Gilles Detillieux wrote:
> Date: Wed, 17 Oct 2001 15:35:53 -0500 (CDT)
> From: Gilles Detillieux <[EMAIL PROTECTED]>
> To: Joe R. Jah <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: [htdig-dev] Re: URL Rewrite patch for 3.1.6 snapshots
>
> > I found 82 links from one document with META ROBOT: Noindex tag;) I could
> > not find an efficient way of hunting down the other 138 links that were
> > unaccounted for in two 20 meg+ files; however, I must assume that they are
> > some sort of duplicates;-/
>
> Hmm. Too bad we couldn't get something more definitive. I'm fairly
> confident that the changes to the HTML parser didn't break anything, but
> I'd feel much more comfortable if we could explain the missing files you
> discovered rather than just assuming it's OK. If I recall, there were
> 88 URLs with doubled slashes that were eliminated in an earlier test,
> but that still leaves around 50 URLs unaccounted for.
>
> If there's any way you can take a snapshot of your site, or a few major
> subdirectories, and duplicate them somewhere else where they won't get
> modified, it would be a big help in getting conclusive results. If you
> index the exact same files with 3.1.5 and 3.1.6, you should be able to
> diff the output of htdig -vvv from both, and pinpoint exactly where the
> differences are happening. I know this is asking a lot, but it would be
> a shame to release 3.1.6 after all the work that's gone into it, only to
> discover afterward that it introduced a serious bug.
Sorry it took such a long time to respond, but I have been very busy
lately. It is not easy to prove a negative; however, I have tried a few
times to make 3.1.6 miss indexing files in stable snapshots of my site
without success;)
Here is a comparison of the latest 3.1.6 snapshot on a snapshot of my site
-- 163 HTML-only documents -- with 3.1.6-072901:
_______3.1.6-072901 + Armstrong patch + ssl.4_______
htdig: Start digging: Sun Nov 11 18:15:43 PST 2001
htmerge: Start merging: Sun Nov 11 18:16:16 PST 2001 33 seconds
htmerge: Total word count: 13171
htmerge: Total documents: 163
htmerge: Total doc db size (in K): 1888
-------------------------8<-------------------------
__________3.1.6-111101 + ssl.5 + FAQ#5.14___________
htdig: Start digging: Sun Nov 11 18:19:19 PST 2001
htmerge: Start merging: Sun Nov 11 18:20:58 PST 2001 99 seconds
htmerge: Total word count: 13171
htmerge: Total documents: 163
htmerge: Total doc db size (in K): 1888
-------------------------8<-------------------------
CPU: 350 MHz Pentium
RAM: 384 Megs
OS: BSDi-4.2
They both index the exact number of documents; this is as conclusive a
result as I can produce. The only difference is the the time they take.
Incidentally, ssl.4 fails to apply to the latest snapshot because of the
recent changes to Connection.cc. I have modified the patch to apply
cleanly to the latest snapshot of 3.1.6:
ftp://ftp.ccsf.org/htdig-patches/3.1.6/ssl.5
Regards,
Joe
--
_/ _/_/_/ _/ ____________ __o
_/ _/ _/ _/ ______________ _-\<,_
_/ _/ _/_/_/ _/ _/ ......(_)/ (_)
_/_/ oe _/ _/. _/_/ ah [EMAIL PROTECTED]
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev