Re: [htdig3-dev] Rejecting duplicates on htdig

Geoff Hutchison Mon, 21 Aug 2000 09:35:23 -0700
On Mon, 21 Aug 2000, Gilles Detillieux wrote:

> > One flaw I've realised is that if there are identical pages that
> > point to different relative pages in two different parts of the tree then
> > then it doesn't pick up the second lot of pages.
> 
> I must say I saw this coming, and I seem to remember mentioning this
> way back when the whole concept of duplicate detection was discussed.
> At the time, I didn't think it would be a serious problem, because I
> didn't think you'd rely on relative links in a document that is intended

That was my thinking too. I also felt that if duplicate detection was a
configuration attribute (potentially on a pre-URL or per-server basis),
then it wouldn't be as big of an issue.

> URL you encounter a document at may not be the preferred location for it.

I've been giving some thought to the question of "preferred location" and
I resort to the HTML 4.0 standard. The BASE tag is supposed to be for this
purpose <http://www.w3.org/TR/html4/struct/links.html#h-12.4> and when
used, it solves the relative link issue.

However, some people may want to designate the preferred location w/o
using the BASE tag. So I suggest using the LINK tag for this purpose:
<http://www.w3.org/TR/html4/struct/links.html#h-12.3.3>

Something like <LINK rel="start" href="http://www.foo.com"> might work.
The HTML spec, for better or worse, doesn't give any standard for REL and
REV attributes. So my suggestion is that the code should use the first URL
it sees as the canonical URL for duplicates unless there's a LINK or BASE
tag.

The only snag is that if we hit one of these tags, it essentially
functions as a redirect, so we'll have to handle that carefully (expire
the current URL, etc.).

> This also brings to light a bug in the current HTML.cc code.  Any of the
> following tags will turn indexing back on, regardless of what turned it
> off:  </noindex>, </style>, </script>, <meta htdig-index>.  That means
> that if a document begins with a <meta name="robots" content="noindex">
> tag, and later includes style or script tags, then indexing will get
> turned back on, which is almost certainly not the desired behaviour.

No, but in this case the document will get cleaned out later. Not great,
but it's better than the other ways the bug can surface.

> It seems that the parser should keep a set of flags to track all the
> reasons indexing has been turned off, and only turn it on when all of
> these conditions have been reversed.  There should be flags for noindex,
> style, script, meta (possibly track different metas, although that may
> be overkill or even undesirable), and duplicate documents (which would
> be irreversable).  The same problem exists with its dofollow variable.

Or we may want to have some sort of stack. This would make implementing
noindex_start as list of strings easier. You'd keep track of what tags
turned off indexing (or following) and pop them off at appropriate times.
Certainly more flexible than a series of flags, plus it solves the problem
of nested tags.

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
Re: [htdig3-dev] Rejecting duplicates on htdig

Reply via email to