Re: [htdig3-dev] Rejecting duplicates on htdig

Gilles Detillieux Mon, 21 Aug 2000 10:32:11 -0700
According to Geoff Hutchison:
> On Mon, 21 Aug 2000, Gilles Detillieux wrote:
> > > One flaw I've realised is that if there are identical pages that
> > > point to different relative pages in two different parts of the tree then
> > > then it doesn't pick up the second lot of pages.
> > 
> > I must say I saw this coming, and I seem to remember mentioning this
> > way back when the whole concept of duplicate detection was discussed.
> > At the time, I didn't think it would be a serious problem, because I
> > didn't think you'd rely on relative links in a document that is intended
> 
> That was my thinking too. I also felt that if duplicate detection was a
> configuration attribute (potentially on a pre-URL or per-server basis),
> then it wouldn't be as big of an issue.

Oh, right.  It didn't occur to me when reading Toivo's patches that this
was not an optional feature.  Making this feature selectable by a config
attribute is a must, IMHO!

> > URL you encounter a document at may not be the preferred location for it.
> 
> I've been giving some thought to the question of "preferred location" and
> I resort to the HTML 4.0 standard. The BASE tag is supposed to be for this
> purpose <http://www.w3.org/TR/html4/struct/links.html#h-12.4> and when
> used, it solves the relative link issue.
> 
> However, some people may want to designate the preferred location w/o
> using the BASE tag. So I suggest using the LINK tag for this purpose:
> <http://www.w3.org/TR/html4/struct/links.html#h-12.3.3>
> 
> Something like <LINK rel="start" href="http://www.foo.com"> might work.
> The HTML spec, for better or worse, doesn't give any standard for REL and
> REV attributes. So my suggestion is that the code should use the first URL
> it sees as the canonical URL for duplicates unless there's a LINK or BASE
> tag.
> 
> The only snag is that if we hit one of these tags, it essentially
> functions as a redirect, so we'll have to handle that carefully (expire
> the current URL, etc.).

Sounds reasonable.  However, if I'm reading the situation correctly based
on previous discussions about this, I think most instances of duplicate
documents are completely unplanned, so in many/most circumstances I
don't think users are going to want to add tags to all their documents to
make things work right.  This means it'll be important for this to work
right in the default case, with no user intervention.  Symbolic links
were probably the most frequent cause of duplicates, from the reports
I've seen, and I got the impression that these links were usually added
without consideration of the impact on indexing.

I think that reparsing documents at multiple locations, to follow
relative URLs at all these locations, will be necessary (or at least
should be optional), to deal with situations where the first encounter
of a document yields many broken relative links, which must be re-parsed
at another location to lead to their intended targets.

> > This also brings to light a bug in the current HTML.cc code.  Any of the
> > following tags will turn indexing back on, regardless of what turned it
> > off:  </noindex>, </style>, </script>, <meta htdig-index>.  That means
> > that if a document begins with a <meta name="robots" content="noindex">
> > tag, and later includes style or script tags, then indexing will get
> > turned back on, which is almost certainly not the desired behaviour.
> 
> No, but in this case the document will get cleaned out later. Not great,
> but it's better than the other ways the bug can surface.
> 
> > It seems that the parser should keep a set of flags to track all the
> > reasons indexing has been turned off, and only turn it on when all of
> > these conditions have been reversed.  There should be flags for noindex,
> > style, script, meta (possibly track different metas, although that may
> > be overkill or even undesirable), and duplicate documents (which would
> > be irreversable).  The same problem exists with its dofollow variable.
> 
> Or we may want to have some sort of stack. This would make implementing
> noindex_start as list of strings easier. You'd keep track of what tags
> turned off indexing (or following) and pop them off at appropriate times.
> Certainly more flexible than a series of flags, plus it solves the problem
> of nested tags.

I don't know that a stack is easier to implement than a bitmask, but both
approaches have their merits, and either would be better than what the
code does now.  Using a stack raises the question of how the code should
deal with tags that are not properly nested.  Should it pop off everything
when faced with a closing tag, until it finds the matching opening tag
on the stack?  What about <meta htdig-noindex> and <meta htdig-index>?
These aren't strictly opening and closing tags, so can any nesting rules
be imposed on them?

I see the whole noindex_start thing as a separate issue, though, because
it's parsed at an earlier stage, and actually causes sections of the
HTML to be stripped out, rather than just flipping flags.  One of the
advantages of doing it this way is that the start and end strings don't
have to be complete tags, and we probably don't want to loose that
feature.  Probably the best way to implement this as a list of choices
would be to use a StringMatch object for noindex_start, and use the
"which" value from a match to index a StringList object for noindex_end.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
Re: [htdig3-dev] Rejecting duplicates on htdig

Reply via email to