According to Toivo Pedaste:
> One flaw I've realised is that if there are identical pages that
> point to different relative pages in two different parts of the tree then
> then it doesn't pick up the second lot of pages.

I must say I saw this coming, and I seem to remember mentioning this
way back when the whole concept of duplicate detection was discussed.
At the time, I didn't think it would be a serious problem, because I
didn't think you'd rely on relative links in a document that is intended
to appear at many different places in the URL space, but I imagine that
in some cases this could be a real problem - especially since the first
URL you encounter a document at may not be the preferred location for it.

> Is there a simple way to get htdig to extract the URLs from a
> page with recording it in the index?

If you look at the htdig/HTML.cc code, you'll see it has two variables,
doindex and dofollow, which get turned on or off based on the presence
of certain tags.  I guess what you'd need to do for duplicate documents
is parse them with the doindex variable turned off.  You'd also need
to do likewise in ExternalParser.cc, which currently doesn't track this
variable (expecting the external parser to look after any such details).

This also brings to light a bug in the current HTML.cc code.  Any of the
following tags will turn indexing back on, regardless of what turned it
off:  </noindex>, </style>, </script>, <meta htdig-index>.  That means
that if a document begins with a <meta name="robots" content="noindex">
tag, and later includes style or script tags, then indexing will get
turned back on, which is almost certainly not the desired behaviour.

It seems that the parser should keep a set of flags to track all the
reasons indexing has been turned off, and only turn it on when all of
these conditions have been reversed.  There should be flags for noindex,
style, script, meta (possibly track different metas, although that may
be overkill or even undesirable), and duplicate documents (which would
be irreversable).  The same problem exists with its dofollow variable.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 


Reply via email to