According to Jamie Anstice:
> Here's a quickie that someone else might like to verify if they've 
> run into the same problem.  When htdig encounters an entity that it 
> doesn't know about (say ’ - which should really be ’ but 
> that's another issue) it copies it verbatim to the extract - so far
> so good.  When the extract is sent out in Display::hilight, the
> extract is decoded with HtSGMLCodec to transform the unsigned char
> characters to entities, and as well as the characters above 160 it
> translates & to &, which is fine except when & is the start of 
> an entity.  This is what leaves things like &146; in extracts.
> Here's a patch to HtSGMLCodec::decode to make sure that it doesn't break
> real entities.

The problem with this is it doesn't make a distinction between an
"&" character in the excerpt that came from a "&" or from a "&"
in the source document.  For example, if a document gives an example of
SGML encoding, and therefore contains something like "<" in the
source HTML document, it goes into the excerpt in the database as "<".
With your patch, that < in the excerpt doesn't get expanded back to
< in the resulting HTML output.  I guess it comes down to which
is the lesser of two evils, and probably your patch is the better choice.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to