According to Gilles Detillieux:
>> Lennart Almkvist wrote:
>> >
>> > Some more testing gave the following results:
>> >
>> > The german flower words "Stiefmütterchen" and the islandic
>> > "þrenningarfjóla" are treated different in meta content
>> > and in the body or title part of an html document.
>> >
>> > When in the body or in the title, the "ü", "þ" and "ó "
>> > are decoded to a one byte character in the .wordlist and .words.db files.
>> >
>> > In meta content however, these words are decoded to "stiefmuuml;t"
>> > and "thorn;rennin" in the .wordlist and .words.db file. That is the "&" is
>> > removed and the rest is kept as letters ("&" is in valid_punctuation but
>> > the ";" is not, by default).
>> >
>> > Should not they be decoded as the title or body is ?
>
>OK, we do clearly have a problem with SGML entities in 3.1.2, as well
>as 3.2. (3.2 has some more serious problems, which I was hoping to
>tackle, but that's another story.) So, right now, it only translates
>&foo; entities outside of any HTML tags. I think there are reasons
>not to translate them in all tags, but where is it valid to do so?
>Certainly in keywords text, alt text in img tags, and meta description
>text. How about htdig-email-subject? Any others I've missed?
- HTML 4.0 "title" attribute (not yet handled by ht://Dig, but would be
nice to improve search results)
- Most of Dublin Core META infomation contents (would be nice if ht://Dig
could directly support this META standard).
- Alt text in client side image maps.
cheers,
Torsten
--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14 Tel: +49-4101-403605
D-25474 Ellerbek Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED] Internet: http://www.inwise.de
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.