According to Hans-Peter Nilsson:
> I plan to add a new attribute: extra_word_characters.
> It is the opposite (or something) to valid_punctuation, it marks a
> (possibly) non-alphanumeric as a valid word-character.

It's like valid_punctuation, in that it's taken as part of the word, but
unlike valid_punctuation in that it's not stripped out before the word
is put in the database, if I understand you correctly.

> This way (and no other I know of), I can make "_" characters part of
> words, and searchable as such.
> 
> A (hopefully) positive side-effect is that people having problems making
> their systems understand their locale (i.e. it is broken in that it
> handles everything as the "C" locale) can state characters here that the
> locale would normally handle.
> 
> Examples:
>  extra_word_characters: _
>  extra_word_characters: "������"
> 
> (If you didn't get the last one, don't worry.)
> Specifying characters handled by the locale as isalpha would be a no-op.
> 
> Comments welcome.

Sounds like a good idea to me.  I'm planning a round of changes to
HTML.cc next week, especially dealing with space handling, but also with
word handling, so it would be a good idea if we try to avoid stepping
on each others toes.  If you get your changes in by Monday or Tuesday,
then I can follow with mine.  I want to get my changes in to 3.1.2,
which will eventually get merged into 3.2.  My concern is if I change
the same part of the code in 3.1.2 that you change in 3.2, the cvs merge
may not put it all together right.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to