Re: on parrot strings

Bryan C. Warnock Mon, 21 Jan 2002 14:45:34 -0800

On Monday 21 January 2002 17:11, Russ Allbery wrote:
> No, pretty much all of the time.  There are differences between proper
> nouns and common nouns, but those are differences routinely quashed as a
> typesetting decision; if you write both proper nouns and common nouns in
> all caps as part of a headline, the lack of distinction is not considered
> a misspelling.  Similarly, if you capitalize the common noun because it
> occurs at the beginning of the sentence, that doesn't transform its
> meaning.


That doesn't mitigate the fact that they are different words.  Sure, English 
is forgiving, as its filled with heteronyms and homographs.  But it's all 
moot because regexes are character-oriented, not word-oriented.  

Given that they're character-oriented, we only need to provide character 
transformations between upper, lower, and title case.  But is that the 
dividing line?

>
> Whereas adding or removing an accent is always considered a misspelling,
> at least in some languages.  It's like adding or removing random letters
> from the word.

No, it's substituting letters in a word.  It's adding or removing random 
characters from the string representation of the word.

>
> re'sume' and resume are two different words.  It so happens that in
> English re'sume' is a varient spelling for one meaning of resume.  I don't
> believe that regexes should try to automatically pick up varient
> spellings.  Should the regex /aerie/ match /eyrie/?  That makes as much
> sense as a search for /resume/ matching /re'sume'/.

Varient spellings imply word-oriented searches.  We're talking about 
character-oriented transformations, and the questions is whether or not 
there's enough justification - which I feel won't come from grammatical 
rationales, but from the 7-bit ASCII storage of words with accents - to 
provide a transformation from a base letter with accents to just the base 
letter.  

Do you feel that altering accented letters to better represent them within 
the facilities provided isn't done, or is wrong?  I'm not sure what 
you're typing as your example word, and whether or not it's getting munged 
in the meantime, but "résumé"  (r, e accent, s, u, m, e accent) is coming 
across "re'sume'" (r, e, apostrophe, s, u, m, e, apostrophe).  (The incoming 
message was encoded ISO-8859-1, so presumably it should have preserved 
character 233, which is what I'm sending out.)

This isn't a ridiculous question.  Personally, I don't think that we should. 
The facilities are quickly coming into place to be able to do proper 
character encodings, and I think that we should lead from the front and 
encourage folks to be proper - not only in their searches, but in their text 
production. 


-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Re: on parrot strings

Reply via email to