On Monday 21 January 2002 17:11, Russ Allbery wrote: > No, pretty much all of the time. There are differences between proper > nouns and common nouns, but those are differences routinely quashed as a > typesetting decision; if you write both proper nouns and common nouns in > all caps as part of a headline, the lack of distinction is not considered > a misspelling. Similarly, if you capitalize the common noun because it > occurs at the beginning of the sentence, that doesn't transform its > meaning.
That doesn't mitigate the fact that they are different words. Sure, English is forgiving, as its filled with heteronyms and homographs. But it's all moot because regexes are character-oriented, not word-oriented. Given that they're character-oriented, we only need to provide character transformations between upper, lower, and title case. But is that the dividing line? > > Whereas adding or removing an accent is always considered a misspelling, > at least in some languages. It's like adding or removing random letters > from the word. No, it's substituting letters in a word. It's adding or removing random characters from the string representation of the word. > > re'sume' and resume are two different words. It so happens that in > English re'sume' is a varient spelling for one meaning of resume. I don't > believe that regexes should try to automatically pick up varient > spellings. Should the regex /aerie/ match /eyrie/? That makes as much > sense as a search for /resume/ matching /re'sume'/. Varient spellings imply word-oriented searches. We're talking about character-oriented transformations, and the questions is whether or not there's enough justification - which I feel won't come from grammatical rationales, but from the 7-bit ASCII storage of words with accents - to provide a transformation from a base letter with accents to just the base letter. Do you feel that altering accented letters to better represent them within the facilities provided isn't done, or is wrong? I'm not sure what you're typing as your example word, and whether or not it's getting munged in the meantime, but "résumé" (r, e accent, s, u, m, e accent) is coming across "re'sume'" (r, e, apostrophe, s, u, m, e, apostrophe). (The incoming message was encoded ISO-8859-1, so presumably it should have preserved character 233, which is what I'm sending out.) This isn't a ridiculous question. Personally, I don't think that we should. The facilities are quickly coming into place to be able to do proper character encodings, and I think that we should lead from the front and encourage folks to be proper - not only in their searches, but in their text production. -- Bryan C. Warnock [EMAIL PROTECTED]