On 14 Apr 2004, at 20:16, Larry Wall wrote:


I think the idea of tagging complete strings with "language" is not
terribly useful.  If it's to be of much use at all, then it should
be generalized to a metaproperty system for applying any property to
any range of characters within a string, such that the properties
float along with the characters they modify.  The whole point of
doing such properties is to be able to ignore them most of the time,
and then later, after you've constructed your entire XML document,
you can say, "Oh, by the way, does this character have the "toetsch"
property?"  There's no point in tagging text with language if 99%
of it gets turned into "Dunno", or "English, but not really."


It seems natural to associate language with utterances. When these utterances are written down - or as I'm doing here, skipping the speaking part and uttering straight to text - then the association still works. But once we start emitting written things (strings) in a less aural way, then the notion of an associated language can easily become forced or inaccurate.


The process whereby we read a string like

"Is <b>this</b> string in Englisch?"

is generally a kind of lossy conversion to our language of preference for that particular string. It's very difficult for us to do otherwise. This natural generalization means that there will always be a demand for strings to have language associated with them, no matter how illogical it may seem to those who reflect upon it a bit.

I think it is this user state that Dan is trying to support. And, in so far as it models natural and common perception, I think I agree with him.

Lossy conversion is a kind of info-sin, especially when it should be avoided. There are circumstances where it would be more natural to read the above string as

"Is open-bold-tag this close-bold-tag string in the-German-word-for-English question mark"

i.e. when we are being more precise.

It is for this more precise user state that we would be preserving information on substrings.

There are plenty of strings which are simply never intended to be uttered, and therefore are effectively language-less. And many strings obviously in particular languages are often treated as if they weren't. It would be odd to submit the processing of such strings to a requirement of non or useless information preservation. Any sensible user would want to turn off language processing in such cases.

So, we need to ask the user their state, and have the necessary level of support in place to be able to behave accordingly.

Looking at this from an object-oriented perspective I can't help but wonder why we don't have a hierarchy of Parrot string types

        String
        LanguageString
        MultiLanguageString

with a "left wins" rule for composition.

Mike






Reply via email to