Re: Plans for string processing

2004-04-16 Thread Jeff Clites
On Apr 14, 2004, at 11:16 AM, Larry Wall wrote:

I think the idea of tagging complete strings with language is not
terribly useful.  If it's to be of much use at all, then it should
be generalized to a metaproperty system for applying any property to
any range of characters within a string, such that the properties
float along with the characters they modify.  The whole point of
doing such properties is to be able to ignore them most of the time,
and then later, after you've constructed your entire XML document,
you can say, Oh, by the way, does this character have the toetsch
property?  There's no point in tagging text with language if 99%
of it gets turned into Dunno, or English, but not really.
I tend to agree, and BTW that's exactly what an NSAttributedString does  
on Mac OS X. To quote the docs:

	An attributed string identifies attributes by name, storing a value  
under
	the name in an NSDictionary. You can assign any attribute name/value  
pair
	you wish to a range of characters, in addition to the standard  
attributes
	described in the Constants section

See:  
http://developer.apple.com/documentation/Cocoa/Reference/Foundation/ 
ObjC_classic/Classes/NSAttributedString.html

(Of course, and NSDictionary is the Cocoa version of a hash.)

This is the basis of styled text handling on Mac OS X, but you can  
toetsch-ify XML documents as well.

JEff



Re: Plans for string processing

2004-04-15 Thread Leopold Toetsch
Aaron Sherman [EMAIL PROTECTED] wrote:

 So, why is that:

   my dog Fiffi:language(blah) eq my dog Fi\x{fb03}:langauge(blah)

 and not

   use language blah;
   my dog Fiffi eq my dog Fi\x{fb03}

What, if this is:

$dog eq my dog Fi\x{fb03}

and C$dog hasn't some language info attached?

leo


Re: Plans for string processing

2004-04-15 Thread Michael Scott
On 14 Apr 2004, at 20:16, Larry Wall wrote:

I think the idea of tagging complete strings with language is not
terribly useful.  If it's to be of much use at all, then it should
be generalized to a metaproperty system for applying any property to
any range of characters within a string, such that the properties
float along with the characters they modify.  The whole point of
doing such properties is to be able to ignore them most of the time,
and then later, after you've constructed your entire XML document,
you can say, Oh, by the way, does this character have the toetsch
property?  There's no point in tagging text with language if 99%
of it gets turned into Dunno, or English, but not really.
It seems natural to associate language with utterances. When these 
utterances are written down - or as I'm doing here, skipping the 
speaking part and uttering straight to text - then the association 
still works. But once we start emitting written things (strings) in a 
less aural way, then the notion of an associated language can easily 
become forced or inaccurate.

The process whereby we read a string like

Is bthis/b string in Englisch?

is generally a kind of lossy conversion to our language of preference 
for that particular string. It's very difficult for us to do otherwise. 
This natural generalization means that there will always be a demand 
for strings to have language associated with them, no matter how 
illogical it may seem to those who reflect upon it a bit.

I think it is this user state that Dan is trying to support. And, in so 
far as it models natural and common perception, I think I agree with 
him.

Lossy conversion is a kind of info-sin, especially when it should be 
avoided. There are circumstances where it would be more natural to read 
the above string as

Is open-bold-tag this close-bold-tag string in 
the-German-word-for-English question mark

i.e. when we are being more precise.

It is for this more precise user state that we would be preserving 
information on substrings.

There are plenty of strings which are simply never intended to be 
uttered, and therefore are effectively language-less. And many strings 
obviously in particular languages are often treated as if they weren't. 
It would be odd to submit the processing of such strings to a 
requirement of non or useless information preservation. Any sensible 
user would want to turn off language processing in such cases.

So, we need to ask the user their state, and have the necessary level 
of support in place to be able to behave accordingly.

Looking at this from an object-oriented perspective I can't help but 
wonder why we don't have a hierarchy of Parrot string types

String
LanguageString
MultiLanguageString
with a left wins rule for composition.

Mike






Re: Plans for string processing

2004-04-15 Thread Aaron Sherman
On Thu, 2004-04-15 at 05:00, Leopold Toetsch wrote:
 Aaron Sherman [EMAIL PROTECTED] wrote:
 
  So, why is that:
 
  my dog Fiffi:language(blah) eq my dog Fi\x{fb03}:langauge(blah)
 
  and not
 
  use language blah;
  my dog Fiffi eq my dog Fi\x{fb03}
 
 What, if this is:
 
   $dog eq my dog Fi\x{fb03}
 
 and C$dog hasn't some language info attached?

Looks good to me. Great example!

Seriously, why is that a problem? That was my entry-point to this
conversation: I just don't see any case in which performing a comparison
of ANY two strings according to whatever arbitrary SINGLE language rules
is a problem. I cannot imagine the case where you need two or more
language rules AND could start off with any sense of what that would
mean, and even if you could contrive such a case, I would suggest that
its rarity should dictate it being attached to a class that defines a
string-like object which mutates its behavior based on the language
spoken by the maintainer of the database from which it was fetched or
somesuch.

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback




Re: Plans for string processing

2004-04-15 Thread Leopold Toetsch
Aaron Sherman [EMAIL PROTECTED] wrote:
 On Thu, 2004-04-15 at 05:00, Leopold Toetsch wrote:
  $dog eq my dog Fi\x{fb03}

 and C$dog hasn't some language info attached?

 Looks good to me. Great example!

 Seriously, why is that a problem?

Dan's problem to come up with better examples--or explanations :)

leo - resisting from further utterances WRT that topic in the absence of
The Plan(tm).


Re: Plans for string processing

2004-04-15 Thread Dan Sugalski
At 11:55 PM +0200 4/15/04, Leopold Toetsch wrote:
Aaron Sherman [EMAIL PROTECTED] wrote:
 On Thu, 2004-04-15 at 05:00, Leopold Toetsch wrote:
	$dog eq my dog Fi\x{fb03}

 and C$dog hasn't some language info attached?

 Looks good to me. Great example!

 Seriously, why is that a problem?
Dan's problem to come up with better examples--or explanations :)
Nah, that turns out not to be the case. It's my plan, and it's 
reasonable to say I'm OK with it. :) While I'd prefer to have 
everyone agree, I can live with it if people don't.

leo - resisting from further utterances WRT that topic in the absence of
The Plan(tm).
The Plan is in progress, though I admit I'm tempted to hit easier and 
less controvertial things (like, say, threads or events) first.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-15 Thread Aaron Sherman
On Thu, 2004-04-15 at 23:13, Dan Sugalski wrote:

 Nah, that turns out not to be the case. It's my plan, and it's 
 reasonable to say I'm OK with it. :) While I'd prefer to have 
 everyone agree, I can live with it if people don't.

Perhaps, as usual, I've been too verbose and everyone just skipped over
what I thought were useful questions, but I came into this thinking I
must just not get it... now I'm left with the feeling that there are
some basic questions no one is asking here. Don't respond to this
message, but please keep these questions in mind as you start to
implement... whatever it is that you're going to implement for this.

 1. People have referred to comparing names, but most of the things
that make comparing names hard exist with respect to NAMES, and
not arbitrary strings (e.g. McLean is very different from
substr(358dsMcLeannbv35d,5,6) That is not something that
attaching metadata to a string is likely to resolve.
 2. There is no universal interchange rule-set (that I have ever
heard of) for operating on sequences of characters with respect
to two or more different languages at once, you have to pick a
language's (or culture's) rules to use, otherwise you are
comparing (or operating on) apples and oranges.
 3. In any given comparison type operation, one side's rules will
have to become dominant for that operation. Woefully, you have
no realistic way to decide this at run-time (e.g. because going
with LHS-wins would result in sorts potentially getting C($a
cmp $b) == 1 and C($b cmp $a) == 1 which can result in
infinite sort times.
 4. Given 1..3, you will probably have to implement some kind of
language context system (in most languages, this is handled by
locale) at some point, and it may need to take priority over the
language property of the strings that it operates on in certain
cases.
 5. Given 4, all unary operators become, for example,
{
set_current_locale($s.langauge);
uc($s.data)
}
Which is, after all what most languages do anyway, but they keep
that language information as a piece of global state. Allowing
just for lexical scoping of such things would be very nice.
 6. Separate from 1..5, language is an interesting property to
associate with strings, but so are a vast number of other
properties. Why are all of them second class citizens WRT
parrot, but not language? Why not build a class one level of
abstraction above raw strings which can bear arbitrary
properties?
 7. Which programming language does Parrot wish to host which
requires unique language tagging of all string data? Would this
perhaps be better left for a 2.0 feature, once the needs of the
client languages are better understood?

Ok, that's my peace. Thanks for taking the time. I'll be over here
watching now.

 easier and less controvertial things (like, say, threads or events) first.

Hah! That's rich!

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback



signature.asc
Description: This is a digitally signed message part


Re: Plans for string processing

2004-04-14 Thread Michael Scott
On 13 Apr 2004, at 23:43, Dan Sugalski wrote:

I've been assuming it's a left-side wins, as you're tacking onto an 
existing string, so you'd get English in all cases. Alternately you 
could get an exception. The end result of a mixed-language operation 
could certainly be the Dunno language or the current default--both'd 
be reasonable.

Would I be right in thinking that *language* in the context of Parrot 
strings is not necessarily an accurate description of the actual 
language of the string, but rather a means of specifying a particular 
set of idiosyncratic behavior normally associated with an actual 
language?

An english string continues to behave in an English way regardless of 
what I append to or insert into it.

Is there ever a situation where the contents of the appended/inserted 
strings are altered because of the change in *language*? In other 
words, are there any *language* (as distinct from character set) 
transforms? And, can new *languages* be defined?

For example, will there be a way to define a *language* toetsch where 
'ro' becomes '0r' in 'b0rken', and 'see' becomes 's.'?

Mike



Re: Plans for string processing

2004-04-14 Thread Larry Wall
On Wed, Apr 14, 2004 at 01:39:17PM +0200, Michael Scott wrote:
: 
: On 13 Apr 2004, at 23:43, Dan Sugalski wrote:
: 
: I've been assuming it's a left-side wins, as you're tacking onto an 
: existing string, so you'd get English in all cases. Alternately you 
: could get an exception. The end result of a mixed-language operation 
: could certainly be the Dunno language or the current default--both'd 
: be reasonable.
: 
: 
: Would I be right in thinking that *language* in the context of Parrot 
: strings is not necessarily an accurate description of the actual 
: language of the string, but rather a means of specifying a particular 
: set of idiosyncratic behavior normally associated with an actual 
: language?
: 
: An english string continues to behave in an English way regardless of 
: what I append to or insert into it.
: 
: Is there ever a situation where the contents of the appended/inserted 
: strings are altered because of the change in *language*? In other 
: words, are there any *language* (as distinct from character set) 
: transforms? And, can new *languages* be defined?
: 
: For example, will there be a way to define a *language* toetsch where 
: 'ro' becomes '0r' in 'b0rken', and 'see' becomes 's.'?

I think the idea of tagging complete strings with language is not
terribly useful.  If it's to be of much use at all, then it should
be generalized to a metaproperty system for applying any property to
any range of characters within a string, such that the properties
float along with the characters they modify.  The whole point of
doing such properties is to be able to ignore them most of the time,
and then later, after you've constructed your entire XML document,
you can say, Oh, by the way, does this character have the toetsch
property?  There's no point in tagging text with language if 99%
of it gets turned into Dunno, or English, but not really.

Larry


Re: Plans for string processing

2004-04-14 Thread Dan Sugalski
At 1:39 PM +0200 4/14/04, Michael Scott wrote:
On 13 Apr 2004, at 23:43, Dan Sugalski wrote:

I've been assuming it's a left-side wins, as you're tacking onto an 
existing string, so you'd get English in all cases. Alternately you 
could get an exception. The end result of a mixed-language 
operation could certainly be the Dunno language or the current 
default--both'd be reasonable.

Would I be right in thinking that *language* in the context of 
Parrot strings is not necessarily an accurate description of the 
actual language of the string, but rather a means of specifying a 
particular set of idiosyncratic behavior normally associated with an 
actual language?
Basically, yes.

Is there ever a situation where the contents of the 
appended/inserted strings are altered because of the change in 
*language*? In other words, are there any *language* (as distinct 
from character set) transforms? And, can new *languages* be defined?
New language code could certainly be defined, yes. I'm not sure you'd 
see too many explicit transforms from one to another past some sort 
of initial classification.

For example, will there be a way to define a *language* toetsch 
where 'ro' becomes '0r' in 'b0rken', and 'see' becomes 's.'?
Probably not, no, unless you really wanted to mangle the 
upcase/downcase/titlecase transformations.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-14 Thread Aaron Sherman
On Tue, 2004-04-13 at 18:23, Leopold Toetsch wrote:
 Aaron Sherman [EMAIL PROTECTED] wrote:
  For example, in Perl5/Ponie:
 
  @names=NAMES;
  print Phone Book: , sort(@names), \n;
 
  In this example, I don't see why I would care that NAMES might be a
  pseudo-handle that iterates over several databases, and returns strings
  in the 7 different languages
 
 I already did show an example where uc(i) isn't I. Collating is sill
 more cmplex then a »simple« uc().

Correct. I agree, and I don't think anything I said contradicted that,
did it?

 Well, we dont't know what the caller expects. The caller has to decide.
 There are basically at least two ways: Treat all strings language
 independent (from their origin) or append more information to each
 string.

Hmmm... or the third, and far more common approach in all languages that
I've seen that deal with these issues: deal with the comparison
according to the rules set out by the language in which the comparison
is being done. Why is that option being passed over? Is it considered to
be, in some way, identical to ignoring language distinctions? How?

  *) Provides language-sensitive character overrides ('ll' treated as a
  single character, for example, in Spanish if that's still desired)
  *) Provides language-sensitive grouping overrides.
 
  Ah, and here we come to my biggest point of confusion.
 
 Another example:
 
  my dog Fiffi eq my dog Fi\x{fb03}
 
 When my program is doing typographical computations, above equation is
 true. And useful. The characters f, f, i are goin' to be printed.
 But the ligature ffi takes less space when printed as such.
 This is the same character string, though, when I'm a reader of this dog
 news paper.

Ok, so here you essentially say, in the typographical context this
statement has one result, in a string data context it has another.

So, why is that:

my dog Fiffi:language(blah) eq my dog Fi\x{fb03}:langauge(blah)

and not

use language blah;
my dog Fiffi eq my dog Fi\x{fb03}

and what in Parrot's name does

james:langauge(blah) eq jim:language(bletch)

mean? Should blah's language rules (in which james and jim are the
same name) or bletch's language rules (in which they are not) take
priority? The comparison of two different languages would have to be
done in a third context of culture (e.g. culture foo holds that
blah's rules for names are used and bletch's rules for everything else
are used except when a word in bletch was derived from a word used in
blah during the third invasion and swap meet of 1233).

Then, of course, we can get into how I feel about my program telling me
(in any context) that ffi and \x{fb03} are the same for any number
of reasons, not the least of which is that I consider such
representations to be markup, not text... but that's just me, and
perhaps I'll just have to put use language 'ignorant American geek' at
the start of all of my programs ;)

  b) Strings will have different languages and behave according to their
  sources regardless of the native rules of the user.

Again, I have never seen any source of information that suggests that
there is a universally known way to implement the above. Don't get me
started on the impact of going to southeast Asia and suggesting that
ok, one of your language rules have to win when comparing characters of
differing languages... ha! IMHO, the only thing that CAN be done at
such a low level as Parrot is to do the work according to the language
rules that govern the rest of this execution of the program, and if a
string makes no sense in that context, an exception is thrown.

But otherwise, how do you sort \x{6728} in Japanese vs Mandarin Chinese?
The two languages have different answers, and you HAVE to pick one.

  IW: Mush together (either concatenate or substr replacement) two
  strings of different languages but same charset
 
  According to whose rules?
 
 User level - what do you want to achieve. At codepoint level the
 operation is fine. It doesn't make sense above that, though.

So, you seem to be suggesting that a single language (that of the user,
not the 2+ involved if you tag every string) should decide? If so, why
have strings tagged with language?

  This means that someone's rules must become dominant,
 
 It doesn't make much sense to do
 
bors S0, S1   # stringwise bit not
 
 to anything that isn't singlebyte encoded. It depends.

Sorry, you lost me. Did I bring that up? I was asking if:

$a cmp $b

would have a result in which $b was considered with respect to $a's
language or visa versa. Most commonly (always?) there is an incomplete
intersection of rules between the two, so someone's rules will have to
win. So you have choices:

  * If you go with LHS vs. RHS, then sort gets borked because sort
will reverse the sides repeatedly as it executes. This can and
would result in infinite sort times.
  * If you come up with a list of languages in descending order 

Re: Plans for string processing

2004-04-13 Thread Jarkko Hietaniemi
Matt Fowles wrote:

 Dan~
 
 I know that you are not technically required to defend your position, 
 but I would like an explanation of one part of this plan.
 
 Dan Sugalski wrote:
 
4) We will *not* use ICU for core functions. (string to number or number 
to string conversions, for example)
 
 
 Why not?  It seems like we would just be reinventing a rather large 
 wheel here.

Without having looked at what ICU supplies in this department I would
guess it's simply because of the overhead.  atoi() is probably quite a
bit faster than pulling in the full support for TIBETAN HALF THREE.

(Though to be honest I think Parrot shouldn't trust on atoi() or any
of those guys: Perl 5 has tought us not to put trust too much on them.
Perl 5 these days parses all the integer formats itself.)



Re: Plans for string processing

2004-04-13 Thread Dan Sugalski
At 10:42 AM +0300 4/13/04, Jarkko Hietaniemi wrote:
Matt Fowles wrote:

 Dan~

 I know that you are not technically required to defend your position,
 but I would like an explanation of one part of this plan.
 Dan Sugalski wrote:

4) We will *not* use ICU for core functions. (string to number or number
to string conversions, for example)


 Why not?  It seems like we would just be reinventing a rather large
 wheel here.
Without having looked at what ICU supplies in this department I would
guess it's simply because of the overhead.  atoi() is probably quite a
bit faster than pulling in the full support for TIBETAN HALF THREE.
(Though to be honest I think Parrot shouldn't trust on atoi() or any
of those guys: Perl 5 has tought us not to put trust too much on them.
Perl 5 these days parses all the integer formats itself.)
That's part of it, yep--if we want it done the way we want it, we'll 
need to do it ourselves, and it'll likely be significantly faster.

Also, there's the issue of not requiring ICU, which makes it 
difficult to do string conversion if it isn't there... :)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-13 Thread Aaron Sherman
Ok, I'm still lost on the language thing. I'm not arguing, I just don't
get it, and I feel that if I'm going to do some of the things that I
want to for Perl 6, I'm going to have to get it.

On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:

 Language
 
 *) Provides language-sensitive manipulation of characters (case mangling)
 *) Provides language-sensitive comparisons

Those two things do not seem to me to need language-specific strings at
all. They certainly need to understand the language in which they are
operating (avoiding the use of the word locale here, as per Larry's
concerns), but why does the language of origin of the string matter?

For example, in Perl5/Ponie:

@names=NAMES;
print Phone Book: , sort(@names), \n;

In this example, I don't see why I would care that NAMES might be a
pseudo-handle that iterates over several databases, and returns strings
in the 7 different languages that those databases happen to contain. I
want my Phone Book sorted in a way that is appropriate to the language
of my phone book, with whatever special-case rules MY language has for
sorting funky foreign letters (and that might mean that even though a
comparison of two strings is POSSIBLE, in the current language it might
yield an exception, e.g. because Chinese and Japanese share a great many
characters that can be roughly converted, but neither have meaning in my
American English comparison).

More generally, an operation performed on a string (be it read
(comparison) or write (upcase, etc)) should be done in the way that the
*caller* expects, regardless of what legacy source the string came from
(I daren't even guess where that string that I got over a Parrot-enabled
CORBA might have been fetched from or if the language is still used
since it was stored in a cache somewhere 200 years ago, and it damn well
better not affect my sorting, no?)

Ok, so that's my take... what am I missing?

 *) Provides language-sensitive character overrides ('ll' treated as a 
 single character, for example, in Spanish if that's still desired)
 *) Provides language-sensitive grouping overrides.

Ah, and here we come to my biggest point of confusion.

You describe logic that surrounds a given language, but you'll never
need cmp to know how to compare Spanish ll to English ll, for
example. In fact, that doesn't even make sense to me. What you will need
is for cmp to know the Spanish comparison rules so that when it gets two
strings to compare, and it is asked to do so in Spanish, the proper
thing will happen.

I guess this boils down to two choices:

a) All strings will have the user's language by default

or

b) Strings will have different languages and behave according to their
sources regardless of the native rules of the user.

b seems to me to yield very surprising results, and not at all justify
the baggage placed inside a string. If I can be forgiven for saying so,
it's even close to Perl 4's $], which allowed you to change the
semantics of arrays, only here, you're doing it as a property on a
string so that I can't trust that any string will behave the way I
expect unless I untaint it.

Again, I'm asking for corrections here.

 IW: Mush together (either concatenate or substr replacement) two 
 strings of different languages but same charset

According to whose rules? Does it make sense to merge an American
English string with a Japanese string unless you have a target language?

This means that someone's rules must become dominant, and as a
programmer, I'm expecting that to be neither string a nor string b, but
the user's. If the user happens to be Portuguese, then I would expect
that some kind of exception is going to emerge, but if the user is
Japanese, then it makes sense, and American English can be treated as
romaji, and an exception thrown if non-romaji ascii characters are used.
Again, this is not something that the STRING can really have much of a
clue about. It's all context.

What is the reason for every string value carrying around such context?
Certainly numbers don't carry around their base as context, and yet
that's critical when converting to a string!

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback




Re: Plans for string processing

2004-04-13 Thread Dan Sugalski
At 1:55 PM -0400 4/13/04, Aaron Sherman wrote:
Ok, I'm still lost on the language thing. I'm not arguing, I just don't
get it, and I feel that if I'm going to do some of the things that I
want to for Perl 6, I'm going to have to get it.
On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:

 Language
 
 *) Provides language-sensitive manipulation of characters (case mangling)
 *) Provides language-sensitive comparisons
Those two things do not seem to me to need language-specific strings at
all. They certainly need to understand the language in which they are
operating (avoiding the use of the word locale here, as per Larry's
concerns), but why does the language of origin of the string matter?
Because the way a string is upcased/downcased/titlecased depends on 
the language the string came from. The treatment of accents and a 
number of specific character sequences depends on the language the 
string came from. Ignore it and, well, you're going to find that 
you're messing up the display of someone's name. That strikes me as 
rather rude.

You also don't always have the means of determining what's right. 
It's particularly true of library code.

For example, in Perl5/Ponie:

@names=NAMES;
print Phone Book: , sort(@names), \n;
In this example, I don't see why I would care that NAMES might be a
pseudo-handle that iterates over several databases, and returns strings
in the 7 different languages that those databases happen to contain.
Then *you* don't. That's fine. Why, though, do you assume that 
*nobody* will? That's the point.

You may decide that all strings shall be treated as if they were in 
character set X, and language Y, whatever that is. Fine. You may 
decide that the language you're designing will treat all strings as 
if they're in character set X and language Y. That's fine too. Parrot 
must support the capability of forcing the decision, and we will.

What I don't want to do is *force* uniformity. Some of us do care. If 
we do it the way I want, then we can ultimately both do what we want. 
If we do it the way you want, though, we can't--I'm screwed since the 
data is just not there and can't *be* there.

We've tried the whole monoculture thing before. That didn't work with 
ASCII, EBCDIC, any of the Latin-x, ISO-whatever, and it's not working 
for a lot of folks with Unicode. (Granted, only a couple of billion, 
so it's not *that* big a deal...) We've also tried the whole global 
setting thing, and if you think that worked I dare you to walk up to 
Jarkko and whisper Locale in his ear.

If you want to force a simplified view of things as either an app 
programmer or language designer, well, great. I am OK with that. More 
than OK, really, and I do understand the desire. What I'm not OK with 
is mandating that simplified view on everyone.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-13 Thread Brent 'Dax' Royal-Gordon
Dan Sugalski wrote:
 1) Parrot will *not* require Unicode. Period. Ever.
My old 8MB Visor Prism thanks you.

 *) Transform stream of bytes to and from a set of 32-bit integers
 *) Manages byte buffer (so buffer positioning and manipulation by code
 point offset is handled here)
What's wrong with, *as an internal optimization only*, storing the
string in the more efficient-to-access format of the patch?  I mean,
yeah, you don't want it to be externally visible, but if you're going to
treat a string as a series of ints, why not store it that way?
I really see no reason to store strings as UTF-{8,16,32} and waste CPU
cycles on decoding it when we can do a lossless conversion to a format
that's both more compact (in the most common cases) and faster.
--
Brent Dax Royal-Gordon [EMAIL PROTECTED]
Perl and Parrot hacker
Oceania has always been at war with Eastasia.


Re: Plans for string processing

2004-04-13 Thread Dan Sugalski
At 12:44 PM -0700 4/13/04, Brent 'Dax' Royal-Gordon wrote:
Dan Sugalski wrote:
1) Parrot will *not* require Unicode. Period. Ever.
My old 8MB Visor Prism thanks you.
:) As does my gameboy.

*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by 
code point offset is handled here)
What's wrong with, *as an internal optimization only*, storing the 
string in the more efficient-to-access format of the patch?  I mean, 
yeah, you don't want it to be externally visible, but if you're 
going to treat a string as a series of ints, why not store it that 
way?

I really see no reason to store strings as UTF-{8,16,32} and waste 
CPU cycles on decoding it when we can do a lossless conversion to a 
format that's both more compact (in the most common cases) and 
faster.
Erm... UTF-32 is a fixed-width encoding. (That Unicode is inherently 
a variable-width character set is a separate issue, though given the 
scope of the project a correct decision) I'm fine with leaving ICU to 
store unicode data internally any damn way it wants, though--partly 
because the IBM folks are Darned Clever and I trust their judgement, 
and partly because it means we don't have to write all the code to 
properly handle Unicode.

Other variable-width encodings will likely be stored internally as 
fixed-width buffers, at least once the data gets manipulated some. 
Assuming I'm not convinced that Unicode is the true way to go... :)
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-13 Thread Michael Scott
On 12 Apr 2004, at 17:43, Dan Sugalski wrote:

IW: Mush together (either concatenate or substr replacement) two 
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown. 
If so, we do the operation. If one string is manipulated the language 
stays whatever that string was. If a new string is created either the 
left side wins or the default language is used, depending on the 
interpreter setting.

Does that mean that a Parrot string will always have a specific 
language associated with it?

Mike



Re: Plans for string processing

2004-04-13 Thread Dan Sugalski
At 10:44 PM +0200 4/13/04, Michael Scott wrote:
On 12 Apr 2004, at 17:43, Dan Sugalski wrote:

IW: Mush together (either concatenate or substr replacement) two 
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is 
thrown. If so, we do the operation. If one string is manipulated 
the language stays whatever that string was. If a new string is 
created either the left side wins or the default language is used, 
depending on the interpreter setting.

Does that mean that a Parrot string will always have a specific 
language associated with it?
Yes.

Note that the language might be Dunno. :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-13 Thread Michael Scott
On 13 Apr 2004, at 22:48, Dan Sugalski wrote:

Note that the language might be Dunno. :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.

Would this be right?

English + English = English
English + Chinese = Dunno
English + Dunno = Dunno
+ being symmetric.

How does a Dunno string know how to change case?

Mike



Re: Plans for string processing

2004-04-13 Thread Dan Sugalski
At 11:28 PM +0200 4/13/04, Michael Scott wrote:
On 13 Apr 2004, at 22:48, Dan Sugalski wrote:

Note that the language might be Dunno. :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.

Would this be right?

English + English = English
English + Chinese = Dunno
English + Dunno = Dunno
+ being symmetric.
I've been assuming it's a left-side wins, as you're tacking onto an 
existing string, so you'd get English in all cases. Alternately you 
could get an exception. The end result of a mixed-language operation 
could certainly be the Dunno language or the current default--both'd 
be reasonable.

How does a Dunno string know how to change case?
It uses the defaults provided by the character set.
--
Dan
--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Plans for string processing

2004-04-13 Thread Leopold Toetsch
Brent 'Dax' Royal-Gordon [EMAIL PROTECTED] wrote:

 I really see no reason to store strings as UTF-{8,16,32} and waste CPU
 cycles on decoding it when we can do a lossless conversion to a format
 that's both more compact (in the most common cases) and faster.

The default format now isn't UTF8. It's a series of fixed sized entries
of either uint_8, uint_16, or uint_32. These reflect most common
encodings which are: char*, USC-2, and UCS-4/UTF-32 (or possibly other
32-bit encodings). This should cover common cases.

No cycles are wasted for storing straight encodings.

leo


Re: Plans for string processing

2004-04-13 Thread Leopold Toetsch
Aaron Sherman [EMAIL PROTECTED] wrote:
 For example, in Perl5/Ponie:

 @names=NAMES;
 print Phone Book: , sort(@names), \n;

 In this example, I don't see why I would care that NAMES might be a
 pseudo-handle that iterates over several databases, and returns strings
 in the 7 different languages

I already did show an example where uc(i) isn't I. Collating is sill
more cmplex then a »simple« uc().

 More generally, an operation performed on a string (be it read
 (comparison) or write (upcase, etc)) should be done in the way that the
 *caller* expects,

Well, we dont't know what the caller expects. The caller has to decide.
There are basically at least two ways: Treat all strings language
independent (from their origin) or append more information to each
string.

 *) Provides language-sensitive character overrides ('ll' treated as a
 single character, for example, in Spanish if that's still desired)
 *) Provides language-sensitive grouping overrides.

 Ah, and here we come to my biggest point of confusion.

Another example:

 my dog Fiffi eq my dog Fi\x{fb03}

When my program is doing typographical computations, above equation is
true. And useful. The characters f, f, i are goin' to be printed.
But the ligature ffi takes less space when printed as such.
This is the same character string, though, when I'm a reader of this dog
news paper.

When I do an analysis of counting fs in dog names, I don't care if
it's written in one of these forms, it should be the same - or when I
search for ffi in the text.

It just depends who's using these features in which context.

 I guess this boils down to two choices:

 a) All strings will have the user's language by default

 or

 b) Strings will have different languages and behave according to their
 sources regardless of the native rules of the user.

and/or either the strings or the users default come in depending on the
desired action.

 IW: Mush together (either concatenate or substr replacement) two
 strings of different languages but same charset

 According to whose rules?

User level - what do you want to achieve. At codepoint level the
operation is fine. It doesn't make sense above that, though.

 This means that someone's rules must become dominant,

It doesn't make much sense to do

   bors S0, S1   # stringwise bit not

to anything that isn't singlebyte encoded. It depends.

The rules - how and when they apply - still have to be layed out.

leo


Re: Plans for string processing

2004-04-13 Thread Aaron Sherman

Thanks for your response. I'm not sure that you and I are speaking about
exactly the same things, since you state that the logical extensions, if
not outright goals, of an alternate approach would be an exclusionary
monoculture. I'm not sure that's quite right

On Tue, 2004-04-13 at 15:06, Dan Sugalski wrote:

   *) Provides language-sensitive manipulation of characters (case mangling)
   *) Provides language-sensitive comparisons
 
 Those two things do not seem to me to need language-specific strings at
 all. They certainly need to understand the language in which they are
 operating (avoiding the use of the word locale here, as per Larry's
 concerns), but why does the language of origin of the string matter?
 
 Because the way a string is upcased/downcased/titlecased depends on 
 the language the string came from. The treatment of accents and a 
 number of specific character sequences depends on the language the 
 string came from.

 Ignore it and, well, you're going to find that 
 you're messing up the display of someone's name. That strikes me as 
 rather rude.

For proper names, you may have a point (though the ordering of names in
a phone book, for example, is often according to the language of the
book, not the origin of the names), and in some forms of string
processing, that kind of deference to the origin of a word may turn out
to be useful. I do get that much.

What I'm not getting is

  * Why do we assume that the language property of a string will be
the language from which the word correctly originates rather
than the locale of the database / web site / file server /
whatever that we received it from? That could actually result in
dealing with native words according to the rules of foreign
languages, and boy-howdy is that going to be fun to debug.
  * Why is it so valuable as to attach a value to every string ever
created for it rather than creating an abstraction at a higher
level (e.g. a class)
  * Why wouldn't you do the same thing for MIME type, as strings may
also (and perhaps more often) contain data which is more
appropriately tagged that way? The SpamAssassin guys would love
you for this!

 What I don't want to do is *force* uniformity. Some of us do care.

Hey, that's a bit of a low blow. I care quite a bit, or I would not ask.
I'm not saying that the guy who wants to sort names according to their
source language is wrong, I'm saying that he doesn't need core support
in Parrot to do it, so I'm curious why it's in there.

 We've tried the whole monoculture thing before.

I just don't think that moving language up a layer or two of abstraction
enforces a monoculture... again, I'm willing to see the light if someone
can explain it.

A lot of your response is about enforcing, and I'm not sure how I gave
the impression of this being an enforcement issue (or perhaps you think
that non-localization is something that needs to be enforced?) I just
can't see how every string needs to carry around this kind of
world-view-altering context when 99% of programs that use string data
(even those that use mixed encodings) won't want to apply said context,
but rather perform all operations according to their locale. Am I wrong
about that?

One thing that was not answered, though is what happens in terms of
dominance. When sorting French and Norwegian Unicode strings, who loses
(wins?) when you try to compare them? Comparing across language
boundaries would be a monumental task, and would be instantly reviled as
wrong by every language purist in the world (to my knowledge no one has
ever published a uniform way to compare two words, much less arbitrary
text, unless you are willing to do so using the rules of one and only
one culture (and I say culture because often the rules of a culture are
mutually incompatible with those of any one source language's strict
rules)). So, if you have to convert in order to compare, whose language
do you do the comparison in? You can't really rely on LHS vs. RHS, since
a sort will reverse these many times (and C$a cmp $b had better be
C-($b cmp $a) or your sort may never terminate!)

-- 
Aaron Sherman [EMAIL PROTECTED]
Senior Systems Engineer and Toolsmith
It's the sound of a satellite saying, 'get me down!' -Shriekback




Plans for string processing

2004-04-12 Thread Dan Sugalski
Okay, I've not dug through all the fallout from the ICU checkin, but 
I can see there's an awful lot. I'll dig through that in a bit, but...

Here's the plan. We've gone over it in the past, but I'm not sure 
everything's been gathered together, so it's time to do so.

Some declarations:

1) Parrot will *not* require Unicode. Period. Ever. (Well, upon 
release, at least) We will strongly recommend it, however, and use it 
if we have it
2) Parrot *will* support multiple encodings (the bytes-code points 
stuff), character sets (code points-meaning of a sort), and 
language-specific overrides of character set behaviour.
3) All string data can be dealt with as either a series of bytes, 
code points, or characters. (Characters are potentially multiple code 
points--basically combining character stuff from those standards that 
do so)
4) We will *not* use ICU for core functions. (string to number or 
number to string conversions, for example)
5) Parrot will autoconvert strings as needed. If a string can't be 
converted, parrot will throw an exception. This goes for language, 
character set, or encoding.
6) There *may* be an overriding set of rules for throwing conversion 
exceptions. (They may be supressed on lossy conversions, or required 
for any conversions)
7) There *may* be an overriding language used for language-specific 
operations (case folding or sorting).

I know ICU's got all sorts of nifty features, but bluntly we're not 
going to use most of them.

The original split of encoding, character set, and language is one 
that I want to keep. I know we've lost a good chunk of that with the 
latest ICU patch, but that's only temporary and the breakage is worth 
it to get Unicode actually in use. I expect I need to step up to the 
plate and get an alternate encoding and charset in, so I'll probably 
take a shot at JIS X 0208:1997 or CNS11643-1992. (Or whatever the 
current version of those is)

As far as Parrot is concerned, a string is a series of bytes which 
may, via its encoding, be turned into a series of 32 bit integer code 
points. Those 32-bit integer code points can be turned, via its 
character set, into a series of characters where each character is 
one or more code points. Those characters may be classified and 
transformed based on the language of the string.

The responsibilities of the three layers are:

Encoding

*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by 
code point offset is handled here)

Character set
=
*) Provides default manipulation and comparison behaviour (sorting 
and case mangling)
*) Provides default character classifications (digit, word char, 
space, punctuation, whatever)
*) Provides code point and character manipulation. (substring 
functionality, basically)
*) Provides integrity features (exceptions if a string would be invalid)

Language

*) Provides language-sensitive manipulation of characters (case mangling)
*) Provides language-sensitive comparisons
*) Provides language-sensitive character overrides ('ll' treated as a 
single character, for example, in Spanish if that's still desired)
*) Provides language-sensitive grouping overrides.

Since examples are good, here are a few. They're in an If we/Then 
Parrot format.

IW: Mush together (either concatenate or substr replacement) two 
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown. 
If so, we do the operation. If one string is manipulated the language 
stays whatever that string was. If a new string is created either the 
left side wins or the default language is used, depending on the 
interpreter setting.

IW: Mush together two strings of different charsets
TP: If the two strings can be losslessly converted to one of the two 
charsets, do so, otherwise transform to Unicode and mush together. If 
transformation is lossy optionally throw an exception (or warning) 
Language rules above still apply.

IW: Force a conversion to a different character set
TP: Does it. An exception or warning may be thrown if the conversion 
is not lossless.

Please note that in most cases parrot deals with string data as 
*strings* in S registers (or hiding behind PMCs) not as integers in I 
registers (even though we treat strings as a series of abstract 
integer code points). This is because even something as simple as 
give me character 5 may return a series of code points if character 
5 is a combining character set. We may (possibly, but possibly not) 
get a bit dirtier for the regex code for speed reasons, but we'll see 
about that.

Also note that some languages, such as perl 6, have a more restricted 
view of things. That's fine, but we don't really care much as long as 
everything that they need is provided, so the fact that Larry's 
mandated the Ux levels is fine, but as they're a (possibly 
excessively) restricted subset of what we're going to do 

Re: Plans for string processing

2004-04-12 Thread Michael Scott
Just thought I'd mention that I'm in the process of trying to get 
strings.pod updated to reflect the current state of affairs.

Mike



Re: Plans for string processing

2004-04-12 Thread Matt Fowles
Dan~

I know that you are not technically required to defend your position, 
but I would like an explanation of one part of this plan.

Dan Sugalski wrote:
4) We will *not* use ICU for core functions. (string to number or number 
to string conversions, for example)
Why not?  It seems like we would just be reinventing a rather large 
wheel here.

Matt