Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Barry Caplan

At 05:13 PM 7/12/2002 -0400, Suzanne M. Topping wrote:
 -Original Message-
 From: Barry Caplan [mailto:[EMAIL PROTECTED]]
 
 At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
 Unicode is a character set. Period. 
 
 Each character has numerous 
 properties in Unicode, whereas they generally don't in legacy 
 character sets.

Each character, or some characters?


For all intents and purposes, each character. Chapter 4.5 of my Unicode 3.0 book says 
 The Unicode Character Database on the CDROM defines a General Category for all 
Unicode characters

So, each character has at least one attribute. One could easily say that each 
character also has an attribute for isUpperCase of either true of false, and so on.

There are no corresponding features in other character sets usually.


 Maybe Unicode is more of a shared set of rules that apply to 
 low level data structures surrounding text and its algorithms 
 then a character set.

Sounds like the start of a philosophical debate. 

Not really. I have been giving presentations for years, and I have seen many others 
give similar presentations. A common definition of character set is a list of 
character you are interested in assigned to codepoints. That fits most legacy 
character sets pretty well, but Unicode is sooo much more than that.



If Unicode is described as a set of rules, we'll be in a world of hurt.


Yeah, one of the heaviest books I own is Unicode 3.0. I keep it on a low shelf so the 
book of rules describing Unicode doesn't fall on me for just that reason. this is 
earthquake country after all  :)


I choose to look at this stuff as the exceptions that make the rule.


I don't really know if it is possible to break down Unicode into more fundamental 
units if you started over. Its complexity is inherent in the nature of the task. My 
own interest is more in getting things done with data and algorithms that use the type 
of material represented by the Unicode standard, more so than the arcania of the 
standard itself. So it doesn't bother me so much that there are exceptions - as long 
as we have the exceptions that everyone agrees on, that is fine by me because it means 
my data and at least some of my algorithms are likely to be preservable across systems.


(On a serious note, these exceptions are exactly what make writing some
sort of is and isn't FAQ pretty darned hard. 

humor
Be careful what you ask for :)
/humor

I can't very well say
that Unicode manipulates characters given certain historical/legacy
conditions and under duress. 

Why not? It is true.

But what if we took a look at it from a different point of view, that the standard is 
a agreed upon set of rules and building blocks for text oriented algorithms? Would 
people start to publish algorithms that extend on the base data provided so we don't 
have to reinvent wheels all the time?

I'm just brainstorming here, this is all just coming to me now. 

If I were to stand in front of a college comp sci class, where the future is all ahead 
of the students, what proportion of time would I want to invest in how much they knew 
about legacy encodings versus how much I could inspire them to build from and extend 
what Unicode provides them?

Seriously, most of the folks on this list that I know personally, and I include myself 
in this category, are approaching or past the halfway point in our careers. What would 
we want the folks who are just starting their careers now to know about Unicode and do 
with it by the time they reach the end of theirs, long after we have stopped working?

For many applications, people are not going to specialize in i18n/l10n issues. They 
need to know what the appropriate building text based blocks are, and how they can 
expand on them while still building whatever they are working on.

Unicode at least hints at this with the bidi algorothm. Moving forward should other 
algorithms be codified into Unicode, or as separate standards or defacto standards? I 
am thinking of Japanese word splitting algorithm. There are proprietary products 
that do this today with reasonable but not perfect results. Are they good enough that 
the rules can be encoded into a standard? If so, then someone would build an open 
implementation, and then there would always be this building block available for 
people to use.

I am sure everyone on this list can think of their own favorite algorithms of this 
type, based on the part of Unicode that interests you the most. My point is that the 
raw information already in unicode *does* suggest the next level of usage, and the 
repeated newbie questions that inspired this thread suggest the need for a 
comprehensive solution at a higher level then a character set provides. Maybe part of 
this means including or at least facilitating the description of lowlevel text 
handling algorithms.

If I did, people would be scurrying around
trying to figure out how to foment the duress.)


The accomplishments of the Unicode 

Re: Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Kenneth Whistler

Barry Caplan wrote:

  At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
  Unicode is a character set. Period. 
  
  Each character has numerous 
  properties in Unicode, whereas they generally don't in legacy 
  character sets.
 
 Each character, or some characters?
 
 
 For all intents and purposes, each character. 
 So, each character has at least one attribute. 

Yes. The implications of the Unicode Character Database include
the determination that the UTC has normatively assigned properties
(multiple) to all Unicode encoded characters.

Actually, it is a little more subtle than that. There are some
properties which accrue to code points. The General Category and
the Bidirectional Category are good examples, since they constitute
enumerated partitions of the entire codespace, and API's need to 
return meaningful values for any code point, including unassigned ones.

Other properties accrue more directly to characters, per se.
They attach to the abstract character, and get associated with
a code point more indirectly by virtue of the encoding of that
character. The numeric value of a character would be a good example
of this. No one expects an unassigned code point or an assigned
dingbat character or a left bracket to have a numeric value property
(except perhaps a future generation of Unicabbalists).
 
 There are no corresponding features in other character sets usually.

Correct. Before the development of the Unicode Standard, character
encoding committees tended to leave that property assignments
either up to implementations (considering them obvious) or up
to standardization committees whose charter was character
processing -- e.g. SC22/WG15 POSIX in the ISO context.

The development of a Universal character encoding necessitated
changing that, bringing character property development and
standardization under the same roof as character encoding.

Note that not everyone agrees about that, however. We are
still having some rather vigorous disagreements in SC22 about
who owns the problem of standardization of character properties.

 A common definition of character set is a list of character 
 you are interested in assigned to codepoints. That fits most 
 legacy character sets pretty well, but Unicode is sooo much 
 more than that.

Roughly the distinction I was drawing between the Unicode CCS
and the Unicode Standard.

 But what if we took a look at it from a different point of view, 
 that the standard is a agreed upon set of rules and building 
 blocks for text oriented algorithms? Would people start to 
 publish algorithms that extend on the base data provided so 
 we don't have to reinvent wheels all the time?

Well the Unicode Standard isn't that, although it contains
both formal and informal algorithms for accomplishing various
tasks with text, and even more general guidelines for how to
do things.

The members of the Unicode Technical Committee are always
casting about for areas of Unicode implementation behavior
where commonly defined, public algorithms would be mutually
beneficial for everyone's implementations and would assist
general interoperability with Unicode data.

To date, it seems to me that the members, as well as other
participants in the larger effort of implementing the Unicode
Standard, have been rather generous in contributing time
and brainpower to this development of public algorithms. The
fact that ICU is an Open Source development effort is enormously
helpful in this regard.

 If I were to stand in front of a college comp sci class, 
 where the future is all ahead of the students, what proportion 
 of time would I want to invest in how much they knew about legacy 
 encodings versus how much I could inspire them to build from and 
 extend what Unicode provides them?

This problem, of Unicode in the computer science curriculum,
intrigues me -- and I don't think it has received enough attention
on this list.

One of my concerns is that even now it seems to be that CS
curricula not only don't teach enough about Unicode -- they basically
don't teach much about characters, or text handling, or anything
in the field of internationalization. It just isn't an area that
people get Ph.D.'s in or do research in, and it tends to get
overlooked in people's education until they go out, get a job
in industry and discover that in the *real* world of software
development, they have to learn about that stuff to make software
work in real products. (Just like they have to do a lot of
seat-of-the-pants learning about a lot of other topics: building,
maintaining, and bug-fixing for large, legacy systems; software
life cycle; large team cooperative development process;
backwards compatibility -- almost nothing is really built from
scratch!)

 
 The major work ahead is no longer in the context of building 
 a character standard. Time is fast approaching to decide to keep 
 it small and apply a bit of polish, or focus on the use and usage 
 of what is already there in Unicode by those who