RE: No Invisible Character - NBSP at the start of a word

2004-11-29 Thread Jony Rosenne


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Peter Constable
> Sent: Tuesday, November 30, 2004 1:20 AM
> To: Unicode Mailing List
> Subject: RE: No Invisible Character - NBSP at the start of a word
> 
> 
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> On Behalf
> > Of Jony Rosenne
> 
...

> 
> Jony, where you and I have had a different worldview is that, it seems
> to me, you view characters as encoding language, and I view characters
> as encoding letterforms; or, put another way, for you, text is
> necessarily linguistic, whereas for me text is text, independent of
> linguistic interpretation. To make this concrete, the fact that a qere
> sequence involves the vowel points of word A rather than word B is
> linguistically interesting, but irrelevant as far as encoding is
> concerned. If the displayed letterforms consist of a lamed with two
> vowel points, then the encoded character sequence IMO should be lamed
> with two vowel points -- and I would not consider that a hack. 

When I look at the text, even with a magnifying glass, I do not see a Lamed
with two points. The displayed form, from my point of view, is a Lamed with
a single point and another point without a base character. The Hiriq is not
under the Lamed, it is between the Lamed and the Mem. The linguistic
approach is just the explanation, the displayed letterforms are quite clear.

Even when I look at old Latin manuscripts, which I did once again when I
visited the flea market in Milan a few months ago, they are not plain text
and they cannot be faithfully reproduced in Unicode without markup. Although
the nature of Hebrew manuscripts is different, I do not understand the
desire to make Hebrew different, and I cannot accept it if it makes the
computerized handling of Hebrew unnecessarily more complicated that it is
already.

To make it very clear: The use of CGJ approved by the UTC is fine by me, and
I have no objection to anyone using it, but it is not required for Hebrew,
and we do not have a standard plain text solution for Qere and Ketiv and for
Yerushala(y)im.  Regarding the latter, the UTC discussion was based on a
mistaken or incomplete presentation of the problem. Yes, for those need two
vowels for a single letter, CGJ would do it, but since this is not my
question, CGJ is not the answer. The hack needed here is an invisible base
character.

If anyone wants to use CGJ or any other Unicode characters that are not
included in the standard Hebrew subset (Unicode does not define subsets, but
other bodies do and implementers necessarily have to) to encode Hebrew
texts, they should do their users a favor and explain to them that they
require specific implementations, operating systems and fonts.

Jony

...

> 
> 
> Peter Constable
> 
> 
> 
> 





Re: Relationship between Unicode and 10646

2004-11-29 Thread Doug Ewell
Peter Kirk  wrote:

> But what happens when a proposal put forward by the UTC is rejected by
> voting members of WG2, which are ISO member bodies worldwide?...
>
> So what does WG2 do? Does it follow its fixed policy of agreeing with
> the UTC despite negative votes? Does "self-abnegation" trump
> democracy? Or is the UTC put in the position that it is forced to
> retract or amend its proposals?

Maybe they sit down and talk about it?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Keyboard Cursor Keys

2004-11-29 Thread Doug Ewell
Robert Finch wrote:

> I see that there are no Unicode characters assigned for cursor/edit
> keys other than that which were originally in ascii ('return', 'tab',
> 'backspace', 'delete'). Could keys like 'cursor left', 'cursor up',
> 'Home', etc. be incorporated somewhere within the standard ?

These are not in Unicode because they are not related to representing
text, as such.  Unicode is, above all, a standard for encoding
characters that are used to write text.

The only control characters in ASCII that directly relate to text
representation are CR, LF, HT, VT, FF, perhaps BS and DEL, and arguably
BEL.  In the C1 zone, NL is also important in some environments.  ESC
and maybe CSI are needed for implementing ISO 6429-style control
sequences.  And of course, you have to have NUL!  But that's pretty much
it for text.

The entire C0 and C1 repertoires were grandfathered into Unicode for
compatibility, but the only other control codes that have been added to
Unicode deal directly with text representation.  These include
unambiguous line and paragraph separators (intended to replace CR and/or
LF, but not in common use), formatting characters such as joiners and
bidirectional aids, variation selectors, and a few script-specific
things.

Character-based cursor command languages such as "curses" were clever
and handy in the days of character-based user interfaces, but have
become less popular as GUIs have largely taken over.  The character
model has never really been extended to include mouse movement, clicking
and double-clicking, etc.

If you are working with terminals or other character-only systems, it's
hard to beat the ISO 6429 model.  You can download a free copy of the
equivalent European standard ECMA-48 at:

http://www.ecma-international.org/publications/files/ecma-st/ECMA-048.pdf

> I know this probably goes against the ideal that Unicode is simply a
> font (ug wrong word here) mapping. But it would make the standard more
> practically applicable.

No.  Not a font mapping at all.  What you are probably trying to say is
that Unicode deals primarily with visible characters, which is closer to
the truth.

> I'm trying to implement a Unicode keyboard device, and I'd rather have
> keyboard processing dealing with genuine Unicode characters for the
> cursor keys, rather than having to use a mix of keyboard scan codes
> and Unicode characters.

This will quickly spiral out of control as you move past the "easy"
cases like adding character codes for cursor control functions.  What
about Shift and Caps Lock?  That would make text representation
ambiguous; you don't want an 'A' created by pressing the A key while
holding Shift to be different from an 'A' created by pressing A with
Caps Lock enabled.  (How would you represent "enabled"?)

What about the Ctrl and Alt keys (or equivalents in Mac and other
platforms)?  What about Num Lock and Scroll Lock?  F1 through F12 (or
whatever)?  And (dare I mention them?) the Windows and Windows Menu
keys?  This last example shows that the set of possible keyboard keys is
open-ended and subject to manufacturer whims.  Laptops have all sorts of
unusual shifting keys not seen on "conventional" keyboards.

If you want to use characters to accomplish cursor control, you really
should take a look at the ECMA standard mentioned above.

> If there is an extended standard of some kind (eg UTF-16 ?) that
> supports this, could someone please point me to it.

Don't understand the reference to UTF-16 here.  UTF-N, for any value of
N, is a way of representing the character codes of Unicode using
sequences of N-bit units.  None is an "extension" of the Unicode
standard.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Radicals and Ideographs

2004-11-29 Thread Edward H. Trager
On Monday 2004.11.29 16:30:06 -0800, Allen Haaheim wrote:
> >they often (not always) combine 1 or more radicals, with 1 or more strokes
> >that are not radicals themselves.
> 
> Sorry Philippe, this is simply not true, and your email follows this with a
> few dubious statements. A Han character has one radical. That is, it can be
> catalogued under only one radical, exceptions before codification
> notwithstanding. The fact that other components in a given character may be
> used as radicals in other contexts is irrelevant and can only confuse
> matters here. 

 To clarify:

 A Han character will always be classified under just one radical in,
 for example, a dictionary.  But there can be differences between
 dictionaries.  For most characters, such as the previously-mentioned
 å (ren4 ãã  "pregnant"), it is very obvious to a literate speaker
 of Chinese or Japanese that the radical is å (nu:3 ããË woman).  But 
for
 a subset of characters, it is not so obvious, so much so that 
 dictionaries may contain a "Table of Characters that are difficult
 to locate" (éæåè).  For example, "ç" (nan2 ãã "male") is a
 simple character, but it is difficult to know whether the radical
 used to find this character in a dictionary is "ç" (field) or "å"
 (power/strength) -- in this case, the radical is "ç".  Of course
 a lot of modern dictionaries use pinyin or a similar phonetic system
 which is great *if* you know the pronounciation: When you do not
 know the pronounciation, then look up by radical followed by a count
 of the remaining strokes after the radical is a traditional and 
 still commonly-used method.  

- Ed Trager

> Allen Haaheim



Keyboard Cursor Keys

2004-11-29 Thread Robert Finch



Hi,
 
This issue has probably been brought up before, but 
I was wondering how it was resolved. I see that there are no Unicode characters 
assigned for cursor/edit keys other than that which were originally in ascii 
('return', 'tab', 'backspace', 'delete'). Could keys like 'cursor left', 
'cursor up', 'Home', etc. be incorporated somewhere within the standard ? I know 
this probably goes against the ideal that Unicode is simply a font (ug 
wrong word here) mapping. But it would make the standard more practically 
applicable. I'm trying to implement a Unicode keyboard device, and I'd rather 
have keyboard processing dealing with genuine Unicode characters for the cursor 
keys, rather than having to use a mix of keyboard scan codes and Unicode 
characters.
 
If there is an extended standard of some kind (eg 
UTF-16 ?) that supports this, could someone please point me to it.
 
 
Thanks,
Rob
 


RE: Ideograph?!?

2004-11-29 Thread Kenneth Whistler
Allen Haaheim provided some further detailed clarification:

> Note that Han characters are logographic, not ideographic. That is, 
> they are graphemes that represent words (or at least morphemes), 
> not ideas.

This correctly states the situation for the normal case for
Chinese characters used writing the Chinese language in most
instances. But as is not unusual for real writing systems, the
situation gets blurred all around the edges.

For one thing, Chinese has characters which are simply used for
their sound, as syllabics. In some instances, they are characters
in dual use, as logographs *or* as syllabics, but in either
instance they are used to "spell out" foreign words irrespective
of the morphemic status of the orginal characters -- or the
morphemes of the foreign word, for that matter.

And the situation is also not so clear when considered in
the dynamic context of the historical borrowing of the Chinese
writing system to write unrelated languages such as Japanese,
Korean, and Vietnamese. Much of the writing system borrowing
was *attached* to words -- in other words, the vocabulary itself was
borrowed in from Chinese, using the Chinese characters to
write it. But Japanese and other languages faced the problem of how
to adapt the writing system for preexisting, *native* vocabulary,
as well as for all the borrowed words from Chinese. And a
variety of strategies evolved, some of which involved
abstracting the *meaning* of a Chinese character, and then
reapplying the character to write an unrelated word in Japanese
(for example) which had a similar meaning. This semantic-based
transference of Chinese characters completely ignored
morphemic status in Chinese, as the whole point was to simply
find the appropriate character to express the lexical semantics
of the historically unrelated (but semantically similar)
word(s) in the borrowing language.

During such a borrowing transition, you can conceive of
the process as many Chinese characters temporarily "floating off"
their morphemic anchors in Chinese, being considered
purely semantically, and then reattaching to a new set of
morphemic anchors in Japanese, where they subsequently
evolve with new lexical histories in another language.

> But somehow "ideograph" has become the standard term in use outside
> the field of experts in Chinese linguistics (because of Ezra 
> Pound et al., perhaps?). 

I don't think you have to look to Ezra Pound's poetic
misrepresentations of the nature of Chinese to find
reasons here.

"East Asian ideograph" and "CJK ideograph" caught on as
acceptable compromise alternatives for "Chinese character"
or "Japanese character", which were language-specific and
misleading (in the Japanese case), or for transliterations
such as kanji or hanzi (also language-specific), or for
sinogram or sinograph, which were too little known (and
also too Chinese-biassed for some). "East Asian logograph"
would have been technically a little more correct, but
not absolutely right, either. "Ideograph" wasn't used because
the standardizers were confused about how Chinese and
Japanese writing systems worked, but simply because it
was a usable term in the right ballpark, available for
a specialized technical usage, and less objectionable
than most of the alternatives.

As Asmus and Richard implied, "ideograph" should simply
be treated as polysemous now. It has a narrow technical
sense applying to the character encoding world, where it
effectively is equivalent to kanji/hanzi/hanja. And it
has a separate graphological sense where it refers to
signs (like symbols marking restroom doors) that represent
ideas directly without being attached to specific words
or morphemes of a particular language.

--Ken

> 
> I hope this doesn't confuse matters.




Re: Spammed by a list member!

2004-11-29 Thread Sarasvati
Kevin Brown, James Kass, and others: Please take this off-topic issue
up privately, not on the mail list. People wishing to engage in the
discussion have been alerted, and may do so elsewhere.

Regards,
-- Sarasvati



Re: Spammed by a list member!

2004-11-29 Thread James Kass

Kevin Brown wrote,

> Dear Dean 
> 
> I personally have not followed the Phoenician thread. While I can understand 
> the 
> frustration of having a discussion blocked (however valid or invalid the 
> reason) 
> I think the method you are choosing to continue it is unprofessional. 

We disagree.  Dean Snyder has started an ad-hoc private discussion
list on the subject of Phoenician.  People on his distribution list
who choose to opt-out should do so privately.

Responding to a private post on a public list is considered ill-mannered.

Spam is generally taken to mean unwanted *commercial* e-mail.  Dean's
private posting, whatever else it might be, wasn't commercial.

Best regards,

James Kass



RE: Ideograph?!?

2004-11-29 Thread Allen Haaheim
>they often (not always) combine 1 or more radicals, with 1 or more strokes
>that are not radicals themselves.

Sorry Philippe, this is simply not true, and your email follows this with a
few dubious statements. A Han character has one radical. That is, it can be
catalogued under only one radical, exceptions before codification
notwithstanding. The fact that other components in a given character may be
used as radicals in other contexts is irrelevant and can only confuse
matters here. 

Allen Haaheim


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Philippe Verdy
Sent: November 29, 2004 2:01 PM
To: Flarn
Cc: [EMAIL PROTECTED]
Subject: Re: Ideograph?!?

From: Michael Norton (a.k.a. Flarn) <[EMAIL PROTECTED]>
> What's an ideograph? Also, what's a radical?
> Are they the same thing?

Some radicals (in the Han script) may be ideographs, but most ideographs are

not radicals: they often (not always) combine 1 or more radicals, with 1 or 
more strokes that are not radicals themselves.

Radicals in the Han script serve to their classification, and help users to 
locate ideographs in dictionnaries, but they also consider the additional 
strokes (radicals are themselves made of a wellknown number of strokes).

Ideographs rarely represent alone a concept or word, but most often a single

syllable. In Chinese many words are short and consist in 2 syllables, and so

are written with two ideographs.

We should call these characters "syllabographs" instead of "ideographs", but

this may conflict with the concept of "syllabaries" that are much simpler, 
unlike Han ideographs that can each represent very complex syllables (with 
diphtongs, multiple consonnants, and distinctive tones), and sometime (in 
fact rarely) a concept or word (which may spelled with more than one 
syllable, depending on local dialects).

Many words are created from two ideographs, and the concept behind each 
ideograph is unrelated or sometimes very far to the meaning of the whole 
word. In that case, the pair of ideographs is chosen mostly because the 
concepts are pronounced similarly in some dialect of Chinese (sometimes old 
dialects), and so they can be read phonetically (For example, "Beijing" is 
written with the two ideographs for "bei" and "jing", but you may wonder why

"bei" and "jing" were used, and which concepts they represent, and their 
relation to the name of the city...).

For these reasons, some linguists prefer to speak about "sinographs" 
(reference to Chinese), or sometimes "pictographs" (because of their visual 
form, instead of their meaning)...






Re: No Invisible Character - NBSP at the start of a word

2004-11-29 Thread Kenneth Whistler
John Hudson responded to Jony Rosenne:

> The idea that the position of such text on a page -- as a marginal 
> note -- somehow demotes 
> it from being text, is particularly nonsensical.

I think you two (Jony and John) are talking at cross-purposes
on this particular point.

The *content* of marginal note can be represented as plain text.
It is the fact of its being a marginal note, its
positioning in the margin in textual layout, and its reference
anchoring to the rest of the text it constitutes a marginal
annotation for that do *not* consist of plain text, and for
which we should not expect a plain text representation.

Peter Kirk summed up the main point of the thread:

> 2) Allowing floating vowel points (and sometimes accents) with a blank 
> base character. This usually, but not always, happens at the beginning 
> of a word. The mechanism for doing this seems to have been clarified by 
> the UTC: use NBSP as the base character.

Correct.

> So can't we leave it that these mechanisms can be used for 
> representation of these forms by those who wish to represent them in 
> plain text, whereas those who want to use other mechanisms are free to 
> do so?

I agree. That is precisely the intent.

And Asmus clarified whatever linebreaking issues there may be.
Those should be dealt with in the context of the revision of
UAX #14 for Unicode 4.1, which takes into account the changed
recommendations regarding NBSP versus SPACE as base for
nonspacing marks.

--Ken







RE: Ideograph?!?

2004-11-29 Thread Allen Haaheim
Note that Han characters are logographic, not ideographic. That is, they are 
graphemes that represent words (or at least morphemes), not ideas. In the west, 
Peter du Ponceau first argued this in the nineteenth century, and the likes of 
Bernhard Karlgren, Peter A. Boodberg, Y.R. Chao and Edward H. Schafer 
established it in the twentieth. But somehow "ideograph" has become the 
standard term in use outside the field of experts in Chinese linguistics 
(because of Ezra Pound et al., perhaps?). 

I hope this doesn't confuse matters.

In the early stages of the development of the written language, selected 
characters were added to homophonous characters to distinguish them 
graphically. A semantically significant character was used. Hsà Shen (died ca. 
AD 149) arranged his dictionary under 540 of these "graphic classifiers" now 
called "radicals," in the Han dynasty. The K'ang Hsi dictionary codifies a list 
of 214. This was reduced to 189 for mainland China's simplified characters.

Thus, the radical is commonly referred to as the "signific" and can provide a 
reminder of the meaning of the character. The character's second component 
(commonly above or to the right of the radical) is called the "phonetic," as it 
can provide a clue to one (or more) likely pronunciation(s). (This is by no 
means foolproof.) In Clark's example the radical å (U+5973) is the signific, 
and the phonetic is å (U + 58EC), which is pronounced "jen" (Pinyin "ren"). 
Indeed, the character has something to do with "woman" semantically, and is 
pronounced "jen" in modern Mandarin.

Allen Haaheim



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Clark Cox
Sent: November 29, 2004 1:09 PM
To: Flarn
Cc: [EMAIL PROTECTED]
Subject: Re: Ideograph?!?

On Mon, 29 Nov 2004 16:06:42 -0500, Clark Cox <[EMAIL PROTECTED]> wrote
> and contains, as a radical the characterå(U+5973), which means
> "woman".

That, of course, should have been â(U+2F25)


-- 
Clark S. Cox III
[EMAIL PROTECTED]
http://www.livejournal.com/users/clarkcox3/
http://homepage.mac.com/clarkcox3/






Re: No Invisible Character - NBSP at the start of a word

2004-11-29 Thread Peter Kirk
On 29/11/2004 19:06, Jony Rosenne wrote:
...
Qere and Ketiv are not malformed. I don't think anyone disagrees that they
are the juxtaposition of the letters of one word with the vowel points of
another.
That most cases can be visibly reproduced by Unicode is a hack, and is not a
sufficient justification to extend Unicode to support cases that cannot be
reproduced. 
 

I don't think there are in fact any cases which cannot be reproduced, 
since NBSP may be used to carry combining marks, and the CGJ mechanism 
has been approved by the UTC. So this discussion is rather pointless. If 
anyone knows of any cases which cannot be represented properly by 
current Unicode, please let us know, and then perhaps we can reopen the 
discussion.

There is the case of Yerushala(y)im, for which the plain text hack would
require an invisible RTL letter to represent the omitted Yod, or to allow
pointing an RLM. The CGJ hack may work too but it is based on a
misunderstanding, as if the Lamed has two vowels.
 

Unicode represents text as written or printed, not pronunciation. Sure, 
Yerushala(y)im is pronounced with a yod which is not written. But this 
letter is not part of the written word form, not part of the spelling. 
It is like many cases in many languages where a letter is pronounced but 
not written. Irrespective of the pronunciation, the Lamed is *written* 
with two vowels. And so Unicode correctly encodes it with two vowels, 
and inserts the CGJ to prevent inappropriate reordering.

Also, these hacks foil searching and sorting, since neither the Qere or the
Ketiv words will be handled correctly.
 

True, it is not possible to search the text and sort on the Qere form, 
for the simple reason that this is not part of the plain text; its 
consonants appear only in the margin. The Qere form can be added to the 
text with markup that it should be invisible, and in this way it can be 
searched and sorted.

It is possible to search and sort on the Ketiv form as this is 
unpointed, by setting the search or sort to use the base characters 
only. But this might require tailoring of collation to ignore NBSP.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Relationship between Unicode and 10646

2004-11-29 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Peter Kirk


> But what happens when a proposal put forward by the UTC is rejected by
> voting members of WG2...

We cannot categorize what has happened as voting members of WG2
rejecting a UTC proposal. First, what has happened is that voting
members of WG2 balloted an amendment to ISO 10646. That amendment
included many items, some proposed by the US, having first been accepted
by UTC, and some coming from other sources. Secondly, the results of the
voting do not terminate the process: the voting was not simply yea/nay
with more nay votes. Rather, both yea and nay votes came with comments;
all 5 of the negative votes were contingent: the NBs indicated their
vote would change to positive if comments are accepted.

So, what will happen? WG2 will meet and work out how all of these
comments should be resolved so that the amendment can move forward.
Resolving the issues could mean that something gets changed from what is
currently proposed. It could mean something gets removed from the
amendment. It could also mean that nothing whatsoever is changed, but
that WG2 decides (after discussing together) that each item proposed in
the amendment should be left as it is.

As for the impact on the relationship of Unicode to ISO 10646, if WG2
ends up changing or removing something from the amendment, then UTC will
have to evaluate those revisions and decide what they want to do.

One thing to keep in mind: the five NBs that voted negatively did so
mostly for different reasons (the one proposal that had items of common
concern to several NBs was N'ko: Canada, Japan and US all commented on
the apostrophe in the script name). If something is really contentious,
then WG2 can choose to split up an amendment, making the contentious
item a separate amendment. If that were to happen, I don't think there's
anything proposed in amendment 2 that wouldn't eventually get approved
(after outstanding details had been worked out).



Peter Constable





Re: Ideograph?!?

2004-11-29 Thread Asmus Freytag
At 02:14 PM 11/29/2004, Kenneth Whistler wrote:
By the way, Google is your friend. If you want to get
information about such things, googling for it is a
good way to start. I suggest reading:
http://encyclopedia.thefreedictionary.com/Chinese%20writing%20system
As Richard Cook has pointed out, the definitions that are
used in the Unicode Standard are not always identical to
the ones for more general use.
A good source for the former is the Unicode glossary
http://www.unicode.org/glossary/
A./
PS: You should probably read Chapters 2 and 12 of the Unicode
Standard as well. 




Spammed by a list member!

2004-11-29 Thread Kevin Brown
On Monday, 29 November 2004 at 8:52 AM, Dean Snyder wrote:

>You are getting this email directly because Rick McGowan, the moderator
>of the Unicode email list, sent me the following response concerning my
>attempt to post the appended message to the Unicode email list:
>
>>All threads on Phoenician have been closed on this mail list.
>>  -- Sarasvati
>
>Personal email traffic could be avoided if we were merely allowed to
>discuss this legitimate and timely Unicode issue on the Unicode email
>list. In particular I would like responses from members of the Unicode
>list to the 3 questions I raise in light of the German DIN rejection of a
>separate encoding for Phoenician.
>
>[The recipients of this email are merely some of those whose names I
>recognize from reading their posts to the Unicode email list in 2003 or 2004.]

Dear Dean

I personally have not followed the Phoenician thread. While I can understand the
frustration of having a discussion blocked (however valid or invalid the reason)
I think the method you are choosing to continue it is unprofessional.

Grabbing email addresses from our public list is something which I'm sure scores
of spammers do every day. In your desperate enthusiasm for your cause, you have
reduced yourself to their level. You have not even limited yourself to those
involved in the original Phoenecian thread which might have been slightly less
unethical.

The correct procedure would have been for you to announce on the Unicode list
that you are starting a seperate list moderated by you and invite people to join
it on a voluntary basis. I'm sure Rick McGowan and  Sarasvati would have had no
objection to such a posting. (I'm sure they will also allow this one!)

Kevin Brown




RE: No Invisible Character - NBSP at the start of a word

2004-11-29 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
> Of Jony Rosenne

> > But it *is* a
> > piece of text, however
> > malformed it might seem from normal lexicographic
> > understanding. It may not be a word. It
> > may, in fact, be two words merged into a unit. But it is most
> > certainly text.
> 
> Sure it is text, but it is not plain text.
> 
> Qere and Ketiv are not malformed. I don't think anyone disagrees that
they
> are the juxtaposition of the letters of one word with the vowel points
of
> another.
> 
> That most cases can be visibly reproduced by Unicode is a hack...

Jony, where you and I have had a different worldview is that, it seems
to me, you view characters as encoding language, and I view characters
as encoding letterforms; or, put another way, for you, text is
necessarily linguistic, whereas for me text is text, independent of
linguistic interpretation. To make this concrete, the fact that a qere
sequence involves the vowel points of word A rather than word B is
linguistically interesting, but irrelevant as far as encoding is
concerned. If the displayed letterforms consist of a lamed with two
vowel points, then the encoded character sequence IMO should be lamed
with two vowel points -- and I would not consider that a hack. 


> and is not a
> sufficient justification to extend Unicode to support cases that
cannot be
> reproduced.
> 
> There is the case of Yerushala(y)im, for which the plain text hack
would
> require an invisible RTL letter to represent the omitted Yod, or to
allow
> pointing an RLM. The CGJ hack may work too but it is based on a
> misunderstanding, as if the Lamed has two vowels.

The only hackish thing about needing CGJ is that the combining classes
for vowel points that occupy the same space relative to a base should
never have been different from one another, but since we cannot revise
that detail, we need to come up with another mechanism to deal with it.
I agree that using CGJ is a hack, but not because the text involves one
base letterform with two combining vowel points.


> > But I'm now, as always, happy to hear alternate suggestions
> > as to how things might be
> > handled in either encoding or display. So if you think merged
> > Ketiv/Qere forms should be
> > handled by markup, perhaps you can explain how, so that I
> > might better understand. Thank you.
> 
> This is the Unicode list, not the markup - SGML etc. list. And I do
not know
> too much about markup.

It's not a list dedicated to discussion of markup, but if people contend
that a solution to a problem lies in something other than plain text,
then it is germane to this list to have that alternative solution
elaborated.



Peter Constable




Re: Relationship between Unicode and 10646

2004-11-29 Thread Peter Kirk
On 27/11/2004 06:29, John Cowan wrote:
...
But formally these other bodies do have the right to 
outvote Unicode, and in effect to force Unicode to reverse its decisions 
- or else to reverse its policy of maintaining compatibility.
   

Formally, yes.  However, by acts of self-abnegation, WG2 has a fixed
policy of not overriding the UTC or vice versa.
 

But what happens when a proposal put forward by the UTC is rejected by 
voting members of WG2, which are ISO member bodies worldwide? For 
example, I note from http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2876.pdf 
that the latest proposed amendments to 10646 have been approved by only 
6 of 31 voting members and disapproved by 5, with 7 abstentions and 13 
votes not received after the deadline. Now it may be that the issues 
which caused the votes against will be resolved; and I don't know what 
the voting rules are, whether the 6 votes are enough for approval as 
they are more than the 5. But this certainly shows that it is by no 
means certain that ISO member bodies will approve amendments proposed by 
the UTC.

So what does WG2 do? Does it follow its fixed policy of agreeing with 
the UTC despite negative votes? Does "self-abnegation" trump democracy? 
Or is the UTC put in the position that it is forced to retract or amend 
its proposals?

Presumably also UTC members could decide to reverse their policy on this 
one as well. And with new voting members joining a rather small group, 
you never know what might happen.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Ideograph?!?

2004-11-29 Thread Kenneth Whistler
Michael Norton (a.k.a. Flarn) asked:

> What's an ideograph? Also, what's a radical?
> Are they the same thing?

No, they aren't.

In the Unicode context, the simplest answer is that
an "ideograph" or a "CJK ideograph" is simply to be
taken as a synonym for "a Chinese character".

A "radical" is one of a list of traditional pieces
of Chinese chracters that are used in indices for
looking up Chinese characters in a dictionary.

By the way, Google is your friend. If you want to get
information about such things, googling for it is a
good way to start. I suggest reading:

http://encyclopedia.thefreedictionary.com/Chinese%20writing%20system

--Ken




Re: Ideograph?!?

2004-11-29 Thread Richard Cook
The term ideograph has special meaning in Unicode/ISO usage. "Ideograph"
is short for "CJK Unified Ideograph", and is one of the characters with
mapping or reference data in the Unihan.txt database.

Likewise, "Radical" has special meaning. CJK Radicals are found in two
places, in the "Kangxi Radicals" block, and in the "CJK Radicals
Supplement".  (Actually, there is also a third block of radicals, "Yi
Radicals", but these are not CJK).

CDL provides a way for precise description of any CJK Unified Ideograph or
Radical. Please see , and the "Jargon Notes".

In other contexts (beyond Unicode) both of these terms have different or
broader usages. Radical, for example, is a 'lexicographic indexing
component' (used in Radical/Stroke indexes), and ideograph is 'idea
writing' ...


On Mon, 29 Nov 2004, Clark Cox wrote:

> On Mon, 29 Nov 2004 15:13:51 -0500, Flarn <[EMAIL PROTECTED]> wrote:
> > What's an ideograph?
>
> An ideograph (aka ideogram) is (from www.m-w.com):
>
> "a picture or symbol used in a system of writing to represent a thing
> or an idea but not a particular word or phrase for it"
>
> > Also, what's a radical?
>
> A radical is, in the set of Han characters, a symbol that occurs as
> part of other ideographic characters that often serves to show common
> meaning or history to the character. In many ways, radicals are to Han
> characters as Greek and Latin roots are to English words.
>
> for instance, in Japanese, the character 妊 (U+598A) means "pregnant",
> and contains, as a radical the character 女(U+5973), which means
> "woman".
>
> --
> Clark S. Cox III
> [EMAIL PROTECTED]
> http://www.livejournal.com/users/clarkcox3/
> http://homepage.mac.com/clarkcox3/
>
>
>




Re: Ideograph?!?

2004-11-29 Thread Philippe Verdy
From: Michael Norton (a.k.a. Flarn) <[EMAIL PROTECTED]>
What's an ideograph? Also, what's a radical?
Are they the same thing?
Some radicals (in the Han script) may be ideographs, but most ideographs are 
not radicals: they often (not always) combine 1 or more radicals, with 1 or 
more strokes that are not radicals themselves.

Radicals in the Han script serve to their classification, and help users to 
locate ideographs in dictionnaries, but they also consider the additional 
strokes (radicals are themselves made of a wellknown number of strokes).

Ideographs rarely represent alone a concept or word, but most often a single 
syllable. In Chinese many words are short and consist in 2 syllables, and so 
are written with two ideographs.

We should call these characters "syllabographs" instead of "ideographs", but 
this may conflict with the concept of "syllabaries" that are much simpler, 
unlike Han ideographs that can each represent very complex syllables (with 
diphtongs, multiple consonnants, and distinctive tones), and sometime (in 
fact rarely) a concept or word (which may spelled with more than one 
syllable, depending on local dialects).

Many words are created from two ideographs, and the concept behind each 
ideograph is unrelated or sometimes very far to the meaning of the whole 
word. In that case, the pair of ideographs is chosen mostly because the 
concepts are pronounced similarly in some dialect of Chinese (sometimes old 
dialects), and so they can be read phonetically (For example, "Beijing" is 
written with the two ideographs for "bei" and "jing", but you may wonder why 
"bei" and "jing" were used, and which concepts they represent, and their 
relation to the name of the city...).

For these reasons, some linguists prefer to speak about "sinographs" 
(reference to Chinese), or sometimes "pictographs" (because of their visual 
form, instead of their meaning)...




Re: CGJ , RLM

2004-11-29 Thread Kenneth Whistler
Mark Davis said (in reference to a long set of comments by
Philippe Verdy on this thread):

> The statements below are incorrect

And Philippe asked:

> Which "statements"? My message is mostly a read as a question, not as an 
> affirmation...

And I will attempt the fact-finding...

> CGJ is a combining character that extends the grapheme cluster started 
> before it, 

True but misleading. CGJ is a combining character, and like *all*
other nonspacing combining characters it has the property
Grapheme_Extend=True. CGJ's *function* is not to extend the grapheme
cluster before it; that just happens automatically, as for any
character with gc=Mn.

And that was a statement.

> but it does not imply any linking with the next grapheme cluster 
> starting at a base character.

True. Another statement.

> So, even if one encodes, A+CGJ+E, there will still be two distinct grapheme 
> clusters A+CGJ and E, and the exact role of the trailing CGJ in the A+CGJ is 
> probably just a pollution, given that this CGJ has no influence on the 
> collation order, so that the sequence A+CGJ+E will collate like A+E, 

Misconstrued. Whether CGJ influences the collation order or not
depends on how it is weighted in a tailored collation table. And
the main *point* of having a CGJ is to provide a target for tailored
collation, so that it *can* make a difference. Statements, by the way.

> and it 
> does not influence the rendering as well.

True. Another statement.

> A "correct" ligaturing would be A+ZWJ+E, 

A matter of opinion, neither obviously true nor false. And a statement.

> with the effect of creating three 
> default grapheme clusters,

False. The correct value is 2.

> that can be rendered as a single ligature, or as 
> separate A and E glyphs if the ZWJ is ignored.

True. And a statement.

> For example, a ligaturing opportunity can be encoded explicitly in the 
> French word "efficace":
> "ef"+ZWJ+"f"+ZWJ+"icace".

True (although superfluous). And a statement.

> Note however that the ZWJ prohibits breaking, 

False. ZWJ is lb=CM, which prevents a break *before*, but not
a break *after*.

> despite in French there's a 
> possible hyphenation at the first occurence, where it is also a syllable 
> break, but not for the second occurence that occurs in the middle of the 
> second syllable.

True (I assume) statements about French.

> I don't know how one can encode an explicit ligaturing opportunity, while 
> also encoding the possibility of an hyphenation (where the sequence above 
> would be rendered as if the first ZWJ had been replaced by an hyphen 
> followed a newline.)

True (I assume) statements about Philippe's state of knowledge.

> To encode the hyphenation opportunity, normally I would use the SHY format 
> control (soft hyphen):
> "ef"+SHY+"fi"+SHY+"ca"+SHY+"ce"

True (I assume) statements about Philippe's practice in text representation.

> 
> If I want to encode explicit ligatures for the "ffi" cluster, if it is not 
> hyphenated, I need to add ZWJ:

False (at least existentially, although I cannot comment on
your personal wants and needs). And a statement.

> "ef"+ZWJ+SHY+"f"+ZWJ+"i"+SHY+"ca"+SHY+"ce"(1)

And as Doug pointed out, this is an incredibly baroque (and obtuse)
way of attempting to represent the word "efficace" in plain text.

> 
> The problem is whever ZWJ will have the expected role of enabling a ligature 
> if it is inserted between a letter and a SHY, instead of the two ligated 
> glyphs. In any case, the ligature should not be rendered if hyphenation does 
> occur, else the SHY should be ignored. So two rendering are to be generated 
> depending on the presence or absence of the conditional syllable break:
> - syllable break occurs, render as: "ef-"+NL+"f"+ZWJ+"icace", i.e. with a 
> ligature only for the "fi" pair, but not for the "ff" pair and not even for 
> the generated "f"+hyphen...
> - syllable break does not occur, render as "ef"+ZWJ+"f"+ZWJ+"icace", i.e. 
> with the 3-letter "ffi" ligature...

A whole series of statements. Together somewhat of a muddle for the
simple observation that "ffi" is not rendered with a single ligature
if there is a line break in the middle of it.

> 
> I am not sure if the string coded as (1) above has the expected behavior, 
> including for collation where it should still collate like the unmarked word 
> "efficace"...

True (I assume) statement about Philippe's state of knowledge.

Reading to the end, I find *only* statements here, and no question
actually posed.

In the future, if you want a message to be taken *as* a question,
it would be best to 1. Make it short, and 2. Actually pose a
question in it, preferably terminating the sentence to be so
interpreted with a "?"

--Ken




Dutch malarkey (was: Re: (base as a combing char))

2004-11-29 Thread Kenneth Whistler
Philippe Verdy responded to John Cowan:

> From: "John Cowan" <[EMAIL PROTECTED]>
> > the need to encode Dutch
> > ij as a single character, which is neither necessary nor practical.
> > (U+0132 and U+0133 are encoded for compatibility only.)  In cases where
> > ij is a digraph in Dutch text, i+ZWNJ+j will be effective.

> Those that want a 
> strong distinction will more likely use U+0132 and U+0133 in their word 
> processors, 

Those Dutch typists who do this will simply be introducing problems
into their text.

> assisted by Dutch lexical correctors so that they will just need 
> to enter "i" then "j", and let the word processor substitute the two letters 
> appropriately by the ij ligated letter when it is appropriate, leaving other 
> instances unchanged.
> 
> As the ij ligated letter is most certainly the most frequent case for 
> entering Dutch text, it may be the default behavior of a Dutch input method,

This is just malarkey. Dutch can and should continue using sequences
of "i" and "j", as they have been for decades.
 
We went around and around on this topic a number of months ago
already, and the baloney is still baloney, even if Philippe
has attempted to slice it a little thinner this time around.

--Ken




Re: Ideograph?!?

2004-11-29 Thread Clark Cox
On Mon, 29 Nov 2004 16:06:42 -0500, Clark Cox <[EMAIL PROTECTED]> wrote
> and contains, as a radical the character å(U+5973), which means
> "woman".

That, of course, should have been â(U+2F25)


-- 
Clark S. Cox III
[EMAIL PROTECTED]
http://www.livejournal.com/users/clarkcox3/
http://homepage.mac.com/clarkcox3/




fl/fi ligature examples

2004-11-29 Thread Philippe Verdy
From: "Otto Stolz" <[EMAIL PROTECTED]>
Just because the âstâ ligature is so uncommon (and the long âÅâ with its
âÅtâ ligature is almost extinct), I was looking for an example involving
âflâ, or âfiâ).
with ff :
   affable, baffe, biffer, Buffy, affriolant, effaroucher, effacer, ...
with ffl :
   effleurer, baffle, affligeant, ...
with fl :
   afleurer, flower, fleur, floral, floraison, inflation, dÃflation, flic, 
infliger...
with ffi :
   traffic, efficace, effilocher, officier, affiche, affine, ...
with fi :
   fi, fin, final, fil, fils, filature, filin, firme, firmament, 
aficionados, dÃfi, figure...

Many more examples of modern and widely used words (at least in English and 
French, but probably too in most Romance languages and other European 
languages including Roman Latin radicals)...
Other widely used ligatures include "st" and "ct": est, test, acte, octet...




Re: CGJ , RLM

2004-11-29 Thread Kenneth Whistler
Otoo Stolz asked:

> In German, however, a ligature must not span a syllable break.
> How should I code plain text, w.r.t. hyphenation and ligatures?
> - "Huf" + ZWNJ + "lattich"
> - "Huf" + SYH + "lattich"
> - "Huf" + SYH + ZWNJ + "lattich"
> - "Huf" + ZWNJ + SYH + "lattich"

You should code it as:

   "Huflattich"
   
You then handle ligation suppression in a high-end rendering
system either by exception dictionaries or by simply
selecting "Huflattich" and setting a "ligaturing Off" property
to suppress the "fl" ligation in that particular instance.

There is no reason whatsoever to be agonizing about how to
*convey* this level of detail -- in German, French, or
whatever -- in plain text.

If you are working with high end rendering where the exact
details of the ligature rendering are of concern for your
final presentation, then you are working with a rich text
system, anyway, which has all kinds of tools for dealing with
such issues. And hyphenation is typically done via higher
level protocol (hyphenation engines using dictionaries), rather
than trying to indicate hyphenation points explicitly in
plain text that way.

--Ken




Re: Ideograph?!?

2004-11-29 Thread Clark Cox
On Mon, 29 Nov 2004 15:13:51 -0500, Flarn <[EMAIL PROTECTED]> wrote:
> What's an ideograph?

An ideograph (aka ideogram) is (from www.m-w.com):

"a picture or symbol used in a system of writing to represent a thing
or an idea but not a particular word or phrase for it"

> Also, what's a radical?

A radical is, in the set of Han characters, a symbol that occurs as
part of other ideographic characters that often serves to show common
meaning or history to the character. In many ways, radicals are to Han
characters as Greek and Latin roots are to English words.

for instance, in Japanese, the character å (U+598A) means "pregnant",
and contains, as a radical the character å(U+5973), which means
"woman".

-- 
Clark S. Cox III
[EMAIL PROTECTED]
http://www.livejournal.com/users/clarkcox3/
http://homepage.mac.com/clarkcox3/




Re: CGJ , RLM

2004-11-29 Thread Asmus Freytag

Wachs-tube (growth tube)
Not the common reading of this. However, a "growth tube" or "growing tube" 
might be an implement in some specialized context. But note that such 
compounds might also be formed with 'Wuchs-', perhaps even preferentially so.

Therefore, reading 'Wachs-' as "wax", as Otto pointed out, is probably better.
A./
PS: If you search for this word and pair it with "Wachs", you will see a 
long list of sites discussing Wachs-tube vs Wachstube. 




Re: CGJ , RLM

2004-11-29 Thread Philippe Verdy
From: "Otto Stolz" <[EMAIL PROTECTED]>
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- "Brief"+SYH+"lasche" (letter flap) vs.
- "Brie"+SYH+"flasche" (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
French hyphenation does not work reliably based only on orthographic rules.
It works wuite well, but with many exceptions, that require using an
hyphenation dictionnary. I think it's true also of almost all alphabet-based
languages, and even for some languages written with so-called "syllabic"
scripts, probably as a matter of style, where separate vocal syllables must
not be broken, as those breaks are not the best according to meaning
(notably for compound words).
The case of German is that there are many possible compound words, and
breaks preferably occur between radical words rather than between syllables,
with exceptions:
- due to other stylistic constraints, or
- on short particles that should better not be detached from their
respective radical (but where do you best break the "hereinzugehen" or
simply "zugehen" verbs?),
- also because not all verb particles are detachable, as they belong to the
radical (many excamples with the "be" particle or radical prefix)
Even if you allow hyphenation only between lexical units, there will exist
some exceptions that can't be resolved without understanding the semantic.
Such compound words with no separator are extremely rare in English, and
very rare in French.
(French examples: there's a clear vocal syllable break in "millionce" after 
"-li-" and before "-on-" prononced with separate vowels, but in "million", 
no break occurs within "-lions" which is a single syllable, pronounced with 
a diphtong; none of these examples are compound words.)

But hyphenation is still preferable in German than only word breaks (on
spaces), due to the average length of compound words, whose margin alignment
may look ugly and hard to read in narrow columns like in newspapers or in
dictionnaries. In Dutch, there's more freedom for the creation of compounds,
that can often be written with or without a separator (a modern Dutch style
prefers using separators, or not creating any compound, by using word
separation with space, but historically Dutch was using the German style
still in use today despite its possible semantic ambiguities).
I think that a German writer that sees a possible ambiguity will often
tolerate to use an unconditional hyphen to create compound words (in your
example, he would write "Brief-Lasche" or "Brie-Flasche" but not
"Brieflasche" whose interpretation is problematic because there's no easy
way to determine it even with the funny semantic of the two alternatives;
unless the author is sure that ligatures are correctly handled with a
ligature on "fl" for the interpretation as "Brie-Flasche", and no ligature,
and a narrow spacing, between f and l for the interpretation as
"Brief-Lasche").
(Historically, German texts were full of ligatures -- much more often than 
in other Latin-based written languages -- those ligatures tending now to 
disappear from most modern publications; with the German rule that a 
ligature should not occur between two syllables, and should be present 
within the same radical, it's easy to see how ligatures are part of the 
orthographic system and that they have a semantic value which helps the 
correct understanding of text, so it would be even more important to use 
ZWNJ or ZWJ in German words, and not letting a renderer do this job 
automatically but inaccurately; for simplicity, I think that ZWNJ inserted 
between radicals to avoid their ligature would be easier to manage than ZWJ 
between two ligaturable letters that must be kept in the same syllable).




Ideograph?!?

2004-11-29 Thread Flarn
What's an ideograph? Also, what's a radical?
Are they the same thing?
- Michael Norton (a.k.a. Flarn)
E-mail address: [EMAIL PROTECTED]



RE: No Invisible Character - NBSP at the start of a word

2004-11-29 Thread Jony Rosenne


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of John Hudson
> Sent: Sunday, November 28, 2004 2:55 AM
> To: 'Unicode Mailing List'
> Subject: Re: No Invisible Character - NBSP at the start of a word
> 
> 
> Jony Rosenne wrote:
> 
> >>Jony, what do you think plain text is? Why should the 
> >>arrangement of text on a page as a 
> >>marginal note be considered any differently from text 
> >>anywhere else *in its encoding*? Are 
> >>you suggesting that Unicode is only relevant to ... what? 
> >>totally unformatted text in a 
> >>text editor?
> 
> > Basically, yes. Except for the control codes in Unicode - 
> spaces, line feed,
> > carriage return, etc.
> 
> > To indicate formatting one uses markup.
> 
> And markup is applied to what? Obviously, to text.

Certainly.

> 
> It seems to me that the primary purpose of the plain text 
> limitation in Unicode is to 
> maintain the character/glyph distinction, so that it is 
> clearly unnecessary to encode 
> display entities such as variant glyphs, ligatures, etc. 
> separately from the underlying 
> character codes that they visibly represent in various ways. 

I believe this is not the only purpose, but the purpose is not as important
as is respecting the scope of Unicode.

> On this basis, I think there 
> is a sound argument to be made against encoding an 'invisible 
> letter', if there is an 
> existing characters -- such as NBSP -- that logically and 
> effectively serves the same 
> purpose in encoding a particular piece of text. But it *is* a 
> piece of text, however 
> malformed it might seem from normal lexicographic 
> understanding. It may not be a word. It 
> may, in fact, be two words merged into a unit. But it is most 
> certainly text.

Sure it is text, but it is not plain text.

Qere and Ketiv are not malformed. I don't think anyone disagrees that they
are the juxtaposition of the letters of one word with the vowel points of
another.

That most cases can be visibly reproduced by Unicode is a hack, and is not a
sufficient justification to extend Unicode to support cases that cannot be
reproduced. 

There is the case of Yerushala(y)im, for which the plain text hack would
require an invisible RTL letter to represent the omitted Yod, or to allow
pointing an RLM. The CGJ hack may work too but it is based on a
misunderstanding, as if the Lamed has two vowels.

Also, these hacks foil searching and sorting, since neither the Qere or the
Ketiv words will be handled correctly.

> 
> The idea that the position of such text on a page -- as a 
> marginal note -- somehow demotes 
> it from being text, is particularly nonsensical.

Promotes, not demotes.

> 
> But I'm now, as always, happy to hear alternate suggestions 
> as to how things might be 
> handled in either encoding or display. So if you think merged 
> Ketiv/Qere forms should be 
> handled by markup, perhaps you can explain how, so that I 
> might better understand. Thank you.

This is the Unicode list, not the markup - SGML etc. list. And I do not know
too much about markup. 

Jony

> 
> John Hudson
> 
> -- 
> 
> Tiro Typeworkswww.tiro.com
> Vancouver, BC[EMAIL PROTECTED]
> 
> Currently reading:
> The Peasant of the Garonne, by Jacques Maritain
> Art and faith, by Jacques Maritain & Jean Cocteau
> Difficulites, by Ronald Knox & Arnold Lunn
> 
> 
> 





Re: CGJ , RLM

2004-11-29 Thread Peter Kirk
On 29/11/2004 14:52, Otto Stolz wrote:
...
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- "Brief"+SYH+"lasche" (letter flap) vs.
- "Brie"+SYH+"flasche" (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
Before our French members get upset at the idea that anyone might keep 
their famous cheese in bottles, let me remind the list of a similar pair 
we had before, although this affects only the less common st ligature:

Wach-stube (watch house)
Wachs-tube (growth tube)
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: CGJ , RLM

2004-11-29 Thread Otto Stolz
Hello,
I had written:
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- "Brief"+SYH+"lasche" (letter flap) vs.
- "Brie"+SYH+"flasche" (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
Peter Kirk schrieb:
Before our French members get upset at the idea that anyone might keep 
their famous cheese in bottles, let me remind the list of a similar pair 
we had before, although this affects only the less common st ligature:
Just because the âstâ ligature is so uncommon (and the long âÅâ with its
âÅtâ ligature is almost extinct), I was looking for an example involving
âflâ, or âfiâ).
...
Wachs-tube (growth tube)
 (waxtube)
Best wishes,
   Otto Stolz



Re: CGJ , RLM

2004-11-29 Thread Otto Stolz
Hi,
Philippe Verdy had written:
For example, a ligaturing opportunity can be encoded explicitly
in the French word "efficace": "ef"+ZWJ+"f"+ZWJ+"icace". [...]
in French there's a possible hyphenation at the first occurence,
where it is also a syllable break, but not for the second occurence
that occurs in the middle of the second syllable.
Doug Ewell wrote:
a system that is capable of high-quality typography [...]
should generate ff-type ligatures and perform  sensible hyphenation by default.
 You can then use ZWNJ to turn ligation *off* where it is not desired.
In German, however, a ligature must not span a syllable break.
How should I code plain text, w.r.t. hyphenation and ligatures?
- "Huf" + ZWNJ + "lattich"
- "Huf" + SYH + "lattich"
- "Huf" + SYH + ZWNJ + "lattich"
- "Huf" + ZWNJ + SYH + "lattich"
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- "Brief"+SYH+"lasche" (letter flap) vs.
- "Brie"+SYH+"flasche" (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
Best wishes,
   Otto Stolz



[Fwd: Re: Re: Relationship between Unicode and 10646]]

2004-11-29 Thread Patrick Andries






 Message original 

  

  Sujet: 
  Re: Re: Relationship between Unicode and 10646]


  Date: 
  Mon, 29 Nov 2004 10:17:34 +0100


  De: 
  Philippe Verdy <[EMAIL PROTECTED]>


  
  
  
  


  
  
  
  


  
  
  
  

  



From: "Patrick Andries" <[EMAIL PROTECTED]>
> Enfin, je ne suis plus si sûr que les sociétés américaines considèrent 
> encore
> Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts 
> individuels
> de la part de techniciens passionés dans ces entreprises, passionnés qu'on 
> laisse
> encore faire sans doute parce que cela crée un bon capital de sympathie 
> multiculturel.

[PA] This was extracted from a longer and private message to Philippe. 
It is out of context here. Unicode is still strategic, the new scripts
may be less so to the major software companies although major software
companies will most probably not be able to ignore the new versions of 
Unicode which will contain more than simply new rare scripts. 

Anyways, this was a private discussion. Thanks, Philippe. Will teach me.

P. A.









International CALIBER2005 - Call for Paper Date extended till 15th December 2004!

2004-11-29 Thread Rajesh Chandrakar




Hurry Up! Last Chance for the 
Professionals!
 
Keeping in view the request received from the various 
professionals across the country, editorial committee has decided to extend the 
last date of receipt of full paper for International CALIBER 2005 from 
30th November 2004 to 15th December 
2004.
 
 Website for International CALIBER-2005 is available at http://web.inflibnet.ac.in/caliber2005/index.jspCALL FOR PAPERSInternational CALIBER 
2005 invites contributed papers and case studies on various aspects of the 
conference theme Multilingual Computing and Information Management in 
Networked Digital Environment and listed below sub-themes of the 
conference:Multilingual Computing and Natural Language 
ProcessingMultilingual Computing- Encoding issues, Scripts, Fonts. 
Character recognition and otherchallenges, Natural Language Processing, 
Multilingual Indexing Tool, Multilingual andMultimedia 
AccessContent Management and Information 
ManagementContent Structuring and XML- Web Cataloguing, Metadata, 
Dublin Core, EAD, TEI, XML, Crosswalk, Automatic Classification, Document 
Clustering, Methodologies, Techniques, Indexing, Retrieval, Content Management, 
MultimediaDigital Information Processing and 
InteroperabilityArchiving, Digitization, Preservation, Information 
Retrieval, Digital Library Architecture, System Scalability, Systems Design, 
Distributed solutions, WebServices and Interoperability - Agents related to 
digital Information, Crossdomain searching, Browsing, Metasearch, OpenURL 
Framework OAI-PMH.Digital Libraries and 
Services
 
Digital Library Development Issues, Digital Library Services, Digital 
Library and 
E-learning, Digital Library Applications. Subject Gateways, Library 
Portals, Digital Library Consortia, Wide area and high-speed 
Networks.Detailed Guidelines and Editorial policy for CALIBER 2005 are 
available in the conference website http://web.inflibnet.ac.in/caliber2005/callforpapers.jspImportant datesReceipt of Full Papers: 
December 15, 2004Intimation to Authors: December 30, 
2004Last Date for Registration: January 7, 
2005Convention Dates: February 2-4, 
2005ContactsFor Paper SubmissionDr. T.A.V. 
MurthyEditor-in-ChiefDirectorINFLIBNET Centre(An IUC of 
UGC)Gujarat University CampusPB 4116, NavrangpuraAhmedabad - 380 
009, (Gujarat) IndiaPhone: +91-79-26304695/8528/5971/0002Fax: 
+91-79-26300990/26307816E-Mail: [EMAIL PROTECTED] or 
[EMAIL PROTECTED] Registration and AccomodationDr. (Mrs.) M.D. 
BabyOrganizing Secretary, CALIBER-2005LibrarianCochin University of 
Science & Technology,Cochin University P.O.Kochi - 682 022, (Kerala) 
IndiaPhone: +91-484-2577595E-mail: [EMAIL PROTECTED] or 
[EMAIL PROTECTED]Queries Can Also be Sent toMr. S.M. 
SalgarChairman, Caliber-2005,Scientist -G, INFLIBNET Centre(An 
IUC of UGC)Gujarat University CampusPB 4116, NavrangpuraAhmedabad - 
380 009, (Gujarat) IndiaPhone: +91-79-26304695/8528/5971/0002Fax: 
+91-79-26300990/26307816E-Mail: [EMAIL PROTECTED] 
For more details on International CALIBER-2005 visit http://web.inflibnet.ac.in/caliber2005/index.jsp
 
With best regards,
Rajesh Chandrakar
Joint-Convener
International 
CALIBER2005


Re: Re: Relationship between Unicode and 10646]

2004-11-29 Thread Philippe Verdy
From: "Patrick Andries" <[EMAIL PROTECTED]>
Enfin, je ne suis plus si sûr que les sociétés américaines considèrent 
encore
Unicode comme quelque chose de stratégique, il s'agit surtout d'efforts 
individuels
de la part de techniciens passionés dans ces entreprises, passionnés qu'on 
laisse
encore faire sans doute parce que cela crée un bon capital de sympathie 
multiculturel.
C'est d'ailleurs ce qui me fait doûter de plus en plus de l'intérêt de 
continuer à soutenir Unicode, s'il n'obéit même plus à des objectifs 
économiques jugés utiles par les seuls membres américains capables de 
soutenir son développement uniquement depuis les Etats-Unis, alors 
qu'Unicode n'est pas encore au point pour bon nombre d'autres pays qui, eux, 
ont des impératifs économiques à soutenir leurs propres langues.

S'il n'y a plus grand chose à faire concernant les écritures latines, ou 
cyrilliques, et si les idéographes chinois sont maintenant laissés à la 
gestion du Rapporteur Idéographique travaillant en Extrème-Orient, il serait 
peut-être bon d'envisager que le développement d'Unicode concernant les 
écritures Africaines, ou du Moyen-Orient se fasse dans des lieux plus 
appropriés que les Etats-Unis, notamment concernant les décisions.

L'Europe offre des lieux de rencontre semble-t-il plus appropriés pour ces 
alphabets mal supportés par Unicode, dont les décisions sont fondées sur des 
rapports distants, sans implication économique sérieuse de la part des 
sociétés encore participantes (si elles continuent à soutenir et payer leurs 
collègues encore engagés pour ce travail de "passionnés").

Il semble que bien des sociétés ou organisations Européennes ou du 
Moyen-Orient, ou d'Afriquepourraient participer plus facilement au sujet des 
langues qui leur tiennent à coeur, en effectuant ces réunions de décision 
dans un lieu plus centré.

Il est d'ailleurs dommage, à l'heure des communications virtuelles, 
qu'Unicode s'en tienne encore, pour la question du vote final, à vouloir 
faire cela uniquement lors de comités restreints aux Etats-Unis, comme si le 
vote électronique n'existait pas! Cela n'empêchera pas la tenue de réunions 
de discussions ou d'arbitrage en différents lieux mais Unicode et ceux qui 
le soutiennent fairaient pas mal d'économies en travaillant de façon moins 
centralisée, et en acceptant de déléguer une partie de son travail.

Il est symptomatique par exemple de voir que la moitié des votants 
potentiels d'Unicode n'utilisent jamais les ressources électroniques en 
ligne (que rien n'interdit de mettre en forme selon des procédures 
administratives propres à Unicode), en ne prenant leurs décisions que sur la 
base de documents imprimés (chers à produire et distribuer) lors de 
"conventions" (chères aussi pour y assister, à cause de frais de 
déplacement, hébergement, et des heures de travail supplémentaires payées 
uniquement pour ce sujet!), et que des documents importants puissent de ce 
fait échapper à leur analyse...