Re: LATIN LETTER N WITH DIAERESIS?

2003-02-03 Thread Asmus Freytag
Thanks for the many replies, I'll comment on a few of them:

At 05:46 PM 2/2/03 +0100, Lukas Pietsch wrote:


Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?

Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C.


I wish the same font had contained a glyph for 2112, but it doesn't.
I'm not used to think in terms of attempting to do a 'sans-serif'
interpretation of a 'script' style. I've not been able to find any
font that does things that way. For example, Arial Unicode MS does
not attempt that, but gives true script shapes for 2112 and similar
letters.

In favor of your hypothesis speaks the fact that the font duplicates
many other Letterlike Symbols in its private use area, while it is
quite good at mapping IPA and Extended Latin characters to their correct
code points, reserving the PUA for additional precomposed or novel symbols.

Did you (or anyone else) have a guess for F759. It can't be 013D, since that
exists in the same font, with a more expected rendering.

Thank you for your detailed responses. Most of them seem to be clerical
errors on my part. I'll fix them at next occasion (not tonight, though ;-).

There was one other remark I wanted to comment on:


Your combinations with latin small letter dotless i (e.g. F704, F731,
F77A) seem to be designed for use in phonetic transcriptions, and hence
are probably intended as IPA U+026A;LATIN LETTER SMALL CAPITAL I


In this font, a capital I has serifs (cross bars) to distinguish it from
lower case letter L. You can see this in F752 and the small capital form
would then be part of F753. That means that the vertical bar in F703 etc
would have to be either the dotless i, or something else altogether.

I don't have experience in phonetic transcriptions, so I couldn't spot
a nonsense mapping, but in this case there seem issues of internal
consistency in the set.



At 11:55 AM 2/2/03 -0500, John Cowan wrote:
I strongly suspect that your various DIGRAPHS WITH BREVE BELOW are
actually underties.  In addition, U+F7A1 looks like a glyph variant
of the glyph often used in American dictionaries to represent edh,
though I have more often seen it with the stroke passing through both
legs of the h portion.  U+F776 and U+F777 are probably also American
dictionary characters representing the so-called short and long
sounds of English oo, though I have more often seen them without ligaturing.


A family member, watching me prepare the charts, suggested I name them
LATIN LETTER SMALL EYEGLASSES WITH MACRON ABOVE, etc. I thought that would
be a neat name for them.

A./


A./





Re: LATIN LETTER N WITH DIAERESIS?

2003-02-03 Thread Otto Stolz
Asmus Freytag had written:


I have updated my document at http://www.unicode.org/~asmus/what_is_this_character.pdf


...


I welcome [...] any help anyone could provide in identifying the characters
or in locating places they are used.


Lukas Pietsch wrote:

Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?


 Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C.

Cf. the Ausgangsschrift tought at German schools, viz.
http://www.dietschweiler.de/SUETTER/schrift.gif
(1915 through 1941), and
http://www.pelikan-lehrerinfo.de/lehrerinfo/shoppix/shopitem151big.gif
(1953 through now (but there have been more recent alternatives, viz.
shopitem150big.gif, shopitem152big.gif, shopitem155big.gif)).


I am not entirely convinced that S and C are the intended meanings.
The left-hand stroke of F725 is far too high for a capital S,
and also the position of the left-hand stroke of F7AA does not look
quite right for a C.

Based on their code positions, I think, the F725 and F7AA characters
are meant as Variants of d, and T, respectively.

F725 resembles U+20B0 GERMAN PENNY SIGN, which is probably a script d,
derived from the Latin word denarius. (Just add an upstroke on the
left hand of the Verdana PUA character.)

This is not convincing either, I know. Just my 0,02 ¤.

Best wishes,
  Otto Stolz





Re: LATIN LETTER N WITH DIAERESIS?

2003-02-03 Thread Curtis Clark
Lukas Pietsch wrote:

Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?


Looks to me like the proofreader's marginal deletion mark. F7AA might 
also be a proofreader's mark.


--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/




MDMP -- Unicode migration in SAP R/3

2003-02-03 Thread malandrinos . a
Hello,

Has anybody of you perfomed an MDMP to Unicode migration in an Oracle database
used as a database for SAP R/3? Any ideas of how feasible/difficult it is or any
information on documentation will be very welcome

Thanks

Andreas





RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson

 --- Kent Karlsson [EMAIL PROTECTED] wrote:
   
   No fallback rendering is coming into picture with your explanation. 
  
  Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
  is very unlikely to have a ligature, specially adapted (and fitting)
  adjustment points, or similar.  The rendering would in that sense
  need to use a fallback mechanism that renders an approximation
  for this rare combination.
 
 Do you mean to say that an application has to take care of combination of

s/has to/should, also in display,/

 all other Unicode characters with each combining marks in the fallback

Including multiple combining marks on one base character.

 mechanism for such approximation? Can you count the number of combinations
 which may result in millions!?

Many, many more.  Which is why you need a fallback mechanism (rather
than ligatures, adjustment points, etc. which cannot handle that many
combinations).

In the case of Indic postfix and prefix matras, the general handling is
in principle simple: for the postfix ones, nothing special need be done,
for the prefix ones (i.e. the reordrant ones) do the reordering (before
the preceding base character at least, for certain Indic combinations,
move it even earlier).  Then the you have the visual order.  I'm
ignoring ligature formation here, but that has to be done as well. For
the superscript, subscript, and split matras (and other combining
marks) the general  approach is a bit more complicated.  See
http://www.unicode.org/notes/tn2/ for hints.

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-02-03 Thread Kent Karlsson



  No, with proper reordering (and normal display mode), the e-matra at
  the beginning of the second word would appear to be last glyph of the
  first word.  Similarly, for the second case, the e-matra glyph would
  have come to the left of the pa.  The fluent reader (ok, not me...)
  would then see those errors anyway, just like I can find spelling
  errors in Swedish, most often without any kind of special marking. (I'm
  assuming through-out that reordrant combining characters 
 are reordered.)
 
 Illegal sequences

There are no illegal sequences.

 are not reordered as you indicated.

Then that is a problem with the display software you are using.

 Also, as far as I
 know there is no mention of reordering of illegal input sequence (or
 invalid combining mark) in Unicode standard.

Again, there are no illegal input sequences.

 Consider the last set of glyphs (left-to-right, top-to-bottom) in the
 attached image. It is the rendering effect of illegal input sequence

See above.

 Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka 
 [U+0915] and without any dotted circle.

Let's see if I understand you. 093F, 0915 is the input.  Since
093F is a combining character, one should (not must, but should)
treat this *as if* the input was 0020, 093F, 0915.  Since 093F
is also reordrant, one must reorder it before the preceding base
character (at least, more for consonant clusters), so the output
glyphs would be glyph for 0915, space, glyph for 0915. 
(But your image does not show that.)

 As you might be knowing the correct input
 sequence should be U+0915 followed by U+093F.

That would be a different input (whether that is correct or
not depends on the authors intent).

 In that case the result would
 have been similar to what appears right now. 

Similar ONLY if you disregard the space glyph that should
have been there.

 (Though some more
 sophisticated font/application may want to replace the 
 appearing glyph for
 U+093F to be substituted by some other glyph with proper 
 attachment point).

That may be.

 Now there is no way that user can identify this illegal input sequence
 without dotted circle.

Yes, there is.  Don't disregard the space glyph.

 In the worst case even this rendered glyph is
 attached to the character from a class (for example, 
 consonant cluster of
 Ka Virama Ma) for which the glyph has been designed to 
 render with.
 In such case even a fluent reader can not identify the error.
 
  
  There are spelling errors, yes.  But there are other ways 
 of indicating
  spelling errors, that are (by now) fairly conventional for 
 any language
  (as long as there is an appropriate dictionary installed), 
 and that also
  are more general (in catching more spelling errors) and 
 less obtrusive
  (the author really wants to write it that way, for some reason).
  
   Apparently, Michka used a non-OpenType Bengali Unicode font when
   he embedded the fonts into the page.  As long as you are looking
   at the page on-line, with the embedded fonts, these errors are
   invisible.  
   
   It may be typographically horrible.  It *should* be 
 typographically
   horrible in order to illustrate bad sequences clearly.
  
  I'd prefer little red wiggly lines under the word, or 
 yellow background
  or some such (just for screen display, not for printing; 
 screen grabs
  not counted).  And that for any spelling error.
 
 Spelling mistakes can be categorized into two different classes.

???

 One
 arising from illegal input sequence (e.g., Vowel Sign E as the first
 character in a word)

There are no illegal input sequences.

 and the other one is legal input sequence with no
 contextual meaning in the dictionary.

A simple spell checker just checks if the word is in the 
dictionary or not (without worrying about the context).
That would catch what you call illegal input sequences too.

 While indication of the  second type
 of mistake is generally used only in sophisticated 
 applications like word processor, 

Why?  There is nothing in principle hindering a spell checker
to be used in a plain text editor.

 everyone wants to know the first kind of mistake.

Without a spell checker, but with proper rendering, spelling
errors can be detected by a fluent reader, since they look
different also without any dotted circles. For some ambiguous
Indic cases, like a prefix matra, consonant, postfix matra, all
possible character sequences for them are misspellings (as far
as I know).

 With your
 explanation it seems that even plain text editor is not 
 useful at all to identify such common typing mistakes!

Consider English.  If I write , that may well be a spell error.
Do I deserve to get the rendering of that string to be littered by
dotted circles just because a sequence of four n's has to be
a spell error?

/Kent K

 - Keyur





Re: How is glyph shaping done?

2003-02-03 Thread Deborah Goldsmith
For information on how this is handled on Mac OS, please see:

http://developer.apple.com/fonts/

Deborah Goldsmith
Manager, Fonts  Unicode
Apple Computer, Inc.
[EMAIL PROTECTED]

On Friday, January 31, 2003, at 11:03  AM, John Hudson wrote:


On Windows, the shaping engines for complex scripts are part of 
Uniscribe (usp10.dll) and make use of OpenType font technology. An 
Arabic OpenType font will contain layout features for Initial init, 
Medial medi and Final fini substitutions (and possibly Isolated 
isol, e.g. to handle contextual variation of the letter heh). 
Uniscribe analyses strings of Arabic text, keeps track of the position 
of letters and their neighbours, and implements the appropriate layout 
feature for each letter.

For more information, see 
http://www.microsoft.com/typography/developers/opentype/default.htm, 
and the MS Arabic font specification at 
http://www.microsoft.com/typography/specs/default.htm






Re: LATIN LETTER N WITH DIAERESIS?

2003-02-03 Thread Peter_Constable

F7C7: A palatalised y is pretty unlikely (it's already palatal). Sure it's
not a palatalised v?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Kenneth Whistler
Erik Ostermueller asked:

 We have a large amount of C++ that currently has Unicode 2.0 support.
 
 Could you all help me figure out what types of operations will fail
 if we attempt to pass Unicode 3.0 thru this code?
 
 I can start the list off with 
 
 -sorting 
 -searching for text 

This depends greatly on what implementation you did for
sorting and searching, and how it handles unassigned code points
in your Unicode 2.0 code. If the code was designed to be
forward compatible, it should do reasonable things with
unassigned code points, and getting Unicode 3.0 data which
is actually using those code points should not disturb your
existing code. But, on the other hand, if you have built
in a bunch of range checks or have used tables which cannot
gracefully handle the appearance of unassigned code points
in your data, then it could well blow up.

The Unicode Collation Algorithm was not defined until after
Unicode 2.0, and was first synched with Unicode 2.1. It has
also been considerably updated since then -- the current version
is aimed at Unicode 3.1. You should take a look at the
current version to check for gotchas you may have in your
current code.

 -text comparison

I assume here you are not talking about language-specific
collation comparisons, but just Unicode analogs of strcmp()
and the like. If so, those should behave well -- they aren't
usually programmed in ways which make them sensitive to
particular code point assignments.

 -other character classification (isSpace, isDigit, etc...).

Again, these depend on what kinds of forward compatibility
assumptions your original code made. If it provides
meaningful results for unassigned code points in Unicode 2.0,
then tossing Unicode 3.0 data at such APIs shouldn't cause
any problem to existing code, other than not getting the
right results for Unicode 3.0 additions until you have
modified and updated your property tables.

 
 I'm understand that these operations probably won't work in ALL cases.
 But how about basic plumbing code -- creating and copying string?

Constructors and copy constructors ought to work fine, unless
you've done something odd.

What you should be more concerned about, however, is
how your code is going to get from Unicode 3.0 to
Unicode 3.1 (or higher), because then you will have to
deal with supplementary characters. Any assumptions that
characters don't lie outside the range U+..U+
will be broken. Whether this will be a small problem
or a big problem for your code depends on whether you
are effectively processing Unicode in UTF-8, UTF-16,
or UTF-32 (or combinations of those). The biggest hit,
when moving from Unicode 3.0 to Unicode 3.1 (or higher)
is for UTF-16 APIs. See Unicode Technical Note #7,
Migrating Software to Supplementary Characters, for some
ideas:
http://www.unicode.org/notes/tn7/

--Ken

 
 As I mentioned in my last post, I've enjoyed
 listening in on this forum -- I've learned a whole lot.
 
 Thanks,
 
 --Erik Ostermueller
 





Re: Public Review Issues update

2003-02-03 Thread Rick McGowan
Please note that the Issues for Public Review have been updated with a new  
review item regarding tailoring of normalization. Please see issue number  
7 on this page:

http://www.unicode.org/review/

Instructions for discussion and submision of formal comments are provided  
on that page. The closing date for review comments is February 28, 2003
(2003-02-28).

Also, the review period for issues 1, 4, 5, and 6 is quickly approaching,  
and these items are expected to be discussed at the next UTC meeting March  
4-7 2003.

Regards,
Rick McGowan
Unicode, Inc.





Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Keyur Shroff

--- Kenneth Whistler [EMAIL PROTECTED] wrote:
 
 This depends greatly on what implementation you did for
 sorting and searching, and how it handles unassigned code points
 in your Unicode 2.0 code. If the code was designed to be
 forward compatible, it should do reasonable things with
 unassigned code points, and getting Unicode 3.0 data which
 is actually using those code points should not disturb your
 existing code. But, on the other hand, if you have built
 in a bunch of range checks or have used tables which cannot
 gracefully handle the appearance of unassigned code points
 in your data, then it could well blow up.

Can you please explain what is the best practice to handle unassigned code
points so that applications can easily become forward compatible? If we
just ignore unassigned code points, then will it make for application
easier to migrate to later version of Unicode?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Register now - IUC23 in Prague

2003-02-03 Thread Lisa Moore




Folks,

Time to register to get the early bird rates for the conference and
hotel...please find the details
below.

Hope to see you in Prague!

Lisa

  Register now! Don't miss out on early bird conference and hotel rates!
*
 Twenty-third Internationalization and Unicode Conference (IUC23)

  Early bird registration rate valid to March 1.
  Hotel guest room group rate valid to March 1.
**
 Twenty-third Internationalization and Unicode Conference (IUC23)
  Unicode, Internationalization, the Web: The Global Connection
http://www.unicode.org/iuc/iuc23
   March 24-26, 2003
Prague, Czech Republic
*

NEWS

  Check out the updated Conference program and register now via the
   Conference Web site ( http://www.unicode.org/iuc/iuc23 ).
   The web site includes abstracts of talks and speakers' biographies
   so you can see the industry leaders that will be there and the hot
   topics for internationalization and Unicode in 2003!

  Sign up for the Workshop on Managing Localization Projects, organized
   by XenCraft, and taking place in the same venue on 27 March -- See:
   http://www.unicode.org/iuc/iuc23

  Attend the new Showcase to find out more about products supporting
   the Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

  Be an Exhibitor! Show off your product at the premier technical
   conference worldwide for both software and Web internationalization.
   See: http://www.unicode.org/iuc/iuc23/showcase.html

CONFERENCE PROGRAM

   The conference features tutorials, lectures, and panel discussions
   that provide coverage of standards, best practices, and recent
   advances in the globalization of software and the Internet.
   See the program: http://www.unicode.org/iuc/iuc23/program.html

GLOBAL COMPUTING SHOWCASE

   For the first time, we will have an Exhibitors' track as part of the
   Conference, in addition to the updated Showcase Exhibition.

   For more information, please visit the Web site at:
   http://www.unicode.org/iuc/iuc23/showcase.html

   Showcase participants include:

   Agfa Monotype Corporation
   Alchemy Software Development Ltd.
   Basis Technology Corporation
   Moravia IT
   Multilingual Computing, Inc.

   Don't be left out!! Sign up now!

CONFERENCE VENUE

The Conference will take place in lovely, historic Prague:

 Marriott Prague Hotel
 V Celnici 8
 Prague, 110 00
 Czech Republic

 Tel:  (+420 2) 2288 
 Fax:  (+420 2) 2288 8889

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Microsoft Corporation
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
+1 858 638 0504 (fax)

   Email: [EMAIL PROTECTED]
  or: [EMAIL PROTECTED]

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding. The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646. In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations. Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail [EMAIL PROTECTED]

   *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.








Re: 4701

2003-02-03 Thread Andrew Cunningham
From memory, although my memory may be faulty, there are some slight 
differences between the animals assigned in the Chinese calendars and 
the animals assigned in the Vietnamese calendar.

in the Vietnamese sequence, it is goat. while most chinese sources 
indicate sheep (occasionally they say ram, but sheep is most common)

at least thats what i seem to remember. But then three's been so many 
fire crackers going off over the three days of tet, that something might 
have rattled loose in my memory.

Andrew

Michael Everson wrote:
At 10:19 -0800 2003-02-01, Eric Muller wrote:


Michael Everson wrote:


Happy New Year of the Yáng to everybody! (I can't work out whether 
it's the Year of the Sheep, the Goat, or the Ram.)


Ram.



europe.cnn.com (which I was looking at for other, sadder reasons), says 
Goat. My local Superquinn's (large grocery chain) has had signs on all 
the Chinese food for weeks which says Ram. My Chinese dictionary says 
Sheep.







Re: Public Review Issues update

2003-02-03 Thread Doug Ewell
Rick McGowan rick at unicode dot org wrote:

 Please note that the Issues for Public Review have been updated with a
 new review item regarding tailoring of normalization. Please see issue
 number 7 on this page:

 http://www.unicode.org/review/

This is hardly a formal comment, but allowing limited tailoring of
normalization forms sounds really really scary to me.  This would
basically apply the concept of CESU-8 -- which is not an official UTF,
but is close enough to cause confusion, and is specified on the Unicode
Web site as though it were official -- to normalization forms.  There
are already four normalization forms, which is probably greater than the
number of people outside this mailing list who understand all of them.
Is it worthwhile to add more special-purpose variations?

 Also, the review period for issues 1, 4, 5, and 6 is quickly
approaching,
 and these items are expected to be discussed at the next UTC meeting
March
 4-7 2003.

As a reminder, my paper arguing against the proposed deprecation of
Plane 14 language tags is available in PDF and HTML formats:

http://home.adelphia.net/~dewell/plane14.pdf
http://home.adelphia.net/~dewell/plane14.html

As for Issue #6, Unicode 4.0 Alpha data, there hasn't been much new to
review so far.  The first Unicode Data.txt file to contain the new
character assignments in Unicode 4.0 was posted only a few hours ago!
Eleven days might not be much time to check through 1200+ new
characters.  I will, however, state the obvious by pointing out that the
Scripts.txt file (among others) needs to be updated to reflect the new
characters.

-Doug Ewell
 Fullerton, California





Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Doug Ewell
Keyur Shroff keyur_shroff at yahoo dot com wrote:

 Can you please explain what is the best practice to handle unassigned
 code points so that applications can easily become forward compatible?
 If we just ignore unassigned code points, then will it make for
 application easier to migrate to later version of Unicode?

I should probably wait for someone like Ken to come by and provide an
authoritative answer, but until then:

The basic rule is that unassigned code points cannot be interpreted or
modified in any way.  In particular, they cannot simply be thrown away,
or converted to an assigned code point such as U+003F or U+FFFD.

That said, there are certain conventions for certain ranges of code
points.  For example, the range from U+0590 through U+08FF is marked in
the Roadmap as being reserved for right-to-left scripts, and IIRC there
are ranges reserved for invisible formatting and control characters
(U+206x and U+FFFx).  But I really don't know how advisable it is to,
say, render an string of unassigned code points like ࠁࠂࠃ as RTL just
because it falls within the RTL block.

Better wait for the experts.

-Doug Ewell
 Fullerton, California