OT: Haikus for Unicode-Haters

2003-02-02 Thread Shlomi Tal
Unicode is shit!
What a dreadful encoding.
Who thought up this crap?

UTF-16
Has those pesky surrogates
Very bad design.

Arabic shaping
Difficult to implement
It's a complex script.

One should circumvent
Endian related issues.
UTF-8 does.

_
STOP MORE SPAM with the new MSN 8 and get 2 months FREE* 
http://join.msn.com/?page=features/junkmail




Re: OT: Haikus for Unicode-Haters

2003-02-02 Thread Roozbeh Pournader
On Sun, 2 Feb 2003, Shlomi Tal wrote:

 Arabic shaping
 Difficult to implement
 It's a complex script.

I can't understand how Arabic suddenly appears in your list. The
complexities are in the script itself, and not in Unicode. I have yet to
see any sound standard for Arabic information interchange that doesn't use 
the same model Unicode uses for Arabic. ISO 8859-6 does, CP1256 does, and 
ISIRI 3342 does. Even the weird UZT standard almost uses the same model.

If only you wanted some complex script, use something more complex next 
time. Mongolian, for example...

roozbeh





Re: OT: Haikus for Unicode-Haters

2003-02-02 Thread Shlomi Tal
You're right, but neither Monogolian nor Indic fits the 5-7-5 syllable 
constraint of haiku. Ben-ga-li-Sha-ping maybe? :-)

But anyway, as I've been reading on Thomas Milo's (Decotype) paper on Arabic 
recently refered to here, Arabic typography isn't so simple once you get out 
of the simplified printing-Arabic paradigm.

I have been using Arabic on computers since 1993, on Accent Software's word 
processor Dagesh (a multiscript word processor for Windows 3.x). The shaping 
mechanism for Arabic hasn't changed since. And I read this implementation 
goes back to the Apple Mac Arabic word processor Al-Kaatib Ad-Dawli, in 
the late 1980s.

ST

_
Tired of spam? Get advanced junk mail protection with MSN 8. 
http://join.msn.com/?page=features/junkmail




Re: LATIN LETTER N WITH DIAERESIS?

2003-02-02 Thread jcowan
Asmus Freytag scripsit:

 However, there are some unidentified characters, or ones that could be 
 considered missing from Unicode  4.0, or which have mappings that for one 
 or the other reason could be considered not ideal. These have been 
 highlighted. I welcome suggestions for additions to or subtractions from 
 this list, plus any help anyone could provide in identifying the characters 
 or in locating places they are used.

I strongly suspect that your various DIGRAPHS WITH BREVE BELOW are
actually underties.  In addition, U+F7A1 looks like a glyph variant
of the glyph often used in American dictionaries to represent edh,
though I have more often seen it with the stroke passing through both
legs of the h portion.  U+F776 and U+F777 are probably also American
dictionary characters representing the so-called short and long
sounds of English oo, though I have more often seen them without ligaturing.




Re: LATIN LETTER N WITH DIAERESIS?

2003-02-02 Thread Lukas Pietsch
 All characters are now mapped to Unicoe characters or character
sequences
 where I felt that this was possible. If there are obvioous errors,
please
 point them out and I'll update the listing.

 However, there are some unidentified characters, or ones that could be
 considered missing from Unicode  4.0, or which have mappings that for
one
 or the other reason could be considered not ideal. These have been
 highlighted. I welcome suggestions for additions to or subtractions
from
 this list, plus any help anyone could provide in identifying the
characters
 or in locating places they are used.

Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?

Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C.

Your F747, spacing left hook below - doesn't it look very much like the
palatalization hooks used elsewhere in the list (which you mapped to
U+0321)?

Your combinations with latin small letter dotless i (e.g. F704, F731,
F77A) seem to be designed for use in phonetic transcriptions, and hence
are probably intended as IPA U+026A;LATIN LETTER SMALL CAPITAL I

F737: the description in your list doesn't match the glyph shown, which
is with triangular colon.

F70F Latin small letter a with colon shows a triangular colon glyph
and should hence be mapped to U+02D0, not U+003A.

F70E Latin small letter a with tilde with modifier letter triangular
colon shows a U+0251 Latin small letter alpha glyph.

F750 Latin small letter i with palatalized hook below shows an
inverted breve glyph, not a hook.

F751 Latin small letter i with tilde with tilde shows a macron and a
tilde

F754 and F755 Latin small letter J with... show i, not j glyphs.

F79B Latin small letter S with retroflex hook below shows not a
retroflex hook, but something more like an ogonek. A retroflex hook
should be attached to the left side of the S, not in the middle below,
and has its own precomposed IPA codepoint U+0282.

F7AC Latin small letter u with dot below with diaeresis shows an
acute, not a diaeresis.

Lukas






RE: 4701

2003-02-02 Thread Greenwood, Timothy









And
the Boston Globe has it as the year of the ghost





Stolen from Mathews, as it happens. 









On Google, year of the goat
has the 





lead. 
















RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:

  
  Without that dotted circle appearing, the e-matra would appear to
  have been properly encoded, 
 
 No, with proper reordering (and normal display mode), the e-matra at
 the beginning of the second word would appear to be last glyph of the
 first word.  Similarly, for the second case, the e-matra glyph would
 have come to the left of the pa.  The fluent reader (ok, not me...)
 would then see those errors anyway, just like I can find spelling
 errors in Swedish, most often without any kind of special marking. (I'm
 assuming through-out that reordrant combining characters are reordered.)

Illegal sequences are not reordered as you indicated. Also, as far as I
know there is no mention of reordering of illegal input sequence (or
invalid combining mark) in Unicode standard.

Consider the last set of glyphs (left-to-right, top-to-bottom) in the
attached image. It is the rendering effect of illegal input sequence
Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka [U+0915] and
without any dotted circle. As you might be knowing the correct input
sequence should be U+0915 followed by U+093F. In that case the result would
have been similar to what appears right now. (Though some more
sophisticated font/application may want to replace the appearing glyph for
U+093F to be substituted by some other glyph with proper attachment point).
Now there is no way that user can identify this illegal input sequence
without dotted circle. In the worst case even this rendered glyph is
attached to the character from a class (for example, consonant cluster of
Ka Virama Ma) for which the glyph has been designed to render with.
In such case even a fluent reader can not identify the error.

 
 There are spelling errors, yes.  But there are other ways of indicating
 spelling errors, that are (by now) fairly conventional for any language
 (as long as there is an appropriate dictionary installed), and that also
 are more general (in catching more spelling errors) and less obtrusive
 (the author really wants to write it that way, for some reason).
 
  Apparently, Michka used a non-OpenType Bengali Unicode font when
  he embedded the fonts into the page.  As long as you are looking
  at the page on-line, with the embedded fonts, these errors are
  invisible.  
  
  It may be typographically horrible.  It *should* be typographically
  horrible in order to illustrate bad sequences clearly.
 
 I'd prefer little red wiggly lines under the word, or yellow background
 or some such (just for screen display, not for printing; screen grabs
 not counted).  And that for any spelling error.

Spelling mistakes can be categorized into two different classes. One
arising from illegal input sequence (e.g., Vowel Sign E as the first
character in a word) and the other one is legal input sequence with no
contextual meaning in the dictionary. While indication of the second type
of mistake is generally used only in sophisticated applications like word
processor, everyone wants to know the first kind of mistake. With your
explanation it seems that even plain text editor is not useful at all to
identify such common typing mistakes!

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
inline: img1.jpg

RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
  
  No fallback rendering is coming into picture with your explanation. 
 
 Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
 is very unlikely to have a ligature, specially adapted (and fitting)
 adjustment points, or similar.  The rendering would in that sense
 need to use a fallback mechanism that renders an approximation
 for this rare combination.

Do you mean to say that an application has to take care of combination of
all other Unicode characters with each combining marks in the fallback
mechanism for such approximation? Can you count the number of combinations
which may result in millions!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com