Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, February 12, 2001 20:30
Subject: Re: Korean linebreking and UTR14(was Re: extracting words)
>
>
> On Mon, 12 Feb 2001, Mark Davis wrote:
>
> Thank you for your answer.
>
> > Asmus Freytag is the one to talk to; he
tracting words)
>
>
>
> On Sun, 11 Feb 2001, Mark Davis wrote:
>
> MD> Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
> MD> recommended in my last message. The Unicode standard is online, as is
the
> MD> TR. Both can be found by going t
I agree with Tex that the algorithm is small, if implemented in the
straightforward way. I also agree with his #1, #2, and #3. I will add two
things:
1. Where performance is important, and where people start adding options
(e.g. uppercase < lowercase vs. the reverse), the implemenation of collati
Please read TUS Chapter 5 and the Linebreak TR before proceeding, as I
recommended in my last message. The Unicode standard is online, as is the
TR. Both can be found by going to www.unicode.org, and selecting the right
topic. The TR in particular discusses the recommended approach to line break
i
Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
s
I have a few JavaScript pages for doing code charts, UTF conversion, and
displaying Unicode glyphs. These work on IE, and on NN 4.7 (although the
layout is not great), but someone complained that on NN 6 they don't work at
all. Anyone have an idea what is happening? There seems to be a problem
wri
I have not been following this discussion up until now. Typically the issue
with syllables is like that with word-sorting. With word sorting, no matter
what is in the second word, any difference in the first word swamps it.
Example:
ab xyz ghi
abc def ghi
In many cases, UCA does handle syllabic
The whole principle of tagging individual
strings with NF* is a bit odd to me; not sure I like it. The K forms in
particular are really a folding operation, much like casing. I would not
expect to find a model where someone tagged every string in a database with its
Case, and then had some e
a Unicode Standard Annex. However, it
has not undergone final editorial review: it is not a stable document and
may not be used as reference material nor cited as a normative reference
from another document.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED
It is the set of code points that can be addressed using surrogate code
points. For more information, see the glossary at www.unicode.org.
Mark
- Original Message -
From: "nikita k" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, February 06, 2001 01:51
Subject:
The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation
was for performance (having a form that reproduces the binary order of
UTF-16). We have yet to see a formal proposal for this, though.
Mark
- Original Message -
From: "J M Sykes" <[EMAIL PROTECTED]>
To: "Unicode
u, Feb 01, 2001 at 10:14:04AM -0800, Mark Davis wrote:
> > If you had made almost any reasonable attempt whatsoever you would have
> > found this. To find out about a character you first look in the charts
and
> > block descriptions. In this particular case, there is an annotati
ng a strftime to ICU date format conversion
routine and noticed that ICU has no week based year support. Fortunately I
don't think my client needs it.
Carl
-Original Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 25, 2001 9:18 PM
To: Carl W. Brown; Unicode L
John,
> It's interesting how we find ways to get around rules that bother us
This is a misrepresentation. The symbol was always intended to be the
Weierstrass elliptic function. It was misnamed, and is thus annotated with
the correct information. Nobody is winking.
> ... If I had read the U
This is not an omission. This issue was debated at great length in the
Unicode technical committee, and the precise wording was agreed to by the
committee.
Mark
- Original Message -
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 31,
Title: Unicode Benefits
>Allows for multilingual documents
using any or all the languages you desire. Invoice or ticketing applications can
print native language names.
*"multilingual documents" are rare --
as most people understand the term 'documents'. What more people care about is
that
RE: Time Intervals
Mark,
Date calculations are much easier if you start on a March 1 date such as
March 1 1900. This is becase the months are 31,30,31,30,31 31,30,31,30,31
31,xx Putting February last makes leap year calculations easier.
Carl
-Original Message-
From: Mark Davis [mailto
It doesn't add any value to insert joiners. Just add the IDS itself to the
font table.
Mark
- Original Message -
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 24, 2001 11:21
Subject: Re: Unicode 3.1: IDS and ZW(N)J
> John Jenkins
This appears to have bounced the first time I sent it.
- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Unicore" <[EMAIL PROTECTED]>; "Unicode" <[EMAIL PROTECTED]>
Sent: Monday, January 22, 2001 08:04
Subject: Time Interva
BTW, we have settled on a term for characters with code points above .
See
http://www.unicode.org/glossary/#supplementary_character
http://www.unicode.org/glossary/#supplementary_code_point
Mark
- Original Message -
From: "David Starner" <[EMAIL PROTECTED]>
To: "Unicode List" <[EM
Yes, I have already proposed an agenda item for the next UTC, to get this
fix into 3.1.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014
Roozbeh P
Unicode is always serialized in a UTF: UTF-8, UTF-16*, or UTF-32*. The
definition of each of these is invariant across systems: in UTF-8 an 'a' is
always stored as 0x61. There is a special UTF for use on EBCDIC systems.
Check out the technical reports and FAQs on www.unicode.org.
Mark
- Orig
optimization strategy. (we here don't use
that strategy, by the way). We think that the implementation strategy could
be changed to still work, but for now we would recommend removing the
characters.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [
;
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, January 15, 2001 07:39
Subject: RE: Transcriptions of "Unicode"
> {Notice: way off-topic}
>
> Mark Davis wrote:
> > There was a period well after the Norman invasion where a
> > large number of w
- Original Message -
From: "Marco Cimarosti" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Monday, January 15, 2001 00:15
Subject: RE: Transcriptions of "Unicode"
> Mark Davis wrote:
> >Much as I admire and appre
quot;
Ar 07:28 -0800 2001-01-12, scrobh Mark Davis:
>According to the references I have, the prefix "uni" is directly from
>Latin while the word "code" is through French. The Indo-European would
>have been *oi-no-kau-do ("give one strike"): *kau apparently
quot;: Still Missing scripts
> On Thu, 11 Jan 2001, Mark Davis wrote:
>
> > By the way, I am still missing the following. If anyone can supply them,
I'd
> > appreciate it.
> >
> > [BOPOMOFO]
> [snip]
> >[MONGOLIAN]
> [snip]
> > See http://www.macchia
Thanks for your detailed note; I'll have to think it over.
...
> But there's another inconsistency in the transcription: the vowels in the
> first ("u-") and third ("-code") syllable are both phonemically long.
> Either you put the length mark on both (recommended for *phonetic*
> transcription),
d.
Mark
- Original Message -
From: "Marco Cimarosti" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Friday, January 12, 2001
03:11
Subject: Re: Transcriptions of
"Unicode"
> Hallo everybody!> > I don't
full
By the way, I am still missing the following. If anyone can supply them, I'd
appreciate it.
[BOPOMOFO]
[KHMER]
[MONGOLIAN]
[MYANMAR]
[SINHALA]
[SYRIAC]
[THAANA]
[THAI]
[TIBETAN]
[YI]
See http://www.macchiato.com/unicode/Unicode_transcriptions.html for
details.
ICU offers a reverse BIDI algorithm. (http://oss.software.ibm.com/icu/)
Mark
- Original Message -
From: "Roozbeh Pournader" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: "Behdad Esfahbod" <[EMAIL PROTECTED]>
Sent: Monday, January 08, 2001 20:12
Subject: Reverse Bidi Algor
In specific cases you may use one character conversion mapping instead of
two, but you should be very careful about that. See
http://www.unicode.org/unicode/reports/tr22/, especially "1.2.1 Best-Fit
Mappings"
Mark
- Original Message -
From: "Lars Marius Garshol" <[EMAIL PROTECTED]>
To: "
Those have been added, and their weights are now
> >reasonable. (Look under the respective Arabic letters.)
> >
> > I have a question outstanding among Inuktitut experts regarding the
> > ordering of some elements of UCAS for Nunavut and Nunavik. More
> > on that later.
We'd like to call people's attention to a few recent items on the Unicode
site.
UTF-8 Corrigendum
- The Unicode Technical Committee has modified the definition of UTF-8 to
forbid conformant implementations from interpreting non-shortest forms for
BMP characters, and clarified some of the conform
Magda, for questions like this, it would be helpful if you ask people to
read the following. If they still have questions afterwards, you could
forward them on to this list.
http://www.unicode.org/help/display_problems.html
http://www.unicode.org/unicode/faq/ (relevant "FAQ Pages")
Mark
- O
he
*opposite* of the embedding; the embedding marks that the embedded text is
to be given a *different* direction than the surrounding text.
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10
I am swamped right now -- I will have more time after the 25th to comment.
Mark
- Original Message -
From: "Roozbeh Pournader" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 19, 2000 02:32
Subject: Bug in Bidi
e/timesens/calendar.html)
Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], [EMAIL PROTECTED], [EMAIL PROTECTED]
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014
That matches what I have on
http://www.macchiato.com/unicode/Unicode_transcriptions.html, right?
(circle?)
Mark
- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode List" <[
D]>
Sent: Tuesday, December 12, 2000 09:01
Subject: Re: Transcriptions of Unicode
Ar 07:11 -0800 2000-12-12, scríobh Mark Davis:
>ARMENIAN
BULGARIAN
>CHEROKEE
>ETHIOPIC
GREEK
>GUJARATI
>GURMUKHI
INUKTITUT
>OGHAM
>RUNIC
RUSSIAN
>SINHALA
>UCAS
See http://www.egt
Some people were kind enough to send me extra transcriptions for
http://www.macchiato.com/unicode/Unicode_transcriptions.html
I am still missing confirmation on the Russian and Greek, and (at least one
language in) the following scripts. Any help from native speakers would be
appreciated.
ARMEN
07, 2000 00:30
Subject: Re: displaying Unicode text (was Re: Transcriptions of "Unicode")
> Mark Davis wrote:
> >
> > Let's take an example.
> >
> > - The page is UTF-8.
> > - It contains a mixture of German, dingbats and Hindi text.
> > - My l
<[EMAIL PROTECTED]>
Sent: Monday, December 04, 2000 22:08
Subject: Re: Transcriptions of
"Unicode"
> Mark Davis wrote:> > > > What wasn't clear from
his message> > is whether Mozilla picks a reasonable font if the
language is not there.> > Sorry
As per the instructions of the Unicode Technical Committee, TR#22: Character
Mapping Markup Language (CharMapML) has been advanced from draft TR to full
TR. See http://www.unicode.org/unicode/reports/tr22/ for more information.
Note: The UTC intends to continue development this TR to also encomp
Gatos, CA, USA mailto:[EMAIL PROTECTED]
>
> +1 408.210.3569 (mobile) +1 408.904.4762 (fax)
> =======
> Globalization Engineering & Consulting Services
>
> On Sat, 2 Dec 2000, Mark Davis wrote:
>
> > Won't Modzilla pick fonts based on charact
ibutes, Mozilla/Netscape 6 will use
> the fonts that have been set up for those languages. E.g.:
>
> ...
>
> Erik
>
> Mark Davis wrote:
> >
> > Done.
> >
> > From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
> > >
> > > I would suggest adding a
> > >
> > > > Mark Davis wrote:
> > > >
> > > > > http://www.macchiato.com/unicode/Unicode_transcriptions.html
>
PROTECTED]>
Cc: "Unicode List" <[EMAIL PROTECTED]>
Sent: Friday, December 01, 2000 22:46
Subject: Re: Transcriptions of "Unicode"
> Cool. Now if you also add LANG attributes, Mozilla/Netscape 6 will use
> the fonts that have been set up for those languages.
t;
> To: "Unicode List" <[EMAIL PROTECTED]>
> Cc: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Friday, December 01, 2000 2:30 PM
> Subject: Re: Transcriptions of "Unicode"
>
>
> > Sad to report, my browser (Netscape 4.7) shows the Yidd
I am interested in collecting transcriptions of the word "Unicode" in
different scripts (and languages). If you are fluent in a language other
than Unicode, I'd appreciate any suggestions. What I have so far is at:
http://www.macchiato.com/unicode/Unicode_transcriptions.html
Mark
Have you tried looking at the Unicode home page, at "Display Problems", or
the FAQ "Unicode on the Web"?
- Original Message -
From: "sreekant" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, November 30, 2000 22:27
Subject: display problems on browser
> hi,
>
The soft hyphen is not sufficient, since in other languages the case where
two letters must be distinguished in collation may not fall on a syllable
boundary, or allow hyphenation between them.
The UTC looked at all the possible existing boundary-control characters;
none of them really work for t
quot;G. Adam Stanislav" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, November 29, 2000 22:42
Subject: Re: UTF-8 Corrigendum, new Glossary
> At 21:08 29-11-2000 -0800, Mark Davis wrote:
> >1. The Unicode Technical Committee has modified t
We would like to call two items to people's attention.
1. The Unicode Technical Committee has modified the definition of UTF-8 to
forbid conformant implementations from interpreting non-shortest forms for
BMP characters, and clarified some of the conformance clauses. For more
information, see
htt
These are good points.
TR 21 deliberately does not specify the language conventions for using
titlecase, which as you note will change the effect of its use (see
http://www.unicode.org/unicode/reports/tr21/#TitlecaseCaveats). Most
products will have some smarts, but also leave it up to the user w
The UTC will be using the terms "supplementary code points", "supplementary
characters" and "supplementary planes". The term it is "deprecating with
extreme prejudice" is "surrogate characters".
See http://www.unicode.org/glossary/ for more information.
Mark
- Original Message -
From: "
I haven't had time to read this list recently, so here is a somewhat belated
response.
>But, even if you do so, we are left with a "wrong" canonical decomposition:
>1FBC;GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI;Lt;0;L;0391
0345N1FB3;
>According to James' statement (which is not to
We have found that it works pretty well to have a uchar32 datatype, with
uchar16 storage in strings. In ICU (C version) we use macros for efficient
access; in ICU (C++) version we use method calls, and for ICU (Java version)
we have a set of utility static methods (since we can't add to the Java
S
That agrees with the results I get on http://www.macchiato.com/unicode/convert.html.
Mark
- Original Message -
From:
J.
William Semich
To: Mark Davis ; Rick H Wesson
Cc: Unicore ; Unicode ; w3c-i18n-ig
Sent: Wednesday, November 15, 2000
22:46
Subject: Re
programmatically, the program is wrong.
Mark
- Original Message -
From:
J.
William Semich
To: Rick H Wesson ; Mark Davis
Cc: Unicore ; Unicode ; [EMAIL PROTECTED] ; w3c-i18n-ig
Sent: Wednesday, November 15, 2000
09:32
Subject: Re: [idn] Javascript code
charts
I just made some fixes in my Javascript Unicode
pages (insomnia again) that may be of interest.
http://www.macchiato.com/unicode/convert.html has
UTF, RACE and LACE conversions, with a bit better error checking.
http://www.macchiato.com/unicode/charts.html has
Unicode charts, plus a new "
The Unicode Standard does define the rendering of such combinations, which
is in the absence of any other information to stack outwards.
Implementations that can't do that will either overstrike, or use some other
fallback rendering.
A sophisticated rendering will use positioning such as control
Doug is right, if you are counting *encoded characters*. This is fine for
programmers, so if that is the purpose, you can use that method. (If the
text is not well-formed, then you probably want to filter (e.g. not count)
isolated half-surrogates, ill-formed UTF-8, and noncharacters.
However, if
ICU has a list of these. If you take a look at
http://oss.software.ibm.com/icu/charset/CharMaps-HTML/windows-1252-2000.html
, for example, you will see some other interesting cases.
Mark
- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL P
OTECTED]>
Sent: Monday, November 06, 2000 04:03
Subject: Re: Normative vs Informative
Ar 01:57 -0800 2000-11-06, scríobh Mark Davis:
>I was looking for a different message from Michael, and ran across this
one.
>This issue was not followed up on this list, so I thought I would report
the
>
t: Thursday, October 26, 2000 03:15
Subject: Re: Normative vs Informative
Ar 00:04 -0700 2000-10-26, scríobh Mark Davis:
>I am leary of using normative your way unless we find strong evidence of
>this.
Well, that's just wrong, Mark. (Sorry, it's beat-up Mark day I guess.)
Ken explained
We appreciate any submissions of FAQ questions, and this is a good one.
Reformat it as a
Q...
A...
pair (plain text is fine), and send to [EMAIL PROTECTED], with the title "FAQ
submission". The editorial committee will then look at it.
Mark
- Original Message -
From: "Marco Cimarosti"
Can someone write up a description of the proposed change, with the
attandant glyphs. There is a UTC meeting next week in San Diego, so now's
the time.
Mark
- Original Message -
From: "Antoine Leca" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, October 31, 2000
Thanks, Misha and Ian. Two quick notes on my papers: The title of "What's
New in Unicode 3.0" should be "What's New in Unicode 3.0.1". My keynote is
also on the same site: "Unicode Myths".
Mark
- Original Message -
From: "Misha Wolf" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECT
>Zumindest die Hälfte der Namen im Lande kann so oder auch so ausgesprochen
werden
> - je nachdem, wie es der Namensträger wünscht.
Much the same in America; you very often don't know how someone's last name
is pronounced (or spelt):
Stein => shtyn? styn? steen?
- Original Message -
Fro
One of the main features of XML is that it has quite strict rules about how
to handle errors. The goal, I believe, is to ensure that we are not awash in
malformed files that have no clear interpretation.
And this is clearly an error: the acceptable code points are quite clearly
stated:
http://ww
You may already be aware of this, but there is an eGroups that archives the
Unicode mail. (It is also searchable: for example, "Etruscan" comes up with
about 15 messages. "Help" comes up with many many screenfulls.)
It is described on http://www.unicode.org/unicode/consortium/distlist.html
Mark
Here is my take on the way Unicode general categories should be mapped to
POSIX ones.
1. As a reminder, the Unicode General Categories are:
L* (letters): Lu, Ll, Lt , Lm, Lo
M* (marks): Mn, Mc, Me
N* (numbers): Nd, Nl, No
P* (punctuation): Pc, Pd, Ps, Pe, Pi, Pf, Po
S* (symbols): Sm, Sc, Sk, So
In general, you can't depend on any of the following:
toUppercase(x) == x iff Cat(x) == Lu
toTitlecase(x) == x iff Cat(x) == Lt
toLowercase(x) == x iff Cat(x) == Ll
There are counterexamples to these, even using the simple 1-1 mappings. Take
a look at the casing charts, at
http://www.unicode.org
For the purpose specified, isLatin1 should just test for <= 0xFF. After all,
one would not want to exclude TAB, CR or LF ☺
Mark
- Original Message -
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, October 05, 2000 10:33
Subject: Re: Correct de
UTF-8, UTF-16, and UTF-32 all support exactly the same character repertoire.
Please look at www.unicode.org, on the front page is a link to the FAQs.
Mark
- Original Message -
From: "George Zeigler" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, October 05, 20
Thanks to the industriousness of volunteer translators and to Magda and
Julie's editorial work, we have many more translations of "What is Unicode"
on www.unicode.org (all in UTF-8, of course).
Check out http://www.unicode.org/unicode/standard/WhatIsUnicode.html. If you
have problems displaying a
Please take a look at the FAQ and material on www.unicode.org to see if it
answers your questions.
- Original Message -
From: "Jennifer Nguyen" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Wednesday, October 04, 2000 08:02
Subject: Starter questions
> Hi,
>
> I'm new
Please take a look at www.unicode.org
- Original Message -
From: "Karambir Rohilla" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, October 03, 2000 21:17
Subject: help me !!!
>
>
>
> hello
>
> Please help me anyone
> waht is UTF8 & UTF16 ?
> regard
> kara
going
forward.
Mark
- Original Message -
From:
Mark Davis
To: Unicode List
Sent: Tuesday, October 03, 2000
07:30
Subject: Re: [OT] Word select in
Microsoft products?
Thanks for the detailed message. I tried it out,
and if I have a sentence like
If there are specific areas where the BIDI algorithm has flaws, that should
be communicated to the UTC bidi subcommittee, ideally with a proposal to fix
the problem.
Mark
- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent:
eems to work. Thanks!
Mark
- Original Message -
From:
Chris
Pratley
To: 'Mark Davis' ; Unicode List
Sent: Monday, October 02, 2000
13:12
Subject: RE: [OT] Word select in
Microsoft products?
There are two edit
controls in the apps you mention
It would be more accurate to say that it does not support all of Unicode
3.0. Just using the phrase "doesn't support 3.0" suggests that it is not
compliant. A product can be compliant to a particular version of Unicode
while only supporting a subset of the characters.
Even compliant products with
[Off topic -- just looking for information from a
broad audience.]
Anyone know how to turn off the extremely annoying
automatic word select (AWS) in Microsoft products? This is the "feature" that
causes dragging outside of a word to behave like double-click. I often want
to select part of
There are a number of similarities between this XNS and IDN, so
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-00.txt would be
worth reading.
On locales: using them is dangerous for matching. The only reason to add
locale is if it were to make a difference which letters match. But th
Are you sure about that?
On http://www.unicode.org/charts/PDF/U0400.pdf there is
0483 COMBINING CYRILLIC TITLO
Mark
- Original Message -
From: "Valeriy E. Ushakov" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: "Aleksandar Poposki" <[EMAIL PROTECTED]>
Sent: Friday, Sept
If those can be confirmed, then the SpecialCasing file should be modified to add
them. Could you verify this in time for the next UTC?
Mark
Cathy Wissink wrote:
> I believe Azeri also uses the dotless i/dotted i Turkish-style casing.
>
> Cathy
>
> -Original Message-
> From: Carl W. Brow
ar need for other scripts such as
> > Arabic?
>
> Mark Davis replied
> > UCA (#10) already handles that. You will get a "fuzzy" compare if you
> > mask off less important weights, and you will get a much
> > better ordering
> > than binary compare
UCA (#10) already handles that. You will get a "fuzzy" compare if you
mask off less important weights, and you will get a much better ordering
than binary compare as well.
Mark
Hart, Edwin F. wrote:
> Is there a need for a "fuzzy" comparison where names with and without
> points in Hebrew? Is
Controlling Ligatures", in TUC 3.0, p. 318.
>
> Am 2000-09-15 um 14:40 UCT hat Mark Davis geschrieben:
> > I'd like to remind everyone to look at the latest version of the Unicode
> > Standard, especially when looking at fine points. To cite Unicode 3.0.1
> > (ht
more
>on the pronunciation rather than the exact spelling.
I didn't quite get the last sentence. I had thought that the vowel marks were used to
get the exact pronunciation. If that is not true, it may be part of my
misunderstanding of the situation.
> Jony
>
> > -Orig
> the Ethnologue staff created separate entries for "Allemanisch,"
> "Alsatian," and "Schwyzerdütsch," which *may* appease nationalistic
> preferences but definitely *does* result in inconsistency and
> confusion.
Interesting example. Some time ago I lived in eastern Switzerland for 4
years, and
I am curious why you feel so strongly that the Hebrew points should be ignored
in domain names. Prima facie, it seems that there is little harm in treating
them no differently from other characters. What problem would arise if the
domain was ABC.COM and I could not get it by typing AB*C.COM? (Here
I'd like to remind everyone to look at the latest version of the Unicode
Standard, especially when looking at fine points. To cite Unicode 3.0.1
(http://www.unicode.org/unicode/standard/versions/Unicode3.0.1.html)
"Section 13.2 Controlling Ligatures, page 318: the text is superseded by the
follow
I share the concern about combinatorial explosions. Look a Spanish, Arabic or
English, for example:
http://oss.software.ibm.com/developerworks/opensource/icu/localeexplorer/
I agree that de-*-sp1996 makes more sense. For us, the variant should go before
the country only if the variant is -- in ge
Not all code points are assigned (or even assignable) to characters. U+xx
is used to refer to code points, which range from 0 to 10. Of these code
points, some are assigned to characters (including regular characters, control
characters, format characters, and private use characters [whose
In general, we've found it far better to have low-level routines always
have APIs in terms of code units that they implement (e.g. bytes in this
case), and add higher-level routines that provide other interesting
boundary information (e.g. code point boundaries, grapheme boundaries,
word boundarie
Good point. In the past, I have used "surrogate characters" to refer to the
characters encoded above , and surrogate code units to refer to the UTF-16
units D800-DFFF. However, I think that leads to confusion. Nobody has come up
with a good term for all characters above . "Plane 1-16 chara
Take a look at the Unicode FAQ on the web, at www.unicode.org
"Gary P. Grosso" wrote:
> Hi Unicoders,
>
> I am working on software to emit HTML in the encoding
> and character set of the user's choice, from SGML/XML
> documents which can contain any Plane 1 Unicode character.
> The question is w
Mark Davis wrote:
> >
>
> > Hello all,
> > I have been trying to input unicode from a browser and store it in a database.
>The problem is the different encodings used to represent the unicode.
> > The input text is in the UTF-8 format. I have read on the Mic
In HTML or XML you always use the code point (e.g. UTF-32), not a series of
code units (UTF-8 or UTF-16). Thus you would use:
𐄣
not �� from UTF-16
nor ð„£ from UTF-8
Mark
Brendan Murray/DUB/Lotus wrote:
> How can one encode a surrogate character as an entity in HTML/XML? Should
> it be as t
901 - 1000 of 1047 matches
Mail list logo