Why is this surprising?
Encoding a script is many many orders of magnitude more complex than
encoding emoji. This is especially true given that the scripts that remain
unencoded are largely used by small populations (or, in the case of historic
scripts, by *no* population at all). It is a complex
> A few months ago I asked a class of 140+ first year Computer Science
> programme and Joint programme students -
>
> Who has heard of Unicode?
I do a similar survey whenever I teach the remedial I18N and Unicode classes at
Amazon. When I ask if software developers *ever* received any formal edu
I agree, although I note that sometimes the additional (redundant) specificity
of "non-7-bit-ASCII characters" is needed when talking to people unclear on
what "ASCII" means.
Addison
> -Original Message-
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter
> Constable
What you might be looking for would be the CLDR project’s “exemplar sets” (see
for example [1]), which describes which characters are customarily used for a
given language and which are sometimes used. However, this is not the same
thing as statistical distribution. One of the points of Unicode
A "Unicode set" in this context means "a set of code points". This is discussed
in section 1.2:
--
This is done by providing syntax for sets of characters based on the Unicode
character properties, and allowing them to be mixed with lists and ranges of
individual code points.
--
More generally
The thread on serif.com discusses formatting of poetry in a Kindle book. The
problem is that the author would like to indent two lines.
You don't want to do that by using a character that "looks like a space" yet
isn't seen by the software to be a space. This would break features like
dictionar
Actually, that's my bad: I meant to type scalar value.
Stephan Stiller wrote:
On 9/15/2013 3:07 PM, Phillips, Addison wrote:
Not if the limit is counted in characters and not in bytes. Twitter, for
example, counts code points in the NFC representation of a tweet.
"character&quo
Not if the limit is counted in characters and not in bytes. Twitter, for
example, counts code points in the NFC representation of a tweet.
Doug Ewell wrote:
Andre Schappo wrote:
> U+2026 is useful for microblogs when one is looking to save characters
Not if the microblog is in UTF-8, as almos
What kind of document do you mean?
For Web formats (HTML, etc.), the answer is "no".
Addison
Addison Phillips
Globalization Architect (Amazon Lab126)
Chair (W3C I18N WG)
Internationalization is not a feature.
It is an architecture.
> -Original Message-
> From: unicode-bou...@unicode
"Back up" here refers to decrementing the pointer in the string.
If you have a string consisting of the following UTF-16 code units, for example:
00C0 0020 20AC D800 DC00 00C5
0 12 3 4 5
If you set the pointer to code unit number 4 (counting from 0), you'll be
Martin wrote:
>
> Quite a few people might expect their Japanese filenames to appear with a
> Japanese font/with Japanese glyph variants, and their Chinese filenames to
> appear with a Chinese font/Chinese glyph variants. But that's never how this
> was planned, and that's not how it works today.
Hi Roger,
(This is a personal response, with chair hat off)
It is very useful to read the big yellow box at the start of that document,
which says:
--
This version of this document was published to indicate the
Internationalization Core Working Group's intention to substantially alter or
repl
> Code points 2066, 2067, and 2068 are unassigned. I presume you mean
> U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL
> FORMATTING.
As Roozbeh pointed out, he means the characters added that provide bidi
isolation.
The W3C Internationalization WG recommends that you use markup
"Unicode processor"??
If what you're looking for is code that breaks text into grapheme
clusters/words/lines/etc., that's called "text segmentation" and is described
in:
http://www.unicode.org/reports/tr29/
But you go on to talk about characters and their properties.. if you're
looking
The block is actually named for it. See:
http://www.unicode.org/charts/PDF/U0080.pdf
This FAQ talks about it:
http://www.unicode.org/faq/blocks_ranges.html
Finally, p217 of the standard actually says so explicitly:
http://www.unicode.org/versions/Unicode6.2.0/ch07.pdf
Addison
Addiso
Doug opined:
>
> >>> I can state that for Israel the scripts in common use are Hebrew,
> >>> Latin (mainly for English but also for several other languages),
> >>> Arabic and Cyrillic.
> >>
> >> I do believe that Israel and Palestine (the Gaza Strip and West Bank
> >> areas) also use the Greek alp
Asmus opined:
I think Yucca has a point.
When the document is in English, it doesn't make sens to display the footer
date in the system locale.
The locale used for this function should either be that of site, or that of the
page.
AP> And hence the work to internationalize JavaScript and prov
That's for search analysis, not rendering.
Sent from my iPhone
On Oct 8, 2011, at 7:45 AM, "Andreas Prilop" wrote:
> On Fri, 7 Oct 2011, Gerrit wrote:
>
>> So if somebody from Google reads this,
>> [...]
>> Additionally, if the standard Android web browser could then
>> use the html “lang” ta
>
> sowmya satyanarayana com>
> wrote:
>
> > Taking this, what is the best way to define _T(x) macro of
> UNICODE version, so
> > that my strings will always be
> > 2 byte wide character?
>
> Unicode characters aren't always 2 bytes wide. Characters with
> values
> of U+1 and greater take
Hello,
UAX #29 (Unicode Text Segmentation) discusses this at length. See especially
the section on grapheme cluster boundaries:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Certainly a function that returns first code point of a string is different
from one that finds th
20 matches
Mail list logo