Consonant shifters and ZWNJ in Khmer

2018-07-20 Thread Norbert Lindenberg via Unicode
The section on consonant shifters in the Khmer section of the Unicode standard 
(page 647 of Unicode 11 [1]) isn’t entirely clear on where the zero width 
non-joiner should be placed to prevent a consonant shifter that’s followed by 
an above-base vowel from being changed to a below-base glyph.

First, it says “U+200C zero width non-joiner should be inserted before the 
consonant shifter” to prevent the change. Then it continues “in such cases, 
U+200C zero width non-joiner is inserted before the vowel sign”, which could be 
interpreted as “after the consonant shifter”. Finally, the examples show ZWNJ 
inserted before the consonant shifter.

The OpenType Khmer shaping description [2], on the other hand, expects ZWNJ to 
be inserted between the consonant shifter (here called RegShift) and the 
above-base vowel.

Questions to the people here who have dealt with Khmer: How is this handled in 
real life?

Thanks,
Norbert

[1] https://www.unicode.org/versions/Unicode11.0.0/ch16.pdf
[2] https://docs.microsoft.com/en-us/typography/script-development/khmer


Re: metric for block coverage

2018-02-23 Thread Norbert Lindenberg via Unicode

> On Feb 18, 2018, at 3:26 , Khaled Hosny via Unicode  
> wrote:
> 
> On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote:
>> Adam Borowski wrote,
>> 
>>> I'm looking for a way to determine a font's coverage of available scripts.
>>> It's probably reasonable to do this per Unicode block.  Also, it's a safe
>>> assumption that a font which doesn't know a codepoint can do no complex
>>> shaping of such a glyph, thus looking at just codepoints should be adequate
>>> for our purposes.
>> 
>> You probably already know that basic script coverage information is
>> stored internally in OpenType fonts in the OS/2 table.
>> 
>> https://docs.microsoft.com/en-us/typography/opentype/spec/os2
>> 
>> Parsing the bits in the "ulUnicodeRange..." entries may be the
>> simplest way to get basic script coverage info.
> 
> Though this might not be very reliable since OpenType does not have a
> definition of what it means for a Unicode block to be supported; some
> font authoring tools use a percentage, others use the presence of any
> characters in the range, and fonts might even provide incorrect data for
> any reason.
> 
> However, I don’t think script or block coverage is that useful, what
> users are usually interested in is the language coverage.
> 
> Regards,
> Khaled


All true. In addition, ulUnicodeRange ran out of bits around Unicode 5.1, so 
scripts/blocks added to Unicode after that, such as Javanese, Tangut, or Adlam, 
cannot be represented. 

Norbert




Re: Unicode education in Schools

2017-08-26 Thread Norbert Lindenberg via Unicode
ECMAScript 6 fixed that, largely along the lines of my proposal:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

Norbert


> On Aug 24, 2017, at 22:14 , Peter Constable via Unicode  
> wrote:
> 
> I thought Javascript had a UCS-2 understanding of Unicode strings. Has it 
> managed to progress beyond that?
> 
>  
> 
>  
> 
> Peter
> 
>  
> 
>  
> 
> From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner 
> via Unicode
> Sent: Thursday, August 24, 2017 5:18 PM
> To: Unicode Mailing List 
> Subject: Fwd: Unicode education in Schools
> 
>  
> 
>  
> 
> -- Forwarded message -
> From: David Starner 
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham 
> 
>  
> 
>  
> 
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode 
>  wrote:
> 
> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
> 
> Richard.
> 
>  
> 
> Steer them away from reinventing the wheel. If they use Java, use Java 
> strings. If they're using GTK, use strings compatible with GTK. If they're 
> writing JavaScript, use JavaScript strings. There's basically no system 
> without Unicode strings or that they would be better off rewriting the wheel.
> 




Re: Northern Khmer on iPhone

2017-03-02 Thread Norbert Lindenberg
On iOS, applications can and do install custom fonts for system-wide use, 
although the installation user experience is pretty bad:
http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html

Norbert


> On Mar 1, 2017, at 18:43 , Alastair Houghton  
> wrote:

[…]

> (Also, FYI, iOS applications can - and some do - install and use their own 
> fonts.  It’s per-application, though; you can’t install them system-wide.)




Re: Suppressing Ligation of Spacing Marks

2016-11-09 Thread Norbert Lindenberg
The part of the specification of the Universal Shaping Engine [1] that deals 
with ZWNJ is a bit unclear, but I read it to mean that ZWNJ should not cause 
the insertion of a dotted circle if the character following it has general 
category Mn or Mc.

The USE specification says: "The zero-width non-joiner is used to prevent a 
fusion of two characters. It continues a preceding cluster but causes a cluster 
break after itself when the following character is not a mark character (gc=Mn 
or gc=Mc).”

The specification does not say how this character should be handled in cluster 
validation. I assume first that the statement about the combining grapheme 
joiner also applies to ZWNJ: “CGJ has been omitted from the above schema in 
order to avoid unnecessary complexity”. I further interpret the little the spec 
does say about ZWNJ to imply that it should be allowed before any character 
with general category Mn or Mc, without affecting the validity of the cluster. 
Inserting a dotted circle would be equivalent to causing a cluster break, which 
the spec rules out when the following character has general category Mn or Mc.

 U+1A63 has gc=Mc, so it shouldn’t be preceded by a dotted circle in the 
sequence . Note that I omitted the first “…” from the 
sequence you provided, because an intervening character might trigger the 
dotted circle.

So this may just be a bug in the implementation of the USE that you’re using. I 
see this bug in Safari (CoreText), but not in Firefox (Harfbuzz); haven’t tried 
Edge. Which one are you using?

[1] http://www.microsoft.com/typography/OpenTypeDev/USE/intro.htm

Best regards,
Norbert


> On Nov 8, 2016, at 18:09 , Richard Wordingham 
>  wrote:
> 
> Should it be possible to suppress the ligation of a base character and
> a visually following spacing mark in plain text?
> 
> The example I have in minf is the sequence  U+1A63 TAI THAM VOWEL SIGN AA>.  It may be desirable to suppress the
> ligation because both ligands have subscript consonants.  However, if
> I write , the Universal Shaping Engine
> decides that the ZWNJ triggers a new syllable, and inserts a dotted
> circle before SIGN AA.  (The dotted circle after SIGN AA results from a
> failure to read the proposal for the Lanna script as it was then
> called.)
> 
> Richard.
> 




Re: Unicode in passwords

2015-10-06 Thread Norbert Lindenberg

> On Oct 6, 2015, at 6:04 , Philippe Verdy  wrote:
> 
> In those conditions, normalizing the Java string will leave those lone 
> surrogates (and non-characters) as is, or will throw an exception, depending 
> on the API used. Java strings do not have any implied encoding (their "char" 
> members are also unrestricted 16-bit code units, they have some basic 
> properties but only in BMP, defined in the builtin Character class API: 
> properties for non-BMP characters require using a library to provide them, 
> such as ICU4J).

The Java Character class was enhanced in J2SE 5.0 to support supplementary 
characters. The String class was specified to be based on UTF-16, and string 
processing throughout the platform was updated to support supplementary 
characters based on UTF-16. These changes have been available to the public 
since 2004. For a summary, see
http://www.oracle.com/technetwork/articles/java/supplementary-142654.html

Norbert


Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-09 Thread Norbert Lindenberg
RFC 7158 section 7 [1] provides not only the \u notation for Unicode code 
points in the Basic Multilingual Plane, but also a 12-character sequence 
encoding the UTF-16 surrogate pair (i.e. \u\u with 0xD800 ≤   
0xDC00 ≤  ≤ 0xDFFF) for supplementary Unicode code points. A tool checking 
for escape sequences that don’t correspond to any Unicode character must be 
aware of this, because neither \u nor \u by itself would correspond to 
any Unicode character, but their combination may well do so.

Norbert

[1] https://tools.ietf.org/html/rfc7158#section-7


 On May 7, 2015, at 5:46 , Costello, Roger L. coste...@mitre.org wrote:
 
 Hi Folks,
 
 The JSON specification says that a character may be escaped using this 
 notation: \u( are four hex digits)
 
 However, not every four hex digits corresponds to a Unicode character. 
 
 Are there tools to scan a JSON document to detect the presence of \u, 
 where  does not correspond to any Unicode character?
 
 /Roger
 




Re: Swift

2014-06-08 Thread Norbert Lindenberg
It does allow some usage that may surprise code reviewers – for example, this 
is a valid Swift program:

let s = 
let s︀ = 
let ︀ = 
let all = s + s︀ + ︀

The value of the constant “all” is . Or at least it is as long as mail 
software doesn’t harm the variation selectors…

Norbert


On Jun 5, 2014, at 9:06 , Mark Davis ☕️ m...@macchiato.com wrote:

 I haven't done any analysis, but on first glance it looks like it is based on 
 
 http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax
 
 
 Mark
 
 — Il meglio è l’inimico del bene —
 
 
 On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn s...@maya.com wrote:
 Has anyone figured out whether character sequences that are non-canonical 
 (de)compositions but could be recomposed to the same result
 are the same identifier or not?
 
 That is: are identifiers merely sequences of characters or intended to be 
 comparable as “Unicode strings” (under some sort of compatibility rule)?
 
 On Jun 5, 2014, at 11:27 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 
  Am 04.06.14 11:28, schrieb Andre Schappo:
  The restrictions seem a little like IDNA2008. Anyone have links to
  info giving a detailed explanation/tabulation of allowed and non
  allowed Unicode chars for Swift Variable and Constant names?
 
  The language reference is at
 
  https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html
 
  For reference, the definition of identifier-character is (read each
  line as an alternative)
 
  identifier-character → Digit 0 through 9
  identifier-character → U+0300–U+036F, U+1DC0–U+1DFF, U+20D0–U+20FF, or
  U+FE20–U+FE2F
  identifier-character → identifier-head­
 
  where identifier-head is
 
  identifier-head → Upper- or lowercase letter A through Z
  identifier-head → U+00A8, U+00AA, U+00AD, U+00AF, U+00B2–U+00B5, or
  U+00B7–U+00BA
  identifier-head → U+00BC–U+00BE, U+00C0–U+00D6, U+00D8–U+00F6, or
  U+00F8–U+00FF
  identifier-head → U+0100–U+02FF, U+0370–U+167F, U+1681–U+180D, or
  U+180F–U+1DBF
  identifier-head → U+1E00–U+1FFF
  identifier-head → U+200B–U+200D, U+202A–U+202E, U+203F–U+2040, U+2054,
  or U+2060–U+206F
  identifier-head → U+2070–U+20CF, U+2100–U+218F, U+2460–U+24FF, or
  U+2776–U+2793
  identifier-head → U+2C00–U+2DFF or U+2E80–U+2FFF
  identifier-head → U+3004–U+3007, U+3021–U+302F, U+3031–U+303F, or
  U+3040–U+D7FF
  identifier-head → U+F900–U+FD3D, U+FD40–U+FDCF, U+FDF0–U+FE1F, or
  U+FE30–U+FE44
  identifier-head → U+FE47–U+FFFD
  identifier-head → U+1–U+1FFFD, U+2–U+2FFFD, U+3–U+3FFFD, or
  U+4–U+4FFFD
  identifier-head → U+5–U+5FFFD, U+6–U+6FFFD, U+7–U+7FFFD, or
  U+8–U+8FFFD
  identifier-head → U+9–U+9FFFD, U+A–U+AFFFD, U+B–U+BFFFD, or
  U+C–U+CFFFD
  identifier-head → U+D–U+DFFFD or U+E–U+EFFFD
 
  As the construction principle for this list, they say
 
  Identifiers begin with an upper case or lower case letter A through Z,
  an underscore (_), a noncombining alphanumeric Unicode character in the
  Basic Multilingual Plane, or a character outside the Basic Multilingual
  Plan that isn’t in a Private Use Area. After the first character, digits
  and combining Unicode characters are also allowed.
 
  Regards,
  Martin
  ___
  Unicode mailing list
  Unicode@unicode.org
  http://unicode.org/mailman/listinfo/unicode
 
 
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode
 
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: data for cp1252

2012-12-08 Thread Norbert Lindenberg
On Dec 7, 2012, at 17:48 , Buck Golemon wrote:

 It's also correct. *All* browsers have this behavior. The W3C has found this 
 behavior to be correct. Opera at one point in time implemented the current 
 unicode.org cp1252 spec, but was forced to change to the W3C spec by 
 real-world requirements.

Correction: The W3C has not said anything on this matter. The proposed encoding 
specification was written by Anne under the WHATWG umbrella. The W3C 
Internationalization working group (of which I'm a member) and Anne met during 
TPAC 2012 in October and agreed to kick off the process of turning the spec 
into a W3C recommendation by publishing it as a working draft. It may well 
change somewhat on the way.

This discussion actually makes me think of one necessary change: The 
specification should clarify that it does not redefine existing encodings, and 
not label the mappings provided by the spec with existing encoding names. The 
spec is targeting web user agents, but the encodings are also used in many 
software systems that are not and don't directly interact with web user agents, 
and the spec shouldn't be interpreted to interfere with those uses.

Norbert





Supplementary characters in the Java(TM) platform

2004-02-27 Thread Norbert Lindenberg
I know a number of you are curious about how the Java platform will 
support the supplementary characters of the Unicode standard. The JSR 
204 expert group, consisting of experts of ten companies, has today 
published the Public Review Draft of its specification at:
http://jcp.org/aboutJava/communityprocess/review/jsr204/index.html

Highlights:
- Low-level APIs use the primitive type int to represent Unicode code 
points.
- Higher-level APIs rely on char sequences, such as String and char[], 
which are interpreted as UTF-16 sequences.
- There are methods to easily convert between various char and code 
point based representations.
- Supplementary characters are allowed in Java programming language 
identifiers.

Almost all of the specified functionality is implemented in the beta 
release of J2SE 1.5, which is available at:
http://java.sun.com/j2se/1.5.0/index.jsp

If you have comments, please send them to the official feedback address:
[EMAIL PROTECTED]
You can hear more about the supplementary character support in the J2SE 
platform in session B6 of the upcoming Unicode conference:
http://www.unicode.org/iuc/iuc25/a338.html

Sorry if you see this message twice - my posting yesterday made it into 
the digest, but not to all subscribers or into the archive.

Best regards,
Norbert



Re: Does Java 1.5 support Unicode math alphanumerics as variable names?

2004-02-26 Thread Norbert Lindenberg
Murray,

Yes, starting from J2SE 1.5 the Java programming language allows 
supplementary characters in identifiers if they meet the specifications 
of the new methods java.lang.Character.isJavaIdentifierStart(int) and 
java.lang.Character.isJavaIdentifierPart(int).

Sorry for the late reply - until the rules of the Java Community 
Process it had to wait until the Public Review Draft is published.

Best regards,
Norbert
On Jan 23, 2004, at 17:46, Murray Sargent wrote:

E.g., math italic i (U+1D456)? With such usage, Java mathematical
programs could look more like the original math.
Thanks
Murray




Re: Traditional dollar sign

2003-10-27 Thread Norbert Lindenberg
The holographic strip on the Euro notes shows the Euro symbol when
viewed at certain angles.

Norbert


Peter Kirk wrote:

 The latest issue of UK banknotes do carry the pound sterling sign (with
 one crossbar), but this is quite new. At least the more recent former
 issues did not, if I remember correctly.
 
 I was surprised to find no Euro symbol on Euro notes or coins.