Consonant shifters and ZWNJ in Khmer
The section on consonant shifters in the Khmer section of the Unicode standard (page 647 of Unicode 11 [1]) isn’t entirely clear on where the zero width non-joiner should be placed to prevent a consonant shifter that’s followed by an above-base vowel from being changed to a below-base glyph. First, it says “U+200C zero width non-joiner should be inserted before the consonant shifter” to prevent the change. Then it continues “in such cases, U+200C zero width non-joiner is inserted before the vowel sign”, which could be interpreted as “after the consonant shifter”. Finally, the examples show ZWNJ inserted before the consonant shifter. The OpenType Khmer shaping description [2], on the other hand, expects ZWNJ to be inserted between the consonant shifter (here called RegShift) and the above-base vowel. Questions to the people here who have dealt with Khmer: How is this handled in real life? Thanks, Norbert [1] https://www.unicode.org/versions/Unicode11.0.0/ch16.pdf [2] https://docs.microsoft.com/en-us/typography/script-development/khmer
Re: metric for block coverage
> On Feb 18, 2018, at 3:26 , Khaled Hosny via Unicode> wrote: > > On Sun, Feb 18, 2018 at 02:14:46AM -0800, James Kass via Unicode wrote: >> Adam Borowski wrote, >> >>> I'm looking for a way to determine a font's coverage of available scripts. >>> It's probably reasonable to do this per Unicode block. Also, it's a safe >>> assumption that a font which doesn't know a codepoint can do no complex >>> shaping of such a glyph, thus looking at just codepoints should be adequate >>> for our purposes. >> >> You probably already know that basic script coverage information is >> stored internally in OpenType fonts in the OS/2 table. >> >> https://docs.microsoft.com/en-us/typography/opentype/spec/os2 >> >> Parsing the bits in the "ulUnicodeRange..." entries may be the >> simplest way to get basic script coverage info. > > Though this might not be very reliable since OpenType does not have a > definition of what it means for a Unicode block to be supported; some > font authoring tools use a percentage, others use the presence of any > characters in the range, and fonts might even provide incorrect data for > any reason. > > However, I don’t think script or block coverage is that useful, what > users are usually interested in is the language coverage. > > Regards, > Khaled All true. In addition, ulUnicodeRange ran out of bits around Unicode 5.1, so scripts/blocks added to Unicode after that, such as Javanese, Tangut, or Adlam, cannot be represented. Norbert
Re: Unicode education in Schools
ECMAScript 6 fixed that, largely along the lines of my proposal: http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html Norbert > On Aug 24, 2017, at 22:14 , Peter Constable via Unicode> wrote: > > I thought Javascript had a UCS-2 understanding of Unicode strings. Has it > managed to progress beyond that? > > > > > > Peter > > > > > > From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of David Starner > via Unicode > Sent: Thursday, August 24, 2017 5:18 PM > To: Unicode Mailing List > Subject: Fwd: Unicode education in Schools > > > > > > -- Forwarded message - > From: David Starner > Date: Thu, Aug 24, 2017, 6:16 PM > Subject: Re: Unicode education in Schools > To: Richard Wordingham > > > > > > On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode > wrote: > > Just steer them away from UTF-16! (And vigorously prohibit the very > concept of UCS-2). > > Richard. > > > > Steer them away from reinventing the wheel. If they use Java, use Java > strings. If they're using GTK, use strings compatible with GTK. If they're > writing JavaScript, use JavaScript strings. There's basically no system > without Unicode strings or that they would be better off rewriting the wheel. >
Re: Northern Khmer on iPhone
On iOS, applications can and do install custom fonts for system-wide use, although the installation user experience is pretty bad: http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html Norbert > On Mar 1, 2017, at 18:43 , Alastair Houghton> wrote: […] > (Also, FYI, iOS applications can - and some do - install and use their own > fonts. It’s per-application, though; you can’t install them system-wide.)
Re: Suppressing Ligation of Spacing Marks
The part of the specification of the Universal Shaping Engine [1] that deals with ZWNJ is a bit unclear, but I read it to mean that ZWNJ should not cause the insertion of a dotted circle if the character following it has general category Mn or Mc. The USE specification says: "The zero-width non-joiner is used to prevent a fusion of two characters. It continues a preceding cluster but causes a cluster break after itself when the following character is not a mark character (gc=Mn or gc=Mc).” The specification does not say how this character should be handled in cluster validation. I assume first that the statement about the combining grapheme joiner also applies to ZWNJ: “CGJ has been omitted from the above schema in order to avoid unnecessary complexity”. I further interpret the little the spec does say about ZWNJ to imply that it should be allowed before any character with general category Mn or Mc, without affecting the validity of the cluster. Inserting a dotted circle would be equivalent to causing a cluster break, which the spec rules out when the following character has general category Mn or Mc. U+1A63 has gc=Mc, so it shouldn’t be preceded by a dotted circle in the sequence. Note that I omitted the first “…” from the sequence you provided, because an intervening character might trigger the dotted circle. So this may just be a bug in the implementation of the USE that you’re using. I see this bug in Safari (CoreText), but not in Firefox (Harfbuzz); haven’t tried Edge. Which one are you using? [1] http://www.microsoft.com/typography/OpenTypeDev/USE/intro.htm Best regards, Norbert > On Nov 8, 2016, at 18:09 , Richard Wordingham > wrote: > > Should it be possible to suppress the ligation of a base character and > a visually following spacing mark in plain text? > > The example I have in minf is the sequence U+1A63 TAI THAM VOWEL SIGN AA>. It may be desirable to suppress the > ligation because both ligands have subscript consonants. However, if > I write , the Universal Shaping Engine > decides that the ZWNJ triggers a new syllable, and inserts a dotted > circle before SIGN AA. (The dotted circle after SIGN AA results from a > failure to read the proposal for the Lanna script as it was then > called.) > > Richard. >
Re: Unicode in passwords
> On Oct 6, 2015, at 6:04 , Philippe Verdywrote: > > In those conditions, normalizing the Java string will leave those lone > surrogates (and non-characters) as is, or will throw an exception, depending > on the API used. Java strings do not have any implied encoding (their "char" > members are also unrestricted 16-bit code units, they have some basic > properties but only in BMP, defined in the builtin Character class API: > properties for non-BMP characters require using a library to provide them, > such as ICU4J). The Java Character class was enhanced in J2SE 5.0 to support supplementary characters. The String class was specified to be based on UTF-16, and string processing throughout the platform was updated to support supplementary characters based on UTF-16. These changes have been available to the public since 2004. For a summary, see http://www.oracle.com/technetwork/articles/java/supplementary-142654.html Norbert
Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?
RFC 7158 section 7 [1] provides not only the \u notation for Unicode code points in the Basic Multilingual Plane, but also a 12-character sequence encoding the UTF-16 surrogate pair (i.e. \u\u with 0xD800 ≤ 0xDC00 ≤ ≤ 0xDFFF) for supplementary Unicode code points. A tool checking for escape sequences that don’t correspond to any Unicode character must be aware of this, because neither \u nor \u by itself would correspond to any Unicode character, but their combination may well do so. Norbert [1] https://tools.ietf.org/html/rfc7158#section-7 On May 7, 2015, at 5:46 , Costello, Roger L. coste...@mitre.org wrote: Hi Folks, The JSON specification says that a character may be escaped using this notation: \u( are four hex digits) However, not every four hex digits corresponds to a Unicode character. Are there tools to scan a JSON document to detect the presence of \u, where does not correspond to any Unicode character? /Roger
Re: Swift
It does allow some usage that may surprise code reviewers – for example, this is a valid Swift program: let s = let s︀ = let ︀ = let all = s + s︀ + ︀ The value of the constant “all” is . Or at least it is as long as mail software doesn’t harm the variation selectors… Norbert On Jun 5, 2014, at 9:06 , Mark Davis ☕️ m...@macchiato.com wrote: I haven't done any analysis, but on first glance it looks like it is based on http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax Mark — Il meglio è l’inimico del bene — On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn s...@maya.com wrote: Has anyone figured out whether character sequences that are non-canonical (de)compositions but could be recomposed to the same result are the same identifier or not? That is: are identifiers merely sequences of characters or intended to be comparable as “Unicode strings” (under some sort of compatibility rule)? On Jun 5, 2014, at 11:27 AM, Martin v. Löwis mar...@v.loewis.de wrote: Am 04.06.14 11:28, schrieb Andre Schappo: The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? The language reference is at https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html For reference, the definition of identifier-character is (read each line as an alternative) identifier-character → Digit 0 through 9 identifier-character → U+0300–U+036F, U+1DC0–U+1DFF, U+20D0–U+20FF, or U+FE20–U+FE2F identifier-character → identifier-head where identifier-head is identifier-head → Upper- or lowercase letter A through Z identifier-head → U+00A8, U+00AA, U+00AD, U+00AF, U+00B2–U+00B5, or U+00B7–U+00BA identifier-head → U+00BC–U+00BE, U+00C0–U+00D6, U+00D8–U+00F6, or U+00F8–U+00FF identifier-head → U+0100–U+02FF, U+0370–U+167F, U+1681–U+180D, or U+180F–U+1DBF identifier-head → U+1E00–U+1FFF identifier-head → U+200B–U+200D, U+202A–U+202E, U+203F–U+2040, U+2054, or U+2060–U+206F identifier-head → U+2070–U+20CF, U+2100–U+218F, U+2460–U+24FF, or U+2776–U+2793 identifier-head → U+2C00–U+2DFF or U+2E80–U+2FFF identifier-head → U+3004–U+3007, U+3021–U+302F, U+3031–U+303F, or U+3040–U+D7FF identifier-head → U+F900–U+FD3D, U+FD40–U+FDCF, U+FDF0–U+FE1F, or U+FE30–U+FE44 identifier-head → U+FE47–U+FFFD identifier-head → U+1–U+1FFFD, U+2–U+2FFFD, U+3–U+3FFFD, or U+4–U+4FFFD identifier-head → U+5–U+5FFFD, U+6–U+6FFFD, U+7–U+7FFFD, or U+8–U+8FFFD identifier-head → U+9–U+9FFFD, U+A–U+AFFFD, U+B–U+BFFFD, or U+C–U+CFFFD identifier-head → U+D–U+DFFFD or U+E–U+EFFFD As the construction principle for this list, they say Identifiers begin with an upper case or lower case letter A through Z, an underscore (_), a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plan that isn’t in a Private Use Area. After the first character, digits and combining Unicode characters are also allowed. Regards, Martin ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: data for cp1252
On Dec 7, 2012, at 17:48 , Buck Golemon wrote: It's also correct. *All* browsers have this behavior. The W3C has found this behavior to be correct. Opera at one point in time implemented the current unicode.org cp1252 spec, but was forced to change to the W3C spec by real-world requirements. Correction: The W3C has not said anything on this matter. The proposed encoding specification was written by Anne under the WHATWG umbrella. The W3C Internationalization working group (of which I'm a member) and Anne met during TPAC 2012 in October and agreed to kick off the process of turning the spec into a W3C recommendation by publishing it as a working draft. It may well change somewhat on the way. This discussion actually makes me think of one necessary change: The specification should clarify that it does not redefine existing encodings, and not label the mappings provided by the spec with existing encoding names. The spec is targeting web user agents, but the encodings are also used in many software systems that are not and don't directly interact with web user agents, and the spec shouldn't be interpreted to interfere with those uses. Norbert
Supplementary characters in the Java(TM) platform
I know a number of you are curious about how the Java platform will support the supplementary characters of the Unicode standard. The JSR 204 expert group, consisting of experts of ten companies, has today published the Public Review Draft of its specification at: http://jcp.org/aboutJava/communityprocess/review/jsr204/index.html Highlights: - Low-level APIs use the primitive type int to represent Unicode code points. - Higher-level APIs rely on char sequences, such as String and char[], which are interpreted as UTF-16 sequences. - There are methods to easily convert between various char and code point based representations. - Supplementary characters are allowed in Java programming language identifiers. Almost all of the specified functionality is implemented in the beta release of J2SE 1.5, which is available at: http://java.sun.com/j2se/1.5.0/index.jsp If you have comments, please send them to the official feedback address: [EMAIL PROTECTED] You can hear more about the supplementary character support in the J2SE platform in session B6 of the upcoming Unicode conference: http://www.unicode.org/iuc/iuc25/a338.html Sorry if you see this message twice - my posting yesterday made it into the digest, but not to all subscribers or into the archive. Best regards, Norbert
Re: Does Java 1.5 support Unicode math alphanumerics as variable names?
Murray, Yes, starting from J2SE 1.5 the Java programming language allows supplementary characters in identifiers if they meet the specifications of the new methods java.lang.Character.isJavaIdentifierStart(int) and java.lang.Character.isJavaIdentifierPart(int). Sorry for the late reply - until the rules of the Java Community Process it had to wait until the Public Review Draft is published. Best regards, Norbert On Jan 23, 2004, at 17:46, Murray Sargent wrote: E.g., math italic i (U+1D456)? With such usage, Java mathematical programs could look more like the original math. Thanks Murray
Re: Traditional dollar sign
The holographic strip on the Euro notes shows the Euro symbol when viewed at certain angles. Norbert Peter Kirk wrote: The latest issue of UK banknotes do carry the pound sterling sign (with one crossbar), but this is quite new. At least the more recent former issues did not, if I remember correctly. I was surprised to find no Euro symbol on Euro notes or coins.